• No results found

Identifying Patterns in User Behavior in aMusic Streaming Service: A Cluster AnalysisApproach

N/A
N/A
Protected

Academic year: 2022

Share "Identifying Patterns in User Behavior in aMusic Streaming Service: A Cluster AnalysisApproach"

Copied!
46
0
0

Loading.... (view fulltext now)

Full text

(1)

 

Degree project in Computer Science Second cycle

Music Streaming Service: A Cluster Analysis Approach

Fredrik Göthner

(2)

EXAMENSARBETE VID CSC, KTH

Identifying Patterns in User Behavior in a Music Streaming Service: A Cluster Analysis Approach

Göthner, Fredrik

E-­‐postadress  vid  KTH:  fgothner@kth.se   Exjobb  i:  Datalogi  

Handledare:  Herman,  Pawel   Examinator:  Lansner,  Anders   Uppdragsgivare:  Spotify  AB   Datum:  2013-­‐06-­‐03  

(3)

Music Streaming Service: A Cluster Analysis Approach

Abstract

Logged  user  data  has  become  a  highly  valued  asset  to  many  Internet  based   services   with   large   user   bases.   Being   able   to   draw   insight   from   this   data   is   considered   a   key   to   gaining   competitive   advantages   for   the   companies   behind  the  services.  This  study  aims  to  identify  patterns  in  the  behavior  of   users  when  interacting  with  Spotify,  a  music  streaming  service,  by  studying   automatically   logged   data.   In   the   study,   we   examine   several   methods   to   perform   such   analyses   using   machine   learning   techniques.   We   identify   six   different   types   of   behavior   through   k-­‐means   cluster   analysis,   each   representing   between   51.4%   and   0.5%   of   all   user   sessions.   We   also   identify   five   factors   partly   explaining   the   differences   in   behavior   between   different   sessions.  These  are  found  through  factor  analysis  and  account  for  39%  of  the   variance  in  the  data.  Finally,  we  demonstrate  how  factors  and  clusters  can  be   translated  from  numeric  representations  to  linguistic  interpretations.    

Att identifiera mönster i användar- beteende för en musikströmningstjänst: En klusteranalys

Sammanfattning

Loggad   användardata   har   blivit   en   högt   värderad   tillgång   för   många   Internetbaserade   tjänster   med   en   stor   mängd   användare.   Att   finna   insikter   från   dessa   data   anses   vara   en   nyckel   till   att   vinna   konkurrensfördelar   för   företagen   bakom   tjänsterna.   Denna   studie   har   som   mål   att   identifiera   mönster   i   beteendet   hos   användare   av   Spotify,   en   musikströmningstjänst,   genom   att   studera   loggad   data.   I   studien   utreds   flera   metoder   för   att   göra   denna   typ   av   analys   genom   att   använda   maskininlärningstekniker.   Vi   identifierar  sex  olika  typer  av  beteende  genom  k-­‐means  klusteranalys,  där  var   och  en  representerar  beteendet  i  mellan  51.4  %  och  0.5  %  av  alla  sessioner.  Vi   identifierar   också   fem   faktorer   som   förklarar   en   del   av   skillnaderna   i   beteende   mellan   användares   olika   sessioner.   Dessa   hittas   genom   faktoranalys,  och  förklarar  tillsammans  39  %  av  variansen  i  studiens  data.  Till   sist   går   vi   igenom   hur   kluster   och   faktorer   kan   översättas   från   numeriska   representationer  till  semantiska  tolkningar.      

(4)

I   would   like   to   thank   my   supervisor   at   KTH,   Pawel   Herman,   for   his   solid   support  and  engagement  in  this  study.  His  support  has  truly  contributed  to   the  quality  of  the  study  and  to  the  enjoyment  of  the  work.    

I  would  also  like  to  thank  my  supervisor  at  Spotify,  Henrik  Landgren,  for  his   support  and  for  the  opportunity  to  do  my  degree  project  for,  in  my  opinion,   one  of  the  most  interesting  Internet  services  in  the  world.  Henrik  has  been  a   constant  source  of  inspiration  and  feedback.    

Finally,  I  would  like  to  thank  the  members  of  the  Spotify  Analytics  Insights   team   who   have   provided   constant   support   with   technical   issues   as   well   as   feedback  regarding  all  conceivable  aspects  of  the  study.    

(5)

1   Introduction  ...  1  

1.1   Background  ...  1  

1.2   Related  Work  ...  2  

1.3   Problem  Statement  ...  2  

1.4   Scope  of  Study  ...  3  

1.5   Potential  Approaches  ...  4  

1.5.1   Factor  Analysis  ...  4  

1.5.2   Principal  Component  Analysis  ...  4  

1.5.3   Cluster  Analysis  ...  5  

1.5.4   Association  Analysis  ...  7  

1.6   Report  Outline  ...  7  

2   Method  ...  9  

2.1   Overview  &  Experimental  Setting  ...  9  

2.2   Dataset  ...  9  

2.2.1   Sessions  ...  9  

2.2.2   Data  Attributes  ...  9  

2.2.3   Population  ...  10  

2.3   Data  Preprocessing  ...  10  

2.3.1   Outlier  Removal  ...  10  

2.3.2   Session  Filtering  ...  10  

2.3.3   Ceiling  Values  ...  10  

2.3.4   Normalization  ...  10  

2.4   Exploratory  Factor  Analysis  ...  11  

2.5   Cluster  Analysis  ...  12  

2.5.1   Algorithms  ...  12  

2.5.2   Model  Selection  ...  13  

2.5.3   Model  Evaluation  ...  15  

3   Results  ...  17  

3.1   Factor  Analysis  ...  17  

3.1.1   Factor  Analysis  Model  Selection  ...  17  

3.1.2   Factor  Loadings  ...  18  

3.1.3   Interpretations  ...  19  

3.2   Cluster  analysis  ...  20  

(6)

3.2.3   Interpretation  ...  23  

3.2.4   Evaluation  of  Generalization  and  Robustness  ...  24  

4   Discussion  &  Conclusions  ...  26  

4.1   Conclusions  ...  26  

4.1.1   Identifying  Behavior  Patterns  ...  26  

4.1.2   Applicability  of  Machine  Learning  ...  26  

4.2   Discussion  of  Results  ...  26  

4.2.1   Model  Validation  ...  26  

4.2.2   Distribution  of  Use  Cases  ...  27  

4.2.3   Uncertainty  of  Classification  ...  27  

4.3   Challenges  and  Emerging  Issues  ...  27  

4.3.1   Data  Collection  and  Preprocessing  ...  27  

4.3.2   Attribute  Selection  ...  27  

4.3.3   Factor  Analysis  ...  28  

4.3.4   Cluster  Analysis  ...  28  

4.4   Future  Work  ...  28  

4.4.1   Other  Learning  Methods  ...  28  

4.4.2   Analyzing  Users  by  Behavior  ...  29  

4.4.3   Expanding  Behavior  ...  29  

References  ...  31  

Appendix  A  ...  33  

Appendix  B  ...  38  

(7)

1 Introduction

1.1 Background

Spotify   is   an   Internet   based   music   streaming   service,   offering   music   to   several   stationary   and   mobile   platforms.   The   service   is   currently   offered   through   a   free,   ad-­‐supported   subscription   and   two   premium   subscriptions   based  on  monthly  payments.  In  Q1  of  2013,  Spotify  reported  24  million  active   users  globally,  and  6  million  paying  subscribers.    

Behavior   related   to   music   consumption   changes   with   technological   development.   During   the   last   century,   technological   innovations   like   vinyl   record  players,  portable  tape  recorders,  CD  players  and  digital  distribution  of   music   have   changed   the   processes   by   which   we   select   music   and   control   playback  as  well  as  the  situations  in  which  we  consume  music  (Guberman,   2011).   It   is   also   possible   to   distinguish   a   shift   in   priorities   amongst   music   consumers  from  fidelity  to  convenience  (Guberman,  2011).  The  emergence  of   music   streaming   services   like   Spotify   during   the   last   decade   is   likely   to   impact  the  way  users  consume  and  listen  to  music.  At  the  same  time,  these   services   provide   great   opportunities   to   study   user   behavior   through   logged   user  data.  In  this  study,  behavior  refers  to  how  the  user  acts  in  order  to  select   music,  control  playback  and  discover  new  music,  rather  than  what  music  the   user  is  listening  to  and  what  he  or  she  is  doing  while  listening.    

The   study   of   user   behavior   through   logged   data   is   a   well-­‐known   research   topic,  for  example  within  the  fields  of  web  search  ranking  (Agichtein  et  al.,   2006),   user-­‐adaptive   systems   (Frias-­‐Martinez   et   al.,   2005)   or   telecommunication  networks  (Zhu  et  al.,  2011).  

For   a   music   streaming   service,   insight   in   user   behavior   is   useful   for   many   purposes,   for   example   product   design   and   development,   optimizing   in-­‐

service  advertising  and  tracking  growth  and  market  penetration.  The  ability   to   segment   the   user   population   based   on   behavior   rather   than   traditional   means  such  as  gender,  age  and  socioeconomic  status  could  provide  valuable   consumer   insight.   Additionally,   the   ability   to   obtain   this   information   through  analysis  of  logged  data  in  addition  to  user  surveys,  focus  groups  or   other  traditional  methods  could  yield  practical  advantages.  

The  size  of  the  data  collected  by  Spotify  on  a  daily  basis  can  be  considered   very  large.  Several  of  the  company’s  data  sources  have  dimensionalities  in  the   hundreds  and  the  size  of  the  data  collected  is  in  the  magnitude  of  Terabytes.  

Being  able  to  automate  analytic  processes  is  vital  to  leverage  this  amount  of   information.    

The   main   goal   of   this   study   is   to   conceive   a   structured   method   to   analyze   user  behavior  through  logged  data.  The  method  should  be  possible  to  apply   to  other,  similar  applications.  A  natural  first  question  in  a  data  driven  study   of  user  behavior  is  simply  if  there  are  any  patterns  or  stereotypes  in  the  way   users  behave  when  they  use  the  service,  and  if  so,  what  they  are.  The  analytic   goal   of   this   study   will   be   focused   on   identifying   potential   patterns   in   user   behavior.  

(8)

1.2 Related Work

Many  studies  have  performed  quantitative  analysis  of  people  based  on  their   behavior   in   a   certain   domain.   Behavior   in   this   sense   can   have   various   meanings,   but   is   typically   connected   to   analyzing   the   decision-­‐making   processes  of  individuals  or  groups  of  people  in  certain  situations.  During  the   last  decade,  this  type  of  problem  has  been  approached  with  various  statistical   methods  and  machine  learning  techniques.    

Within  the  field  of  behavior  related  to  music  consumption,  Boer  et  al.  (2012)   assessed  how  the  use  of  music  is  “underpinned  by  psychological  processes”.  

The   study   identified   ten   different   psychological   functions   of   music,   found   through   principal   component   analysis   of   survey   data   and   demonstrated   through   multi-­‐dimensional   scaling.   The   study   also   examined   systematic   differences   in   the   importance   of   these   functions   across   different   gender   groups  or  cultures.  

In   another   study,   more   focused   on   exploring   differences   between   different   groups   of   people,   Chamorro-­‐Premuzic,   Swami   &   Cermakova   (2010)   performed   statistical   hypothesis   testing   regarding   the   associations   between   individuals’   motives   for   listening   to   music,   music   consumption   habits   and   personal   attributes   such   as   demographics,   “Big   Five”   personality   traits   and   emotional   intelligence.   The   study   concluded   that   age   and   motives   for   listening  better  explained  differences  in  music  consumption  than  personality   traits.  

Several  studies  have  worked  with  identifying  groups  of  individuals,  based  on   habits   or   behavior   in   isolated   situations,   using   various   cluster   analysis   approaches.   For   instance,   Jiang,   Ferreira   &   Gonzales   (2012)   studied   daily   activity  patterns  of  urban  inhabitants  through  principal  component  analysis   and  cluster  analysis.  Brandtzaeg  et  al.  (2010)  performed  a  typology  study  of   Internet  usage  and  identified  five  types  of  European  Internet  users  using  k-­‐

means  cluster  analysis  of  survey  data.  Primack  et  al.  (2012)  used  a  two-­‐step   cluster  analysis  approach  based  on  k-­‐means  and  hierarchical  agglomerative   cluster   analysis   to   segment   U.S.   university   students   based   on   substance   abuse  habits.    

Data  related  to  behavior  has  also  been  used  to  improve  the  performance  of   algorithms  working  mainly  on  heuristics  and  other  types  of  data.  Agichtein   et  al.  (2006)  showed  how  user  behavior  can  be  used  as  implicit  feedback  to   improve   the   performance   of   web-­‐ranking   algorithms   based   on   artificial   neural  networks.    

Generally,  machine  learning  offers  several  attractive  advantages  for  this  type   of   study;   it   can   be   applied   to   very   large   data   sources,   its   methods   can   aid   distinguishing  patterns  that  are  too  complex  to  distinguish  manually  and  it   allows  us  to  rely  on  computers  for  heavy  computations  (Marsland,  2009).    

1.3 Problem Statement

The  goal  of  this  study  can  be  expressed  in  two  separate  parts:  

(9)

1. Identifying  the  most  common  patterns  in  user  behavior.  

These   patterns   will   be   referred   to   as   use   cases.   This   objective  can  be  expressed  as  two  sub-­‐objectives:  

a. Obtain   numeric   representations   of   use   cases   through   quantitative   analysis   of   automatically   logged  user  data.  

b. Conceive   a   method   to   translate   the   numeric   representations   to   comprehensible,   semantic   descriptions.      

2. Evaluating   the   usefulness   of   machine   learning   methods   for   this   type   of   analysis.   More   specifically,   compare   the   applicability   of   various,   specific   techniques,   and   demonstrate  the  results  of  one  or  several  techniques.    

Using   the   taxonomic   description   suggested   by   Frias-­‐Martinez   et   al.   (2005)   and   considering   the   stated   purpose   of   the   study,   the   task   at   hand   can   be   described  as  collaborative  modeling  (modeling  a  population  of  many  users)   for   classification   purpose.   However,   it   should   be   noted   that   although   the   ultimate  purpose  is  to  be  able  to  classify  a  user  or  an  instance  of  usage  by  the   type   of   behavior   the   user   exhibits,   this   task   is   of   unsupervised   nature,   meaning   that   we   do   not   have   any   data   conveying   the   “ground   truth”  

behavior  type  (Marsland,  2009).    

Taking  this  into  account,  the  following  description  of  the  task  is  suggested:  

1. Collaborative   modeling.   Finding   patterns   in   user   behavior.  

2. Interpretation.   Translating   these   patterns   into   use   cases   by  explaining  what  they  represent.  

3. Classification.   Using   the   resulting   model   and   interpretation  to  classify  and  describe  previously  unseen   data.  

The   study   examines   behavior   in   discrete   instances   of   use,   rather   than   the   aggregated   behavior   of   each   user.   This   is   further   discussed   in   the   Method-­‐

chapter.  

1.4 Scope of Study

As   the   purpose   of   the   study   is   to   identify   patterns   in   user   behavior,   it   is   necessary  to  decide  on  a  level  of  detail  of  the  behavior  that  is  studied.  The   aim  of  the  study  is  to  identify  the  most  common  behavior  of  the  users,  rather   than  to  identify  every  type  of  behavior  that  a  user  can  exhibit.      

Considering  the  purpose  of  the  study,  the  data  available  and  certain  practical   implications,  we  define  a  set  of  criteria  for  the  data  attributes  that  are  used  to   describe  user  behavior:  

• Discriminability.   The   attribute   should   provide   information   about   the   user’s   behavior   that   can   help   distinguish  one  type  of  behavior  from  another.  

(10)

• Platform  invariance.  The  attribute  should  be  as  robust  as   possible  with  respect  to  what  type  of  platform  is  used  to   access  the  service.  

• Robustness   over   time.   The   meaning   and   values   of   the   attribute  should  be  robust  to  changes  in  the  service  and   software  updates  in  the  clients.  

Based  on  these  criteria,  the  definition  of  user  behavior  for  the  purpose  of  this   study  is  limited  to  data  concerning  the  following:  

• The   actions   taken   by   the   user   to   manipulate   the   sequence  of  tracks  played,  

• The   distinct   parts   of   the   client,   or   views,   visited   by   the   user  when  navigating  the  client,  or  

• The  creation  and  acquisition  of  playlists.  

In  addition  to  the  data  related  to  behavior  that  is  used  for  modeling,  other   user-­‐related   data   is   used   to   assist   the   interpretation   and   analysis   of   the   resulting  patterns.    

1.5 Potential Approaches

The   following   section   covers   state-­‐of-­‐the-­‐art   machine   learning   techniques   that  have  been  used  for  similar  applications.    

1.5.1 Factor Analysis

Factor   analysis   is   a   method   of   determining   if   covariance   among   data   attributes  can  be  expressed  as  the  result  of  common,  underlying  factors  and   how   those   relate   to   the   observed   attributes.   Factor   analysis   is   commonly   used   to   explore   or   confirm   underlying   structures   of   data   sets   within   a   number  of  different  fields,  for  example  psychology  (Lecavalier  &  Norris,  2010;  

Mavor  &  Louis,  2010)  or  athletics  (Ertel,  2011).  

For  the  purpose  of  this  study,  factor  analysis  can  be  used  to  reveal  structure  in  the   data  and  to  better  understand  tendencies  in  user  behavior.  

1.5.2 Principal Component Analysis

Principal   component   analysis   (PCA)   refers   to   projecting   the   observed   data   onto  a  different  base,  such  that  the  components  of  the  base  are  orthogonal,   and   represent   the   directions   in   which   the   observed   data   has   the   greatest   amounts   of   variance.   This   representation   can   be   found   by   extracting   the   eigenvectors  of  the  covariance  matrix  of  the  observed  data.  The  order  of  size   of  the  eigenvalues  of  each  eigenvector  will  correspond  to  the  order  of  size  of   variance  accounted  for  by  that  component  (Marsland,  2009).    Through  PCA,   a   representation   of   the   data   can   often   be   obtained   that   represents   most   of   the   variance   in   the   data   with   fewer   dimensions   (Marsland,   2009).   For   this   reason,   PCA   is   often   used   for   dimensionality   reduction   (Lewis-­‐Beck,   1994).  

Dimensionality  reduction  does  not  have  an  obvious  advantage  for  this  study,   since  each  attribute  conveys  interpretative  value.  Coercing  several  attributes   means   risking   losing   some   of   the   interpretative   value,   and   although   it   has   proven  useful  in  many  other  studies  (Lewis-­‐Beck,  1994),  PCA  is  not  used  in   this  study.    

(11)

1.5.3 Cluster Analysis

Cluster  analysis  has  been  used  in  several  previous  studies  to  identify  groups   of  individuals,  who  are  common  in  some  aspect  (Jiang,  Ferreira  &  Gonzales,   2012;  Brandtzaeg  et  al.,  2010;  Primack  et  al.,  2012).  Cluster  analysis  refers  to   partitioning   the   data   set.   In   other   words,   finding   groups,   or   clusters,   of   samples  in  the  observed  data  such  that  samples  within  the  same  cluster  are   similar   in   some   sense,   and   samples   from   different   clusters   are   dissimilar   (Marsland,   2009).   Once   groups   like   this   are   found,   they   can   be   used   to   classify   new   data   by   assigning   new   samples   to   the   most   appropriate   group   (Marsland,  2009).    

Finding  such  clusters  in  a  dataset  representing  separate  instances  of  listening   could   reveal   patterns   in   user   behavior,   if   we   can   argue   that   each   cluster   corresponds  to  one  behavior  type.    

There  are  a  number  of  known  algorithms  that  performs  clustering,  working   in  different  ways.  Several  taxonomies  categorizing  clustering  algorithms  have   been   suggested   (Tran   et   al.,   2013;   Cios   et   al.,   1998).   In   this   study,   we   categorize  algorithms  by  three  main  types:  

• Objective   function-­‐based   optimization   methods,   in   which   the   algorithm   tries   to   find   a   partition   of   the   samples  that  optimizes  the  value  of  an  objective  function   (Cios  et  al.,  1998).  

• Hierarchical   methods,   where   the   samples   are   aggregated   or   divided   into   different   groups   (Cios   et   al.,   1998).  

• Density  based  methods,  where  a  cluster  is  defined  as  a   region  with  high  density  (Tran  et  al.,  2013).  

In   this   study,   several   algorithms   of   each   of   these   categories   are   considered   and   described   by   two   characteristics,   namely   similarity   metric   and   model   arguments.   Similarity   metric   refers   to   how   similarity   (or   dissimilarity)   between  two  samples  is  determined.  Common  ways  to  determine  similarity   includes  basing  it  on  a  distance  norm  or  probability.  Model  arguments  refers   to  parameters  that  need  to  be  specified  before  using  the  algorithm,  and  for   which  a  good  setting  has  to  be  found  for  each  application.  An  overview  of  the   algorithm  types  and  a  few  examples  are  presented  in  Table  1.  

(12)

Table  1.  Overview  of  clustering  algorithms.  

Category Algorithm(s)

Most important model arguments

Similarity metric Objective

function- based optimization

K-means Number of clusters (k), initial

assignments, distance norm

Distance (typically Euclidean norm) Fuzzy c-

means

Number of clusters, initial assignments, distance norm, fuzzification- parameter

Distance

K-medoids Number of clusters (k), distance norm

Distance

Gaussian Mixture Models

Number of clusters, covariance shape

Probability

Hierarchical Agglomerative clustering (e.g.

AGNES)

Linkage, distance norm

Distance

Density based

DBSCAN Neighborhood size, core point limit

Distance

In  order  to  evaluate  the  usefulness  of  different  clustering  algorithms  for  this   study,  the  following  criteria  are  used:  

1. The   algorithm   should   yield   a   result   that   can   easily   be   interpreted  and  used  to  explain  user  behavior.  

2. The   algorithm   should   have   a   feasible   computation   time   (<1h)   for   large   data   sets   (>100,000   samples   and   >20   attributes)  on  conventional  personal  computers.    

3. The  algorithm  should  yield  a  model  that  can  be  used  to   classify  samples  that  were  not  used  for  training.  

The  algorithms’  relations  to  these  criteria  are  listed  in  Table  2.  

Table  2.  Review  of  six  different  clustering  algorithms  

Algorithm Criterion # References

1 2 3

K-means Yes Yes Yes Marsland (2009)

Fuzzy c-means Medium Yes Yes Meyer et al.

(2013)

K-medoids Medium No Yes Maechler et al.

(2013)

Agglomerative clustering

Medium No Yes Maechler et al.

(2013) Gaussian

Mixture Models

Medium Yes Yes Marsland (2009)

DBSCAN No No No Tran et al.

(2013)

K-­‐means   clustering   is   a   commonly   used   clustering   technique.   It   outputs   a   measure  of  distance  for  each  sample  to  a  number  (k)  of  prototypes  (means).  

(13)

Each  sample  is  then  assigned  to  its  closest  prototype  (Zhu  et  al.,  2011;  Jiang,   Ferreira  &  Gonzales,  2012).  K-­‐means  meets  the  above  criteria.    

Fuzzy   c-­‐means   clustering   offers   an   interesting   feature   in   that   it   assigns   a   degree   of   membership   between   each   sample   and   each   cluster,   rather   than   strictly  assigning  each  sample  to  a  cluster.  However,  it  requires  specifying  a   fuzzification   value,   which   specifies   the   degree   to   which   samples   affect   far-­‐

away   cluster   centers   (Meyer   et.   al.   2013).   This   parameter   needs   to   be   configured  to  a  suitable  value  in  parallel  with  the  number  of  cluster  centers,   and  its  impact  on  the  end-­‐result  is  not  obvious  before  training  the  model.    

K-­‐medoids   is   computationally   heavier   than   k-­‐means,   although   implementations  exists  that  can  handle  large  datasets  (Maechler  et  al.,  2013).  

In  k-­‐medoids,  each  cluster  is  represented  by  a  prototype  sample,  whereas  in   k-­‐means   each   cluster   is   represented   by   the   mean   vector,   or   centroid.   This   makes   k-­‐medoids   less   sensitive   to   outliers   (Marsland,   2009).   However,   this   also   means   that   k-­‐medoids   has   trouble   representing   binary   attributes   well,   since  the  prototypes  cannot  take  any  intermediate  value.    

Gaussian   Mixture   Models   (GMM)   uses   a   procedure   similar   to   that   of   k-­‐

means   to   fit   a   statistical   model   to   the   data   through   Expectation   Maximization   (Marsland,   2009).   It   offers   several   mathematical   advantages   over  k-­‐means  and  k-­‐medoids  –  it  is  generally  more  flexible  on  the  shape  and   size   of   the   clusters   (Marsland,   2009).   Consequently,   the   resulting   model   is   more  complex,  including  means,  prior  probabilities  and  covariance  matrices   of   the   obtained   mixture   model   (Marsland,   2009).   Overall,   it   meets   the   criteria  and  is  considered  an  appropriate  candidate  for  this  study.    

Based   on   the   above   criteria   (interpretability,   computational   feasibility,   predictive   possibility),   k-­‐means   and   GMM   are   selected   as   appropriate   methods  for  cluster  analysis.  

1.5.4 Association Analysis

Association  analysis    refers  to  extracting  association  rules  between  items  in  a   transactional  data  set.  These  frequent  item  sets  can  be  used  to  predict  items   in  future  transactions  (Frank  &  Witten,  2000).  Association  analysis  has  been   used   for   the   purpose   of   identifying   behavioral   patterns   in   previous   studies   (Ros,  Delgado  &  Amparo  Vila,  2009;  2011).  However,  the  resulting  association   rules  are  typically  used  for  prediction  purpose,  and  do  not  offer  explanatory   value   equivalent   to   the   results   of   the   other   methods   discussed.   Association   analysis  is  therefore  not  covered  further  in  this  study.  

1.6 Report Outline

The   second   chapter   of   this   report,   Method,   covers   details   about   data   gathering   and   preprocessing.   It   also   covers   theoretic   descriptions   of   the   machine  learning  techniques  used  to  find  patterns  in  user  behavior  as  well  as   descriptions  of  how  these  methods  are  used  in  this  study.  

The   third   chapter,   Results,   covers   the   measurements   made   for   model   selection,   numeric   descriptions   of   the   behavior   patterns   found   and   an   explanation  of  how  these  numeric  descriptions  can  be  used  to  interpret  user  

(14)

behavior.   It   also   contains   a   brief   analysis   of   the   generalizing   ability   of   the   final  model  and  its  robustness.    

The  fourth  chapter,  Discussion  &  Conclusions,  concludes  the  findings  of  the   study  and  discusses  the  challenges  and  issues  which  emerged  throughout  the   work.  It  also  covers  suggestions  for  future  work.  

(15)

2 Method

2.1 Overview & Experimental Setting

The   data   used   for   the   study   is   collected   through   Spotify’s   logged   and   aggregated  data  sources,  through  a  distributed  computing  and  data  storage   framework   called   Apache   Hadoop.   The   datasets   used   for   this   study   is   obtained   by   running   scripts   conforming   to   the   MapReduce   programming   paradigm  (Dean  &  Ghemawat,  2004).  

The  data  is  preprocessed  using  the  NumPy  (NumPy,  2013)  and  Scikit-­‐Learn   Python   libraries   (Scikit-­‐Learn,   2013),   which   are   used   to   plot   the   initial   distributions   of   each   attribute,   remove   outliers,   filter   the   dataset   and   normalize  the  data.    

The   first   analytic   step   involves   performing   Exploratory   Factor   Analysis   to   obtain  an  understanding  of  the  underlying  structure  of  the  data.  This  is  done   using  the  fa()-­‐function  from  the  psych  R-­‐library.    

Lastly,   the   data   is   clustered   using   two   different   algorithms,   k-­‐means   clustering  and  Gaussian  Mixture  Models.  Both  implementations  are  from  the   Scikit-­‐Learn  Python  library.    

2.2 Dataset

2.2.1 Sessions

To  isolate  instances  of  use,  we  introduce  the  concept  of  sessions.  A  session  is   defined   as   a   user   continuously   using   the   service   (by   playing   songs   or   browsing   the   client)   with   at   most   15   minutes   of   inactivity   and   using   one   platform   only.   Any   activity   after   more   than   15   minutes   of   inactivity   or   a   platform  change  is  assigned  to  a  different  session  than  any  previous  activity.  

2.2.2 Data Attributes

For  each  session,  21  data  attributes  are  collected:    

• 3  attributes  related  to  the  timing  of  the  session.  

• 10  attributes  related  to  how  the  user  browses  the  client.  

• 6  attributes  related  to  music  selection.  

• 2  attributes  related  to  playlist  maintenance.  

In   this   study,   a   stream   is   defined   as   a   user   playing   a   track   for   at   least   30   seconds.  A  playlist  is  a  collection  of  songs  compiled  by  a  user.  

The  attributes  are  evaluated  based  on  the  criteria  stated  in  the  Introduction-­‐

chapter  (discriminability,  platform  invariance  and  robustness  over  time)  and   through  discussions  with  internal  Spotify  staff.  The  following  data  attributes   were   considered   but   rejected   due   to   limited   interpretability   or   poor   performance  in  pilot  tests:    

• 2  attributes  related  to  state  of  client  during  playback.  

• 2  attributes  related  to  the  time  of  the  session.  

(16)

• 1  attribute  related  to  the  platform  used.  

• 14  attributes  related  to  the  source  of  the  streams.  

2.2.3 Population

The   study   is   limited   to   users   from   Sweden.   The   user   base   is   sampled   to   obtain  a  dataset  of  feasible  size.  The  dataset  used  for  training  of  the  models   consists  of  179,748  sessions.  The  data  is  collected  for  the  period  January  28th  –   February  24th  2013.  

2.3 Data Preprocessing

2.3.1 Outlier Removal

Many   machine   learning   applications,   including   k-­‐means   clustering,   are   sensitive   to   outliers   (Marsland,   2009).   Outliers   are   samples   taking   unusual   values  for  one  or  more  attributes.    

In  this  study,  we  define  an  outlier  as  a  sample  for  which  one  attribute  takes  a   value   that   falls   outside   the   99th   percentile   of   the   distribution   for   that   attribute.  Any  sample  defined  as  an  outlier  is  removed.  In  other  words,  each   attribute  (excluding  those  covered  in  section  2.3.3)  is  limited  to  the  range  of   values   containing   99%   of   the   data.   The   rationale   behind   this   procedure   is   that   any   data   outside   the   99th   percentile   does   not   fit   the   definition   of   “the   most  common  behavior”  stated  in  section  1.4,  and  is  thus  outside  the  scope   of  the  study.  

2.3.2 Session Filtering

Sessions  containing  less  than  2  streams  or  that  are  shorter  than  90  seconds   are  filtered  from  the  dataset.  The  rationale  is  that  these  sessions  are  too  short   to   be   used   to   analyze   behavior.   After   filtering   and   outlier   removal,   138,920   samples  remain.  

2.3.3 Ceiling Values

Some  attributes,  conveying  information  about  events  that  only  happen  in  a   small   subset   of   samples   have   distributions   that   are   very   sparse.   These   attributes  have  zero-­‐values  for  the  vast  majority  of  the  samples,  but  typically   have  very  variable  values  among  the  non-­‐zero  samples.  Instead  of  removing   samples  falling  outside  of  the  99th  percentile,  a  threshold  for  the  attribute  is   defined  and  values  over  this  threshold  are  simply  truncated  to  the  threshold.  

This   allows   us   to   retain   the   information   conveyed   by   the   attribute   while   avoiding  outliers  caused  by  certain  high  attribute  values.  

2.3.4 Normalization

For   the   methods   used   in   the   study   (discussed   in   sections   2.4   and   2.5),   it   is   important   that   each   attribute   is   weighted   equally,   regardless   of   the   magnitude   of   its   values   (Marsland,   2009).   The   dataset   is   normalized   by   subtracting   the   mean   value   and   dividing   by   the  standard   deviation   of   each   attribute.  This  procedure  is  known  as  z-­‐score  normalization.    

(17)

2.4 Exploratory Factor Analysis

Exploratory   Factor   analysis   is   a   method   for   revealing   structure   of   the   data,   working   under   the   assumption   that   the   covariance   among   attributes   is   explained   by   common   underlying   factors,   or   latent   variables   (Lewis-­‐Beck,   1994).  

   

Figure  1.  Factor  analysis  model.  

Figure  1  demonstrates  how  each  observed  data  attribute,  xi,  is  modeled  as  a   random   variable,   for   which   each   observation   is   a   weighted   sum   of   observations   from   a   lower   number   of   latent,   random   variables   and   one   unknown  random  variable  ui:  

𝑥! =   𝑏!"𝐹!+ 𝑑!𝑢!

!

!!!

 

where   𝑏!,!   is   the   factor   loading   between   latent   variable   𝐹!   and   attribute   𝑥!,   and  𝑑!  is  the  noise  component  of  𝑥!.    

An  observation  of  the  data  attributes  are  usually  expressed  in  vector  form,  x,   and  the  factor  loadings  in  matrix  form,  B.  

In  this  study,  an  initial  solution  to  the  factor  loadings  is  found  by  obtaining   the  minimum  residual  solution  through  the  Ordinary  Least  Squares-­‐method   (Revelle,  2013).  The  initial  solution  is  rotated  to  the  Varimax  rotation,  which   maximizes   the   variance   in   the   factor   loadings   for   each   factor   (Lewis-­‐Beck,   1994).   In   practice,   this   means   that   each   factor   will   typically   have   loadings   with  high  absolute  values  for  a  few  attributes,  and  close-­‐to-­‐zero  for  the  rest.  

The  number  of  latent  variables  to  use  is  decided  by  studying  the  amount  of   variance  explained  with  different  numbers  of  latent  variables.  This  approach   is  discussed  by  Lewis-­‐Beck  (1994).  In  addition,  a  scree-­‐plot  is  used  in  which   the  eigenvalues  of  the  covariance  matrix  are  used  to  estimate  the  number  of   latent   variables   in   the   data.   In   a   scree-­‐plot,   the   eigenvalues   are   ranked   in   descending  order,  and  the  number  of  variables  is  estimated  at  the  first  point   (if  one  can  be  found)  in  the  plot  where  the  descent  between  two  eigenvalues   can  be  considered  small  (Lewis-­‐Beck,  1994;  Ertel,  2005).  This  point  is  referred   to  as  an  “elbow-­‐point”.  A  generic  example  is  demonstrated  in  Figure  2.  

x

1

x

2

x

n

F

1

F

2

F

k

… …

u

1

u

2

u

n

b11 b12

bkn

d1

d2

dn

(18)

  Figure  2.  Demonstration  of  an  “elbow  point”.  After  the  4th  value   of  on  the  x-­‐axis,  the  metric  does  not  decrease  much.  

2.5 Cluster Analysis

2.5.1 Algorithms

Two  clustering  algorithms  are  used  to  perform  cluster  analysis:  k-­‐means  and   Gaussian   Mixture   Models   (GMM).   These   are   selected   based   on   the   criteria   discussed   in   the   Introduction-­‐chapter:   Interpretability,   computational   feasibility  and  predictive  possibility.  

K-­‐means  

K-­‐means   is   a   prototype-­‐based   model   that   seeks   to   minimize   the   sum   of   squared  distance  from  each  sample  to  its  respective  cluster  center  (Marsland,   2009).   The   approach   is   discussed   further   below.   The   algorithm   works   in   three  major  steps  (Zhu  et  al.,  2011):  

1. It   first   initializes   a   user-­‐specified   number   of   cluster   centers  (k)  by  assigning  samples  to  centers  (randomly  or   systematically).    

2. It   then   updates   the   values   of   each   cluster   center   to   the   mean  value  of  its  corresponding  samples.    

3. Next,   each   sample   is   reassigned   to   its   closest   cluster   center.  

Steps   2   and   3   are   repeated   until   no   sample   changes   assignment   or   until   a   fixed  number  of  maximum  iterations  is  reached.  

The   result   is   k   cluster   centers   and   a   label   for   each   sample   used   in   training   and  with  values  in  the  sample  space.  K-­‐means  is  susceptible  to  local  minima   (Marsland,  2009).  To  mitigate  this  problem,  the  algorithm  is  run  30  times  for   each   k   with   different   initial   assignments,   and   the   best-­‐scoring   solution   is   selected  for  each  k.      

0 5 10 15 20 25

1 2 3 4 5 6 7 8 9

metric

varying parameters

(19)

GMM  

GMM   is   a   parametric   method   that   estimates   the   parameters   of   a   user-­‐

defined   number   (M)   of   normally   distributed   multivariate   random   variables   under   the   assumption   that   each   sample   is   an   observation   from   one   of   the   variables.  The  objective  is  to  maximize  the  log-­‐likelihood  of  the  data  under   the   model.   GMM   estimates   three   parameters   for   each   variable   model   (Marsland,  2009):  

• μm,  which  represents  the  mean  of  the  mth  variable,    

• Σm,   which   represents   the   covariance   matrix   of   the   mth   variable,  and    

• αm,   which   is   the   weight   or   prior   probability   of   the   mth   variable.    

The   parameters   are   estimated   through   Expectation   Maximization,   which,   similarly  to  k-­‐means,  involves  three  steps  (Marsland,  2009):    

1. Initialize  the  random  variables,  for  example  by  setting  μm  

to  the  values  of  randomly  selected  data  points,  setting  Σm  

to   the   covariance   matrix   of   the   entire   data   set,   and   setting  αm  to  1/M.  

2. Calculate  the  posterior  probabilities  Zi,m  for  each  sample   xi  under  each  random  variable.  This  is  called  the  E-­‐step.  

 

𝑍!,! = 𝑝 𝑥!    𝑚 = 𝑘) = 𝛼!𝒩 𝑥!    𝝁!, 𝚺!) 𝛼!𝒩 𝑥!    𝝁!, 𝚺!)

!!!!  

 

3. Update  μm,  Σm  and  αm  using  Zi,m  as  a  weight  between  each   sample  and  variable.  This  is  called  the  M-­‐step.  

 

a. 𝝁!= !!!,!!!

!!,!

!        

b. 𝚺!=   𝒊!!,!(!!!  𝝁!)(!!!𝝁!)𝒕

!!,!

!  

 

c. 𝛼! = !!𝑵!,!  

Steps  2  and  3  are  repeated  until  no  changes  in  the  parameters  occur  or  until  a   fixed  number  of  iterations  is  reached.  To  avoid  local  maxima,  the  algorithm   is  run  5  times  for  each  value  of  M,  and  the  model  with  highest  log-­‐likelihood   score  is  selected  for  that  GMM.    

2.5.2 Model Selection

The  algorithms  above  have  certain  parameters  that  need  to  be  specified.  The   most   central   one   is   the   number   of   clusters;   the   problem   of   selecting   the   number  of  clusters  is  a  common  problem  in  cluster  analysis  (Jiang,  Ferreira  &  

Gonzales,  2012;  Zhu  et  al.,  2011).    

To   determine   a   suitable   number   of   clusters,   the   algorithms   are   run   with   different   settings,   and   the   resulting   models   evaluated   first   against   a   set   of   numeric  measures  and  then  against  a  set  of  heuristic  criteria.    

(20)

The   numeric   measures   used   are   silhouette   coefficient   for   both   algorithms,   and   sum   of   squared   error   and   Bayesian   Information   Criterion   (BIC)   for   k-­‐

means  and  GMM,  respectively.  They  are  described  below.  

Silhouette  Coefficient  

The   silhouette   coefficient   is   a   measure   of   proximity   of   a   sample   to   other   samples   in   the   same   cluster,   compared   to   the   proximity   of   the   sample   to   other  samples  in  the  closest  neighboring  cluster.  

It  is  defined  for  one  sample  xi,  as    𝑠! ≔   !!!!!

!"#  (!!,!!)  

where  𝑎!  is  the  average  distance  between  the  ith  sample  and  other  samples  in   the  same  cluster  and  𝑏!  is  the  lowest  average  distance  to  the  samples  in  any   other   cluster.   The   silhouette   coefficient   ranges   between   -­‐1   and   1.   A   value   close  to  1  indicates  that  the  sample  is  significantly  closer  to  samples  within   its  own  cluster  than  samples  in  other  clusters,  a  value  of  0  indicates  that  the   sample  is  right  in  between  two  clusters  while  a  negative  value  means  that  the   point  is  closer  to  samples  in  a  different  cluster  than  its  own  (Maechler,  2013).  

To  evaluate  the  models,  the  average  silhouette  coefficient  for  3,000  randomly   selected  samples  is  used.  

Sum  of  Squared  Error  (SSE)  

The  objective  function  for  k-­‐means  is  defined  as  

𝑆𝑆𝐸 =   | 𝑥!− 𝝁!|!

!!∈!!

!

!!!

 

where   k   is   the   number   of   clusters,   N   the   sample   size,   𝑆!   denotes   the   jth   cluster,  𝑥!  denotes  the  ith  sample  and  𝝁!  denotes  the  cluster  mean  of  cluster   𝑆!.  A  low  value  of  SSE  indicates  that  all  samples  are  close  to  their  respective   cluster   centers.   However,   the   measure   does   not   account   for   the   number   of   clusters,   meaning   that   SSE   will   in   theory   reach   zero   if   k   ≈   N.   Thus,   the   objective  should  be  to  find  a  value  where  increasing  k  does  not  lead  to  a  big   decrease  in  SSE  (Kile  &  Uhlen,  2012).  This  is  done  by  plotting  SSE  values  for   different  values  of  k,  and  trying  to  find  an  elbow  point.  In  this  case,  an  elbow   point   is   a   value   of   k   after   which   SSE   only   decreases   marginally   (Demonstrated  in  Figure  2).    

Bayesian  Information  Criterion  (BIC)  

Evaluating  statistical  models  like  GMM  based  only  on  log-­‐likelihood  values   for  the  data  under  the  model  often  leads  to  the  highest  score  for  the  model   with   highest   dimensionality   (Schwarz,   1978),   with   over-­‐fitting   and   low   generalizing   ability   as   a   result.   Schwarz   (1978)   suggested   the   Bayesian   Information   Criterion,   which   is   a   measure   for   model   evaluation   that   penalizes   models   with   higher   dimensionality.   For   GMM,   dimensionality   translates   to   means,   covariance-­‐matrices   and   weights   for   each   variable   (3*M).  It  can  be  expressed  as  follows  (in  this  form,  it  is  optimized  towards  a   low  value):  

𝐵𝐼𝐶 = −2 ∗ ln 𝐿 + 𝑘 ∗ ln  (𝑁)  

(21)

where   L   is   the   likelihood-­‐value   of   the   data   under   the   model,   k   is   the   dimensionality  of  the  model  and  N  is  the  sample  size.  The  likelihood-­‐value  is   defined  as  the  dataset’s  summed  score  of  the  probability  density  function  of   the  mixture  model.  

Heuristic  Criteria  

In   addition   to   the   numeric   measures,   a   number   of   heuristic   criteria   are   established  to  assess  the  validity  of  a  model.  These  are  meant  to  reflect  the   over-­‐all  objectives  of  the  study:  

1. Each   of   the   resulting   clusters   should   represent   a   significant   portion   of   the   samples   (cardinality)   and   a   significant  portion  of  the  users.  Any  cluster  representing   only  a  few  sessions  or  a  few  users  arguably  falls  outside   the  scope  of  identifying  the  most  common  behavior.  

2. The   number   of   clusters   should   not   be   too   large   to   overview   and   grasp.   The   results   of   the   study   should   be   meaningful   also   to   persons   not   involved   in   the   study.  

They  should  not  require  excessive  studying  to  overview.    

3. The  number  of  clusters  should  not  be  too  small  to  have   analytic   value.   The   study   should   ideally   provide   a   partition  that  is  meaningful  and  interesting.  

4. Each  cluster  should  have  “outstanding”  (higher  or  lower   than   average)   values   for   several   attributes.   Different   types  of  behavior  are  expected  to  affect  several  attributes.  

A   partition   that   captures   extreme-­‐values   for   one   attribute  per  cluster  and  shows  average  values  for  other   attributes  is  not  considered  meaningful.  

5. Each   cluster   or   prototype   should   differ   from   all   other   clusters   in   several   attributes.   A   partition   in   which   only   one  or  two  attributes  differ  between  clusters  is  arguably   too  granular.  

2.5.3 Model Evaluation Generalizing  Ability  

In  order  to  evaluate  extent  to  which  the  resulting  models  can  be  applied  to   different   data,   the   models   are   used   to   classify   different   datasets.   Similar   measures   to   the   model   selection   procedure   are   monitored   to   assess   the   generalizing  ability  of  the  models.  

For  k-­‐means,  the  average  SSE  per  sample  and  average  silhouette  coefficient  is   measured.   For   GMM,   the   average   log-­‐likelihood   per   sample   and   average   silhouette  coefficient  is  measured.    

The  models  are  evaluated  using  four  separate  datasets:  

1. Different  user  group.  The  dataset  is  from  the  same  time   period  and  country,  but  with  a  different  set  of  users.  

2. Different  user  group  and  different  time  period.  The  data   is   collected   during   the   four   weeks   prior   to   that   of   the   training  data;  December  31st  (2012)  –  January  27th  (2013).      

(22)

3. Different  time  period  and  country.  The  data  is  collected   for  the  same  time  period  as  2.  but  with  users  from  Great   Britain  instead  of  Sweden.  

4. Random   data.   The   data   is   randomly   generated.   Each   attribute  is  uniformly  distributed  in  [0,  μ  +  2*σ],  where  μ   is   the   mean   of   the   attribute   and   σ   is   the   standard   deviation.    

Robustness  

In   order   to   test   the   robustness   of   the   model,   the   training   data   is   classified   several  times  with  different  amounts  of  noise  added.  The  noise  is  meant  to   represent   variation   which   may   occur   naturally   between   different   observations   of   the   same   use   case.   The   same   measures   are   used   as   in   the   generalizing  tests.  

The   noise   added   is   normally   distributed   around   zero,   with   varying   values   variance.   For   each   trial,   the   relative   portion   of   samples   assigned   the   same   label  as  the  original  data  is  measured.      

(23)

3 Results

3.1 Factor Analysis

This  section  covers  the  results  of  the  factor  analysis.  

3.1.1 Factor Analysis Model Selection

To   select   an   appropriate   number   of   latent   variables,   several   factor   analysis   models   are   trained   and   evaluated   with   a   scree-­‐plot   and   by   studying   the   amount  of  variance  explained.  

The  scree  plot  in  Figure  3  is  derived  from  the  Pearson  correlation  coefficient   matrix   of   the   normalized   data   set   of   21   attributes   and   138,920   samples.   It   shows   the   eigenvalues   of   the   correlation   coefficient   matrix   in   descending   order.  The  sizes  of  the  eigenvalues  indicates  that  4  or  5  components  would   be  a  suitable  choice  for  number  of  latent  variables.  Figure  4  also  shows  that   the  amount  of  variance  explained  only  increases  marginally  with  6  or  more   components.  Based  on  the  scree-­‐plot  and  the  increase  in  explained  variance   between  4  and  5  components,  the  factor  analysis  model  with  5  components  is   selected.    

   

Figure   3.   Scree   plot.   Eigenvalues   of   the   correlation   coefficient   matrix  of  the  data  in  decreasing  order.  

 

0 0.5 1 1.5 2 2.5 3 3.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Eigenvalue

Component

References

Related documents

The ethnographic design makes it possible to observe the interaction between the service provider and customer, giving the opportunity to identify the behavior

Based on the results, the contributing factors to pedestrian accidents, analysis on identified patterns on different severity levels and with additional variables, as well as

Stable partition means that an invitation by l b is guaranteed to reach every other correct nodes in the partition within one decision period (dcP eriod). Then, within another

Denna teori stöds av Iyer, Khwaj, Luttmer och Shue (2009) som visar på att icke-professionella investerare lättare kan bedöma den faktiska risken för en

I don’t know, I try to keep things more general and my thoughts about this, I think I’ve said all that I’ve wanted… Maybe I can say like this – this I mentioned earlier –

This work started by choosing an open source video player. The open source video player is then developed further according to requirements of this project. The player is

Need for better informed policies – number of cluster programs and cluster initiatives growing rapidly - European Cluster Memorandum 4x5 Principles New call in 2008 under CIP:

c) strengthen cluster programs and initiatives d) and forming European-wide programs for transnational cluster interaction. • Many EU initiatives have