• No results found

Online Dimensionality Reduction

N/A
N/A
Protected

Academic year: 2021

Share "Online Dimensionality Reduction"

Copied!
141
0
0

Loading.... (view fulltext now)

Full text

(1)

Licentiate Thesis in Engineering

Online Dimensionality Reduction

KAITO ARIU

(2)

Online Dimensionality Reduction

KAITO ARIU

Licentiate Thesis in Engineering KTH Royal Institute of Technology Stockholm, Sweden 2021

Academic Dissertation which, with due permission of the KTH Royal Institute of Technology, is submitted for public defence for the Degree of Licentiate of Engineering on Wednesday the 10th of 2021, at 10:00 a.m. in Harry Nyquist, Malvinas väg 10, Stockholm.

(3)

© Kaito Ariu, Jungseul Ok, Alexandre Proutiere, Se-Young Yun, Paper 2 ISBN 978-91-7873-780-2

TRITA-EECS-AVL-2021:12

(4)

i

Abstract

In this thesis, we investigate online dimensionality reduction methods, where the algorithms learn by sequentially acquiring data. We focus on two specific algorithm design problems in (i) recommender systems and (ii) heterogeneous clustering from binary user feedback. (i) For recommender systems, we con-sider a system consisting of m users and n items. In each round, a user, selected uniformly at random, arrives to the system and requests a recom-mendation. The algorithm observes the user id and recommends an item from the item set. A notable restriction here is that the same item cannot be recommended to the same user more than once, a constraint referred to as a no-repetition constraint. We study this problem as a variant of the multi-armed bandit problem and analyze regret with the various structures pertaining to items and users. We devise fundamental limits of regret and algorithms that can achieve the limits order-wise. The analysis explicitly highlights the importance of each component of regret. For example, we can distinguish the regret due to the no-repetition constraint, that generated to learn the statistics of user’s preference for an item, and that generated to learn the low-dimensional space of the users and items were shown. (ii) In the clustering with binary feedback problem, the objective is to classify items solely based on limited user feedback. More precisely, users are just asked simple questions with binary answers. A notable difficulty stems from the heterogeneity in the difficulty in classifying the various items (some items require more feedback to be classified than others). For this problem, we derive fundamental limits of the cluster recovery rates for both offline and online algorithms. For the offline setting, we devise a simple algorithm that achieves the limit order-wise. For the online setting, we propose an algo-rithm inspired by the lower bound. For both of the problems, we evaluate the proposed algorithms by inspecting their theoretical guarantees and using numerical experiments performed on the synthetic and non-synthetic dataset.

(5)

Sammanfattning

Denna avhandling studerar algoritmer f¨or datareduktion som l¨ar sig fr˚an se-kventiellt inh¨amtad data. Vi fokuserar speciellt p˚a fr˚agest¨allningar som upp-kommer i utvecklingen av rekommendationssystem och i identifieringen av heterogena grupper av anv¨andare fr˚an data. F¨or rekommendationssystem be-traktar vi ett system med m anv¨andare och n objekt. I varje runda observerar algoritmen en slumpm¨assigt vald anv¨andare och rekommenderar ett objekt. En viktig begr¨ansning i v˚ar problemformulering ¨ar att rekommendationer inte f˚ar upprepas: samma objekt inte kan rekommenderas till samma anv¨andare mer ¨an en g˚ang. Vi betraktar problemet som en variant av det flerarmade banditproblemet och analyserar systemprestanda i termer av “˚anger” under olika antaganden.Vi h¨arleder fundamentala gr¨anser f¨or ˚anger och f¨oresl˚ar al-goritmer som ¨ar (ordningsm¨assigt) optimala. En intressant komponent av v˚ar analys ¨ar att vi lyckas att karakt¨arisera hur vart och ett av v˚ara antagan-den p˚averkar systemprestandan. T.ex. kan vi kvantifiera prestandaf¨orlusten i ˚anger p˚a grund av att rekommendationer inte f˚ar upprepas, p˚a grund av att vi m˚aste l¨ara oss statistiken f¨or vilka objekt en anv¨andare ¨ar intresse-rade av, och f¨or kostnaden f¨or att l¨ara sig den l˚agdimensionella rymden f¨or anv¨andare och objekt. F¨or problemet med hur man b¨ast identifierar grupper av anv¨andare h¨arleder vi fundamentala gr¨anser f¨or hur snabbt det g˚ar att identifiera kluster. Vi g¨or detta f¨or algoritmer som har samtidig tillg˚ang till all data och f¨or algoritmer som m˚aste l¨ara sig genom sekventiell inh¨amtning av data. Med tillg˚ang till all data kan v˚ar algoritm uppn˚a den optimala pre-standan ordningsm¨assigt. N¨ar data m˚aste inh¨amtas sekventiellt f¨oresl˚ar vi en algoritm som ¨ar inspirerad av den nedre gr¨ansen p˚a vad som kan uppn˚as. F¨or b˚ada problemen utv¨arderar vi de f¨oreslagna algoritmerna numeriskt och j¨amf¨or den praktiska prestandan med de teoretiska garantierna.

(6)

Acknowledgement

First and foremost, I would like to express my gratitude to my advisors, Prof. Alexandre Prouti`ere and Prof. Mikael Johansson. I cannot thank you enough for your great support and for giving me freedom in pursuing my academic work. In particular, I wish to express my deepest gratitude to Prof. Alexandre Prouti`ere, my main supervisor. His attitude toward conducting great research, his inspi-rational and knowledgeable approach, and his willingness to engage in active discussions, no matter how busy he was, all contributed positively to my life as a researcher. Without his assistance, I would not have known how to proceed and would not have been able to do interesting research. I am further indebted to Prof. Mikael Johansson for his help in writing the abstract in Swedish.

I would also like to thank Prof. Se-Young Yun, Prof. Junseul Ok, and Narae Ryu for their research collaboration. The ingenious inspiration of Prof. Se-Young Yun helped me a lot and gave me motivation. The ability of Prof. Junseul Ok to defy the odds has helped me many times. The knowledge and theoretical rigor of Narae Ryu also helped me on many occasions.

I appreciate the round-the-clock discussions and congenial time spent with our group members, Yassir Jedra, Simon Lindst˚ahl, Alessio Russo, Damianos Tranos, Filippo Vannella, and Po-An Wang. It was a pleasure to work with such a talented and passionate group of people. At KTH, I was fortunate to meet great people, in particular, Rodrigo Gonz´alez, Takuya Iwaki, Yuchao Li, Othmane Mazhar, Xiaoqiang Ren, Ruo-Chun Tzeng, Yu Xing, Ingvar Ziemann, and too many others to list here.

The work done during my studies has been financially supported by the Naka-jima Foundation Scholarship. The author conducted part of the research using the facilities of the Masason Foundation.

Finally, I want to show my gratitude to my family for supporting me in what-ever decisions I made and whatwhat-ever interests I had, throughout my entire life.

(7)
(8)

Contents

Acknowledgement iii

Contents 1

1 Introduction 3

1.1 Thesis outline . . . 3

1.2 Offline Dimentionality Reduction and Clustering . . . 4

1.3 Online Dimentionality Reduction: Problem formulation . . . 4

1.4 Related work . . . 5 1.5 Contribution . . . 6 2 Publications 7 3 Discussion 9 References 11 A Paper I 17 A.1 Introduction . . . 17

A.2 Related Work . . . 19

A.3 Models and Preliminaries . . . 20

A.4 Regret Lower Bounds . . . 22

A.5 Algorithms . . . 25

A.6 Conclusion . . . 29

A.7 Table of Notations . . . 30

A.8 Algorithms and experiments . . . 33

A.9 Preliminaries: Properties of the user arrival process . . . 47

A.10 Justifying the regret definitions . . . 50

A.11 Fundamental limits for Model A . . . 52

Proof of Lemma 7 . . . 54

A.12 Fundamental limits for Model B . . . 57

A.13 Fundamental limits for Model C . . . 60

Examples . . . 60 1

(9)

Proof . . . 61

A.14 Performance guarantees of ECT . . . 70

A.15 Performance guarantees of ET . . . 75

A.16 Performance guarantees of EC-UCS . . . 79

A.17 Performance guarantees of ECB and Item Clustering . . . 92

B Paper II 101 B.1 Introduction . . . 102 Feedback model . . . 103 Main contributions . . . 104 B.2 Related work . . . 106 B.3 Information-theoretical limits . . . 107

Uniform selection strategy . . . 107

Adaptive selection strategy . . . 108

Proof of Theorem 10 . . . 109

Proof of Theorem 11 . . . 110

B.4 Algorithms . . . 112

Uniform selection strategy . . . 112

Adaptive selection strategy . . . 113

Proof of Theorem 12 . . . 115

B.5 Numerical experiments: Synthetic data . . . 119

B.6 Numerical experiments: Real-world data . . . 120

B.7 Conclusion . . . 124 B.8 Table of notations . . . 125 B.9 Proof of Proposition 1 . . . 127 B.10 Proof of Lemma 22 . . . 128 B.11 Proof of Corollary 1 . . . 130 B.12 Proof of Lemma 23 . . . 130

(10)

Chapter 1

Introduction

Today, the development of technologies has made it possible to store and access large amounts of data. In order to make decisions that lead to business from data, it is necessary to extract meaningful or relatively influential information from data. This kind of problem has been addressed as a dimensionality re-duction problem in the machine learning community, and most often in offline learning setting, i.e., the acquisition of data (mostly obtained from i.i.d. distri-bution) is clearly separated from the learning process. However, in applications such as online advertisement and recommendation systems, online learning, which involves learning by sequentially acquiring high-dimensional data, is becoming in-creasingly important.

In this thesis, we study online dimensionality reduction methods. We provide fundamental limits for some online dimension-reduction problems and propose algorithms achieving these limits. In paper I, the regret minimization problem in the recommender systems is studied. We consider three latent structures over large numbers of users and items: (i) Clustered items and statistically identical users, (ii) unclustered items and statistically identical users, (iii) clustered items and clustered users. In paper II, the problem of large-scale labeling tasks from binary user feedback is studied. Items are clustered using the binary answers from a finite set of questions. Importantly, we account for the fact that items are inherently more difficult to classify than others (the hardness of classifying each item is captured through a parameter, different from one item to another).

1.1

Thesis outline

This thesis is based on two papers. Chapter 1 is the introduction and covers the background, related work, the main contribution of this thesis. In Chapter 2, the list of publication is presented. In Chapter 3, we discuss the results and propose possible future research directions. Chapter A contains the first paper. In Chapter B, the second paper is presented.

(11)

1.2

Offline Dimentionality Reduction and Clustering

Dimensionality reduction (see e.g, [58]) refers to a technique desgined to map data in a high-dimensional space into a low dimensional space. Such a reduction has advantages. For example, it is easier for people to interpret the data.

One of the most well-known methods is Principal Component Analysis (PCA). This can be understood as the process of making the best possible projection of the data. If the original data is in Rn

and map it into Rd (d < n), the mapping

is linear and represented by a matrix W ∈ Rd×n. PCA finds both the reduction

matrix W and the recovery matrix U simultaneously by solving the least-squares problem on the projected data (See, for example, [53] Chapter 23).

Another well-known method is the Compressed Sensing method. It is the method of selecting elements based on the assumption that the data is sparse, that is, only d elements out of n elements are useful. LASSO [54] and variants using regularization are the classical methods to solve such a problem. Theoretical guarantees have also been derived [8].

The last method we introduce is clustering. Clustering is to categorize similar data points into a known (or sometimes unknown) number of groups. Though there is a wide range of applications such as image segmentation, web services, bioinformatics, etc, the objective behind clustering is often vague. In many cases, it is impossible to know the true category of the data, or it may not exist. Fur-thermore, there is no clear standard definition for ”similarity”, and it can be designated/interpreted arbitrarily. Numerous clustering methods have been pro-posed, including K-means, K-medians, Gaussian mixture models (GMMs), Spec-tral Clustering, and nonparametric methods.

All the aforementioned methods are offline, i.e., we first collect the data (most often data points are i.i.d.) and we then perform dimensionality reduction. The process of dimensionality reduction is independent of the process of data acqui-sition.

1.3

Online Dimentionality Reduction: Problem formulation

In this thesis, we study online dimensionality reduction methods. In each round or step, the decision maker decides on the next data point to collect based on the already collected data. This freedom in collecting data sequentially may enhance the performance compared to offline methods.

First, we theoretically study the performance of the recommender systems in an online setting. The system consists of large numbers of items and users. Each round, a user, chosen uniformly at random, needs a recommendation. The decision-maker observes the user id and selects an item to be presented to the user. An important constraint is that an item cannot be recommended twice to a user. After the recommendation, the user immediately rates the item as +1 if she likes it or 0 otherwise. We consider the problem under various structural

(12)

1.4. RELATED WORK 5

assumptions: (i) first when only the items are clustered, (ii) then when users are clustered and when items’ success rates are drawn from a distribution ζ on [0, 1] in an i.i.d. (over items) manner, (iii) finally when both items and users are clustered. We define the reward as the expected number of +1 over the given number of rounds. The objective is to maximize the reward.

Second, we investigate large-scale labeling tasks from binary user feedback. Specifically, we study the clustering problem of the large set of items from noisy bi-nary answers from users. Items are initially assigned to unknown non-overlapping clusters. Each time a user arrives at the system, the algorithm presents to that user a set of items together with a question with binary answers selected from a finite question set. Each item has a different clustering hardness. The objective is to cluster the items with a minimal error rate. We consider two cases (i) uniform sampling strategy (offline algorithm), where the number of users used for each (question, item) pair is constant over the pairs (ii) adaptive sampling strategy (online algorithm), where (question, item) pairs are selected sequentially.

1.4

Related work

Clustering in the Stochastic Block Model

Stochastic Block Models (SBMs) [1] are random graph models with non-overlapping clusters. When probing two nodes, the probability that edge appears is deter-mined by the cluster indexes of the nodes only. In the offline setting, each pair of nodes is probed, and the data is simply a random realization of a random graph. This offline setting has received a lot of attention: researchers have studied the problems of recovery and detection of clusters, e.g., [1, 2, 15, 61, 63]. Most of the work provides guarantees in high probability (not in expectation).

There is very few online analysis for clustering in the SBMs. [62] analyzed the adaptive clustering problem of simplified SBMs where the probability of the existence of edges can take only two values p (intra-cluster) and q (inter-cluster). The paper also proposed an adaptive algorithm with vanishing proportion of the misclassified nodes (with high probability) under some conditions. [64] further generalized the analysis to cover exact cluster recovery (vanishing number of the misclassified nodes). The guarantees are with high probability (not in expecta-tion).

Degree-Corrected Block Models (DCBMs) [31] are variants of SBMs that al-lows for per-node heterogeneity. In [16], offline cluster recovery algorithms for DCBMs are studied. The analysis is given with high probability and no adaptive algorithms are studied.

Low rank matrix completion

Low-rank matrix completion [10, 32, 49] is a method to recover a low-rank ma-trix from partially observed entries (possibly with noise). For the offline

(13)

set-ting, numerous methods have been proposed, such as convex optimization-based methods [10, 49] and spectral-based methods [32, 33]. For this problem, scenar-ios with online data collection have not been extensively studied, e.g., see how-ever [13,27,36,52]. Most of the analysis provide guarantees with high probability.

1.5

Contribution

The main contributions of this thesis are described as follows.

For the first recommender systems problem, we derive instance-specific regret lower bounds for each of the models. The derivations are based on novel online optimization methods. The methods could be key for other online stochastic optimization problems. Furthermore, for each of the models, we present optimal algorithms whose regret scalings match the lower bounds. The algorithms make use of clustering, bandit, and hypothesis testing techniques.

For the second heterogeneous clustering from binary feedback problem, we first derive error rate lower bounds for both uniform and adaptive selection strategies. For the uniform strategy, we devise a K-means based algorithm whose perfor-mance is proved to match the lower bound. For the adaptive strategy, we propose an adaptive algorithm that explicitly utilizes the derived error rate lower bound formula.

(14)

Chapter 2

Publications

This thesis is based on two papers. Paper I

K. Ariu, N. Ryu, S. Yun, and A. Proutiere, “Regret in Online Recommenda-tion Systems,” In 34th Conference on Neural InformaRecommenda-tion Processing Systems (NeurIPS), 2020.

Paper II

K. Ariu, J. Ok, A. Proutiere, S. Yun, “Optimal Clustering from Noisy Binary Feedback,” arXiv preprint arXiv:1910.06002, 2020.

The following publications are not included in this thesis. Journal articles

K. Ariu, T. Inamori, R. Funase, and S. Nakasuka, “A dimensionless relative trajectory estimation algorithm for autonomous imaging of a small astronomical body in a close distance flyby,” Advances in Space Research, Volume 58, Issue 4, pp. 528-540, 2016

S. Ikari, T. Inamori, T. Ito, K. Ariu, K. Oguri, M. Fujimoto, S. Sakai, Y. Kawakatsu and R. Funase, “Attitude Determination and Control System for the Micro Spacecraft PROCYON,” Transactions of the Japan Society for Aeronauti-cal and Space Sciences, 2018.

B. Sarli, K. Ariu and H. Yano, “PROCYON’s Probability Analysis of Accidental Impact on Mars,” Advances in Space Research, Volume 57, Issue 9, pp. 2003-2012, 2016

(15)

Conference Presentations

P. Wang, A. Proutiere, K. Ariu, Y. Jedra, A. Russo, “Optimal Algorithms for Multiplayer Multi-Armed Bandits,” In The 23rd International Conference on Ar-tificial Intelligence and Statistics (AISTATS), 2020.

Preprints

K. Ariu, K. Abe, A. Proutiere, “Thresholded LASSO Bandit,” arXiv preprint arXiv:2010.11994, 2020

M. Kato, K. Abe, K. Ariu, S. Yasui, “A Practical Guide of Off-Policy Evaluation for Bandit Policy Evaluation,” arXiv preprint arXiv:2010.12470, 2020

(16)

Chapter 3

Discussion

Some challenging issues remain as future work and are explained below.

For the first problem, our structures are limited to SBMs structures and con-tinuous structures. It may also be possible to extend our results to the case of low-rank structures. For the continuous structures, we assume that the users are homogeneous. It could be possible to assume clustered structures among users like the setting in [5]. Furthermore, the algorithms can be more practical by making them anytime algorithms. However, the analysis of such algorithms may be fraught with difficulties.

For the second problem, the adaptive algorithm has not been analyzed. This is because of the hardness of estimating the parameter hi. It is difficult to derive

an upper bound unless we assume strong symmetry for the parameters pk`. Even

for the uniform strategy, improving the clusters by hypothesis testing leads to improving the constant coefficient of error rate. Even in that case, the poor accuracy of the estimation of hi still makes it difficult to apply the

likelihood-ratio test as usual. It may be possible for a large company to apply the methods to a CAPTCHA-like system and demonstrate clustering in live experiments.

In this thesis, we have discussed recommendation systems and heterogeneous clustering problems as part of the online dimensionality reduction methods. How-ever, the scope of application of online dimensionality reduction methods is not limited to these two fields. It is interesting to generalize the methods to other applications such as online advertisement, reinforcement learning, auction, etc.

(17)
(18)

References

[1] Emmanuel Abbe. Community detection and stochastic block models. Founda-tions and Trends in CommunicaFounda-tions and Information Theory, 14(1-2):1–162, 2018.

[2] Emmanuel Abbe, Afonso S Bandeira, and Georgina Hall. Exact recovery in the stochastic block model. IEEE Transactions on Information Theory, 62(1):471– 487, 2015.

[3] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, pages 235–256, 2002.

[4] Thomas Bonald and Alexandre Proutiere. Two-target algorithms for infinite-armed bandits with bernoulli rewards. In Advances in Neural Information Processing Systems 26, pages 2184–2192. 2013.

[5] Guy Bresler, George H Chen, and Devavrat Shah. A latent source model for online collaborative filtering. In Advances in Neural Information Processing Systems, pages 3347–3355, 2014.

[6] Guy Bresler and Mina Karzand. Regret bounds and regimes of optimality for user-user and item-item collaborative filtering. In 2018 Information Theory and Applications Workshop (ITA), pages 1–37, 2018.

[7] Sebastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic multi-armed bandits. In Proceedings of the 26th Annual Conference on Learning Theory, pages 122–134, 2013.

[8] Peter B¨uhlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011. [9] Loc Bui, Ramesh Johari, and Shie Mannor. Clustered bandits. arXiv preprint

arXiv:1206.4169, 2012.

[10] Emmanuel J Cand`es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717–772, 2009.

(19)

[11] Richard Combes, Chong Jiang, and Rayadurgam Srikant. Bandits with bud-gets: Regret lower bounds and optimal algorithms. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 245–257, 2015.

[12] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1):20–28, 1979.

[13] Jicong Fan and Madeleine Udell. Online high rank matrix completion. In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8690–8698, 2019.

[14] Chao Gao, Yu Lu, and Dengyong Zhou. Exact exponent in optimal rates for crowdsourcing. In International Conference on Machine Learning, pages 603–611, 2016.

[15] Chao Gao, Zongming Ma, Anderson Y Zhang, and Harrison H Zhou. Achieving optimal misclassification proportion in stochastic block models. The Journal of Machine Learning Research, 18(1):1980–2024, 2017.

[16] Chao Gao, Zongming Ma, Anderson Y Zhang, Harrison H Zhou, et al. Com-munity detection in degree-corrected block models. The Annals of Statistics, 46(5):2153–2185, 2018.

[17] Aur´elien Garivier, Pierre M´enard, and Gilles Stoltz. Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 2018.

[18] Claudio Gentile, Shuai Li, Purushottam Kar, Alexandros Karatzoglou, Gio-vanni Zappella, and Evans Etrue. On context-dependent clustering of ban-dits. In Proceedings of the 34th International Conference on Machine Learning, pages 1253–1262, 2017.

[19] Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of ban-dits. In Proceedings of the 31th International Conference on Machine Learning, pages 757–765, 2014.

[20] Ryan G Gomes, Peter Welinder, Andreas Krause, and Pietro Perona. Crowd-clustering. In Advances in neural information processing systems, pages 558– 566, 2011.

[21] Aditya Gopalan, Odalric-Ambrym Maillard, and Mohammadi Zaki. Low-rank bandits with latent mixtures. arXiv preprint arXiv:1609.01508, 2016.

[22] Botao Hao, Tor Lattimore, and Csaba Szepesvari. Adaptive exploration in linear contextual bandit. arXiv preprint arXiv:1910.06996, 2019.

(20)

REFERENCES 13

[23] Reinhard Heckel and Kannan Ramchandran. The sample complexity of on-line one-class collaborative filtering. In Proceedings of the 34th International Conference on Machine Learning, pages 1452–1460, 2017.

[24] Chien-Ju Ho, Shahin Jabbari, and Jennifer Wortman Vaughan. Adaptive task assignment for crowdsourced classification. In International Conference on Machine Learning, pages 534–542, 2013.

[25] Wassily Hoeffding. Probability inequalities for sums of bounded random vari-ables. Journal of the American Statistical Association, 58(301):13–30, 1963. [26] Matthieu Jedor, Vianney Perchet, and Jonathan Louedec. Categorized

ban-dits. In Advances in Neural Information Processing Systems, pages 14399– 14409, 2019.

[27] Chi Jin, Sham M Kakade, and Praneeth Netrapalli. Provable efficient online matrix completion via non-convex stochastic gradient descent. arXiv preprint arXiv:1605.08370, 2016.

[28] Kwang-Sung Jun, Rebecca Willett, Stephen Wright, and Robert Nowak. Bilin-ear bandits with low-rank structure. In Proceedings of the 36th International Conference on Machine Learning, pages 3163–3172, 2019.

[29] O. Kallenberg. Random Measures, Theory and Applications. Probability The-ory and Stochastic Modelling. Springer International Publishing, 2017. [30] David R Karger, Sewoong Oh, and Devavrat Shah. Iterative learning for

reliable crowdsourcing systems. In Advances in neural information processing systems, pages 1953–1961, 2011.

[31] Brian Karrer and Mark EJ Newman. Stochastic blockmodels and community structure in networks. Physical review E, 83(1):016107, 2011.

[32] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE transactions on information theory, 56(6):2980–2998, 2010.

[33] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries. The Journal of Machine Learning Research, 11:2057–2078, 2010.

[34] Ashish Khetan and Sewoong Oh. Achieving budget-optimality with adap-tive schemes in crowdsourcing. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4844–4852. Curran Associates, Inc., 2016.

[35] Robert D. Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. Re-gret bounds for sleeping experts and bandits. In Proceedings of the 21st Annual Conference on Learning Theory, pages 425–436, 2008.

(21)

[36] Akshay Krishnamurthy and Aarti Singh. Low-rank matrix and tensor comple-tion via adaptive sampling. arXiv preprint arXiv:1304.4672, 2013.

[37] Joon Kwon, Vianney Perchet, and Claire Vernade. Sparse stochastic bandits. In Proceedings of the 30th Conference on Learning Theory, pages 1269–1270, 2017.

[38] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive alloca-tion rules. Advances in applied mathematics, 6(1):4–22, 1985.

[39] Shuai Li, Wei Chen, and Kwong-Sak Leung. Improved algorithm on online clustering of bandits. arXiv preprint arXiv:1902.09162, 2019.

[40] Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 539–548, 2016. [41] Shuai Li and Shengyu Zhang. Online clustering of contextual cascading

ban-dits. In Proceedings of the Thirty-Second AAAI Conference on Artificial In-telligence, 2018.

[42] Odalric-Ambrym Maillard and Shie Mannor. Latent bandits. In Proceedings of the 31th International Conference on Machine Learning, pages 136–144, 2014. [43] J´er´emie Mary, Romaric Gaudel, and Philippe Preux. Bandits and recom-mender systems. In International Workshop on Machine Learning, Optimiza-tion and Big Data, pages 325–336, 2015.

[44] Jonas W Mueller, Vasilis Syrgkanis, and Matt Taddy. Low-rank bandit meth-ods for high-dimensional dynamic pricing. In Advances in Neural Information Processing Systems, pages 15442–15452, 2019.

[45] Jungseul Ok, Sewoong Oh, Jinwoo Shin, and Yung Yi. Optimality of belief propagation for crowdsourced classification. In International Conference on Machine Learning, pages 535–544, 2016.

[46] Jungseul Ok, Se-Young Yun, Alexandre Proutiere, and Rami Mochaourab. Collaborative clustering: Sample complexity and efficient algorithms. In In-ternational Conference on Algorithmic Learning Theory, pages 288–329, 2017. [47] Martin Raab and Angelika Steger. Balls into Bins - A Simple and Tight Anal-ysis. In Proceedings of the Second International Workshop on Randomization and Approximation Techniques in Computer Science, pages 159–170, 1998. [48] Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez,

Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. Journal of Machine Learning Research, 11(Apr):1297–1322, 2010.

(22)

REFERENCES 15

[49] Benjamin Recht. A simpler approach to matrix completion. Journal of Ma-chine Learning Research, 12(12), 2011.

[50] Paul Resnick and Hal R Varian. Recommender systems. Communications of the ACM, 40(3):56–58, 1997.

[51] Daniel Russo and Benjamin Van Roy. Satisficing in time-sensitive bandit learn-ing. arXiv preprint arXiv:1803.02855, 2018.

[52] Yun Se-Young, Marc Lelarge, and Alexandre Prouti`ere. Fast and memory optimal low-rank matrix approximation. In NIPS 2015, 2015.

[53] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[54] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996. [55] Long Tran-Thanh, Matteo Venanzi, Alex Rogers, and Nicholas R Jennings.

Ef-ficient budget allocation with accuracy guarantees for crowdsourcing classifica-tion tasks. In Proceedings of the 2013 internaclassifica-tional conference on Autonomous agents and multi-agent systems, pages 901–908. International Foundation for Autonomous Agents and Multiagent Systems, 2013.

[56] Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1–230, 2015.

[57] A.B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer New York, 2008.

[58] Laurens Van Der Maaten, Eric Postma, and Jaap Van den Herik. Dimension-ality reduction: a comparative. J Mach Learn Res, 10(66-71):13, 2009. [59] Ramya Korlakai Vinayak and Babak Hassibi. Crowdsourced clustering:

Query-ing edges vs triangles. In Advances in Neural Information ProcessQuery-ing Systems, pages 1316–1324, 2016.

[60] Peter Welinder, Steve Branson, Pietro Perona, and Serge J Belongie. The mul-tidimensional wisdom of crowds. In Advances in neural information processing systems, pages 2424–2432, 2010.

[61] Se-Young Yun and Alexandre Proutiere. Accurate community detection in the stochastic block model via spectral algorithms. arXiv preprint arXiv:1412.7335, 2014.

[62] Se-Young Yun and Alexandre Proutiere. Community detection via random and adaptive sampling. In Conference on Learning Theory, pages 138–175, 2014.

(23)

[63] Se-Young Yun and Alexandre Proutiere. Optimal cluster recovery in the la-beled stochastic block model. In Advances in Neural Information Processing Systems, pages 965–973, 2016.

[64] Se-Young Yun and Alexandre Prouti`ere. Optimal sampling and clustering in the stochastic block model. In Advances in Neural Information Processing Systems, pages 13422–13430, 2019.

[65] Se-Young Yun, Alexandre Proutiere, et al. Streaming, memory limited algo-rithms for community detection. In Advances in Neural Information Processing Systems, pages 3167–3175, 2014.

[66] Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I Jordan. Spectral methods meet em: A provably optimal algorithm for crowdsourcing. In Ad-vances in neural information processing systems, pages 1260–1268, 2014. [67] Dengyong Zhou, Sumit Basu, Yi Mao, and John C Platt. Learning from the

wisdom of crowds by minimax entropy. In Advances in neural information processing systems, pages 2195–2203, 2012.

(24)

Appendix A

Paper I

Regret in Online Recommendation Systems

Kaito Ariu, Narae Ryu, Se-Young Yun, Alexandre Prouti`ere

Abstract

This paper proposes a theoretical analysis of recommendation systems in an online setting, where items are sequentially recommended to users over time. In each round, a user, randomly picked from a population of m users, requests a recommendation. The decision-maker observes the user and selects an item from a catalogue of n items. Importantly, an item cannot be recom-mended twice to the same user. The probabilities that a user likes each item are unknown. The performance of the recommendation algorithm is captured through its regret, considering as a reference an Oracle algorithm aware of these probabilities. We investigate various structural assumptions on these probabilities: we derive for each structure regret lower bounds, and devise algorithms achieving these limits. Interestingly, our analysis reveals the rela-tive weights of the different components of regret: the component due to the constraint of not presenting the same item twice to the same user, that due to learning the chances users like items, and finally that arising when learning the underlying structure.

A.1

Introduction

Recommendation systems [50] have over the last two decades triggered important research efforts (see, e.g., [6, 18, 19, 40] for recent works and references therein), mainly focused towards the design and analysis of algorithms with improved efficiency. These algorithms are, to some extent, all based on the principle of col-laborative filtering: similar items should yield similar user responses, and similar users have similar probabilities of liking or disliking a given item. In turn, efficient

(25)

recommendation algorithms need to learn and exploit the underlying structure tying the responses of the users to the various items together.

Most recommendation systems operate in an online setting, where items are sequentially recommended to users over time. We investigate recommendation algorithms in this setting. More precisely, we consider a system of n items and m users, where m ≥ n (as this is typically the case in practice). In each round, the algorithm needs to recommend an item to a known user, picked randomly among the m users. The response of the user is noisy: the user likes the rec-ommended item with an a priori unknown probability depending on the (item, user) pair. In practice, it does not make sense to recommend an item twice to the same user (why should we recommend an item to a user who already consid-ered or even bought the item?). We restrict our attention to algorithms that do not recommend an item twice to the same user, a constraint referred to as the no-repetition constraint. The objective is to devise algorithms maximizing the expected number of successful recommendations over a time horizon of T rounds. We investigate different system structures. Specifically, we first consider the case of clustered items and statistically identical users – the probability that a user likes an item depends on the item cluster only. We then study the case of unclustered items and statistically identical users – the probability that a user likes an item depends on the item only. The third investigated structure exhibits clustered items and clustered users – the probability that a user likes an item depends on the item and user clusters only. In all cases, the structure (e.g., the clusters) is initially unknown and has to be learnt to some extent. This paper aims at answering the question: How can the structure be optimally learnt and exploited?

To this aim, we study the regret of online recommendation algorithms, defined as the difference between their expected number of successful recommendations to that obtained under an Oracle algorithm aware of the structure and of the success rates of each (item, user) pair. We are interested in regimes where n, m, and T grow large simultaneously, and T = o(mn) (see§A.3 for details). For the aforementioned structures, we first derive non-asymptotic and problem-specific regret lower bounds satisfied by any algorithm.

(i) For clustered items and statistically identical users, as T (and hence m) grows large, the minimal regret scales as K max{ log(m)

log(m log(m)T ), T

∆m}, where K is the

num-ber of item clusters, and ∆ denotes the minimum difference between the success rates of items from the optimal and sub-optimal clusters.

(ii) For unclustered items and statistically identical users, the minimal satisficing regret1 scales as max{ log(m)

log(m log(m)T ), T

mε}, where ε denotes the threshold defining

the satisficing regret (recommending an item in the top ε percents of the items is assumed to generate no regret).

(iii) For clustered items and users, the minimal regret scales asm ∆ or as

m

∆log(T /m),

(26)

A.2. RELATED WORK 19

depending on the values of the success rate probabilities.

We also devise algorithms that provably achieve these limits (up to logarithmic factors), and whose regret exhibits the right scaling in ∆ or ε. We illustrate the performance of our algorithms through experiments presented in the appendix.

Our analysis reveals the relative weights of the different components of regret. For example, we can explicitly identify the regret induced by the no-repetition constraint (this constraint imposes us to select unrecommended items and induces an important learning price). We may also characterize the regret generated by the fact that the item or user clusters are initially unknown. Specifically, fully exploiting the item clusters induces a regret scaling as K∆mT . Whereas exploiting user clusters has a much higher regret cost scaling as least as m.

In our setting, deriving regret lower bounds and devising optimal algorithms cannot be tackled using existing techniques from the abundant bandit literature. This is mainly due to the no-repetition constraint, to the hidden structure, and to the specificities introduced by the random arrivals of users. Getting tight lower bounds is particularly challenging because of the non-asymptotic nature of the problem (items cannot be recommended infinitely often, and new items have to be assessed continuously). To derive these bounds, we introduce novel techniques that could be useful in other online optimization problems. The design and analysis of efficient algorithms also present many challenges. Indeed, such algorithms must include both clustering and bandit techniques, that should be jointly tuned.

Due to space constraints, we present the pseudo-codes of our algorithms, all proofs, numerical experiments, as well as some insightful discussions in the ap-pendix.

A.2

Related Work

The design of recommendation systems has been framed into structured bandit problems in the past. Most of the work there consider a linear reward structure (in the spirit of the matrix factorization approach), see e.g. [19], [18], [39], [41], [40], [21]. These papers ignore the no-repetition constraint (a usual assumption there is that when a user arrives, a set of fresh items can be recommended). In [43], the authors try to include this constraint but do not present any analytical result. Furthermore, notice that the structures we impose in our models are different than that considered in the low-rank matrix factorization approach.

Our work also relates to the literature on clustered bandits. Again the no-repetition constraint is not modeled. In addition, most often, only the user clus-ters [9], [42] or only the item clusclus-ters are considered [37], [26]. Low-rank bandits extend clustered bandits by modeling the (item, user) success rates as a low-rank matrix, see [28], [44], still without accounting for the no-repetition constraint, and without a complete analysis (no precise regret lower bounds).

(27)

One may think of other types of bandits to model recommendation systems. However, none of them captures the essential features of our problem. For ex-ample, if we think of contextual bandits (see, e.g., [22] and references therein), where the context would be the user, it is hard to model the fact that when the same context appears several times, the set of available arms (here items) changes depending on the previous arms selected for this context. Budgeted and sleeping bandits [11], [35] model scenarios where the set of available arms changes over time, but in our problem, this set changes in a very specific way not covered by these papers. In addition, studies on budgeted and sleeping bandits do not account for any structure.

The closest related work can be found in [5] and [23]. There, the authors ex-plicitly model the no-repetition constraint but consider user clusters only, and do not provide regret lower bounds. [6] extends the analysis to account for item clus-ters as well but studies a model where users in the same cluster deterministically give the same answers to items in the same cluster.

A.3

Models and Preliminaries

We consider a system consisting of a set I = [n] := {1, . . . , n} of items and a set U = [m] of users. In each round, a user, chosen uniformly at random from U , needs a recommendation. The decision-maker observes the user id and selects an item to be presented to the user. Importantly an item cannot be recommended twice to a user. The user immediately rates the recommended item +1 if she likes it or 0 otherwise. This rating is observed by the decision-maker, which helps subsequent item selections.

Formally, in round t, the user ut ∼ unif(U ) requires a recommendation. If

item i is recommended, the user ut = u likes the item with probability ρiu.

We introduce the binary r.v. Xiu to indicate whether the user likes the item,

Xiu ∼ Ber(ρiu). Let π denote a sequential item selection strategy or algorithm.

Under π, the item iπ

t is presented to the t-th user. The choice iπt depends on

the past observations and on the identity of the t-th user, namely, iπ

t is Ft−1π

-measurable, with Fπ

t−1 = σ(ut, (us, iπs, Xiπ

sus), s ≤ t − 1) (σ(Z) denotes the

σ-algebra generated by the r.v. Z). Denote by Π the set of such possible algorithms. The reward of an algorithm π is defined as the expected number of positive ratings received over T rounds: E[PT

t=1ρiπ

tut]. We aim at devising an algorithm with

maximum reward.

We are mostly interested in scenarios where (m, n, T ) grow large under the constraints (i) m ≥ n (this is typically the case in recommendation systems), (ii) T = o(mn), and (iii) log(m) = o(n). Condition (ii) complies with the no-repetition constraint and allows some freedom in the item selection process. (iii) is w.l.o.g. as explained in [5], and is just imposed to simplify our definitions of regret (refer to Appendix A.10 for a detailed discussion).

(28)

A.3. MODELS AND PRELIMINARIES 21

Problem structures and regrets

We investigate three types of systems depending on the structural assumptions made on the success rates ρ = (ρiu)i∈I,u∈U.

Model A. Clustered items and statistically identical users. In this case, ρiudepends on the item i only. Items are classified into K clusters I1, . . . IK.

When the algorithm recommends an item i for the first time, i is assigned to cluster Ik with probability αk, independently of the cluster assignments of the

other items. When i ∈ Ik, then ρi= pk. We assume that both α = (αk)k∈[K]and

p = (pk)k∈[K] do not depend on (n, m, T ), but are initially unknown. W.l.o.g.

assume that p1 > p2 ≥ p3 ≥ . . . ≥ pK. To define the regret of an algorithm

π ∈ Π, we compare its reward to that of an Oracle algorithm aware of the item clusters and of the parameters p. The latter would mostly recommend items from cluster I1. Due to the randomness in the user arrivals and the cluster sizes,

recommending items not in I1 may be necessary. However, we define regret as if

recommending items from I1 was always possible. Using our assumptions T =

o(mn) and log(m) = o(n), we can show that the difference between our regret and the true regret (accounting for the possible need to recommend items outside I1) is

always negligible. Refer to Appendix A.10 for a formal justification. In summary, the regret of π ∈ Π is defined as: Rπ(T ) = T p

1−P T t=1E h PK k=11{iπ t∈Ik}pk i . Model B. Unclustered items and statistically identical users. Again here, ρiu depends on the item i only. when a new item i is recommended for

the first time, its success rate ρi is drawn according to some distribution ζ over

[0, 1], independently of the success rates of the other items. ζ is arbitrary and initially unknown, but for simplicity assumed to be absolutely continuous w.r.t. Lebesgue measure. To represent ζ, we also use its inverse distribution function: for any x ∈ [0, 1], µx := inf{γ ∈ [0, 1] : P[ρi ≤ γ] ≥ x}. We say that an item i

is within the ε-best items if ρi≥ µ1−ε. We adopt the following notion of regret:

for a given ε > 0, Rπ ε(T ) =

PT

t=1Emax{0, µ1−ε− ρiπt} . Hence, we assume that

recommending items within the ε-best items does not generate any regret. We also assume, as in Model A, that an Oracle policy can always recommend such items (refer to Appendix A.10). This notion of satisficing regret [51] has been used in the bandit literature to study problems with a very large number of arms (we have a large number of items). For such problems, identifying the best arm is very unlikely, and relaxing the regret definition is a necessity. Satisficing regret is all the more relevant in our problem that even if one would be able to identify the best item, we cannot recommend it (play it) more than m times (due to the no-repetition constraint), and we are actually forced to recommend sub-optimal items. A similar notion of regret is used in [5] to study recommendation systems in a setting similar to our Model B.

Model C. Clustered items and clustered users. We consider the case where both items and users are clustered. Specifically, users are classified into L clusters U1, . . . , UL, and when a user arrives to the system the first time, she

(29)

is assigned to cluster U` with probability β`, independently of the other users.

There are K item clusters I1, . . . IK. When the algorithm recommends an item

i for the first time, it is assigned to cluster Ik with probability αk as in Model

A. Now ρiu = pk`when i ∈ Ik and u ∈ U`. Again, we assume that p = (pk`)k,`,

α = (αk)k∈[K] and β = (β`)`∈[L] do not depend on (n, m, T ). For any `, let k`∗=

arg maxkpk`be the best item cluster for users in U`. We assume that k∗` is unique.

In this scenario, we assume that an Oracle algorithm, aware of the item and user clusters and of the parameters p, would only recommend items from cluster k∗` to a user in U`(refer to Appendix A.10). The regret of an algorithm π ∈ Π is hence

defined as: Rπ(T ) = TP `β`pk∗ ``− PT t=1E h P k,`1{ut∈U`,iπt∈Ik}pk` i .

Preliminaries – User arrival process

The user arrival process is out of the decision maker’s control and strongly impacts the performance of the recommendation algorithms. To analyze the regret of our algorithms, we will leverage the following results. Let Nu(T ) denote the number

of requests of user u up to round T . From the literature on ”Balls and Bins process”, see e.g. [47], we know that if n := E[maxu∈UNu(T )], then

n =            log(m)

log(m log(m)T )(1 + o(1)) if T = o(m log(m)),

log(m)(dc+ o(1)) if T = cm log(m), T

m(1 + o(1)) if T = ω(m log(m)),

where dc is a constant depending on c only. We also establish the following

concentration result controlling the tail of the distribution of Nu(T ) (refer to

Appendix A.9):

Lemma 1. Define N = 4 log(m)

log(m log(m)T +e) + e2T

m . Then, ∀u ∈ U , E[max{0, Nu(T ) −

N }] ≤(e−1)m1 .

The quantities n and N play an important role in our regret analysis.

A.4

Regret Lower Bounds

In this section, we derive regret lower bounds for the three envisioned structures. Interestingly, we are able to quantify the minimal regret induced by the specific features of the problem: (i) the no-repetition constraint, (ii) the unknown success probabilities, (iii) the unknown item clusters, (iv) the unknown user clusters. The proofs of the lower bounds are presented in Appendices A.11-A.12-A.13.

(30)

A.4. REGRET LOWER BOUNDS 23

Clustered items and statistically identical users

We denote by ∆k = p1− pk the gap between the success rates of items from the

best cluster and of items from cluster Ik, and introduce the function: φ(k, m, p) = 1−e−mγ(p1,pk)

8(1−e−γ(p1,pk)), where γ(p, q) = KL(p, q) + KL(q, p) and KL(p, q) = p log p q + (1 −

p) log1−p1−q. Using the fact that KL(p, q) ≤ (p − q)2/q(1 − q), we can easily show

that as m grows large, φ(k, m, p) scales as η/(16∆2

k) when ∆k is small, where

η := minkpk(1 − pk).

We derive problem-specific regret lower bounds, and as in the classical stochas-tic bandit literature, we introduce the notion of uniformly good algorithm. π is uniformly good if its expected regret Rπ(T ) is O(max{T , log(m)

log(m log(m)T +e)}) for

all possible system parameters (p, α) when T, m, n grow large with T = o(nm) and m ≥ n. As shown in the next section, uniformly good algorithms exist. Theorem 1. Let π ∈ Π be an arbitrary algorithm. The regret of π satisfies: for all T ≥ 1 such that m ≥ c/∆2

2 (for some constant c large enough), Rπ(T ) ≥

max{Rnr(T ), Ric(T )}, where Rnr(T ) and Ric(T ), the regrets due to the no-repetition

constraint and to the unknown item clusters, respectively, are defined by Rnr(T ) :=

nP

k6=1αk∆k and Ric(T ) := mT Pk6=1αkφ(k, m, p)∆k.

Assume that π is uniformly good, then we have2:

(T ) & Rsp(T ) := log(T ) X k6=1 ∆k 2KL(pk, p1) ,

where Rsp(T ) refers to the regret due to the unknown success probabilities.

From the above theorem, analyzing the way Rnr(T ), Ric(T ), and Rsp(T ) scale,

we can deduce that:

(i) When T = o(m log(m)), the regret arises mainly due to either the no-repetition constraint or the need to learn the success probabilities, and it scales at least as max{ log(m)

log(m log(m)T ), log(T )}.

(ii) When T = cm log(m), the three components of the regret lower bound scales in the same way, and the regret scales at least as log(T ).

(iii) When T = ω(m log(m)), the regret arises mainly due to either the no-repetition constraint or the need to learn the item clusters, and it scales at least as mT.

Unclustered items and statistically identical users

In this scenario, the regret is induced by the no-repetition constraint, and by the fact the success rate of an item when it is first selected and the distribution ζ are unknown. These two sources of regret lead to the terms Rnr(T ) and Ri(T ),

respectively, in our regret lower bound.

2We write a & b if lim inf

(31)

Theorem 2. Assume that the density of ζ satisfies, for some C > 0, ζ(µ) ≤ C for all µ ∈ [0, 1]. Let π ∈ Π be an arbitrary algorithm. Then its satisficing regret satisfies: for all T ≥ 1 such that m ≥ c/ε2 (for some constant c ≥ 1 large

enough), Rπ ε(T ) ≥ max{Rnr(T ), Ri(T )}, where Rnr(T ) := n Rµ1−ε 0 (µ1−ε− µ)ζ(µ)dµ and Ri(T ) := mT (1−ε)2 2C (1− εC 1−ε) 2 min{1,(1+C)ε}+1/m.

Clustered items and clustered users

To state regret lower bounds in this scenario, we introduce the following notations. For any ` ∈ [L], let ∆k` = pk∗

``− pk` be the gap between the success rates of

items from the best cluster Ik∗

` and of items from cluster Ik. We also denote by

R`= {r ∈ [L] : k`∗6= kr∗} . We further introduce the functions:

φ(k, `, m, p) = 1 − e −mγ(pk∗``,pk`) 81 − e−γ(pk∗``,pk`) and ψ(`, k, T, m, p) = 1 − e −T mγ(pk∗``,pk`) 81 − e−γ(pk∗``,pk`) .

Compared to the case of clustered items and statistically identical users, this scenario requires the algorithm to actually learn the user clusters. To discuss how this induces additional regret, assume that the success probabilities p are known. Define L⊥= {(`, `0) ∈ [L]2: pk∗

``6= pk ∗ `0`

0}, the set of pairs of user clusters

whose best item clusters differ. If L⊥6= ∅, then there isn’t a single optimal item

cluster for all users, and when a user u first arrives, we need to learn its cluster. If p is known, this classification generates at least a constant regret (per user) – corresponding to the term Ruc(T ) in the theorem below. For specific values of p,

we show that this classification can even generate a regret scaling as log(T /m) (per user). This happens when L⊥(`) = {`0 6= ` : k∗

` 6= k∗`0, pk∗ `` = pk

``0} is not

empty – refer to Appendix A.13 for examples. In this case, we cannot distinguish users from U` and U`0 by just presenting items from Ik

` (the greedy choice for

users in U`). The corresponding regret term in the theorem below is R0uc(T ).

To formalize this last regret component, we define uniformly good algorithms as follows. An algorithm is uniformly good if for any user u, Rπu(N ) = o(Nα) as N grows large for all α > 0, where Ruπ(N ) denotes the accumulated expected regret

under π for user u when the latter has arrived N times.

Theorem 3. Let π ∈ Π be an arbitrary algorithm. Then its regret satisfies: for all T ≥ 2m such that m ≥ c/ mink,`∆2k`(for some constant c large enough), R

π(T ) ≥

max{Rnr(T ), Ric(T ), Ruc(T )}, where Rnr(T ), Ric(T ), and Ruc(T ) are regrets due

to the no-repetition constraint, to the unknown item clusters, and to the unknown user clusters respectively, defined by:

       Rnr(T ) := nP`β`Pk6=k∗ ` αk∆k`, Ric(T ) := mT P`β`Pk6=k∗ ` αkφ(k, `, m, p)∆k`, Ruc(T ) := mP`∈[L]β` P k∈R`∆k`ψ(`,k,T ,m,p) K .

(32)

A.5. ALGORITHMS 25

In addition, when T = ω(m), if π is uniformly good,

(T ) & R0uc(T ) := c(β, p)m log(T /m) where c(β, p) = infn∈FP`β`Pk6=k∗ ` ∆k`nk`with F = {n ≥ 0 : ∀`, ∀`0∈ L⊥(`), X k6=k∗ ` KL(pk`, pk`0)nk`≥ 1}.

Note that we do not include in the lower bound the term Rsp(T ) corresponding

to the regret induced by the lack of knowledge of the success probabilities. Indeed, it would scale as log(T ), and this regret would be negligible compared to Ruc(T )

(remember that T = o(m2)), should L

⊥ 6= ∅. Under the latter condition, the

main component of regret is for any time horizon is due to the unknown user clusters. When L⊥ 6= ∅, the regret scales at least as m if for all `, L⊥(`) = ∅, and

m log(T /m) otherwise.

A.5

Algorithms

This section presents algorithms for our three structures and an analysis of their regret. The detailed pseudo-codes of our algorithms and numerical experiments are presented in Appendix A.8. The proofs of the regret upper bounds are post-poned to Appendices A.14-A.15-A.16.

Clustered items and statistically identical users

To achieve a regret scaling as in our lower bounds, the structure needs to be exploited. Even without accounting for the no-repetition constraint, the KL-UCB algorithm would, for example, yield a regret scaling as n

2log(T ). Now we

could first sample T /m items and run KL-UCB on this restricted set of items – this would yield a regret scaling as T

m∆2log(T ), without accounting for the

no-repetition constraint. Our proposed algorithm, Explore-Cluster-and-Test (ECT), achieves a better regret scaling and complies with the no-repetition constraint. Refer to Appendix A.8 for numerical experiments illustrating the superiority of ECT.

The Explore-Cluster-and-Test algorithm. ECT proceeds in the follow-ing phases:

(a) Exploration phase. This first phase consists in gathering samples for a subset S of randomly selected items so that the success probabilities and the clusters of these items are learnt accurately. Specifically, we pick |S| = blog(T )2c items, and for each of these items, gather roughly log(T ) samples.

(b) Clustering phase. We leverage the information gathered in the explo-ration phase to derive an estimate ˆρi of the success probability ρifor item i ∈ S.

(33)

K-means algorithm. In turn, we extract from this phase, accurate estimates ˆp1

and ˆp2of the success rates of items in the two best item clusters, and a set V ⊂ S

of items believed to be in the best cluster: V := {i ∈ S : ˆρi> (ˆp1+ ˆp2)/2}.

(c) Test phase. The test phase corresponds to an exploitation phase. When-ever this is possible (the no-repetition constraint is not violated), items from V are recommended. When an item outside V has to be selected due to the no-repetition constraint, we randomly sample and recommend an item outside V. This item is appended to V. To ensure that any item i in the (evolving) set V is from the best cluster with high confidence, we keep updating its empirical success rate ˆρi, and periodically test whether ˆρi is close enough from ˆp1. If this is not

the case, i is removed from V.

In all phases, ECT is designed to comply with the no-repetition constraint: for example, in the exploration phase, when the user arrives, if we cannot recommend an item from S due to the constraint, we randomly select an item not violating the constraint. In the analysis of ECT regret, we upper bound the regret generated in rounds where a random item selection is imposed. Observe that ECT does not depend on any parameter (except for the choice of the number of items initially explored in the first phase).

Theorem 4. We have: RECT(T ) = O 2N α1 PK k=2 αk(p1−pk) (p1−p2)2 + (log T ) 3 ! .

The regret lower bound of Theorem 1 states that for any algorithm π, Rπ(T ) =

Ω(N ), and if π is uniformly good Rπ(T ) = Ω(max{N , log(T )}). Thus, in view

of the above theorem, ECT is order-optimal if N = Ω((log T )3), and

order-optimal up to an (log T )2factor otherwise. Furthermore, note that when R ic(T ) =

Ω(T

2m) is the leading term in our regret lower bound, ECT regret has also the

right scaling in ∆2: RECT(T ) = O(T2m).

Unclustered items and statistically identical users

When items are not clustered, we propose ET (Explore-and-Test), an algorithm that consists of two phases: an exploration phase that aims at estimating the threshold level µ1−ε, and a test phase where we apply to each item a sequential

test to determine whether the item if above the threshold.

The Explore-and-Test algorithm. The ET algorithm proceeds as follows. (a) Exploration phase. In this phase, we randomly select of set S consisting of b8ε22log T c items and recommend each selected item to b42log T c users. For each

item i ∈ S, we compute its empirical success rate ˆρi. We then estimate µ1−ε 2 by ˆ µ1−ε 2 defined so that: ε 2|S| = {i ∈ S : ˆρi≥ ˆµ1−ε 2}

. We also initialize the set V of candidate items to exploit as V = {i ∈ S : ˆρi≥ ˆµ1−ε

2}.

(b) Test phase. In this phase, we recommend items in V, and update the set V. Specifically, when a user u arrives, we recommend the item i ∈ V that has been recommended the least recently among items that would not break the

(34)

A.5. ALGORITHMS 27

no-repetition constraint. If no such items exist in V, we randomly recommend an item outside V and add it to V.

Now to ensure that items in V are above the threshold, we perform the following sequential test, which is reminiscent of sequential tests used in optimal algorithms for infinite bandit problems [4]. For each item, the test is applied when the item has been recommended for the b2`log log

2(2em2)c times for any positive integer

`. For the `-th test, we denote by ¯ρ(`)the real number such that KL( ¯ρ(`), ˆµ 1−ε

2) =

2−`. If ¯ρ(`) ≤ ˆµ1−ε

2, the item is removed from V.

Theorem 5. Assume that the density of ζ satisfies ζ(µ) ≤ C for all µ ∈ [0, 1]. For any ε ≥ Cq2 log Tπ , we have: RET

ε (T ) = O



Nlog(1/ε) log log(m)ε +(log T )ε2 2

 . In view of Theorem 2, the regret of any algorithm scales at least as Ω(Nε). Hence, the above theorem states that ET is order-optimal at least when N = Ω((log T )2).

Clustered items and clustered users

The main challenge in devising an algorithm in this setting stems from the fact that we do not control the user arrival process. In turn, clustering users with low regret is delicate. We present Explore-Cluster with Upper Confidence Sets (EC-UCS), an algorithm that essentially exhibits the same regret scaling as our lower bound. The idea behind the design of EC-UCS is as follows. We estimate the success rates (pk`)k,`using small subsets of items and users. Then based on these

estimates, each user is optimistically associated with a UCS, Upper Confidence Set, a set of clusters the user may likely belong to. The UCS of a user then shrinks as the number of requests made by this user increases (just as the UCB index of an arm in bandit problems gets closer to its average reward). The design of our estimation procedure and of the various UCS is made so as to get an order-optimal algorithm. In what follows, we assume that m2 ≥ T (log T )3 and

T ≥ m log(T ).

The Explore-Cluster-with-Upper Confidence Sets algorithm. (a) Exploration and item clustering phase. The algorithm starts by collecting data to infer the item clusters. It randomly selects a set S consisting of min{n, b(log T )m 2c} items. For the 10m first user arrivals, it recommends items

from S uniformly at random. These 10m recommendations and the corresponding user responses are recorded in the dataset D. From the dataset D, the item clus-ters are extracted using a spectral algorithm (see Algorithm 4 in the appendix). This algorithm is taken from [65], and considers the indirect edges between items created by users. Specifically, when a user appears more than twice in D, she creates an indirect edge between the items recommended to her for which she provided the same answer (1 or 0). Items with indirect edges are more likely to belong to the same cluster. The output of this phase is a partition of S into item clusters ˆI1, . . . , ˆIK. We can show that with an exploration budget of 10m, w.h.p.

(35)

at least m/2 indirect edges are created and that in turn, the spectral algorithm does not make any clustering errors w.p. at least 1 − 1

T.

(b) Exploration and user clustering phase. To the (10 + log(T ))m next user arrivals, EC-UCS clusters a subset of users using a Nearest-Neighbor algo-rithm. The algorithm selects a subset U∗of users to cluster, and recommendations to the remaining users will be made depending some distance to the inferred clus-ters in U∗. Users from all clusters must be present in U∗. To this aim, EC-UCS first randomly selects a subset U0 of bm/ log(T )c users from which it extracts

the set U∗ of blog(T )2c users who have been observed the most. The extraction

and the clustering of U∗is made several times until the b(10 + log(T ))mc-th user arrives so as to update and improve the user clusters. From these clusters, we deduce estimates ˆpk` of the success probabilities.

(c) Recommendations based on Optimistic Assignments. After the 10m-th arrivals, recommendations are made based on the estimated ˆpk`’s. For

user ut ∈ U/ 0, the item selection further depends on the ˆρkut’s, the empirical

success rates of user utfor items in the various clusters. A greedy recommendation

for ut would consist in assigning ut to cluster ` minimizing kˆp·`− ˆρ·utk over `,

and then in picking an item from cluster ˆIk with maximal ˆpk`. Such a greedy

recommendation would not work as when uthas not been observed many times,

the cluster she belongs to remains uncertain. To address this issue, we apply the Optimism in Front of Uncertainty principle often used in bandit algorithms to foster exploration. Specifically, we build a set L(ut) of clusters ut is likely to

belong to. L(ut) is referred to as the Upper Confidence Set of ut. As we get more

observations of ut, this set shrinks. Specifically, we let xk`= max{|ˆpk`− ˆρkut| −

, 0}, for some well defined  > 0 (essentially scaling as plog log(T )/ log(T ), see Appendix A.8 for details), and define L(ut) = {` ∈ [L] : Pkx2k`nkut <

2K log(nut)} (nut is the number of time uthas arrived, and nkut is the number

of times uthas been recommended an item from cluster ˆIk). After optimistically

composing the set L(ut), ut is assigned to cluster ` chosen uniformly at random

in L(ut), and recommended an item from cluster ˆIk with maximal ˆpk`.

Theorem 6. For any `, let σ` be the permutation of [K] such that pσ`(1)` >

pσ`(2)` ≥ · · · ≥ pσ`(K)`. Let R`= {r ∈ [L] : k

` 6= kr∗}, S`r= {k ∈ [K] : pk`6= pkr},

y`r = mink∈S`r|pk`− pkr|, δ = min`(pσ`(1)`− pσ`(2)`), and φ(x) := x/log (1/x).

Then, we have: REC−UCS(T ) = O mX ` β`(pσ`(1)`− pσ`(K)`) max K3log K φ(min(y`r, δ)2) , √ K min`β` ! + X r∈R`\L⊥(`) K2log K φ(|pk∗ `r− pk ∗ ``| 2)+ X k∈S`r X r∈L⊥(`) K log N |S`r||pk`− pkr|2    .

EC-UCS blends clustering and bandit algorithms, and its regret analysis is rather intricate. The above theorem states that remarkably, the regret of the EC-UCS algorithm macthes our lower bound order-wise. In particular, the algorithm

(36)

A.6. CONCLUSION 29

manages to get a regret (i) scaling as m whenever it is possible, i.e., when L⊥(`) = ∅ for all `, (ii) scaling as m log(N ) otherwise.

In Appendix A.8, we present ECB, a much simpler algorithm than EC-UCS, but whose regret upper bound, derived in Appendix A.17, always scales as m log(N ).

A.6

Conclusion

This paper proposes and analyzes several models for online recommendation sys-tems. These models capture both the fact that items cannot repeatedly be rec-ommended to the same users and some underlying user and item structure. We provide regret lower bounds and algorithms approaching these limits for all mod-els. Many interesting and challenging questions remain open. We may, for ex-ample, investigate other structural assumptions for the success probabilities (e.g. soft clusters), and adapt our algorithms. We may also try to extend our analy-sis to the very popular linear reward structure, but accounting for no-repetition constraint.

(37)

A.7

Table of Notations

Notations common to all models

n Number of items

m Number of users

I Set of items

U Set of users

ut User requesting recommendation at round t

T Time horizon

Xiu Binary random variable to indicate whether user u likes the item i

ρ = (ρiu)i∈I,u∈U Probability that the user u likes the item i

π Algorithm for sequential item selection

Π Set of all algorithms for sequential item selection iπt Item selected at round t under π

t−1 σ-algebra generated by (ut, (us, iπs, Xiπ

sus), s ≤ t − 1)

n Term E[maxu∈UNu(T )]

N Term 4 log(m) logm log(m)T +e+ e2T m Generic notations ˆ a Estimated value of a

σ(A) σ-algebra generated by A

KL(p, q) Kullback–Leibler divergence from Bernoulli random variable with pa-rameter p to that with papa-rameter q

& We write a & b if lim infT →∞a/b ≥ 1

(38)

A.7. TABLE OF NOTATIONS 31

Model A: Clustered items and statistically identical users Ik Set of items in the item cluster k

α = (αk)k∈[K] Probability that an item is assigned to the item cluster k

K Number of item clusters

∆ Minimum difference between the success rates of items in optimal clus-ter and of items in sub-optimal clusclus-ter

p = (pk)k∈[K] Probability that the user likes the item i ∈ Ik

(T ) Regret of an algorithm π

∆k Term p1− pk

φ(k, m, p) Term 8(1−e−γ(p1,pk))1−e−mγ(p1,pk) γ(p, q) Term KL(p, q) + KL(q, p)

η Term minkpk(1 − pk)

S Set of initially sampled items

V Set of items believed to be in the best cluster

Table A.2: Table of notations: Model A

Model B: Unclustered items and statistically identical users ζ Distribution over [0, 1]

µx Term inf{γ ∈ [0, 1] : P[ρi≤ γ] ≥ x}

ε Constant that specifies the ε-best items Rπ

ε(T ) Satisficing regret of algorithm π with a given ε > 0

C Constant that regularizes the distribution ζ(µ) S Set of initially sampled items

V Set of items believed to be in the best cluster

(39)

Model C: Clustered items and clustered users Ik Set of items in the item cluster k

α = (αk)k∈[K] Probability that an item is assigned to the item cluster k

U` Set of users in the user cluster `

β = (β`)`∈[L] Probability that a user is assigned to the user cluster `

K Number of item clusters

L Number of user clusters

p = (pk`)k∈[K],`∈[L] Probability that the user u likes the item i such that i ∈ Ikand u ∈ U`

k∗` Term arg maxkpk`

∆k` Term pk∗

``− pk`

∆ Minimum difference between the success rates of items in optimal clus-ter and of items in sub-optimal clusclus-ter

δ` Term mink:∆k`>0∆k` φ(k, `, m, p) Term 1−e −mγ(pk∗ ``,pk`) 8  1−e−γ(pk∗``,pk`)  ψ(`, k, T, m, p) Term 1−e − Tmγ(pk∗ `,pk) 8  1−e −γ(pk∗ `,pk)  L⊥ Set {(`, `0) ∈ [L]2: pk`∗`6= pk∗ `0` 0} L⊥(`) Set {`06= ` : k`∗6= k ∗ `0, pk∗``= pk∗``0}

Ruπ(N ) Accumulated expected regret under π for user u when the user has

arrived N times

Rπ(T ) Regret of an algorithm π S Set of initially sampled items U0 Set of initially sampled users

U∗

Set of (log T )2 users in U

0who have been arrived the most

L(ut) Upper Condifence Set of the user ut

σ` Permutation of [K] such that pσ`(1)`> pσ`(2)`≥ · · · ≥ pσ`(K)`

R` Set {r ∈ [L] : k∗` 6= k ∗ r} S`r Set {k ∈ [K] : pk`6= pkr} y`r Term mink∈S`r|pk`− pkr| δ Term min`(pσ`(1)`− pσ`(2)`)  Term K q 8Km t log t

m (Updated only when the user clustering is

exe-cuted)

References

Related documents

This study builds on the work carried out by Almond &amp; Verba 4 as well as Putnam 5 in so far as to argue for the importance of civil society and its influence on

Based on the research questions which is exploring an adaptive sensor using dynamic role allocation with interestingness to detect various stimulus and applying for detecting

As the public spheres addressed by traditional (legacy) media remain largely national or local, the issues mentioned above and indeed other related ones are increasingly global,

The aim of this thesis is to explore and elaborate how the practice of key account management (KAM) is colored by cultural conflicts, dilemmas and more

Att Mary Elizabeth dejtar Charlie, en kille som hon från början inte såg som pojkvänsmaterial (detta framgår i filmen), kan innebära att Mary Elizabeth vill ha honom som pojkvän

Gerber &amp; Hui (2012) mean that a reason why people are interested to participate in crowdfunding platforms is because they feel a social solidarity and they want to invest

Questions stated are how migrants organize their trajectory and with the help of which actors, how migrants experience corruption during their trajectory and what this might

The aim of the thesis is to examine user values and perspectives of representatives of the Mojeño indigenous people regarding their territory and how these are