16:20 - 16:40
Distributed Learning of Non-Convex Linear Models with One Round of Communication (55)
Mike Izbicki (Claremont McKenna College), Christian R. Shelton (UC Riverside)
We present the optimal weighted average (OWA) distributed learning algorithm for linear models. OWA achieves statistically optimal learning rates,uses only one round of communication,works on non-convex problems,and supports a fast cross validation procedure.The OWA algorithm first trains local models on each of the compute nodes;then a master machine merges the models using a second round of optimization. This second optimization uses only a small fraction of the data, and so has negligible computational cost.Compared with similar distributed estimators that merge locally trained models,OWA either has stronger statistical guarantees, is applicable to more models,or has a more computationally efficient merging procedure.
17:00 - 17:20
Trade-offs in Large-Scale Distributed Tuplewise Estimation and Learning (628)
Robin Vogel (Telecom Paris, LTCI, Institut Polytechnique de Paris; IDEMIA), Aurélien Bellet (INRIA), Stephan Clémençon (Telecom Paris, LTCI, Institut Polytechnique de Paris), Ons Jelassi (Telecom Paris, LTCI, Institut Polytechnique de Paris), Guillaume Papa (Telecom Paris, LTCI, Institut Polytechnique de Paris)
The development of cluster computing frameworks hasallowed practitioners to scale out various statistical estimation and machinelearning algorithms with minimal programming effort. This is especially true for machine learning problems whose objective function is nicely separable across individual data points, such as classification and regression.In contrast, statistical learning tasks involving pairs (ormore generally tuples) of data points - such as metric learning,clustering or ranking - do not lend themselves as easily todata-parallelismand in-memory computing.In this paper, we investigate how to balance between statistical performanceand computational efficiency in such distributed tuplewise statisticalproblems. We first propose a simple strategy based on occasionallyrepartitioningdata across workers between parallel computation stages, where the number ofrepartitioning steps rules the trade-off between accuracy and runtime. We thenpresent some theoretical results highlighting the benefits brought by theproposed method in terms of variance reduction, and extend our results todesigndistributed stochastic gradient descent algorithms for tuplewiseempiricalrisk minimization.Our results are supported by numerical experiments in pairwisestatistical estimation and learning on synthetic and real-world datasets.