Federated ML Model Evaluation

A method for sending a trained ML model you wish to evaluate to the data, rather than requiring the data to be centralised first.

Federated ML Model Evaluation


Matt Siegel

Published on

August 3, 2022

What is Federated ML evaluation?

Federated ML model evaluation allows you to send a trained ML model you wish to evaluate to the data, rather than requiring the data to be centralised first. Once the model is in place, predictions can be generated using the model on the data where the data resides. These predictions can be used locally for analysis by the data custodian. Additionally, performance metrics can be derived from the same predictions, and these can be returned to the data scientist for their own model quality and performance assessments.

Why Federated ML evaluation?

Consider the situation where you (a data scientist) have a powerful new ML model, which has been seen to provide decent accuracy on some small, curated handsets you’ve been able to collect centrally. There are a few things you may want to do next, such as:

1. Evaluate that model on as many datasets as you can, to further gather evidence for the general applicability of your model (e.g. answer the question “does it still work well on data from different sources?”). Performing evaluations such as this is critical in order for you to learn how you may be able to improve your model further (e.g. “I need more data” or “I need more layers”), or even to understand whether it is working as you had expected on real-world data.

2. Use that model to generate predictions for new datapoints, thereby putting it into practice to solve a real problem ‘in the wild’. In some circumstances, this may enable you to monetise your model, or possibly just collaborate with others on solving interesting problems.

Federated ML model evaluation makes it easy for you to achieve these two goals. Model evaluation and deployment are critical steps in any real-world ML development cycle. They are also impossible to carry out without federated ML model evaluation in many scenarios, such as those which relate to datasets which are sensitive (containing private information), widely distributed, or simply very large and difficult to centralise.

How is Federated ML Model Evaluation a PET?

Firstly, the fact that the data doesn’t need to be centralised reduces the attack surface (i.e. the number of parties that need to be trusted with the data). Furthermore, controls can be put in place to provide better levels of privacy protection (as they are in the Bitfount platform) to:

1. Enforce strict access controls on the data,

2. Ensure that only aggregate results are ever returned to the data scientist running the evaluation, and

3. Apply appropriate levels of differential privacy to ensure private information cannot be leaked from the results.

Federated ML Model Evaluation Overview

Let’s see what this looks conceptually with a couple of animations.

“Classic” ML model evaluation

The figure above shows the “classic” paradigm for ML model evaluation, where a number of organisations hold the data. This data must then be sent to a central server (shown in the middle of the figure). This server holds a trained ML model (the brain), which can be run on the centralised data to produce the desired predictions and evaluation results. This is quite clearly sub-optimal from a privacy perspective, as all this data must be collected and stored on the central server.

Federated ML Model Evaluation

In this figure, we see the methodology employed by federated evaluation. Namely, the data remains where it is, and the centrally-held ML model is sent to each of the remote sites or nodes. Then, statistics and predictions are computed using the model and data at each node, and these are transmitted centrally for aggregation. This aggregation can be carried out using secure aggregation, a protocol which ensures the data scientist can only see the aggregated result. Statistical noise may also be added to the statistics collected centrally using differential privacy to further ensure privacy is preserved.

Federated Evaluation in Practice

In addition to the single-model evaluation use-case introduced already, there are two more common practical applications of federated model evaluation.

The first of these extends the notion of single-shot model evaluation to a paradigm in which many evaluations can be run using the same model, but with different hyper-parameter or decision rule configurations. By doing so, the results returned for each configuration can be analysed, thereby allowing the data scientist to “tune” any parameters of interest to eke out the best performance from the model.

The second of these is relevant if there are multiple candidate ML models to evaluate, and not just a single model. In these cases each model can be evaluated separately, making it possible to effectively carry out A/B experiments on real data. Armed with the results of the A/B experiments (or trials), the data scientist simply picks the best “lab-built” model possible for application to real-world data.

Real-world examples

As one example, Bitfount is currently collaborating with researchers at Moorfields Eye Hospital to use federated ML model evaluation to run their groundbreaking Biomarker models for predicting certain eye conditions on remote datasets held by hospitals. Doing so will make it possible to flag patients for clinical trial recruitment, without impacting privacy, and reduce recruitment costs by not requiring a ‘human in the loop’.

In the wider industry, Apple and Google (see references) have shown they have been able to successfully  put the methodologies of federated ML evaluation into practice at scale successfully. In the work described by Apple, the authors show how Apple’s virtual assistant (Siri) could be personalised. This was achieved by improving the speech recognition language model over multiple rounds of federated evaluating and tuning across hundreds of thousands of iOS devices. They also showed how the relevance of news articles in the Stocks app could be improved by using federated evaluation over large numbers of parameter sets to select optimal parameters for their algorithm which selects personalised news articles.


Apple’s approach to federated evaluation and tuning:

Matthias Paulik, Matt Seigel, Henry Mason, Dominic Telaar, Joris Kluivers, Rogier van Dalen, Chi Wai Lau, Luke Carlson, Filip Granqvist, Chris Vandevelde, Sudeep Agarwal, Julien Freudiger, Andrew Byde, Abhishek Bhowmick, Gaurav Kapoor, Si Beaumont, Áine Cahill, Dominic Hughes, Omid Javidbakht, Fei Dong, Rehan Rishi, Stanley Hung



Federated Optimization: Distributed Machine Learning for On-Device Intelligence

Jakub Konecny, H. Brendan McMahan, Daniel Ramage


Towards federated learning at scale: system design

Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konečný, Stefano Mazzocchi, H. Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, Jason Roselander