Federated Learning: Machine learning without data collection

Centralised machine learning raises data privacy and security concerns, as personal data is not controlled by individuals and unauthorised access, use, or destruction of data is a risk. These concerns are prevalent across industries:

right at the close of 2022, Chinese electric car maker Nio Inc. announced that its computer systems were hacked, with hackers accessing data on users and vehicle sales and demanding a $2.25 million bitcoin ransom for the company’s internal data.
In November 2022, Medibank, an Australian health insurance company, also suffered a data breach in which cybercriminals accessed the personal and medical data of around 9.7 million customers after their refusal to pay demanded random.

Anonymising user data is a widely adopted approach to address security and privacy concerns, however, it remains prone to re-identification attacks. Such attacks can re-identify individuals from anonymised data by matching anonymised data to publicly available information: for example matching names, addresses, or other identifying information found in the anonymised data with information found in public records or social media.

Another crucial issue with centralised machine learning is the potential for bias in centralised machine learning systems. If the data used to train these systems is not representative of the entire population, the resulting model may be biased and make unfair or inaccurate predictions about certain groups of people.

In 2020, Google AI researchers Margaret Mitchell and Timnit Gebru criticised the way the company’s AI system was being developed. Their paper on the potential ethical ramifications of the technology was asked to be retracted, and they were both dismissed by Google.

Federated Learning is a promising solution where “source data stays at source”. It allows a model to be trained in local devices, rather than a central server; updates from local models are aggregated to create a global model.

Federated Learning is a promising solution where “source data stays at source”. It allows a model to be trained in local devices, rather than a central server; these local models are aggregated to create a global model.

However, Federated Learning has its own pitfalls, as currently, only the tech giants can implement federated learning at scale due to model security concerns and coordination efforts.

Central coordination: it’s a complex process to coordinate the training process across a decentralised network of devices, monitor participation, and distribute rewards to incentivise participants — the manual process is a main blocker, and using Blockchain can automate it.
Model security: Federated Learning can be vulnerable to malicious attacks intending to harm the learning process or the privacy of other participants. Bad participants can deliberately send incorrect or inconsistent data or gradients to other participants during the learning process, causing the model to be inaccurate; they can also poison the data by injecting malicious or misleading data into the shared model, causing it to make incorrect predictions.

In the next sections, we will discuss how blockchain technology can mitigate the coordination and security challenges in Federated Learning by 1) enabling decentralised consensus and 2) implementing a Proof of Stake system to incentivise participants.

Use Blockchain for Decentralised Consensus

Blockchain technology can mitigate the coordination challenges in centralised Federated Learning by enabling decentralised consensus.

A blockchain is a decentralised computer system that keeps a record of digital transactions and is not controlled by any one person or organisation. The concept was first popularised by Bitcoin, the first and most widely used cryptocurrency, and now a plethora of decentralised applications run on Ethereum.

Blockchain can function as a “trust engine” without centralised coordination. No single entity controls it and all users have access to the same information cryptography ensures that the information stored on it is secure and cannot be altered or tampered with. Users can securely and transparently exchange information and conduct transactions without needing to trust each other.

Decentralised consensus is a way of making sure that everyone agrees on the information being used and shared. It gives more people control over their data and makes it harder for others to misuse or steal it. This approach also allows more people to take part in training computer models, which makes the models more diverse and better.

In traditional Federated Learning, a central server receives local updates, aggregates them into a global model update, and then sends updates to local clients.

Centralised FL

In traditional FL set-ups, models are trained locally and only model updates are shared with the server. Raw data stay local. The server receives local updates and sends back aggregated updates.

On-chain FL

Our proposed on-chain solution uses smart contracts to aggregate updates. A smart contract is a computer program that can automatically execute the terms of a contract when certain conditions are met. It is self-executing because it can automatically enforce the rules and penalties of an agreement, without the need for a middleman.

With the on-chain solution, the model does not need to trust a centralised server. The proposers conduct local training and send model updates on-chain, and then the on-chain smart contract runs a FedAvg (a weighted average of local updates).

On-chain Security Guarantee Using Game Theory

The quality of local outputs is determined in the peer-to-peer validation process. Participants partition their data for training and testing before the training process started. In each round of training, participants are randomly selected to be proposers or voters. If the latter, the voters run the aggregated update on their test data to assess whether the update improves their performance. The collective vote determines whether the aggregated update gets finalised or not — for example if over half of the voters voted against it, the update isn’t finalised and the system restarts this round of training (goes back to the proposer/voter selection stage). The decentralised validation mechanism is not only ‘democratic’, but more secure. If there is only a central party determining the quality of the output, it is easier for malicious proposers to learn its criteria and incorporate them into the loss function.

Based on the collective vote, proposers are rewarded/slashed, which enables the system to filter out malicious clients. All proposers are required to put down enough stake to participate. If the global update lowers model performance, all proposers will be slashed; in the multi-round training process, malicious clients are far more likely to appear in rounds with deteriorating performance than benign ones. In other words, if a proposer wants to bring down model performance, they are more likely to get slashed. We adjust the system parameters (staking threshold, reward/slashing ratio) such that malicious clients are filtered out (stake drops below zero) while the expected return for benign clients is greater than zero. The multi-round training process is essentially a multi-round game, using the right parameters(determined in simulation), it filters out malicious clients in an automated way.

Defending Malicious Behaviours in Federated Learning with Blockchain

Federated Learning: Machine learning without data collection

Use Blockchain for Decentralised Consensus

On-chain Security Guarantee Using Game Theory