Reducing AI bias in a world of decentralized data

One question I’m often asked is: As data generation shifts towards the edge, how will AI evolve to function effectively in distributed environments?

It’s a good, and very relevant, question. Estimates say that in the next year (which is not far off), more than half—in fact, 75%—of enterprise-generated data will be created and processed at the edge. By the edge, we mean the place where the data first lands, as people, devices, and IoT connect to the network edge.

As this trend keeps going, traditional machine learning approaches that are centralized may not be able to keep up. Centralized approaches rely on data being in a consolidated location, such as in a data center or cloud, for model training, tuning, and inferencing. However, this means moving huge volumes of data with minimal delay, from thousands of edge devices, back to a single location.

The challenges of centralized machine learning

As you can imagine, repeatedly shuttling data back and forth from the edge to a single location, especially when delays are costly to productivity, can be challenging:

Latency and high costs. Continuous transfer requires high bandwidth, which becomes expensive both in terms of energy, funds, and time. Time delays impact the ability to learn from and act on data in real time. And while backhauling is just problematic for now, soon it will become unsustainable as the growth of data outstrips the growth of bandwidth.
Security and privacy risks. Moving data exposes it to security and privacy risks in transit. There are ways to shield the data, such as through encryption or even quantum communication, but they add cost or complexity.
Data privacy regulations. Privacy regulations, such as GDPR or HIPPA, prevent the moving or sharing of private data externally between organizations, and also internally across sovereign geographies.

Starting to overcome challenges with machine learning at the edge

Moving machine learning to the edge has helped to resolve some of these challenges. Instead of model training and inferencing happening in a centralized location, it occurs directly at the data source, tackling the issues that come with moving data.

But there’s a new challenge this creates—and one it still doesn’t solve.

Unlike a central cloud or data center that uses consolidated data, edge devices see only their own portion of data. So drones, industrial robots, scanners, X-ray machines—any connected devices you can think of—aren’t able to learn from each other.

When a machine learns only from one organization’s data, or an edge device sees only its own data, it learns in isolation. The results it generates are, in turn, limited are not as detailed or accurate.

In fact, it can even feed into bias.

An example of bias due to isolated learning

Consider this scenario. A local hospital wants to use AI to more quickly detect abnormalities on MRI scans. So the hospital starts training their machine learning model on scans. In one year, the hospital sees a lot of instances of brain tumors, but very little of multiple sclerosis lesions. As a result, the neural network model becomes very effective at detecting tumors—but less effective at detecting the abnormalities it’s had less exposure to. The model is also more sensitive to the specific demographic that visits that hospital.

Across the border, however, is a hospital that has seen lots of multiple sclerosis cases—and from a more diverse demographic. Imagine how much better the first hospital’s model would be if it could also learn from this second one, filling in its own gaps?

The introduction—and growing importance—of swarm learning

There actually is a way these two models can learn from each other—without compromising data privacy. Google kicked off this capability with federated learning, but at HPE, we have evolved this further into what we call swarm learning.

Swarm learning is a decentralized machine learning solution (meaning model training happens closer to or at the data source). It uses edge computing and blockchain technology to collect and transfer only data insights (insights being the crucial word here) to the central source—and not the data itself.

Because swarm learning leaves the raw data where it is and takes only the insights, this also means learnings derived from protected data can be safely shared across locations – and even across organizations. Hospital one’s model can now learn from hospital two’s, helping to achieve accuracy comparable to centralized learning, but with reduced latency, security risk, and bias.

Wondering where the name ‘swarm learning’ comes from? Well, if you watch a swarm of birds in nature, you’ll see hundreds of birds flying in a very coordinated way, without depending on one flock leader. In a similar way, and unlike federated learning, swarm learning doesn’t rely on a central custodian (or “leader”) to collect and share learnings. Rather, blockchain through a smart contract selects a different collector each time, enabling structured sharing of data insights.

Are organizations ready for a decentralized approach?

Swarm learning provides great potential—but it also requires enterprises to have a certain level of data maturity. They need to remove data siloes and be able to access data in real-time from edge devices.