Behind Twitter’s Biased AI Cropping and How to Fix It.

Published in

The Startup

10 min readSep 29, 2020

Twitter’s AI crop has a bias. When given a large photo that contained the press photos of Mitch McConnell and Barack Obama, the AI picks Mitch. Swap the position of the two pictures, and the AI picks the white guy again. This behavior has led to a lot of random experimentation online. Still, it’s important to cover what we know, what Twitter’s response was, what caused the problem, and how to fix it.

The History

It’s a well-known fact that people like to click on stories with images. It’s also known that images that are uniform, consistent, and canonical get more clicks. For example, take this photo of the bent pyramid at Dahshur. First, I’ll give you a square center crop.

A tight crop of a pyramid that looks like a pile of bricks.

Are you likely to click on that? Not a chance. Let’s do another square crop, but this time around the salient object.

Boom, it’s click city. This effect gets amplified when it comes to faces. Years ago, Saeideh Bakhshi, Eric Gilbert, and I published findings that showed photos with people’s faces get more engagement on Instagram and Flickr. The Internet is full of tools trying to get better crops and has been for decades because better crops lead to better engagement and more dollars. In this case, it will come down to the Image and the AI.

Exhibit A: The Image

A vertical image with a photo of Mitch up top and Barack on the bottom with about 5x amount of whitespace between them. — The test image, shrunk down to 480 pixels high from 3000.

Take a JPEG of two people, one black man, and one white guy up on top. Space them out with about 3000 pixels of vertical white space. Upload it to Twitter. The crop goes to the white guy. Put the black man on the top, and the crop again goes to the white guy. Duplicate the black man in the photo; crop still goes to the white guy. You can follow the long tread of random tests as you like. This contrived example forces the AI to make a decision. It’s not as complicated as the Trolly problem, but also not an ordinary image one would use for anything else.

Exhibit B: The AI

We can call it the AI or the Algorithm; either way, we don’t know much about it. I’m reusing parts of an explanatory thread by expert game designer Rob Zubek which provides an excellent walk-through. Looking for faces and cropping is a classic game. This heuristic doesn’t work for images like my pyramid photo. If the image has 20 faces, an algorithm designer would have to adjust things to pick the most prominent face or randomly pick one. Also, face datasets have known biases, and face detectors often miss faces. Following the hot trend in AI, people look to train and learn a general method.

In 2018, Twitter blogged a Speedy Neural Networks for Smart Auto-Cropping of Images and posted a paper called Faster Gaze Prediction With Dense Networks and Fisher Pruning on arXiv. We can only guess that this is their system. Twitter has promised to open source the algorithm, too, but more on that later. Both the blog post and technical report point to how they use DeepGaze II (an eye-gaze prediction method) to identify salient areas of the image. This method does not look for faces or objects; it only predicts eye gaze to find regions of interest based on low-level features/textures. Once they find a salient area, they crop around it.

Five sample photos and heatmaps of sailent regions below each photo. — Some sample photos and hot spots (saliency maps) from the DeepGaze II.

Easy ya. We can see the before and after tests in the report.

Four photos of Twitter’s former poor cropping mechanism. — Before: Some really bad image crops.

Four photos of Twitter’s new cropping mechanism with proper cuts to the salient object. — After: Gaze-based cropping with vastly superior results.

Now, this idea of computational prediction of eye-tracking predates today’s deep learning methods. Deep learning provides advancements to the pixels in/eye-gaze out world. When we look at the new approach versus the old one, there is a ‘hot damn’ magical moment. I’m sure they ran people of all races and sizes through the training, validation, and testing process.

Eye tracking, Gaze Prediction, and Salience

There’s no shortage of tools that people use to track or predict gaze to measure engagement. Many eye-tracking studies have verified people will look at faces, and it’s quite common to use eye-tracking to measure where people look on your website, video game, or advertisement. Many factors that come into play. Take the example of a diaper ad. If the baby in the ad is facing forward, eye-tracking software will show you people look at the baby’s face. If the baby is looking towards the text, we see more of the words get attention from the viewer and the logo in the bottom right. Using expensive and often clunky equipment, one can run these experiments and get results.

Two photos on the left, two eye tracking photos center, two DeepGaze II predictions right. AI predictions match eye tracking. — Two diaper ads (left), Eye-tracking heatmaps (middle), DeepGaze II predictions (right)

As luck would have it, DeepGaze II has an online tool to demo predictions. By comparison, the AI method is mostly identical, with the only absence of the forward face. The auto-cropping method would likely ignore the baby facing to the side.

Deeper analysis of DeepGaze II

If we go back to Exhibit A, one will notice: it’s not a typical photograph from a point-and-click camera, and it has an odd orientation and has a lot of white-space. This challenging example likely was not a tested case. The always white guy result that emerges is impressive. From their arXiv paper, they build on DeepGaze II on VGG-19 to extract feature maps. I’m not sure what VGG-19 used for training, but a recent IBM study showed that Flickr’s copyleft photo dataset faces pretty much white males aged 30. The DeepGaze II online demo requires the image to have a maximum dimension of 2048, so I shrunk it down from 3000 and generated the prediction map. Just a reminder, Barack is on the bottom.

Prediction map of example Mitch/Barack Photo. Both faces look identical here for prediction. — DeepGaze II Prediction Map. Barack is on the bottom.

Overlay of Barack and Mitch’s prediction maps. Obama’s is larger but less dense that Mitch’s. — Difference Layers between Mitch and Barack. Barack has a larger perimeter circle, but Mitch has the larger inner ring.

If we look at the prediction map, Barack’s face has a larger perimeter— must be the ears, sorry Barack 😀 can’t blame the AI— but the high-density center region is smaller by about 14%. Can we assume they are only using the inner density or is this bias from VGG-16’s training? Looking at the actual pixels, we see that Barack’s photo is 10.25% shorter by height than Mitch’s. You’ll also notice Barack’s face is blurrier as well as smaller.

Overlay of Mitch and Obama photos. Obama’s photo is shorter but same width. — The two images used in the text are equal width but not height.

So here we can play some tricks. We can blur Mitch and make Barack bigger, which levels the field. These stop-gap solutions don’t scale. The core issue likely lies in the underlying features, and the product Twitter built. The former is clear; VGG-19 has a bias from ImageNet’s millions of images recently shown to be problematic. The bias is to certain textures, offensive/racist tags, and probably lack diversity in faces. There’s a common misconception that throwing more images/data at the problem will fix issues. However, if one doubles or triples the dataset size with a similar pool with the same biases, those biases would get stronger.

Photo of two politicians on the left, center is AI prediction map with bias, right is a sAUC map with less visible bias. — Mitch Blurred and resized to match Barack: the saliencey map for sAUC metrics (right) shows the two somewhat more even regions while the probability distribution (center) leans to the white person.

So, can we patch this one case example? Let’s blur Mitch (Gaussian blur 1.5 pixels) and make them the same size. The saliency map for sAUC metrics shows the two somewhat even regions, but probability distribution leans to the white person. One can’t help but point to the VGG-19 features. If one uses the sAUC map, you could have two mostly equal regions and still have no good crop of a 3000 x 583 pixel image. It’s up to the product now.

The Problem with Product

We have to look at the whole system, which we know even less about. let us say a user uploads an image, let’s assume the algorithm says “I can’t decide” and picks one. It might randomly select the white face and then cache that crop. Subsequent uploads by other uses might reuse that crop. The system might even fingerprint that crop and then find close matches for similar images (so flipping the face order will still break ties with the white face). There’s a lot of ways to engineer the product. I’m just speculating at this point, but a base algorithm cannot be isolated from the system when judging fairness.

Academic papers tend to run evaluations in isolation — a product feature tests inside a complete system with different metrics. Products typically run A/B style tests to show an uptick in clicks, engagements, and retention. That must have happened here. If it didn’t, I would have loved to be in the room when some scientists told a project manager they could make crops 60% better at the cost of a drop in engagement on site. All kidding aside, behavior on sight could be used to reinforce the bias. When Twitter open-sources the method, it won’t be enough to hand us a research paper or a Jupyter notebook. We will need to know if those clicks on crops were used to reinforce their method.

Can this be fixed?

First, academic papers need to deal with products and product metrics. Solid partnerships between industry and academia can help in this arena. These collaborations work best with in-kind working relationships, contractors, data sharing, and gifts. This is not typically something companies want to share, but there are some examples.

I’m optimistic we can find a better path, but the road is arduous. Equity and fairness is a challenging problem because AI systems are layers of AI systems. Here, we see Twitter using DeepGaze II, built on VGG-19, trained on ImageNet. A texture bias, a lack of diversity, and contrast preferences in the base training set have significant implications. It’s unlikely that many people could even train VGG-19 themselves on their potentially less biased data given the magnitude of computing needed. DeepGaze II cites it’s best to reuse what’s there; while a new idea at the time, it’s now common practice. Many AI systems build on existing feature networks, and they all carry the same bias.

Heuristics as editorial decisions can help but won’t solve the problem. People tend to shy away from adding heuristics to AI systems. Or at least they like to pretend they don’t use them. In this case, we saw blurring the image adds some fairness in our example, but this is a balancing trick. Too much blur and nothing works, and adding a blur fixes some issues, not all.

If Twitter plans open sourcing to absolve themselves, they should tell us all more about the data, the algorithm, and the whole system. Even if we had an AI method that was 100% free of biases and diversity issues, when plugged into a product where people click on only one race group, and those clicks reinforce the AI, that system would start exhibiting community bias. We can gain some insights by open sourcing the method for researchers but still need to know how it is installed and tested in production to make sense of what happened.

As with many AI systems, most algorithms only knows how to do what they know how to do. Give it a photo; it will crop it. What decides when not to crop? Could we build a system look at this oddly shaped image with too much white space and say, “no, I’m not cropping this”? In this case, where a user is forcing the AI to decide, why is there one outcome? We can build systems that analyze multiple alternatives to determine what to do. In a sense, forcing one crop doesn’t make sense, so why force the AI?

And while it might be easy and very correct to say just let people crop their photos, it’s more likely the case nobody would ever manually crop anything. Many sites already have to nudge people to give them an image, let alone crop it. It’s a problem of sites fighting to drive up engagement. A human in the loop system might suggest a few crops and ask the user to pick one, but even then, there’s an effect on the product: making it take longer to push a post even by a few seconds can negatively impact engagement. Then again, having a racial bias in an AI crop can too. I believe that editorial choice is a good idea and should be elected at cost.

Many Thanks to Rob Zubek at SomaSim Games for the excellent starter thread on Twitter and for spending time with me experimenting with DeepGaze II. The baby diaper examples are well known across the industry, and I believe it comes from Tobii, who makes eye-tracking hardware and software. The photograph of the bent pyramid at Dahshur is my own and those two example small crops (and all the heat maps in this post as well) can be used under CC-BY-NC-ND license. Exhibit A is a collage of two press photos.