top of page
Search
Mark J. Panaggio

What do classification algorithms reveal about whether black lives matter?

Updated: Jun 19, 2020

Classifications problems are hard. There is no way around it. We face them every day. The reality is that mistakes are inevitable and the consequences of those mistakes can be significant. Some solutions are better than others, but there is no perfect solution.

What is a classification problem, you ask? Well, let me explain. A classification problem is a statistical problem where one seeks to use previously identified patterns to determine the unobserved label for an individual based on other observable characteristics. These problems arise in field called “statistical learning” or “machine learning” that is concerned with finding patterns from data.


Spam classification is one of the most well-known examples of this. When you receive an email, your email client must make a decision. Is this message spam or not? The problem is that spam messages do not declare themselves as such, so the true label of the email is unknown unless the user manually assigns it a label. Instead, your email client uses an algorithm to determine whether the message is likely to be spam based on the relevant characteristics of the message. These characteristics, commonly referred to as features, might include information about the source (Did it come from a known spammer?), the time the message was sent (Why was this sent at 4 AM local time?), the subject matter (Did a distant and unknown relative just inherit a large sum of money and ask for your help?), the style (Is it filled with typos and inconsistent capitalization?), etc. You can determine which features are relevant by comparing legitimate messages to messages that people have marked as spam in the past and figuring out which features distinguish the two classes.


This same sort of problem arises in many different contexts. Banks use classification algorithms to determine whether it is likely that a loan applicant will default or that a given transaction is fraudulent. Doctors use classification algorithms to determine whether a person has a particular disease. Advertisers use classification algorithms to determine whether specific people are likely to buy a given product. The list goes on and on.


Classification problems also arise in law enforcement. Police officers often need to determine whether to pull someone over, whether someone should be arrested, or whether deadly force is required. Similarly, district attorneys need to determine whether to pursue charges. They may not use a complicated computer algorithm to make these determinations, but they are still trying to assign labels that are not directly observable based on observable characteristics and they make those determinations based on their past experiences.


Let’s look at a simplified version of one of these decisions, the decision whether to stop someone over “suspicious” activity, in order to see why these problems can be so difficult and why it is so hard for society as a whole to agree on the best solution. As most of you know, my background is in math not law enforcement, so this hypothetical scenario is not intended to be an accurate portrayal of how these decision are actually made in practice, but rather an illustration to demonstrate the inevitable trade-offs.


Suppose that the police have received a call about suspicious activity and after arriving at the scene, they spot a suspect driving in the vicinity. They now need to make a determination about whether to pull this individual over to question them. In practice, this would be a complicated decision that would need to account for a variety of factors such as the type of activity, the credibility of the caller, and the perceived threat to safety, but for illustration purposes let’s suppose that all of this information can be boiled down into a single number that I will call the “appearance of guilt”. If that number is low, then the person appears guilty and if not, then the person appears innocent. The problem is that appearances can be deceiving. Some innocent people may appear guilty and some guilty people may appear innocent.


How should the police officer determine whether to intervene? An experienced cop would use prior experience as a guide. Presumably, they would already have developed a sense of how innocent and suspicious people behaved in the past and then they would use that sense to determine whether this particular suspect is likely to be innocent or guilty.


We might illustrate this (hypothetical) scenario as follows.



Here each dot represents an individual suspect and the color indicates whether they are innocent (green) or guilty (red). The horizontal axis represents the appearance of guilt. Notice that the innocent people tend to fall on the low end of the spectrum and the guilty people tend to fall on the high end of the spectrum, but there is some overlap in the middle.


Since later on, we will want to look at the role that race might play in these decisions, I have divided these points into two groups, one representing majority race (no border) and the other representing the minority (black border). As currently constructed, the average guiltiness of both groups is the same.


When confronted with a suspect, the officer might proceed by (metaphorically) using a line through the data to try to divide the innocent from the guilty based on past experiences. If the suspect falls in the shaded region to the left of the threshold represented by the line (called the classification boundary) then the officer would assume they are innocent and let them go and if the suspect falls to the right of this line, then the officer would stop them for questioning. You can see two possible lines below with the misclassifications labelled in a darker shade:



Notice that no matter where we draw the line, mistakes here are inevitable! These mistakes can fall into one of two types: false positives (the green dots on the right where the innocent are treated as if they are guilty) and false negatives (the red dots on the left where the guilty are treated as if they are innocent). In scenario 1(left), the threshold (black line) for stopping the suspect is low and therefore many of the innocent will be stopped, but relatively few guilty suspects will get away. In the second plot (right), the scenario is reversed. In other words, there is a trade-off between false negatives and false positives. As you move the black line to the right, the number of false positives decreases, but the number of false negatives increases, and the opposite happens if you move the black line to the left. You can visualize this trad-eoff as follows:



Here I have plotted the false positive and false negative rates (normalized so their maximum value is one) as we move the threshold from left to right. I have also labeled the points corresponding to the two preceding examples for reference. If we move the classification boundary all the way to the left (which corresponds to the top left of the curve here), then the false negative rate goes to 0, meaning that no guilty people get away. Unfortunately, the result is that the false positive rate goes to 1, meaning that all of the innocent people are stopped in the process. If we move the black line all the way to the right (which corresponds to the bottom right here), we don’t stop any innocent people, but all of the guilty people get away too. Neither seems like a great solution.


So, what is the best solution? The perfect solution would be the one where both the false positive rate and the false negative rate are zero, but that is not attainable. One could try to find the nearest point, but that might not be the best solution either. To find the best solution, one must specify a method for comparing solutions. How do you measure the quality of a solution? The simplest way to do this would be to count up the number of mistakes, but that assumes all mistakes are equally problematic, but this need not be the case. If this is just a matter of questioning a suspect, then perhaps it is more problematic to let the guilty go free than to question the innocent, making false negatives more significant than false positives. In this case scenario 1 would be more desirable. If instead the decision is about whether it is necessary to use lethal force, then perhaps the scales would tip in the other direction with false positives (i.e. shooting someone who is innocent) being more problematic. Under these conditions, scenario 2 would be more desirable.


One can quantify this by assigning some sort of cost to each of the two types of mistakes and then seek the solution with the smallest total cost. If the costs are equal then the best solution would be the one with the least total mistakes. On the other hand, if the cost for false positives is twice as high, then we would be willing to accept up to two additional false negatives if it allows us to reduce the number of false positives by 1.


This may sound like a satisfactory resolution of the problem, but it raises more questions. What is the right cost to use? How would you enforce this sort of uniform criteria in practice? The reality is that every single point along the black curve is the “best solution” for some cost, so different people could adopt very different decision-making criteria that is still optimal in some sense. And if they are all best solutions according to some criteria, then unless we can agree on the right criteria, there is no universal best solution! Perhaps this is what economist Thomas Sowell, had in mind when he said:

“There are no solutions. There are only trade-offs.”


Returning to the issue of deadly force, an immediate conflict becomes apparent. To the police, false negatives are extremely problematic. Failing to use deadly force when a suspect is a real threat means putting police officers’ lives in jeopardy. However, to others using deadly force when it is not warranted means unnecessary civilian deaths. This means that choosing any cost structure involves making a choice about the implicit trade-offs between the lives of the police and the lives of civilians. It involves a statement about how much different types of lives matter.


Although it is difficult to know precisely where the classification boundary should go, some cases are clear cut. In the case of George Floyd, whether he was guilty of a crime is beside the point, he had already been subdued, so the force used by Derek Chauvin and his colleagues was completely gratuitous. In the context of classification, George Floyd was a green dot way on the left of the plot, there was no way the threshold for lethal force should have been so low as to warrant viewing him as a threat at that point.

Other cases are less clear. The tragic story of Tamir Rice comes to mind. He was a young boy who was shot by police officers while playing with a pellet gun that the officers mistook for a real weapon. In the context of the example above, he was a green dot much further to the right. He did not deserve to die, and there is evidence to suggest that his death was avoidable, but ultimately deaths like his are the result of difficult choices that police officers make about those trade-offs. In that split second where officers think they see a gun, they must decide whether to risk a false positive (shooting an innocent person) or a false negative (potentially being shot themselves). I don’t envy having to make that call.


In cases like Tamir’s, it seems appropriate to consider whether the police are drawing the lines in the right places. How many innocents should we be willing to sacrifice to save one cop? How many cops should we be willing to sacrifice to save one innocent person? There are no easy answers.


Up to this point, I have skirted around the topic of race and instead focused on the complexity of the decision. But the unfortunate reality is that race seems to be a significant factor in some of these decisions. The dataset below shows the number of people shot to death by police in the US by race over the last 4 years.

The largest numbers of deaths correspond to whites, followed by blacks, then Hispanics. However, the raw numbers are misleading. White people make up 60% of the population followed by Hispanics at 18% and blacks 13%. So, all else being equal, one would expect there to be over 4 times as many white people shot to death by police as black people. We actually see about 54% as many black deaths as white instead of the 22% we would expect if deaths were proportional. That is a massive discrepancy!


There are (at least) two possible explanations for that discrepancy. One explanation is that African Americans are more likely to have violent confrontations with the police because they commit more crimes. In the context of our example, this would mean that there are proportionally more red (guilty) dots among minorities then among the majority. The higher incarceration rate in the African American community has been touted as evidence of that this might be the case. If this is true, then even if you use the same criteria for people of every race, it will result in more searches and more arrests and more deaths of African Americans.


It is worth mentioning that there is a chicken and egg question there. Are African Americans incarcerated more because they commit more crimes (which might be partially explained by disparities in income and education among other factors) or because they are targeted more by police and therefore caught more often? Or perhaps the answer is both.


This leads into the second possibility: that cops have a different threshold for detaining and/or using force against black and white suspects. Framed another way, this would mean that black people are viewed more suspiciously or as more threatening than white people under otherwise identical circumstances. In the context of our classification problem, that would be like shifting all of the points for black suspects to the right as we see in the hypothetical scenario below:



In this case, again no matter where you draw the line, more black people would be classified as guilty and including a disproportionately large number of people who are innocent.


So which is it? One of the ways to check is to look at arrest rates and false positive rates in tandem. If police apply a uniform threshold and African Americans are more likely to commit crimes, then there would be more African American arrests, but the false positive rate would be lower than the rate for whites. If this is merely the result of prejudice, then both the number of African American arrests and the false positive rate would increase. And if both factors are in play, then it is possible that the false positive rate would be comparable with a higher arrest rate.


In Stanford’s open policing project, they looked at millions of stops by the police and tabulated how often people of each race were stopped, how often they were searched, and how often those searches turned up contraband (called the hit rate). Unsurprisingly, they found that black Americans and Hispanics were more likely to be stopped and searched than whites. Remember that this could be evidence of discrimination or evidence that they are simply more likely to be guilty.


However, when they adjusted for the hit rates (for nerds like me: by using a hierarchical Bayesian model) they found evidence that a majority of police departments discriminate against both African Americans and Hispanics by using a lower threshold for determining whether they needed to be searched. This is equivalent to viewing them more suspiciously because of their race, and this could not be explained by the higher probability of possessing contraband alone.


Therein lies the two-fold problem that we are currently facing:


1. There seems to be a low threshold for the use of lethal force that results in too many injuries and innocent deaths. This affects people of all races and especially the groups that are most likely to be involved in tense confrontations with police. This can be reduced by raising the bar for the use of lethal force, but, in light of the tradeoffs between false positives and false negatives, it is important to recognize that this could increase the number of police injuries and deaths. Currently, there are around 1000 people (guilty and innocent) shot to death by police each year, while the number of police deaths is around 100. Sad though it may be, it is worth considering is that the right ratio? Is it the best we can hope for?


2. The second problem is that the threshold for police intervention seems to be significantly lower for African Americans than for other racial groups and especially when compared with whites. This may be due in part to increased crime rates, but that cannot fully explain the discrepancy. As I discussed in a previous post, prejudice can emerge even in the absence of a legitimate statistical basis, and those types of biases are likely to be a factor in the discrepancies in the higher stoppage rates, incarceration rates and death rates of African Americans.


To some extent the issues of police brutality and racial discrimination are separate issues. However, they become intertwined in cases like George Floyd and Tamir Rice. The first problem, police brutality, has no clear-cut “best” solution. We may all agree that some types of force are excessive and that the threshold needs to be changed. But we may never be able to agree on the right place to draw the line. Given the trade-offs, it makes sense why we are unable to do so. For the second problem, racial discrimination in policing, the solution seems more straightforward in principle: police departments need apply the same standard of evidence to people of all races. Unfortunately, actually ensuring that they behave that way poses a significant challenge. This does not mean that the number of arrests will be proportional, but it does mean eliminating a double standard.


The implications of these different standards of evidence are profound. If given the same evidence African Americans are more likely to be stopped, searched, and arrested than their white counterparts, then statistically speaking it is far more likely that they will be victims of police brutality. In addition, given the pervasiveness of this trend, it would be surprising if there was not a similar discrepancy in the thresholds used when deciding whether to use force (which are less common and therefore harder to analyze). Even if one were to concede that some level of racial profiling is warranted, the type of bias that results in lower standards of evidence hardly seems justifiable.


Inevitably, a lower threshold for minorities will result in increased false-positive rates. This is more than just an inconvenience. False positives mean more arrests and incarcerations of the innocent, and more deaths of unarmed men, women and children at the hands of police. As I discussed earlier, these discrepancies make an implicit statement about the value of life. If black Americans are policed with a lower threshold of evidence than white Americans, even after adjusting for other factors, then that is equivalent to placing a lower value on their lives when evaluating the risks. So, regardless of whether you agree with all of the methods or policies advocated by the black lives matter movement, the evidence suggests that African Americans (and Hispanics too) are policed in such a way that devalues their lives. And those who care about the value of human life should be able to say without reservation that this is wrong. Black lives should matter just as much as other lives.


PS. I will explain more about the model used by the research group at Stanford and dig into some of this open policing data for myself in the next post.

46 views0 comments

Recent Posts

See All

Comments


bottom of page