Cheat Detection in Chess — Derek's Homepage

The main goal of this project is to create and deploy an app that shows interactive results from a statistical model that detects players who are cheating on lichess. It is currently a work-in-progress and can be viewed on GitHub.

Introduction: the state of the art for cheat detection in online chess games are typically models that look at individual moves of a player's games, and examine features such as a player’s average centipawn loss, time usage on certain moves, and performance in critical positions. However, it is not feasible to run these models on every player’s games so it is valuable to have a simpler statistical model that flags players who are likely cheating in the first place.

Background on rating systems: for those new to the online rating system, ratings on lichess are calculated using the Glicko-2 rating system. This is a slight refinement to the Glicko rating system, and a major refinement to the Elo rating system. Both Glicko and Elo rating systems lead to normally distributed ratings with enough players in a steady state, but the Glicko rating system takes into account rating deviation (RD), which tells us the standard deviation of a player’s rating. The RD of a player decreases after every game they play because a more active player is closer to their expected rating, and the RD of a player increases over time if they don’t play any games. In both rating systems, after every game, both players’ ratings change depending on how close they perform to their expected value. In the Glicko rating system, the RD is used as a weight for how much to change a player’s rating. This has the effect of making players’ ratings more stable over time, and intuitively, this seems accurate because after long enough, most players’ ratings stabilize and the amount their rating can change should reflect that.

One useful but simple feature that is easily extracted from millions of games are the players’ ratings and the result of the games. Based on the players’ ratings, we can calculate the expected result and compare it to the actual result. A player who decides to start cheating will play above their expected strength, and we can set a threshold to detect such players if they play too far above their expected strength for some minimum period of time. This will allow us to differentiate cheaters from players who are legitimately improving. It is worth noting that the thresholds should be different for different rating bins because lower rated players are more unlikely to be underrated, while higher rated players are more likely to be overrated [why we might prefer this metric over average rating gain is discussed in the model development section]

In the exploratory plot below, we can see that generally speaking, players in lower rating bins are over-performing, while players in higher rating bins are underperforming. I believe this is due to a combination of two effects: (1) improvement is easier for lower rated players than for higher rated players, and (2) lower rated and higher rated players are at extreme ends of the rating distribution so the lowest rated players almost exclusively face higher rated players and any result that isn’t a loss exceeds the expected result by a lot (with the opposite effect for the highest rated players).

Background on Model Development: The idea for the model is straightforward: we set a threshold on improvement and too much improvement is flagged as suspicious. Average rating gain is one possibility to quantify improvement, but it is important to distinguish between a player having a good run and a player who is likely cheating (and there is the issue that a player might gain a lot of rating points because of a high RD after a period of inactivity). A metric that takes this into consideration is how much a player exceeds their expected score on average. For example, consider the following situation:

player A scores 18.0/20 against similarly rated players, so their expected score was 10.0/20
player B scores 4.5/5 against much stronger rated players, so their expected score was 0.0/5
player C hasn’t played in awhile, returns with a high RD, and scores 4/4 against 4 weaker opponents, so their expected score was 3.5/4
all three players gain 50 rating points

Which player seems the most suspicious?

An average rating gain metric would tell us that player A gained 50 pts / 20 games = 2.5 points per game, player B gained 10 points per game, and that player C gained 12.5 points per game. However, if we take expected score into account, player A scored (18.0-10.0)/20 = 0.4 points more than expected per game, player B scored (4.5-0.0)/5 = 0.9 points more than expected per game, and player C scored (4.0-3.5)/4 = 0.125, so player B is the most suspicious, followed by player A and then player C.

A slight refinement to this metric is to take rating bin into account. Since players’ rating behaviors (such as under-performing or over-performing) vary by bin, the thresholds are set differently for each bin. How the thresholds are tuned is discussed in the next section that describes how model training and labeling the data are tied together.

Building and Fitting the Model: The main idea is to set a threshold for how much a player’s actual score against their opponents exceeded their expected score, on average. This gives us a nice feature that has a range of 0.00 to 1.00, and is relatively easy to tune in theory. However, we now have the issue of fitting the data to real observations, which carry inherent uncertainty because how do we know who cheated? lichess.org has cheat detection and players’ accounts are flagged as having violated the terms of service. However, it is worth noting that rating manipulation and abusive behavior are two examples of violating the terms of service that aren’t due to cheating. There are also edge cases such as a closed account – which can be voluntarily done by a player who isn’t cheating, but is “often” the signature of an account where a player was caught cheating by lichess, and closed their account (to presumably avoid having their username publicly labeled as having violated the terms of use). In addition, not every player who has cheated has been caught, so in that sense we are working with limited, conservative labels.

I estimate the number of cheaters as X = 0.00 * N_open + 0.75 * N_closed + 1.00 * N_violation, where N_open is the number of open accounts, N_closed is the number of closed accounts, and N_violation is the number of accounts that have violated the terms of service (so there are N accounts = N_open + N_closed + N_violation). The choice of 0.75 for closed accounts is arbitrary, and is based on personal experience, but this is an important hyperparameter that changes how the selective the model is in flagging cheaters.

Then, the metric that used to fit the model is log(X+1) * X / N where N is the number of accounts, and X is the estimated number of cheaters above the threshold being evaluated. The reasoning behind using this metric instead of accuracy = X / N is that accuracy may lead to a high threshold that catches 1 / 1 cheater, but it is preferable to capture more cheaters even if the accuracy decreases. Therefore, the additional multiplier term log(X+1) ensures that we prioritize capturing a high number of cheaters, but not such a high number that the thresholds lead to largely inflated numbers (e.g. in excess of 10% of a particular rating bin being flagged as cheaters, for example). There isn’t any particular mathematics behind this multiplier term, and perhaps an inverse normal distribution would be more accurate given the normal distribution of ratings. However, so far this metric has led to reasonable thresholds and numbers of players being flagged during training runs.

This leads to the final part of model fitting – the process of labeling of the data. lichess has a rate-limited API so it is not feasible to label the entire dataset. Instead, we will keep a dictionary that tracks the account status of each player so we only search for a given account once, and we start searching for the optimal threshold at 0.15 in increments of +0.01 until we maximize the metric log(X+1) * X / N, as shown in the model fitting plots below:

Neural Network Expressivity