Introduction

One of the main challenges with football data is identifying which records refer to the same player across different providers. Different websites use different naming conventions (especially on middle names and accents), and there may be some data quality issues (like different dates of birth for the same player).

For example, Juventus’s #3 Gleison Bremer appears differently on different websites. **SofaScore** reports him as Bremer*.* **TransferMarkt** reports him with both his short name Bremer and his full name Gleison Bremer Silva Nascimento. **FBRef** has him as Gleison Bremer.

This is just one player, and we already have five different name variations across three platforms.

image.png

In this article, we’ll talk you through our solution for matching player records using fuzzy string matching and bidirectional validation, based on the player names and dates of birth.

<aside> 💡

Our code is available at this link: https://github.com/parmacalcio1913/players-matcher.

</aside>

Previous Work and Acknowledgments

A similar work was already made public by Joris Bekkers, in an article called Designing a Player ID Matching System.

Our solution is largely based on string_grouper by Chris van den Berg and sparse_dot_topn by ING Bank.

Dataset

Our repository also contains a toy dataset that you can use to try out our solution. We manually collected player names and birth dates from two popular football websites:

The dataset deliberately includes some challenging scenarios: