What Makes the Best Defensive Footballers?

Analyzing player attributes from FIFA '17 player rankings.

Author: Brian Daisey

Introduction

In football (soccer) as in most sports, the lion's share of the glory goes to the top offensive players. This project shifts the focus onto the defensive element that actually wins championships.

Using detailed player data from the FIFA '17 video game we will examine how defenders match up against the top 1000 players and one another. Then we take a closer look at the defensive players by examining individual attributes and their contribution to the overall player ratings. Hopefully after reading you will have a better idea of which skills are most crucial for the success of defensive footballers!

Outline

  1. Getting Started
    • 1.1 Required Libraries
    • 1.2 Dataset Source
    • 1.3 Load and View Data
    • 1.4 Isolate Top 1000 Players
    • 1.5 Isolate Defenders
  2. Explore Top 1000 Players
    • 2.1 Top 1000 Player Breakdown by Position
    • 2.2 How Many Defenders in Top 1000?
  3. Explore Defenders
    • 3.1 Defenders by Position
    • 3.2 Defenders by Rating
  4. Begin Defender Analysis
    • 4.1 Selection of Attributes
    • 4.2 Attributes vs. Rating, Visually
  5. Reveal Key Attributes with Multiple Linear Regression
    • 5.1 Null Hypothesis
    • 5.2 Regression with scikit-learn
    • 5.3 Regression with statsmodels
  6. Predict Rating From Attributes with Machine Learning
    • 6.1 Predictions with Train/Test Split
  7. Conclusions
    • 7.1 Key Attributes
    • 7.2 Prediction of Overall Rating

1. Getting Started

1.1 Required Libraries

  • Pandas: used for data display and partitioning
  • Matplotlib - pyplot: used for plotting Pandas data into graphs and charts
  • Seaborn: provides a high-level interface for graphics on top of Matplotlib
  • scikit-learn: very popular machine learning library
    • linear-model: used to calculate models for multiple linear regression
    • model_selection: used to split up dataset into test and training data and evaluate predictions
  • statsmodels - api: used to calculate models and statistics with multple linear regression
In [280]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sklearn import model_selection
from statsmodels import api as sm

1.2 Dataset Source

The dataset used is the ranking for all 17,588 players present in the popular video game FIFA 2017. It contains detailed rankings for all feasible aspects of a players performance, most of which are a ranking out of 100. There is also qualitative data such as Position, Nationality and Club Name. In this project we will be concerned with the quantitave rankings of the players and their postitions.

The dataset was found on Kaggle. There is not much metadata on the Kaggle page, so FIFA Index was used to gain a greater understanding of the attributes present in the dataset and their meaning. Luckily, the FIFA Index pages for individual player statistics very closely match the terms used in the Kaggle dataset.

  • Kaggle dataset here.
  • Example of individual player page on Fifa Index here.

1.3 Load and View Data

Load .csv file and display the first 5 rows to get a sense of what is contained in the dataset.

In [281]:
fifa = pd.read_csv("FullData.csv")
fifa.head()
Out[281]:
Name Nationality National_Position National_Kit Club Club_Position Club_Kit Club_Joining Contract_Expiry Rating ... Long_Shots Curve Freekick_Accuracy Penalties Volleys GK_Positioning GK_Diving GK_Kicking GK_Handling GK_Reflexes
0 Cristiano Ronaldo Portugal LS 7.0 Real Madrid LW 7.0 07/01/2009 2021.0 94 ... 90 81 76 85 88 14 7 15 11 11
1 Lionel Messi Argentina RW 10.0 FC Barcelona RW 10.0 07/01/2004 2018.0 93 ... 88 89 90 74 85 14 6 15 11 8
2 Neymar Brazil LW 10.0 FC Barcelona LW 11.0 07/01/2013 2021.0 92 ... 77 79 84 81 83 15 9 15 9 11
3 Luis Suárez Uruguay LS 9.0 FC Barcelona ST 9.0 07/11/2014 2021.0 92 ... 86 86 84 85 88 33 27 31 25 37
4 Manuel Neuer Germany GK 1.0 FC Bayern GK 1.0 07/01/2011 2021.0 92 ... 16 14 11 47 11 91 89 95 90 89

5 rows × 53 columns

Display some basic statistics about the dataset. Since summary statistics are used in the pandas describe() method, only the columns with quantitative data are displayed. Use the 'count' row to get a sense of how complete each column is. There are 17588 players in the dataset in total, so we know that the rows for the specific skill attributes and overall rating are completely populated.

In [282]:
fifa.describe()
Out[282]:
National_Kit Club_Kit Contract_Expiry Rating Age Weak_foot Skill_Moves Ball_Control Dribbling Marking ... Long_Shots Curve Freekick_Accuracy Penalties Volleys GK_Positioning GK_Diving GK_Kicking GK_Handling GK_Reflexes
count 1075.000000 17587.000000 17587.000000 17588.000000 17588.000000 17588.000000 17588.000000 17588.000000 17588.000000 17588.000000 ... 17588.000000 17588.000000 17588.000000 17588.000000 17588.000000 17588.000000 17588.000000 17588.000000 17588.000000 17588.000000
mean 12.219535 21.294536 2018.899414 66.166193 25.460314 2.934103 2.303161 57.972766 54.802877 44.230327 ... 47.403173 47.181146 43.383443 49.165738 43.275586 16.609620 16.823061 16.458324 16.559814 16.901183
std 6.933187 19.163741 1.698787 7.083012 4.680217 0.655927 0.746156 16.834779 18.913857 21.561703 ... 19.211887 18.464396 17.701903 15.871735 17.710839 17.139904 17.798052 16.600741 16.967256 18.034485
min 1.000000 1.000000 2017.000000 45.000000 17.000000 1.000000 1.000000 5.000000 4.000000 3.000000 ... 4.000000 6.000000 4.000000 7.000000 3.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 6.000000 9.000000 2017.000000 62.000000 22.000000 3.000000 2.000000 53.000000 47.000000 22.000000 ... 32.000000 34.000000 31.000000 39.000000 30.000000 8.000000 8.000000 8.000000 8.000000 8.000000
50% 12.000000 18.000000 2019.000000 66.000000 25.000000 3.000000 2.000000 63.000000 60.000000 48.000000 ... 52.000000 48.000000 42.000000 50.000000 44.000000 11.000000 11.000000 11.000000 11.000000 11.000000
75% 18.000000 27.000000 2020.000000 71.000000 29.000000 3.000000 3.000000 69.000000 68.000000 64.000000 ... 63.000000 62.000000 57.000000 61.000000 57.000000 14.000000 14.000000 14.000000 14.000000 14.000000
max 36.000000 99.000000 2023.000000 94.000000 47.000000 5.000000 5.000000 95.000000 97.000000 92.000000 ... 91.000000 92.000000 93.000000 96.000000 93.000000 91.000000 89.000000 95.000000 91.000000 90.000000

8 rows × 41 columns

Two columns that will be very important for further analysis are 'Rating' and 'Club_Position', so we need to keep only the rows where those columns both have actual values.

In [249]:
fifa = fifa[pd.notnull(fifa['Rating'])]
fifa = fifa[pd.notnull(fifa['Club_Position'])]
fifa.describe()
Out[249]:
National_Kit Club_Kit Contract_Expiry Rating Age Weak_foot Skill_Moves Ball_Control Dribbling Marking ... Long_Shots Curve Freekick_Accuracy Penalties Volleys GK_Positioning GK_Diving GK_Kicking GK_Handling GK_Reflexes
count 1075.000000 17587.000000 17587.000000 17587.000000 17587.000000 17587.000000 17587.000000 17587.000000 17587.000000 17587.000000 ... 17587.000000 17587.000000 17587.000000 17587.000000 17587.000000 17587.000000 17587.000000 17587.000000 17587.000000 17587.000000
mean 12.219535 21.294536 2018.899414 66.165349 25.459544 2.934042 2.303122 57.971513 54.801785 44.231592 ... 47.401376 47.179394 43.381134 49.163757 43.273725 16.610223 16.823449 16.458805 16.560130 16.901348
std 6.933187 19.163741 1.698787 7.082330 4.679237 0.655896 0.746159 16.834438 18.913841 21.561664 ... 19.210956 18.463458 17.699756 15.870012 17.709623 17.140204 17.798483 16.601091 16.967686 18.034984
min 1.000000 1.000000 2017.000000 45.000000 17.000000 1.000000 1.000000 5.000000 4.000000 3.000000 ... 4.000000 6.000000 4.000000 7.000000 3.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 6.000000 9.000000 2017.000000 62.000000 22.000000 3.000000 2.000000 53.000000 47.000000 22.000000 ... 32.000000 34.000000 31.000000 39.000000 30.000000 8.000000 8.000000 8.000000 8.000000 8.000000
50% 12.000000 18.000000 2019.000000 66.000000 25.000000 3.000000 2.000000 63.000000 60.000000 48.000000 ... 52.000000 48.000000 42.000000 50.000000 44.000000 11.000000 11.000000 11.000000 11.000000 11.000000
75% 18.000000 27.000000 2020.000000 71.000000 29.000000 3.000000 3.000000 69.000000 68.000000 64.000000 ... 63.000000 62.000000 57.000000 61.000000 57.000000 14.000000 14.000000 14.000000 14.000000 14.000000
max 36.000000 99.000000 2023.000000 94.000000 47.000000 5.000000 5.000000 95.000000 97.000000 92.000000 ... 91.000000 92.000000 93.000000 96.000000 93.000000 91.000000 89.000000 95.000000 91.000000 90.000000

8 rows × 41 columns

These columns were nearly complete originally. From the 'count' row in the description of the data above, it appears that only one row was dropped from the dataset.

1.4 Isolate Top 1000 Players

Create a dataframe with only the top 1000 players. This will be based on the 'Rating' field, which contains an overall ranking of each player from 1-100.

In [283]:
# sort descending by Rating
sorted = fifa.copy().sort_values('Rating', ascending=False)

# get top 1000 players
top1000 = sorted.head(1000)
top1000.head()
Out[283]:
Name Nationality National_Position National_Kit Club Club_Position Club_Kit Club_Joining Contract_Expiry Rating ... Long_Shots Curve Freekick_Accuracy Penalties Volleys GK_Positioning GK_Diving GK_Kicking GK_Handling GK_Reflexes
0 Cristiano Ronaldo Portugal LS 7.0 Real Madrid LW 7.0 07/01/2009 2021.0 94 ... 90 81 76 85 88 14 7 15 11 11
1 Lionel Messi Argentina RW 10.0 FC Barcelona RW 10.0 07/01/2004 2018.0 93 ... 88 89 90 74 85 14 6 15 11 8
2 Neymar Brazil LW 10.0 FC Barcelona LW 11.0 07/01/2013 2021.0 92 ... 77 79 84 81 83 15 9 15 9 11
3 Luis Suárez Uruguay LS 9.0 FC Barcelona ST 9.0 07/11/2014 2021.0 92 ... 86 86 84 85 88 33 27 31 25 37
4 Manuel Neuer Germany GK 1.0 FC Bayern GK 1.0 07/01/2011 2021.0 92 ... 16 14 11 47 11 91 89 95 90 89

5 rows × 53 columns

If you view more than just the head of this dataset, lowest rating among top 1000 is 77. There are more players ranked 77 beyond the top 1000, so this is not a perfect sampling. Since there is no other 'tell-all' attribute, no clear secondary ranking criteria is evident. However, this is not the main focus of the project and this top 1000 set will be suitable for our analysis.

1.5 Isolate Defenders

To isolate the defenders from this dataset, let's first view all the unique position names from the column 'Club_Position'. This column is used over National Position because it has values for all players in the dataset, while National Position only has slightly over 1000 values filled.

In [312]:
fifa['Club_Position'].unique()
Out[312]:
array(['LW', 'RW', 'ST', 'GK', 'Sub', 'RCM', 'CAM', 'LCB', 'LCM', 'RS',
       'RB', 'RCB', 'LM', 'LDM', 'RM', 'LB', 'CDM', 'RDM', 'LF', 'CB',
       'LAM', 'Res', 'CM', 'LS', 'RF', 'RWB', 'RAM', 'LWB', nan, 'CF'], dtype=object)

To see what these abbreviations actually correspond to, here is an image of football positions on a field:

The 7 defensive positions are circled in red above. The abbreviations stand for Left Wingback, Left Back, Left Center Back, Center Back, Right Center Back, Right Back, and Right Wingback.

Now we can make a list of the defensive position acronyms and filter the dataset to return only the defensive players.

In [284]:
def_positions = ['LWB', 'LB', 'LCB', 'CB', 'RCB', 'RB', 'RWB']
defenders = fifa.copy()

#filter based on only records that have the positions present in def_positions above
defenders = defenders[defenders['Club_Position'].isin(def_positions)]
defenders.head()
Out[284]:
Name Nationality National_Position National_Kit Club Club_Position Club_Kit Club_Joining Contract_Expiry Rating ... Long_Shots Curve Freekick_Accuracy Penalties Volleys GK_Positioning GK_Diving GK_Kicking GK_Handling GK_Reflexes
15 Thiago Silva Brazil Sub 14.0 PSG LCB 2.0 07/01/2012 2020.0 89 ... 71 61 73 71 63 9 9 5 12 10
16 Sergio Ramos Spain LCB 15.0 Real Madrid LCB 4.0 08/01/2005 2020.0 89 ... 55 73 67 68 66 7 11 9 8 11
24 Diego Godín Uruguay LCB 3.0 Atlético Madrid LCB 2.0 08/01/2010 2019.0 88 ... 43 49 51 50 47 5 6 15 8 15
25 Mats Hummels Germany LCB 5.0 FC Bayern LCB 5.0 07/01/2016 2021.0 88 ... 51 65 53 68 60 5 15 10 6 6
27 Giorgio Chiellini Italy LCB 3.0 Juventus LCB 3.0 07/01/2005 2018.0 88 ... 49 60 31 50 45 4 3 2 3 3

5 rows × 53 columns

Verify we still have all 7 defensive positions represented.

In [313]:
defenders['Club_Position'].unique()
Out[313]:
array(['LCB', 'RB', 'RCB', 'LB', 'CB', 'RWB', 'LWB'], dtype=object)

From the counts in Rating and individual attributes below, it appears that 2,534 out of the total 17,587 players are listed as defensive players. This yields a large enough dataset for further analysis.

In [285]:
defenders.describe()
Out[285]:
National_Kit Club_Kit Contract_Expiry Rating Age Weak_foot Skill_Moves Ball_Control Dribbling Marking ... Long_Shots Curve Freekick_Accuracy Penalties Volleys GK_Positioning GK_Diving GK_Kicking GK_Handling GK_Reflexes
count 198.000000 2534.000000 2534.000000 2534.000000 2534.000000 2534.000000 2534.000000 2534.000000 2534.000000 2534.000000 ... 2534.000000 2534.000000 2534.000000 2534.000000 2534.000000 2534.000000 2534.000000 2534.000000 2534.000000 2534.000000
mean 10.070707 14.911208 2018.826756 68.732044 26.938043 2.842541 2.123520 59.636543 53.398185 66.851618 ... 42.773481 45.623520 41.534333 45.276243 37.470008 10.714680 10.694949 10.770324 10.747040 10.715864
std 7.000004 14.730465 1.651298 5.989564 4.093811 0.610532 0.338559 9.757831 13.485955 7.735976 ... 15.472043 15.660997 14.874850 11.765115 12.713013 3.193751 3.124549 3.309674 3.184599 3.192558
min 2.000000 2.000000 2017.000000 49.000000 18.000000 1.000000 2.000000 25.000000 14.000000 23.000000 ... 11.000000 11.000000 10.000000 11.000000 11.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 4.000000 4.000000 2017.250000 65.000000 24.000000 2.000000 2.000000 54.000000 44.000000 62.000000 ... 30.000000 33.000000 30.000000 37.000000 28.000000 8.000000 8.000000 8.000000 8.000000 8.000000
50% 6.000000 12.000000 2018.000000 69.000000 27.000000 3.000000 2.000000 61.000000 56.000000 67.000000 ... 42.000000 44.000000 39.000000 44.000000 35.000000 11.000000 11.000000 11.000000 11.000000 11.000000
75% 16.000000 23.000000 2020.000000 72.000000 30.000000 3.000000 2.000000 66.000000 64.000000 72.000000 ... 55.000000 58.000000 52.000000 53.000000 46.000000 13.000000 13.000000 13.000000 13.000000 13.000000
max 23.000000 98.000000 2023.000000 89.000000 39.000000 5.000000 4.000000 87.000000 84.000000 92.000000 ... 88.000000 86.000000 87.000000 90.000000 80.000000 32.000000 32.000000 41.000000 33.000000 34.000000

8 rows × 41 columns

2. Explore Top 1000 Players

2.1 Top 1000 Player Breakdown by Position

To find the position breakdown of the top 1000 players we will need a list of all positions and a count of how many players are listed at each. First, we will use Pandas groupby() function to combine all the rows with the same position. Chaining this with count() is what yields our position counts. Finally, sort these counts in a descending manner to view which positions are the most prevalent among the top 1000 players.

In [286]:
by_position = top1000.groupby(top1000['Club_Position']).count().reset_index()
by_position = by_position[['Club_Position','Rating']]
by_position = by_position.sort_values('Rating', ascending=False)
by_position.columns = ['Club_Position','Count']
by_position.head(13)
Out[286]:
Club_Position Count
27 Sub 295
4 GK 83
26 ST 65
17 RCB 64
7 LCB 59
0 CAM 37
16 RB 36
8 LCM 35
18 RCM 35
21 RM 32
11 LM 31
6 LB 30
25 Res 26

Unexpected Results!

This is not an ideal result because of the 'Sub' and 'Res' players that are in the dataset. Substitutes and Reserves represent a variety of positions, so this is not informative for our analysis of the breakdown of the top 1000 players by position.

To rectify, we drop all 'Sub' and 'Res' players from the dataset. Then, resample and re-group the top 1000 players.

In [287]:
# drop substitute and reserve players
drop_sub_res = fifa.copy()
drop_sub_res = drop_sub_res[drop_sub_res['Club_Position'] != 'Sub']
drop_sub_res = drop_sub_res[drop_sub_res['Club_Position'] != 'Res']

# sort descending by Rating
sorted = drop_sub_res.copy().sort_values('Rating', ascending=False)

# get new top 1000 players by position
new_top1000 = sorted.copy().head(1000)

by_position = new_top1000.groupby(new_top1000['Club_Position']).count().reset_index()
by_position = by_position[['Club_Position','Rating']]
by_position = by_position.sort_values('Rating', ascending=False)
by_position.columns = ['Club_Position','Count']
by_position.head()
Out[287]:
Club_Position Count
4 GK 119
17 RCB 99
25 ST 88
7 LCB 81
0 CAM 55

Verify that all 'Sub' and 'Res' players have indeed been dropped from our top 1000 players.

In [257]:
by_position['Club_Position'].unique()
Out[257]:
array(['GK', 'RCB', 'ST', 'LCB', 'CAM', 'LB', 'RB', 'RM', 'LCM', 'LM',
       'RCM', 'RDM', 'LW', 'LDM', 'RW', 'LS', 'CDM', 'RS', 'CB', 'CM',
       'RAM', 'LWB', 'RWB', 'LAM', 'RF', 'LF'], dtype=object)

Much better. Let's now plot a histogram of the breakdown of the positions among the top 1000 players for a more visual overview of the distribution.

In [258]:
plt.figure(figsize=(12,6))
plt.title("Top 1000 Players, By Position", fontsize=16)
sns.barplot(data=by_position, x='Club_Position', y='Count')
plt.show()

Obsevations:

  • Goalkeepers (GK) are the most represented position among the top 1000 players.
  • Strikers (ST), at 3rd most represented, are the only offensive players in the top 12 positions.
  • The remaining 10 of the top 12 spots are split between defensive and midfield players.

2.2 How Many Defenders in Top 1000?

To answer this question, use the list of defensive positions generated in Section 1.5. Simply return a count of the number of rows comprising these positions from within the top 1000 players data.

In [259]:
def1000 = new_top1000.copy()
def1000 = def1000[def1000['Club_Position'].isin(def_positions)]
res = def1000.Rating.count()
print("There are {} defenders among the top 1000 players.".format(res))
There are 306 defenders among the top 1000 players.

There are 3 main categories of positions in football: defensive, midfield, and offensive. Subtracting the 119 Goalkeeprs found above, defensive players make up very close to 1/3 of the remaining top 1000 players, so this result is not surprising.

3. Explore Defenders

3.1 Defenders by Position

Let's now take a look at the breakdown of total defensive players by position. Follow the same paradigm as player breakdown in Section 2.1: Group by position, get a count, sort descending, plot!

In [260]:
def_grouped = defenders.groupby(defenders['Club_Position']).count().reset_index()
def_grouped = def_grouped[['Club_Position','Rating']]
def_grouped = def_grouped.sort_values('Rating', ascending=False)
def_grouped.columns = ['Club_Position','Count']
def_grouped
Out[260]:
Club_Position Count
5 RCB 633
2 LCB 631
1 LB 549
4 RB 548
0 CB 83
3 LWB 45
6 RWB 45
In [261]:
plt.figure(figsize=(12,6))
plt.title("Defensive Player Position Distribution", fontsize=16)
sns.barplot(data=def_grouped, x='Club_Position', y='Count')
plt.show()

Center Backs, Left Wingbacks, and Right Wingbacks have far fewer players listed than the other positions. Perhaps this is due to differing nomenclature across the world for the same positions?

There is also undoubtedly overlap between the RCB, LCB, and CB in that Right and Left Center Backs surely fill in at Center Back from time to time. Also, the positions names may change based on different formations being used throughout the course of a match. This is quickly getting out of depth of my football knowledge, let's proceed.

3.2 Defenders by Rating

Follow the (now familiar) group, count, sort, plot paradigm to get an idea of the ratings distribution for all defensive players.

In [262]:
# group by rating, count, sort
def_by_rating = defenders.groupby(defenders['Rating']).count().reset_index()
def_by_rating = def_by_rating[['Club_Position','Rating']]
def_by_rating = def_by_rating.sort_values('Rating', ascending=False)
def_by_rating.columns = ['Count','Rating']

# plot rating distribution
plt.figure(figsize=(12,6))
plt.title("Defensive Player Rating Distribution", fontsize=16)
sns.barplot(data=def_by_rating, x='Rating', y='Count')
plt.show()

This result is about what you'd expect: many average ratings, with a few very lowly and very highly ranked players. The majority of the players are rated between 60 and 77. Also looks like the data is fairly normally distributed.

4. Begin Defender Analysis

4.1 Selection of Attributes

Now that we have explored the data a bit, we can get down to analysis. The first thing to do is view all of the columns present in the dataset.

In [263]:
# get a list of all the columns
cols = defenders.columns.tolist()

#display all results in a more compact manner
for i,col in enumerate(cols):
    if i%5 ==0:
        print(" ") #new line
    print(col,"   ", end=" ") #do not go to new line after printing each item
 
Name     Nationality     National_Position     National_Kit     Club      
Club_Position     Club_Kit     Club_Joining     Contract_Expiry     Rating      
Height     Weight     Preffered_Foot     Birth_Date     Age      
Preffered_Position     Work_Rate     Weak_foot     Skill_Moves     Ball_Control      
Dribbling     Marking     Sliding_Tackle     Standing_Tackle     Aggression      
Reactions     Attacking_Position     Interceptions     Vision     Composure      
Crossing     Short_Pass     Long_Pass     Acceleration     Speed      
Stamina     Strength     Balance     Agility     Jumping      
Heading     Shot_Power     Finishing     Long_Shots     Curve      
Freekick_Accuracy     Penalties     Volleys     GK_Positioning     GK_Diving      
GK_Kicking     GK_Handling     GK_Reflexes     

This is somewhat overwhelming, especially because there are so many attributes for specific individual player skills.

Luckily, the website FIFA Index maintains detailed information about each player using these same attributes, as well as additional player information such as bios and club membership. FIFA Index can provide us with some guidance moving forward. By viewing one of the individual player pages, we can see that the 34 specific player attributes are grouped into 7 categories. The 5 categories we will focus on are highlighted in red below.

Eliminate Shooting attributes because that does not factor heavily into a defensive player's game. Eliminate Goalkeeper attributes because it is unclear what this category means.

Use the Fifa index to get a list of all the attributes we will be interested in for defensive players.

In [288]:
# lists of attributes by category
ball_skills = ['Ball_Control', 'Dribbling']
passing = ['Crossing','Short_Pass','Long_Pass']
defence = ['Marking', 'Sliding_Tackle', 'Standing_Tackle']
mental = ['Aggression','Reactions','Attacking_Position','Interceptions','Vision','Composure']
physical = ['Acceleration','Stamina','Strength','Balance','Speed','Agility','Jumping']

# list of all defensive attributes
def_group_list = [ball_skills, passing, defence, mental, physical]
# flattens a list of lists
def_attrs_list = [item for sublist in def_group_list for item in sublist]
def_attrs_list
Out[288]:
['Ball_Control',
 'Dribbling',
 'Crossing',
 'Short_Pass',
 'Long_Pass',
 'Marking',
 'Sliding_Tackle',
 'Standing_Tackle',
 'Aggression',
 'Reactions',
 'Attacking_Position',
 'Interceptions',
 'Vision',
 'Composure',
 'Acceleration',
 'Stamina',
 'Strength',
 'Balance',
 'Speed',
 'Agility',
 'Jumping']

Build a new dataframe from our defensive attributes list, the player ratings overall, position, and player name. Dataframe should have 24 columns: 21 for the attributes and 3 additional for the fields newly added to the list.

In [265]:
new_df_list = ['Name','Club_Position','Rating'] + def_attrs_list

def_with_attrs = defenders.copy()
def_with_attrs = def_with_attrs[new_df_list]
def_with_attrs.head()
Out[265]:
Name Club_Position Rating Ball_Control Dribbling Crossing Short_Pass Long_Pass Marking Sliding_Tackle ... Interceptions Vision Composure Acceleration Stamina Strength Balance Speed Agility Jumping
15 Thiago Silva LCB 89 80 68 60 79 81 90 89 ... 91 74 80 72 74 81 68 76 75 93
16 Sergio Ramos LCB 89 83 61 66 76 70 85 90 ... 88 63 76 77 84 81 60 78 80 92
24 Diego Godín LCB 88 76 53 55 79 70 87 89 ... 88 52 80 62 69 80 58 67 63 89
25 Mats Hummels LCB 88 77 68 64 78 82 85 87 ... 89 79 91 62 66 85 58 65 64 68
27 Giorgio Chiellini LCB 88 55 56 58 55 59 92 90 ... 88 50 78 69 69 91 65 79 59 89

5 rows × 24 columns

Now we will reduce the overall number of attributes by taking the average of the attributes within each category.

Average is used rather than sum because the categories have different numbers of attributes, so this will allow the "category attributes" to all remain on the same scale (out of 100).

In [266]:
# could have used a loop here, but this makes it very clear what is happening
def_with_attrs['Ball_Skills'] = def_with_attrs[ball_skills].mean(axis=1)
def_with_attrs['Passing'] = def_with_attrs[passing].mean(axis=1)
def_with_attrs['Defence'] = def_with_attrs[defence].mean(axis=1)
def_with_attrs['Mental'] = def_with_attrs[mental].mean(axis=1)
def_with_attrs['Physical'] = def_with_attrs[physical].mean(axis=1)
def_with_attrs.head()
Out[266]:
Name Club_Position Rating Ball_Control Dribbling Crossing Short_Pass Long_Pass Marking Sliding_Tackle ... Strength Balance Speed Agility Jumping Ball_Skills Passing Defence Mental Physical
15 Thiago Silva LCB 89 80 68 60 79 81 90 89 ... 81 68 76 75 93 74.0 73.333333 90.000000 77.500000 77.000000
16 Sergio Ramos LCB 89 83 61 66 76 70 85 90 ... 81 60 78 80 92 72.0 70.666667 88.000000 74.166667 78.857143
24 Diego Godín LCB 88 76 53 55 79 70 87 89 ... 80 58 67 63 89 64.5 68.000000 87.333333 73.166667 69.714286
25 Mats Hummels LCB 88 77 68 64 78 82 85 87 ... 85 58 65 64 68 72.5 74.666667 88.000000 77.666667 66.857143
27 Giorgio Chiellini LCB 88 55 56 58 55 59 92 90 ... 91 65 79 59 89 55.5 57.333333 91.333333 68.666667 74.428571

5 rows × 29 columns

View the right-most columns of the above dataframe to see the added categories. Spot-checking the new categories for a couple players reveals that our averages were computed correctly.

Create a new dataframe with just the grouped categories and descriptor columns.

In [289]:
def_grouped_df_list = ['Name','Club_Position','Rating','Ball_Skills','Passing','Defence','Mental','Physical']
def_grouped = def_with_attrs[def_grouped_df_list].reset_index()
def_grouped.head()
Out[289]:
index Name Club_Position Rating Ball_Skills Passing Defence Mental Physical
0 15 Thiago Silva LCB 89 74.0 73.333333 90.000000 77.500000 77.000000
1 16 Sergio Ramos LCB 89 72.0 70.666667 88.000000 74.166667 78.857143
2 24 Diego Godín LCB 88 64.5 68.000000 87.333333 73.166667 69.714286
3 25 Mats Hummels LCB 88 72.5 74.666667 88.000000 77.666667 66.857143
4 27 Giorgio Chiellini LCB 88 55.5 57.333333 91.333333 68.666667 74.428571

Now we have a concise dataframe with the attributes grouped and averaged by their more general categories.

4.2 Attributes vs. Rating, Visually

Create a correlation heatmap to see how the various attributes correlate to Rating and to one another. First need to create a dataframe with just the columns we want to correlate. Then use the pandas function .corr() to create a corellation matrix.

In [290]:
just_cats = def_grouped[['Rating','Ball_Skills','Passing','Defence','Mental','Physical']]
just_cats = just_cats.corr()
just_cats
Out[290]:
Rating Ball_Skills Passing Defence Mental Physical
Rating 1.000000 0.467982 0.540002 0.889679 0.763568 0.275776
Ball_Skills 0.467982 1.000000 0.839580 0.295610 0.707785 0.640836
Passing 0.540002 0.839580 1.000000 0.380827 0.763691 0.536702
Defence 0.889679 0.295610 0.380827 1.000000 0.593483 0.140040
Mental 0.763568 0.707785 0.763691 0.593483 1.000000 0.440334
Physical 0.275776 0.640836 0.536702 0.140040 0.440334 1.000000

Plot the correlation matrix onto a heatmap. The heatmap plot within Seaborn makes this very easy to do. Only the first parameter within sns.heatmap is required, the remainder just relate to the display of the plot.

In [291]:
plt.figure(figsize=(12,8))
plt.title("Correlation Heatmap of Rating and Defensive Attributes", fontsize=18)
sns.heatmap(just_cats, cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True, annot=True)
plt.show()

From the heatmap above, we can see that ALL attributes and overall rating are positively correlated to one another using the scale on the right or the annotations in each cell. The more red a box is, the stronger the correlation, and the more blue a box is, the weaker the correlation.

Observations:

  • Rating is most strongly correlated with the Defence attribute. This is expected because, after all, we are looking at defensive players only.
  • Rating and Mental are also highly correlated.
  • Lowest correlation on the whole map is between Physical and Defence. It's more of a mental game to be a defender!

The remainder of our analysis will focus on the top row of this heatmap: how the different attributes affect overall player Rating.

5. Reveal Key Attributes with Multiple Linear Regression

5.1 Null Hypothesis

We are looking at the impacts of 5 different (aggregate) defensive attributes on overall player Rating.

Null Hypothesis: None of the attributes have a noticeable impact on the Rating.

To test the null hypothesis, we will perform Multiple Linear Regression on the dataset using scikit-learn.

5.2 Regression with scikit-learn

Use the dataframe with the grouped attributes for the regression model. Create 2 new dataframes for the features and the target of the regression.

  • The features (independent variables) are the attributes Ball Skills, Passing, Defence, Mental, and Physical.
  • The target (dependent variable) is the overal player Rating.
In [270]:
grouped_attrs = ['Ball_Skills','Passing','Defence','Mental','Physical']
features = def_grouped[grouped_attrs]
target = def_grouped[['Rating']]

Define X and y for use in scikit-learn's LinearRegression() function. Fit the model.

In [271]:
X = features
y = target['Rating']
lm = linear_model.LinearRegression()
model = lm.fit(X,y)

Get an R-squared score to test how well the variance is explained by the model. Values range from 0 to 1, so the ~0.88 we returned means that almost all the variance is explained by the model.

In [272]:
lm.score(X,y)
Out[272]:
0.87779893416824695

Moment of truth! Find the coefficients from the model to see which attributes had the largest impact overall.

In [273]:
sk_coeffs = lm.coef_.tolist()
for attr, coef in zip(grouped_attrs, sk_coeffs):
    print("Attribute: {}, Coefficient: {}".format(attr,coef))
Attribute: Ball_Skills, Coefficient: 0.0064515377877897295
Attribute: Passing, Coefficient: -0.0029818216181492843
Attribute: Defence, Coefficient: 0.5759816559211581
Attribute: Mental, Coefficient: 0.2758208335929364
Attribute: Physical, Coefficient: 0.01979911271776172

Findings from the heatmap are confirmed. Defence has the most impact on Rating, followed by Mental! Passing actually has a negative impact on Rating, which may or may not be small enough that it is insignificant.

To find out which attributes actually have a meaningful impact (and to test the null hypothesis) we need to observe the p-values. This is beyond the capability of scikit-learn.

5.3 Regression with statsmodels

It is possible to reuse the features (X) and target (y) variables created in Section 5.2, so setting up a model in statsmodels is very simple. The only thing to note is that we have to add a constant in statsmodels manually. We want to use one because we know that our Rating data does not go all the way to zero.

This model uses the method of Ordinary Least Squares (OLS) to estimate our parameters. OLS basically aims to minimize the sum of the squared distances between the actual values in the dataset and the predicted values generated by the regression line. More information about Ordinary Least Squares can be found here.

In [292]:
sm_y = y
sm_X = X

#add a constant to the features
sm_X = sm.add_constant(X)

# use Ordinary Least Squares
OLS_model = sm.OLS(y,X).fit()
OLS_model.summary()
Out[292]:
OLS Regression Results
Dep. Variable: Rating R-squared: 0.999
Model: OLS Adj. R-squared: 0.999
Method: Least Squares F-statistic: 4.588e+05
Date: Sat, 16 Dec 2017 Prob (F-statistic): 0.00
Time: 08:33:40 Log-Likelihood: -5694.7
No. Observations: 2534 AIC: 1.140e+04
Df Residuals: 2529 BIC: 1.143e+04
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Ball_Skills -0.0244 0.008 -2.876 0.004 -0.041 -0.008
Passing -0.0144 0.009 -1.637 0.102 -0.032 0.003
Defence 0.6623 0.007 94.045 0.000 0.648 0.676
Mental 0.3000 0.011 26.758 0.000 0.278 0.322
Physical 0.1274 0.006 20.080 0.000 0.115 0.140
Omnibus: 1549.719 Durbin-Watson: 1.977
Prob(Omnibus): 0.000 Jarque-Bera (JB): 60762.831
Skew: 2.287 Prob(JB): 0.00
Kurtosis: 26.549 Cond. No. 39.2

Looking at the coefficients above ('coef' column in the middle chart), it is clear that Defence and Mental again have the most impact on player Rating.

It is interesting to note that the impact from Passing is indeed negligible. From the p-values above ('P>|t|' column in the middle chart) we can see that it is the only one above a critical value of 5% (p-value of 0.05). The remainder of the p-values are well below 0.05, which means they have significance within the model.

This means that we reject the null hypothesis because clearly the player attributes to contribute to the overall player Rating.

A point of concern is the R-squared result of 0.999. This indicates that we may have overfit the model. We will see how our data works with training and test sets in the next section, which will lend some insight about overfitting.

Note: This tutorial was invaluable in performing these two models of Multiple Linear Regression.

6. Predict Rating From Attributes with Machine Learning

6.1 Predictions with Train/Test Split

Reuse X and y again from scikit-learn regression for our model selection. Here we will split up the dataset into training and testing data for both variables. X_train and Y_train are used to generate (train) the multiple linear regression model. Then, the X_test data (grouped attributes) is used with the model to make predictions for the expected Ratings. These predictions are then compared with the actual Rating values (found in y_test).

Test/train split is up to the user, but the majority of the dataset should be used to train the model. Here we will use 80% of the data for training and 20% for testing results.

In [310]:
# create training and testing data from same X and y used in regression above
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)

# create and fit the model
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)

# generate predictions for player Rating to compare with y_test data
predictions = lm.predict(X_test)

# display first 10 results of our predicted player Ratings.
predictions[0:10]
Out[310]:
array([ 66.45653207,  62.86344931,  82.57320142,  71.37339697,
        72.79713737,  66.25004822,  63.19825994,  65.62886779,
        72.47635851,  64.73423796])

Plot the predicted values from the model against the actual values from the dataset (y_test).

Add an identity line for predictions = y_test. This identity line will inform us how closely the predictions are to the actual rating values. The better the predictions, the more closely the plotted points will follow the identity line.

In [311]:
plt.figure(figsize=(8,8))
plt.title("Predicted vs. Actual Values for Player Rating", fontsize=16)
plt.scatter(y_test,predictions)
plt.plot(y_test, y_test, color="Red") # identity line y=x
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.show()

From the plot above it is clear that the model predicted the player Ratings very well for the test data.

The model also provides a score of the actual accuracy (1 is perfect).

In [278]:
print("Accuracy (scale of 0 to 1): {}".format(model.score(X_test, y_test)))
Accuracy (scale of 0 to 1): 0.8858971655190202

This is a suspiciously high accuracy score for the model. Rather than overfitting the model, as suspected in section 5.3, it likely has to do with the nature of the dataset itself. This is discussed further in Section 7 below.

Note: This tutorial was followed to carry out the train/test split and plotting of the results. It is by the same author as the regression tutorial above and both are very highly recommended for learning more about these topics.

7. Conclusions

7.1 Key Attributes

From the heatmap (Section 4.2) and two multiple linear regression models (Section 5.2 and 5.3) we discovered that player attributes do indeed affect the overall player Rating. Not surprisingly because we looked at defensive players, the Defence attribute plays the biggest role in determining overall Rating. More interestingly, the Mental category is the second-strongest contributor to overall Rating. From the OLS model and p-values in Section 5.3, we see that Physical also contributes as does Ball Skills (to a much lesser extent). Passing does not contribute to the model in a significant way.

All the virtual football managers out there should take note of how much more important the Mental attributes are than Physical ones!

It is surprising that Passing does not contribute to the model because you would expect that passing well to clear the ball out of the defensive portion of the field would be an important contributor to the overall rating of a defensive player.

7.2 Prediction of Overall Rating

The predictions of the model so closely matched the actual player ratings that it raises suspicion. After doing some research about why the R-squared value can be too high, overfitting the model is not likely the cause. Overfitting occurs when there are too many features (player attributes) in the model when compared with the number of observations. When this happens, the regression model "bends" too closely to the actual points and reflects the noise in the data rather than actually generalizing the overall population. It is always discretionary, but using 5 attributes for a sample size of over 2500 observations seems reasonable.

The cause of the very accurate model (and suspiciously high R-squared value) is likely due more to the structure of the dataset itself. Rather than being generated from real-world observations, or calculated from joining disparate datasets, the overall player Rating is DERIVED from the attributes themselves. While we do not know the actual weights and algorithms applied to the individual attributes to generate player Rating, it is probable that the creators of FIFA '17 mostly used these attributes when computing overall player rating. This likely explains why virtually all of the variance is explained by our linear regression model and why the predicted Ratings were so close to the actual Ratings.


Thank you for reading! Any comments/criticisms of methodology, analysis, display, clarity of writing, or lack of actual football knowledge are warmly welcomed!