CASE OF STUDY

Predicting next crises using classification problems



In this section, we want to show the potential and versatility of our application.

This case of study is based on two main uses of our application:

  1. The possibility to combine different databases from different sources.

  2. The ability to apply machine learning concepts to build predictive models.

Our application framework is Shiny Server, but in Machine Learning, Python and Scitik-Learn is the state of art, so we have a flexible platform to use different data science approaches.

In [1]:
# Library import
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns;
sns.set()

from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

import warnings
warnings.simplefilter('ignore')

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
In [2]:
# Functions to obtain missing values
def checkMissingValues(dataframe,othervalue):

    def missing_values_table(df):
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum()/len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        return mis_val_table_ren_columns 
    
    def check_numeric_Values(df,othervalue):
        print('TypeValue : '+str(othervalue))
        print('---------------------')
        for i in list(dataframe.columns):
            print(str(i)+' have '+str(len(dataframe[dataframe[i]== 0]))+' values')
    
    check_numeric_Values(dataframe,othervalue)
    mvt = missing_values_table(dataframe)
    
    return mvt
In [3]:
# Function to metrics visualization
def eval_model(dataframe,prediction,c_positive):
    
    (tn, fp, fn, tp) = metrics.confusion_matrix(dataframe, prediction).ravel()
      
    p = tp + fn
    n = fp + tn
            
    #accuracy = np.float64(tp + tn) / (p + n)
    accuracy = metrics.accuracy_score(dataframe,prediction)
    
    #recall = np.float64(tp) / p
    recall = metrics.recall_score(dataframe,prediction,pos_label = c_positive, average = "binary")
    
    specify = np.float64(tn) / n
    
    error = np.float64(fp) / n
    
    #precision = np.float64(tp) / (tp + fp)
    precision = metrics.precision_score(dataframe,prediction,pos_label = c_positive, average = "binary")
    
    # Macromedia
    macromedia = (recall + specify) / 2

    #f1 = np.float64(2 * precision * recall) / (precision + recall)
    f1 = metrics.f1_score(dataframe,prediction,pos_label = c_positive,average = "binary")
    
    print("True Positives:"+str(tp)+"")
    print("True Negatives:"+str(tn)+"")
    print("False Positives:"+str(fp)+"")
    print("False Negatives:"+str(fn)+"")
    print("Positives:"+str(p)+"")
    print("Negatives:"+str(n)+"")

    # Curve ROC generation
    def curveRoc(tpr,fpr):
        plt.title('Curve ROC Analysis')
        x1 = [0,fpr]
        x2 = [fpr,1]
        y1 = [0,tpr]
        y2 = [tpr,1]
        plt.plot(x1,y1,x2,y2,color='r', linewidth=3.0)
        plt.plot(fpr,tpr,'o')        
        plt.show()

    curveRoc(recall,error)
    output_df = pd.DataFrame(
        data = [accuracy, recall, specify, error, precision, macromedia, f1],
        index = ["Accuracy", "True Positives Rate (recall)", "True Negatives Rate (specify)", "False Positives Rate (negative error)", "Positive Predictive Value (precision)", "Macromedia", "F1 score"],
        columns = ["Value"]
    )
    return output_df
In [4]:
# Seed asignation for reproducible results
seed = 0

We are going to load two datasets generated by our application, the first one corresponds to the European Financial Crisis database, and will be named df_crises [SOURCE URL](#https://www.esrb.europa.eu/pub/financial-crises/html/index.en.html).

The second one will be a combination of the databases downloaded from the discover section of our first application version. In this case, we want to try to predict future crises taking the join of these two databases building a classification problem. After that, we have predicted the future values for the features, trying to estimate crises for a set of countries on 2020. That data it is called df_crises_fut.

In [5]:
# Data loading 
df_crises = pd.read_csv('../Data/discover_crises.csv', na_values = 'NA')
df_crises_fut = pd.read_csv('../Data/prediccion_sin_imputar_antes.csv' , na_values = 'NA')

First of all, we have explore our data in the discover obtaining information that there are not crises registered before 1973, so we want to filter the correct data before modeling phase.

In [6]:
# Selection of data
df_crises = df_crises.loc[df_crises['TIME'] >= 1973]
In [7]:
# Define time we want to predict for future cases
df_crises_fut['TIME'] = 2020
In [8]:
# Fill NA values = No crises
df_crises['crises'] = df_crises['crises'].fillna(0)    
In [9]:
# Dimension of training dataset
print("Size of the Dataset:  %d" % df_crises.shape[0])
print("Number of features: %d" % df_crises.shape[1])

# Visualizing first instances
df_crises.head()
Size of the Dataset:  1291
Number of features: 52
Out[9]:
GEO TIME dcp shi_hcom shi_vacr prc_hpi_a_index_total prc_hpi_a_pc_total prc_hpi_a_index_new_dwellings prc_hpi_a_pc_new_dwellings shi_htra ... ilc_mdes06_arrears_total_total ilc_mdes06_arrears_total_bellow ilc_mded01_share_housing_cost_total_total ilc_mded01_share_housing_cost_total_below ilc_lvho05a ilc_lvho02_tenure_total_total_owner ilc_lvho02_tenure_total_total_owner_mortgage ilc_lvho02_tenure_total_total_tenant ilc_lvho01_cities ilc_lvho01_rural
0 AT 2018 222.0 NaN NaN 119.64 4.7 113.54 3.2 NaN ... 3.5 12.3 18.1 39.5 13.5 NaN NaN NaN 31.0 38.2
1 BE 2018 212.0 NaN NaN 109.42 2.9 111.41 3.2 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 BG 2018 97.0 NaN NaN 123.96 6.6 120.04 5.7 NaN ... 1.7 2.3 26.8 45.1 41.6 NaN NaN NaN 45.3 31.9
3 HR 2018 146.0 NaN NaN 111.14 6.1 99.95 3.6 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 CY 2018 507.0 NaN NaN 104.39 1.8 105.59 2.5 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 52 columns

The main problem of these type of databases is that there are a lot of missing values. We have to speak a bit about them because they are so important when someone is analyzing results. The score is not the real value of knowledge to obtain, we have to know from what source our score comes. So we are going to do a shallow exploration about missing values.

In [10]:
# Calling function to see missing values
missing = checkMissingValues(df_crises, None)
TypeValue : None
---------------------
GEO have 0 values
TIME have 0 values
dcp have 0 values
shi_hcom have 0 values
shi_vacr have 0 values
prc_hpi_a_index_total have 0 values
prc_hpi_a_pc_total have 2 values
prc_hpi_a_index_new_dwellings have 0 values
prc_hpi_a_pc_new_dwellings have 3 values
shi_htra have 0 values
prc_hpi_hs_index have 0 values
prc_hpi_hs_pc have 0 values
crises have 895 values
gov_10dd_edpt1_gross_debt_general_government_million_euro have 0 values
gov_10dd_edpt1_gross_debt_general_government_gdp have 0 values
gov_10dd_edpt1_gross_debt_local_government_million_euro have 0 values
gov_10dd_edpt1_gross_debt_local_government_gdp have 0 values
gov_10dd_edpt1_gross_debt_entral_government_million_euro have 0 values
gov_10dd_edpt1_gross_debt_central_government_gdp have 0 values
gov_10dd_edpt1_gross_debt_secial_security_million_euro have 29 values
gov_10dd_edpt1_gross_debt_social_security_funds_gdp have 103 values
gov_10dd_edpt1_gross_debt_state_government_million_euro have 0 values
gov_10dd_edpt1_gross_debt_state_government_gdp have 0 values
gov_10dd_edpt1_gross_domestic_products_total_economy_million_euro have 0 values
gov_10dd_edpt1_net_lending_net_borrowing_general_government_million_euro have 0 values
gov_10dd_edpt1_net_lending_net_borrowing_general_government_gdp have 9 values
gov_10dd_edpt1_gross_fixed_capital_formation_general_government_million_euro have 0 values
gov_10dd_edpt1_gross_fixed_capital_formation_general_government_gdp have 0 values
gov_10dd_edpt1_interest_payable_general_government_million_euro have 0 values
gov_10dd_edpt1_interest_payable_general_government_gdp have 2 values
gov_10dd_edpt1_intergovernmental_lending_general_government_million_euro have 117 values
gov_10dd_edpt1_intergovernmental_lending_general_government_gdp have 120 values
gov_10dd_edpt1_liabilities_general_government_million_euro have 0 values
gov_10dd_edpt1_liabilities_general_government_gdp have 3 values
gov_10dd_edpt1_debt_securities_general_government_million_euro have 0 values
gov_10dd_edpt1_debt_securities_general_government_gdp have 1 values
gov_10dd_edpt1_currency_deposits_general_government_million have 42 values
gov_10dd_edpt1_currency_deposits_general_government_gdp have 57 values
nama_10_pc_final_consumption_expenditure_chain_linked_volumes have 0 values
ilc_di04_total_household_euro_median_net_income have 0 values
lfsi_jhh_a have 0 values
lfsa_urgan_20_to_64_total have 0 values
ilc_mdes06_arrears_total_total have 0 values
ilc_mdes06_arrears_total_bellow have 0 values
ilc_mded01_share_housing_cost_total_total have 0 values
ilc_mded01_share_housing_cost_total_below have 0 values
ilc_lvho05a have 0 values
ilc_lvho02_tenure_total_total_owner have 0 values
ilc_lvho02_tenure_total_total_owner_mortgage have 1 values
ilc_lvho02_tenure_total_total_tenant have 0 values
ilc_lvho01_cities have 0 values
ilc_lvho01_rural have 0 values
In [11]:
# Table visualization of missing values
missing
Out[11]:
Missing Values % of Total Values
GEO 0 0.000000
TIME 0 0.000000
dcp 1263 97.831139
shi_hcom 407 31.525949
shi_vacr 978 75.755229
prc_hpi_a_index_total 1023 79.240899
prc_hpi_a_pc_total 1026 79.473277
prc_hpi_a_index_new_dwellings 1028 79.628195
prc_hpi_a_pc_new_dwellings 1037 80.325329
shi_htra 836 64.756003
prc_hpi_hs_index 1141 88.381100
prc_hpi_hs_pc 1159 89.775368
crises 0 0.000000
gov_10dd_edpt1_gross_debt_general_government_million_euro 805 62.354764
gov_10dd_edpt1_gross_debt_general_government_gdp 805 62.354764
gov_10dd_edpt1_gross_debt_local_government_million_euro 998 77.304415
gov_10dd_edpt1_gross_debt_local_government_gdp 998 77.304415
gov_10dd_edpt1_gross_debt_entral_government_million_euro 998 77.304415
gov_10dd_edpt1_gross_debt_central_government_gdp 998 77.304415
gov_10dd_edpt1_gross_debt_secial_security_million_euro 1020 79.008521
gov_10dd_edpt1_gross_debt_social_security_funds_gdp 1020 79.008521
gov_10dd_edpt1_gross_debt_state_government_million_euro 1247 96.591789
gov_10dd_edpt1_gross_debt_state_government_gdp 1247 96.591789
gov_10dd_edpt1_gross_domestic_products_total_economy_million_euro 805 62.354764
gov_10dd_edpt1_net_lending_net_borrowing_general_government_million_euro 805 62.354764
gov_10dd_edpt1_net_lending_net_borrowing_general_government_gdp 805 62.354764
gov_10dd_edpt1_gross_fixed_capital_formation_general_government_million_euro 805 62.354764
gov_10dd_edpt1_gross_fixed_capital_formation_general_government_gdp 805 62.354764
gov_10dd_edpt1_interest_payable_general_government_million_euro 805 62.354764
gov_10dd_edpt1_interest_payable_general_government_gdp 805 62.354764
gov_10dd_edpt1_intergovernmental_lending_general_government_million_euro 1021 79.085980
gov_10dd_edpt1_intergovernmental_lending_general_government_gdp 1021 79.085980
gov_10dd_edpt1_liabilities_general_government_million_euro 1034 80.092951
gov_10dd_edpt1_liabilities_general_government_gdp 1034 80.092951
gov_10dd_edpt1_debt_securities_general_government_million_euro 805 62.354764
gov_10dd_edpt1_debt_securities_general_government_gdp 805 62.354764
gov_10dd_edpt1_currency_deposits_general_government_million 833 64.523625
gov_10dd_edpt1_currency_deposits_general_government_gdp 833 64.523625
nama_10_pc_final_consumption_expenditure_chain_linked_volumes 900 69.713400
ilc_di04_total_household_euro_median_net_income 915 70.875290
lfsi_jhh_a 917 71.030209
lfsa_urgan_20_to_64_total 655 50.735864
ilc_mdes06_arrears_total_total 907 70.255616
ilc_mdes06_arrears_total_bellow 907 70.255616
ilc_mded01_share_housing_cost_total_total 914 70.797831
ilc_mded01_share_housing_cost_total_below 914 70.797831
ilc_lvho05a 910 70.487994
ilc_lvho02_tenure_total_total_owner 936 72.501936
ilc_lvho02_tenure_total_total_owner_mortgage 936 72.501936
ilc_lvho02_tenure_total_total_tenant 936 72.501936
ilc_lvho01_cities 1035 80.170411
ilc_lvho01_rural 1035 80.170411

As we supposed, there are a lot of features that have an important percent of missing values. We have to select a threshold, this threshold consists of a rule for drop features from our dataset based on the percentage of missing values. For example, for this case, we are going to take features lower or equal than 70% of missing values.

In [12]:
# Getting columns that have more percentage than 70%
missing = missing.loc[missing['% of Total Values'] >= 70]
In [13]:
# List of features removed
list(missing.index)
Out[13]:
['dcp',
 'shi_vacr',
 'prc_hpi_a_index_total',
 'prc_hpi_a_pc_total',
 'prc_hpi_a_index_new_dwellings',
 'prc_hpi_a_pc_new_dwellings',
 'prc_hpi_hs_index',
 'prc_hpi_hs_pc',
 'gov_10dd_edpt1_gross_debt_local_government_million_euro',
 'gov_10dd_edpt1_gross_debt_local_government_gdp',
 'gov_10dd_edpt1_gross_debt_entral_government_million_euro',
 'gov_10dd_edpt1_gross_debt_central_government_gdp',
 'gov_10dd_edpt1_gross_debt_secial_security_million_euro',
 'gov_10dd_edpt1_gross_debt_social_security_funds_gdp',
 'gov_10dd_edpt1_gross_debt_state_government_million_euro',
 'gov_10dd_edpt1_gross_debt_state_government_gdp',
 'gov_10dd_edpt1_intergovernmental_lending_general_government_million_euro',
 'gov_10dd_edpt1_intergovernmental_lending_general_government_gdp',
 'gov_10dd_edpt1_liabilities_general_government_million_euro',
 'gov_10dd_edpt1_liabilities_general_government_gdp',
 'ilc_di04_total_household_euro_median_net_income',
 'lfsi_jhh_a',
 'ilc_mdes06_arrears_total_total',
 'ilc_mdes06_arrears_total_bellow',
 'ilc_mded01_share_housing_cost_total_total',
 'ilc_mded01_share_housing_cost_total_below',
 'ilc_lvho05a',
 'ilc_lvho02_tenure_total_total_owner',
 'ilc_lvho02_tenure_total_total_owner_mortgage',
 'ilc_lvho02_tenure_total_total_tenant',
 'ilc_lvho01_cities',
 'ilc_lvho01_rural']
In [14]:
# Droping columns from dataset
df_crises = df_crises.drop(columns = list(missing.index), axis = 1)
df_crises_fut = df_crises_fut.drop(columns = list(missing.index), axis = 1)

As a line of work it will be assumed that the objective of the model is its use in a real environment, that is to say, we will use the model trained for the treatment of data not available during development.

Because of this, the dataset will be divided into two sets: A training set that will be used in the learning processes. And test set that will be used to simulate the future real scene.

Also, we have to remember that we have another dataset generated to make predictions for 2020.

In [15]:
# Size of the training set (80% of total)
training_size = int(len(df_crises)*0.8)

# Disorder the data (This part it is so important).
df_crises = df_crises.sample(frac=1, random_state=0).reset_index(drop=True)

# Copy of the test set
df_crises_new= df_crises.iloc[training_size:].copy()

# Copy for the training set.
df_crises = df_crises.iloc[:training_size].copy()

print("Size of the dataset avalaible for model building: ", len(df_crises))
print("Size of the dataset that below to new data: ", len(df_crises_new))
Size of the dataset avalaible for model building:  1032
Size of the dataset that below to new data:  259

Data preparation


Each feature of this dataset has to be treated depending on the type of the features. In the state of art in data science, it is normally to work with three main types of features.

In [16]:
# Preprocesing steps
prep_steps = []

# Type of features
num_features = df_crises.select_dtypes(include=np.number).columns.tolist() # Initially contains the numerical
cat_features = df_crises.select_dtypes(exclude=np.number).columns.tolist() # Initially excludes numerics
bin_features = []

print('Features (Initially)')
print('Numerical: ',num_features)
print('Categorical: ',cat_features)
print('Binaries (treatment as numerics): ', bin_features)
Features (Initially)
Numerical:  ['TIME', 'shi_hcom', 'shi_htra', 'crises', 'gov_10dd_edpt1_gross_debt_general_government_million_euro', 'gov_10dd_edpt1_gross_debt_general_government_gdp', 'gov_10dd_edpt1_gross_domestic_products_total_economy_million_euro', 'gov_10dd_edpt1_net_lending_net_borrowing_general_government_million_euro', 'gov_10dd_edpt1_net_lending_net_borrowing_general_government_gdp', 'gov_10dd_edpt1_gross_fixed_capital_formation_general_government_million_euro', 'gov_10dd_edpt1_gross_fixed_capital_formation_general_government_gdp', 'gov_10dd_edpt1_interest_payable_general_government_million_euro', 'gov_10dd_edpt1_interest_payable_general_government_gdp', 'gov_10dd_edpt1_debt_securities_general_government_million_euro', 'gov_10dd_edpt1_debt_securities_general_government_gdp', 'gov_10dd_edpt1_currency_deposits_general_government_million', 'gov_10dd_edpt1_currency_deposits_general_government_gdp', 'nama_10_pc_final_consumption_expenditure_chain_linked_volumes', 'lfsa_urgan_20_to_64_total']
Categorical:  ['GEO']
Binaries (treatment as numerics):  []

Numerical features


In [17]:
# Get unique values per feature
print("Different values per feature: ",list(map(lambda col: "{:s}: {:d}".format(col,len(df_crises[col].value_counts())), num_features)))
df_crises.describe(include='number')
Different values per feature:  ['TIME: 47', 'shi_hcom: 651', 'shi_htra: 359', 'crises: 2', 'gov_10dd_edpt1_gross_debt_general_government_million_euro: 386', 'gov_10dd_edpt1_gross_debt_general_government_gdp: 319', 'gov_10dd_edpt1_gross_domestic_products_total_economy_million_euro: 386', 'gov_10dd_edpt1_net_lending_net_borrowing_general_government_million_euro: 386', 'gov_10dd_edpt1_net_lending_net_borrowing_general_government_gdp: 127', 'gov_10dd_edpt1_gross_fixed_capital_formation_general_government_million_euro: 386', 'gov_10dd_edpt1_gross_fixed_capital_formation_general_government_gdp: 51', 'gov_10dd_edpt1_interest_payable_general_government_million_euro: 385', 'gov_10dd_edpt1_interest_payable_general_government_gdp: 60', 'gov_10dd_edpt1_debt_securities_general_government_million_euro: 382', 'gov_10dd_edpt1_debt_securities_general_government_gdp: 293', 'gov_10dd_edpt1_currency_deposits_general_government_million: 330', 'gov_10dd_edpt1_currency_deposits_general_government_gdp: 72', 'nama_10_pc_final_consumption_expenditure_chain_linked_volumes: 177', 'lfsa_urgan_20_to_64_total: 151']
Out[17]:
TIME shi_hcom shi_htra crises gov_10dd_edpt1_gross_debt_general_government_million_euro gov_10dd_edpt1_gross_debt_general_government_gdp gov_10dd_edpt1_gross_domestic_products_total_economy_million_euro gov_10dd_edpt1_net_lending_net_borrowing_general_government_million_euro gov_10dd_edpt1_net_lending_net_borrowing_general_government_gdp gov_10dd_edpt1_gross_fixed_capital_formation_general_government_million_euro gov_10dd_edpt1_gross_fixed_capital_formation_general_government_gdp gov_10dd_edpt1_interest_payable_general_government_million_euro gov_10dd_edpt1_interest_payable_general_government_gdp gov_10dd_edpt1_debt_securities_general_government_million_euro gov_10dd_edpt1_debt_securities_general_government_gdp gov_10dd_edpt1_currency_deposits_general_government_million gov_10dd_edpt1_currency_deposits_general_government_gdp nama_10_pc_final_consumption_expenditure_chain_linked_volumes lfsa_urgan_20_to_64_total
count 1032.000000 704.000000 3.680000e+02 1032.000000 3.860000e+02 386.000000 3.860000e+02 386.000000 386.000000 386.000000 386.000000 386.000000 386.000000 3.860000e+02 386.000000 363.000000 363.000000 310.000000 507.000000
mean 1996.028101 81200.099432 2.610860e+05 0.317829 3.679690e+05 58.239637 5.050729e+05 -14785.061917 -2.619948 15703.168135 3.750259 12968.279793 2.234974 2.890009e+05 42.562435 16350.984022 1.947934 19392.580645 8.471992
std 13.381161 110218.610905 3.749577e+05 0.465859 5.957965e+05 34.151613 7.379153e+05 31019.914239 3.407248 20850.217195 1.124329 20266.758022 1.334846 4.812915e+05 24.932889 41861.184642 3.836883 10172.943513 4.513656
min 1973.000000 620.000000 1.030000e+03 0.000000 3.334000e+02 3.700000 6.976400e+03 -173919.700000 -15.100000 158.900000 1.500000 9.200000 0.000000 7.000000e-01 0.000000 0.000000 0.000000 4000.000000 1.700000
25% 1984.000000 12657.500000 4.504250e+04 0.000000 1.610770e+04 36.675000 4.514793e+04 -12400.900000 -4.400000 1780.525000 2.900000 762.900000 1.300000 1.130540e+04 27.400000 171.050000 0.200000 9500.000000 5.300000
50% 1996.000000 34235.000000 9.420000e+04 0.000000 1.048843e+05 51.350000 1.876430e+05 -2908.500000 -2.400000 6637.450000 3.700000 4047.350000 2.050000 7.912430e+04 39.650000 1176.400000 0.400000 19550.000000 7.400000
75% 2008.000000 98722.500000 3.222500e+05 1.000000 3.258397e+05 77.800000 4.502313e+05 -109.550000 -0.300000 20051.500000 4.400000 12139.325000 3.000000 2.602708e+05 55.400000 6936.750000 1.400000 26200.000000 10.400000
max 2019.000000 658000.000000 1.990000e+06 1.000000 2.321957e+06 178.900000 3.386000e+06 58012.000000 5.900000 84537.000000 7.700000 83566.000000 7.300000 1.992978e+06 121.500000 212686.000000 35.400000 45800.000000 27.300000

Categorical features


In [18]:
# Simple statistics for categorical features
df_crises[cat_features].describe()
Out[18]:
GEO
count 1032
unique 28
top PL
freq 41

Binaries features


In [19]:
# Unique values of crises feature
df_crises['crises'].unique()
Out[19]:
array([ 0.,  1.])

Our class feature to predict have a value problem that we have solved before, when we made the outer join of databases and we do not fill the missing values for columns that not match, We have made joins taking the pair of keys (GEO, TIME) from crises database and (GEO, TIME) from the union selected of the set of our databases. The advantage of following the same data format it can be seen there, we can join databases easily. So if years registered are not in our databases we will suppose that there are not any crises on them, we fill that values with 0, which means no crises.

In [20]:
# Process to fill NA
def fill_crises(df):
    df['crises'] = df['crises'].fillna(0)
    return df
In [21]:
# Add the process to list of processes
prep_steps.append(fill_crises)

# Added feature to binary features and removed from numerical
num_features.remove('crises')
bin_features.append('crises')
In [22]:
# Drop binaries features from categorical features
cat_features= [feat for feat in cat_features if feat not in bin_features]
cat_features
Out[22]:
['GEO']

Creation of a preprocessing _pipeline_.


Pipelines are a typical way to automatizing data processing. Depending on the type of feature, the pipeline has defined a line of work to apply over it. It is easy to work with that type of structure if we want to make an automated system.

In [23]:
# Function to apply processes over data
def preprocess_crises_data(df):
    
    # Copy of the original dataset.
    preprocessed_df = df.copy()  
    
    # Aplication of defined steps
    for step in prep_steps:
        step(preprocessed_df)
        
    # Creation of sets of features and class
    y = preprocessed_df['crises']
    X = preprocessed_df.drop('crises', axis=1)
    return X, y
In [24]:
# Call and apply of preprocess function over data
X, y = preprocess_crises_data(df_crises)
X.head()
Out[24]:
GEO TIME shi_hcom shi_htra gov_10dd_edpt1_gross_debt_general_government_million_euro gov_10dd_edpt1_gross_debt_general_government_gdp gov_10dd_edpt1_gross_domestic_products_total_economy_million_euro gov_10dd_edpt1_net_lending_net_borrowing_general_government_million_euro gov_10dd_edpt1_net_lending_net_borrowing_general_government_gdp gov_10dd_edpt1_gross_fixed_capital_formation_general_government_million_euro gov_10dd_edpt1_gross_fixed_capital_formation_general_government_gdp gov_10dd_edpt1_interest_payable_general_government_million_euro gov_10dd_edpt1_interest_payable_general_government_gdp gov_10dd_edpt1_debt_securities_general_government_million_euro gov_10dd_edpt1_debt_securities_general_government_gdp gov_10dd_edpt1_currency_deposits_general_government_million gov_10dd_edpt1_currency_deposits_general_government_gdp nama_10_pc_final_consumption_expenditure_chain_linked_volumes lfsa_urgan_20_to_64_total
0 HU 2014 8360.0 113790.0 79195.2 76.7 105547.0 -2751.0 -2.6 5635.2 5.3 4244.6 4.0 67517.1 65.4 209.7 0.2 7700.0 7.6
1 FI 1978 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 DK 1987 27900.0 93100.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 CZ 1985 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 EE 2011 1920.0 17180.0 1011.1 6.1 16667.6 193.1 1.2 819.3 4.9 21.5 0.1 252.5 1.5 25.1 0.2 8300.0 12.2

Build of a classification model


Now it is time for modeling. We have selected a group of models to apply. These algorithms must bellow to the definition of the problem that we want to solve, in this case, classification, we are going to work with the following models:

  • Decision Trees classifier.
  • Random Forest classifier.
  • Support Vector Machine classifier.
  • Multilayer perceptron classifier.

We are going to build these models on pipelines, these pipelines will be built with a preprocessor and a model.

Transformer build


The first step of the Pipeline build, consists of the data transformation. As a result of the previous preprocessing phase, three sets of characteristics are available to apply transformations:

In [25]:
# Print list of vars
print('Numerical: ', num_features)
print('Categorical: ', cat_features)
print('Binaries (threatment as numerical): ', bin_features)
Numerical:  ['TIME', 'shi_hcom', 'shi_htra', 'gov_10dd_edpt1_gross_debt_general_government_million_euro', 'gov_10dd_edpt1_gross_debt_general_government_gdp', 'gov_10dd_edpt1_gross_domestic_products_total_economy_million_euro', 'gov_10dd_edpt1_net_lending_net_borrowing_general_government_million_euro', 'gov_10dd_edpt1_net_lending_net_borrowing_general_government_gdp', 'gov_10dd_edpt1_gross_fixed_capital_formation_general_government_million_euro', 'gov_10dd_edpt1_gross_fixed_capital_formation_general_government_gdp', 'gov_10dd_edpt1_interest_payable_general_government_million_euro', 'gov_10dd_edpt1_interest_payable_general_government_gdp', 'gov_10dd_edpt1_debt_securities_general_government_million_euro', 'gov_10dd_edpt1_debt_securities_general_government_gdp', 'gov_10dd_edpt1_currency_deposits_general_government_million', 'gov_10dd_edpt1_currency_deposits_general_government_gdp', 'nama_10_pc_final_consumption_expenditure_chain_linked_volumes', 'lfsa_urgan_20_to_64_total']
Categorical:  ['GEO']
Binaries (threatment as numerical):  ['crises']

As we mention in the previous steps, there are a lot of missing values on the numeric features. Apply single imputers like median or mean are the typical ways applied for that. However, it is important that remember the percentage of missing values that we have, so we are going to apply a more complex imputer, it consists of a multivariate imputer that estimates each feature from all the others. That imputer is on experimental mode API, but we are going to use it for this problem.

In [26]:
# Numerical transformed based on MICE with Random Forest and StandarScaler
num_transformer = Pipeline(steps=[('imputer',  IterativeImputer(random_state=0, estimator=RandomForestRegressor(n_estimators = 20))),
                                  ('scaler', StandardScaler())])

For categorical features, it is typical to apply OneHotEncoder for index categories.

In [27]:
# Apply OneHotEncoder to categorical features
cat_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

Finally, we create a ColumnTransformer, which is a set of different transformers groupped over a set of features.

In [28]:
crises_trans = ColumnTransformer(transformers=[('num', num_transformer, num_features),
                                              ('cat', cat_transformer, cat_features)])

Class frequency analysis


Analyze the class is so important when we work in classification problems. Especially in the cases when the problem is not balanced. Depending on the type of problem that we want to solve there will be different class distributions. It is important in unbalanced problems safe the proportion of positive cases with negative cases with the object of no-lose the problem domain when we are working with subsets of data.

In [29]:
print("Classes: ", y.unique())

sns.countplot(x=y)
plt.title("Class frequency")
plt.xlabel('Crises')
plt.ylabel('samples')
plt.show()

print("The class frequencies positives are %f" % (sum(y==1) / y.shape[0]))
Classes:  [ 0.  1.]
The class frequencies positives are 0.317829
In [30]:
# Final preprocessing application
X, y = preprocess_crises_data(df_crises)
In [31]:
# Divide dataset in train and test
train_atts, test_atts, train_label, test_label = train_test_split(
    X, # Data with features
    y, # Vector/dataset with the class 
    train_size=0.8, # Training proportion
    random_state=seed, # Seed
    stratify = y # Stratification over class
)

Training of the classifiers


Now we are going to apply a process of model selection and validation. It is important to know that one problem could be solved with different models and approaches. First of all, it is important to start with models with parameters by default. It provides us a reference for metrics that we want to improve building better configurations of parameters.

In [32]:
# Decision Tree classifier construction
clf = tree.DecisionTreeClassifier(random_state = seed)
In [33]:
# Pipeline build (transformer -> clasifier)
pipe = Pipeline([('columnTrans', crises_trans), ('tree', clf)])
In [34]:
# Train process
fitted = pipe.fit(train_atts, train_label)
In [35]:
# Test evaluation
predictions = fitted.predict(test_atts)
In [36]:
# Metrics
eval_model(test_label, predictions, 1)
True Positives:50
True Negatives:123
False Positives:18
False Negatives:16
Positives:66
Negatives:141
Out[36]:
Value
Accuracy 0.835749
True Positives Rate (recall) 0.757576
True Negatives Rate (specify) 0.872340
False Positives Rate (negative error) 0.127660
Positive Predictive Value (precision) 0.735294
Macromedia 0.814958
F1 score 0.746269
In [37]:
# Features importances for the model
importances = [(feat, importance) for feat, importance in zip(train_atts.columns, fitted.steps[1][1].feature_importances_)]
In [38]:
# Ordered importances
sorted(importances)
Out[38]:
[('GEO', 0.30156989655066657),
 ('TIME', 0.047486345529942853),
 ('gov_10dd_edpt1_currency_deposits_general_government_gdp',
  0.04554173099011951),
 ('gov_10dd_edpt1_currency_deposits_general_government_million',
  0.01694060394700183),
 ('gov_10dd_edpt1_debt_securities_general_government_gdp',
  0.03009976430337296),
 ('gov_10dd_edpt1_debt_securities_general_government_million_euro',
  0.0088213011873495041),
 ('gov_10dd_edpt1_gross_debt_general_government_gdp', 0.036623875766167484),
 ('gov_10dd_edpt1_gross_debt_general_government_million_euro',
  0.064044925287269613),
 ('gov_10dd_edpt1_gross_domestic_products_total_economy_million_euro',
  0.025471961010029021),
 ('gov_10dd_edpt1_gross_fixed_capital_formation_general_government_gdp',
  0.0076126846817530574),
 ('gov_10dd_edpt1_gross_fixed_capital_formation_general_government_million_euro',
  0.063253449249648916),
 ('gov_10dd_edpt1_interest_payable_general_government_gdp',
  0.061348967695015626),
 ('gov_10dd_edpt1_interest_payable_general_government_million_euro',
  0.0078223675907837884),
 ('gov_10dd_edpt1_net_lending_net_borrowing_general_government_gdp',
  0.018560630400296176),
 ('gov_10dd_edpt1_net_lending_net_borrowing_general_government_million_euro',
  0.063435708331631827),
 ('lfsa_urgan_20_to_64_total', 0.0),
 ('nama_10_pc_final_consumption_expenditure_chain_linked_volumes',
  0.0094459434418486896),
 ('shi_hcom', 0.051521039987298595),
 ('shi_htra', 0.01640334187188211)]

First of all, we have to differ between metrics. In unbalanced problems, where the positive class to predict it is not in equal proportion with the negative class, we have to provide more informative metrics.

In the problem domain that we are, imagine that our stakeholders want to anticipate crises and they have, in general, two types of measures to apply:

  • Several measures which mean the application of policies that involve risks associated.

  • Light measures that involved policies that do not imply a high risk in their application.

Depending on the type of measures, we need precision for several measures and recall to get the maximum number of crises detected.

We can observe that there are several risks to taking predictions from this model, it is simple, there is a lot of influence of the imputation of the missing values. We can not use this model in a real environment.

But for the interest, we want to see which countries predict our classifier that will get in crisis in 2020.

In [39]:
# Preprocessing aplication
X_1, y_1 = preprocess_crises_data(df_crises_fut)
In [40]:
# Future cases prediction
predictions = fitted.predict(X_1)

Those are the countries that we have predicted crises using our model.

In [41]:
# Listing countries to predict
list(df_crises_fut['GEO'])
Out[41]:
['AT',
 'BE',
 'BG',
 'HR',
 'CY',
 'CZ',
 'DK',
 'EE',
 'FI',
 'FR',
 'DE',
 'GR',
 'HU',
 'IE',
 'IT',
 'LV',
 'LT',
 'LU',
 'NL',
 'NO',
 'PL',
 'PT',
 'RO',
 'SK',
 'SI',
 'ES',
 'SE',
 'UK']
In [42]:
# Join of predictions
df_crises_fut['crises_dt'] = predictions
In [43]:
# Visualization of predictions
df_crises_fut = df_crises_fut[['GEO', 'crises_dt']]
df_crises_fut
Out[43]:
GEO crises_dt
0 AT 1.0
1 BE 1.0
2 BG 1.0
3 HR 1.0
4 CY 1.0
5 CZ 1.0
6 DK 1.0
7 EE 0.0
8 FI 0.0
9 FR 0.0
10 DE 0.0
11 GR 0.0
12 HU 0.0
13 IE 0.0
14 IT 0.0
15 LV 1.0
16 LT 1.0
17 LU 1.0
18 NL 0.0
19 NO 1.0
20 PL 0.0
21 PT 0.0
22 RO 0.0
23 SK 0.0
24 SI 1.0
25 ES 1.0
26 SE 1.0
27 UK 1.0

Now we are going to train more complex models with the interest of seeing their predictions.

In [44]:
# Ensemble Random Forest
clf = RandomForestClassifier(random_state = seed)
In [45]:
# Pipeline build (transformer -> clasifier)
pipe = Pipeline([('columnTrans', crises_trans), ('rf', clf)])
In [46]:
# Training process
fitted = pipe.fit(train_atts, train_label)
In [47]:
# Test predictions
predictions = fitted.predict(test_atts)
In [48]:
eval_model(test_label, predictions, 1)
True Positives:44
True Negatives:135
False Positives:6
False Negatives:22
Positives:66
Negatives:141
Out[48]:
Value
Accuracy 0.864734
True Positives Rate (recall) 0.666667
True Negatives Rate (specify) 0.957447
False Positives Rate (negative error) 0.042553
Positive Predictive Value (precision) 0.880000
Macromedia 0.812057
F1 score 0.758621
In [49]:
# Features importances for the model
importances = [(feat, importance) for feat, importance in zip(train_atts.columns, fitted.steps[1][1].feature_importances_)]

# Ordered importances
sorted(importances)
Out[49]:
[('GEO', 0.13632924899687066),
 ('TIME', 0.041257283457599971),
 ('gov_10dd_edpt1_currency_deposits_general_government_gdp',
  0.043310613712252269),
 ('gov_10dd_edpt1_currency_deposits_general_government_million',
  0.055448617010494125),
 ('gov_10dd_edpt1_debt_securities_general_government_gdp',
  0.040801568047302277),
 ('gov_10dd_edpt1_debt_securities_general_government_million_euro',
  0.046741542378791016),
 ('gov_10dd_edpt1_gross_debt_general_government_gdp', 0.036685577738109591),
 ('gov_10dd_edpt1_gross_debt_general_government_million_euro',
  0.03764146670919867),
 ('gov_10dd_edpt1_gross_domestic_products_total_economy_million_euro',
  0.03466738752556054),
 ('gov_10dd_edpt1_gross_fixed_capital_formation_general_government_gdp',
  0.028756062150174226),
 ('gov_10dd_edpt1_gross_fixed_capital_formation_general_government_million_euro',
  0.046519367375250163),
 ('gov_10dd_edpt1_interest_payable_general_government_gdp',
  0.048738066839794092),
 ('gov_10dd_edpt1_interest_payable_general_government_million_euro',
  0.029172698135742095),
 ('gov_10dd_edpt1_net_lending_net_borrowing_general_government_gdp',
  0.039505087016014059),
 ('gov_10dd_edpt1_net_lending_net_borrowing_general_government_million_euro',
  0.049396544318143137),
 ('lfsa_urgan_20_to_64_total', 0.0014214186577935128),
 ('nama_10_pc_final_consumption_expenditure_chain_linked_volumes',
  0.066623038219877584),
 ('shi_hcom', 0.041316317004886312),
 ('shi_htra', 0.034503108799170221)]
In [50]:
# Future cases prediction
predictions = fitted.predict(X_1)
In [51]:
# Join of predictions
df_crises_fut['crises_rf'] = predictions
In [52]:
# Visualization of predictions
df_crises_fut = df_crises_fut[['GEO', 'crises_dt', 'crises_rf']]
df_crises_fut
Out[52]:
GEO crises_dt crises_rf
0 AT 1.0 0.0
1 BE 1.0 0.0
2 BG 1.0 0.0
3 HR 1.0 0.0
4 CY 1.0 1.0
5 CZ 1.0 0.0
6 DK 1.0 1.0
7 EE 0.0 0.0
8 FI 0.0 1.0
9 FR 0.0 1.0
10 DE 0.0 1.0
11 GR 0.0 0.0
12 HU 0.0 0.0
13 IE 0.0 1.0
14 IT 0.0 1.0
15 LV 1.0 0.0
16 LT 1.0 0.0
17 LU 1.0 0.0
18 NL 0.0 0.0
19 NO 1.0 0.0
20 PL 0.0 0.0
21 PT 0.0 1.0
22 RO 0.0 0.0
23 SK 0.0 1.0
24 SI 1.0 0.0
25 ES 1.0 1.0
26 SE 1.0 1.0
27 UK 1.0 1.0
In [53]:
# Support Vector Machines
clf = SVC(random_state = seed)
In [54]:
# Pipeline build (transformer -> clasifier)
pipe = Pipeline([('columnTrans', crises_trans), ('svc', clf)])
In [55]:
# Training processes
fitted = pipe.fit(train_atts, train_label)
In [56]:
# Predictions over test set
predictions = fitted.predict(test_atts)
In [57]:
# Evaluation metrics
eval_model(test_label, predictions, 1)
True Positives:32
True Negatives:137
False Positives:4
False Negatives:34
Positives:66
Negatives:141
Out[57]:
Value
Accuracy 0.816425
True Positives Rate (recall) 0.484848
True Negatives Rate (specify) 0.971631
False Positives Rate (negative error) 0.028369
Positive Predictive Value (precision) 0.888889
Macromedia 0.728240
F1 score 0.627451
In [58]:
# Future cases prediction
predictions = fitted.predict(X_1)
In [59]:
# Join of predictions
df_crises_fut['crises_svc'] = predictions
df_crises_fut
Out[59]:
GEO crises_dt crises_rf crises_svc
0 AT 1.0 0.0 0.0
1 BE 1.0 0.0 1.0
2 BG 1.0 0.0 1.0
3 HR 1.0 0.0 1.0
4 CY 1.0 1.0 0.0
5 CZ 1.0 0.0 0.0
6 DK 1.0 1.0 0.0
7 EE 0.0 0.0 0.0
8 FI 0.0 1.0 1.0
9 FR 0.0 1.0 1.0
10 DE 0.0 1.0 0.0
11 GR 0.0 0.0 0.0
12 HU 0.0 0.0 0.0
13 IE 0.0 1.0 0.0
14 IT 0.0 1.0 0.0
15 LV 1.0 0.0 0.0
16 LT 1.0 0.0 0.0
17 LU 1.0 0.0 1.0
18 NL 0.0 0.0 0.0
19 NO 1.0 0.0 1.0
20 PL 0.0 0.0 1.0
21 PT 0.0 1.0 0.0
22 RO 0.0 0.0 0.0
23 SK 0.0 1.0 0.0
24 SI 1.0 0.0 0.0
25 ES 1.0 1.0 0.0
26 SE 1.0 1.0 0.0
27 UK 1.0 1.0 1.0
In [60]:
# Multilayer perceptron classifier
clf = MLPClassifier(random_state = seed)
In [61]:
# Pipeline build (transformer -> clasifier)
pipe = Pipeline([('columnTrans', crises_trans), ('nn', clf)])
In [62]:
# Entrenamiento
fitted = pipe.fit(train_atts, train_label)
In [63]:
# Evaluación de test
predictions = fitted.predict(test_atts)
In [64]:
eval_model(test_label, predictions, 1)
True Positives:53
True Negatives:136
False Positives:5
False Negatives:13
Positives:66
Negatives:141
Out[64]:
Value
Accuracy 0.913043
True Positives Rate (recall) 0.803030
True Negatives Rate (specify) 0.964539
False Positives Rate (negative error) 0.035461
Positive Predictive Value (precision) 0.913793
Macromedia 0.883785
F1 score 0.854839
In [65]:
# Future cases prediction
predictions = fitted.predict(X_1)
In [66]:
# Join of predictions
df_crises_fut['crises_nn'] = predictions
df_crises_fut
Out[66]:
GEO crises_dt crises_rf crises_svc crises_nn
0 AT 1.0 0.0 0.0 1.0
1 BE 1.0 0.0 1.0 0.0
2 BG 1.0 0.0 1.0 0.0
3 HR 1.0 0.0 1.0 1.0
4 CY 1.0 1.0 0.0 1.0
5 CZ 1.0 0.0 0.0 0.0
6 DK 1.0 1.0 0.0 1.0
7 EE 0.0 0.0 0.0 0.0
8 FI 0.0 1.0 1.0 0.0
9 FR 0.0 1.0 1.0 0.0
10 DE 0.0 1.0 0.0 0.0
11 GR 0.0 0.0 0.0 0.0
12 HU 0.0 0.0 0.0 0.0
13 IE 0.0 1.0 0.0 0.0
14 IT 0.0 1.0 0.0 0.0
15 LV 1.0 0.0 0.0 0.0
16 LT 1.0 0.0 0.0 1.0
17 LU 1.0 0.0 1.0 1.0
18 NL 0.0 0.0 0.0 0.0
19 NO 1.0 0.0 1.0 0.0
20 PL 0.0 0.0 1.0 1.0
21 PT 0.0 1.0 0.0 0.0
22 RO 0.0 0.0 0.0 0.0
23 SK 0.0 1.0 0.0 0.0
24 SI 1.0 0.0 0.0 0.0
25 ES 1.0 1.0 0.0 0.0
26 SE 1.0 1.0 0.0 0.0
27 UK 1.0 1.0 1.0 1.0

As you can see, we have mentioned that there are a lot of different approaches and techniques to apply. We have to take into account the risks associated with the policies to apply. Selecting the correct metric to optimize is quite important when we are going to choose a single model. But sometimes, taking into consideration the "opinion" of different models is interesting from the point of view of generalization. We are going to apply this concept with the objective of evaluating, one more time, which countries will probably crises on year 2020.

In [67]:
# Sum of votes of classifiers
df_crises_fut['crises'] = df_crises_fut['crises_dt'] \
                        + df_crises_fut['crises_rf'] \
                        + df_crises_fut['crises_svc'] \
                        + df_crises_fut['crises_nn']
In [68]:
# Sort values by classifier votes
df_crises_fut.sort_values(by=['crises'], ascending = False)
Out[68]:
GEO crises_dt crises_rf crises_svc crises_nn crises
27 UK 1.0 1.0 1.0 1.0 4.0
3 HR 1.0 0.0 1.0 1.0 3.0
4 CY 1.0 1.0 0.0 1.0 3.0
6 DK 1.0 1.0 0.0 1.0 3.0
17 LU 1.0 0.0 1.0 1.0 3.0
1 BE 1.0 0.0 1.0 0.0 2.0
26 SE 1.0 1.0 0.0 0.0 2.0
25 ES 1.0 1.0 0.0 0.0 2.0
20 PL 0.0 0.0 1.0 1.0 2.0
19 NO 1.0 0.0 1.0 0.0 2.0
16 LT 1.0 0.0 0.0 1.0 2.0
0 AT 1.0 0.0 0.0 1.0 2.0
9 FR 0.0 1.0 1.0 0.0 2.0
8 FI 0.0 1.0 1.0 0.0 2.0
2 BG 1.0 0.0 1.0 0.0 2.0
13 IE 0.0 1.0 0.0 0.0 1.0
15 LV 1.0 0.0 0.0 0.0 1.0
10 DE 0.0 1.0 0.0 0.0 1.0
21 PT 0.0 1.0 0.0 0.0 1.0
23 SK 0.0 1.0 0.0 0.0 1.0
24 SI 1.0 0.0 0.0 0.0 1.0
5 CZ 1.0 0.0 0.0 0.0 1.0
14 IT 0.0 1.0 0.0 0.0 1.0
12 HU 0.0 0.0 0.0 0.0 0.0
11 GR 0.0 0.0 0.0 0.0 0.0
18 NL 0.0 0.0 0.0 0.0 0.0
7 EE 0.0 0.0 0.0 0.0 0.0
22 RO 0.0 0.0 0.0 0.0 0.0

In conclusion, it is a simple capstone built with the power of our application to combine different databases from diverse sources. We have built a simple dataset as a result of the join of different databases where we use our parameters predictions to fill the single prediction for 2020 (for each parameter). Over the models generated we have applied a voting system, but we can apply a lot of different techniques and, of course, we can get better configurations of parameters for our models making an optimal search. After that, we have to say that we have worked over a complex dataset generated with a lot of missing values and we have to remember that there are a lot of risks behind decisions based on that predictions, especially when the policies to be applied mean drastic measures. It is obvious that sometimes the data does not lead to knowledge but it is necessary in data science this transparency.