Transform category columns to boolean

10 Dec 2017

Input values in Logit regression should be numerical vectors as:

  • Integer - for example number of sales products…
  • Real - for example millis from session user
  • Boolean - Married or not married

Categories values are not numerical vectors, you cannot measure a value for example between married and single. This is not useful as an input for logistic regression. This type of columns should be expanded to boolean data. Pandas library has an function to do this easyly. Next example is retrieved from this source.

Note: Before there is some sort of cleaning of some categories that could be aggregated because are too similar.

Each category column is transformed to as many columns as categories so all are now numerical vectors.

marital   marital_married marital_single marital_divorced marital_unknown
married   1 0 0 0
single   0 1 0 0
divorced   0 0 1 0
unknown   0 0 0 1

input data can be retrieved from here

columns of data initial:
[‘age’ ‘job’ ‘marital’ ‘education’ ‘default’ ‘housing’ ‘loan’ ‘contact’
‘month’ ‘day_of_week’ ‘duration’ ‘campaign’ ‘pdays’ ‘previous’ ‘poutcome’
‘emp_var_rate’ ‘cons_price_idx’ ‘cons_conf_idx’ ‘euribor3m’ ‘nr_employed’ ‘y’]

columns of data final:
[‘age’ ‘duration’ ‘campaign’ ‘pdays’ ‘previous’ ‘emp_var_rate’
‘cons_price_idx’ ‘cons_conf_idx’ ‘euribor3m’ ‘nr_employed’ ‘y’

‘job_admin.’
‘job_blue-collar’
‘job_entrepreneur’
‘job_housemaid’
‘job_management’
‘job_retired’
‘job_self-employed’
‘job_services’
‘job_student’
‘job_technician’
‘job_unemployed’
‘job_unknown’

‘marital_divorced’
‘marital_married’
‘marital_single’
‘marital_unknown’

‘education_Basic’
‘education_high.school’
‘education_illiterate’
‘education_professional.course’
‘education_university.degree’
‘education_unknown’

‘default_no’
‘default_unknown’
‘default_yes’

‘housing_no’
‘housing_unknown’
‘housing_yes’

‘loan_no’
‘loan_unknown’
‘loan_yes’

‘contact_cellular’
‘contact_telephone’

‘month_apr’
‘month_aug’
‘month_dec’
‘month_jul’
‘month_jun’
‘month_mar’
‘month_may’
‘month_nov’
‘month_oct’
‘month_sep’

‘day_of_week_fri’
‘day_of_week_mon’
‘day_of_week_thu’
‘day_of_week_tue’
‘day_of_week_wed’

‘poutcome_failure’
‘poutcome_nonexistent’
‘poutcome_success’]

import numpy as np
import pandas as pd

def main(): 

	data = pd.read_csv('banking.csv',header=0)
	data = data.dropna()
	print(data.shape)
	list(data.columns)
	data['education'].unique()

	print("reducing some enumerated values...")
	data['education']=np.where(data['education'] =='basic.9y', 'Basic', data['education'])
	data['education']=np.where(data['education'] =='basic.6y', 'Basic', data['education'])
	data['education']=np.where(data['education'] =='basic.4y', 'Basic', data['education'])

	print("columns of data initial {}:").format(data.columns.values)

	print("creating dummy variables...")
	cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']
	for var in cat_vars:
	    cat_list='var'+'_'+var
	    cat_list = pd.get_dummies(data[var], prefix=var)
	    data1=data.join(cat_list)
	    data=data1
	cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']
	data_vars=data.columns.values.tolist()
	to_keep=[i for i in data_vars if i not in cat_vars]

	data_final=data[to_keep]
	print("columns of data final {}:").format(data_final.columns.values)

if __name__ == "__main__":
    import sys
    main()

Other two ways to to the same without pandas, last column (D) is a 4 value cateory

    # create a new matrix X2 with the correct number of columns
    N, D = X.shape
    X2 = np.zeros((N, D+3))
    X2[:,0:(D-1)] = X[:,0:(D-1)] # non-categorical

    # one-hot
    for n in range(N):
        t = int(X[n,D-1])
        X2[n,t+D-1] = 1

    # method 2
    # Z = np.zeros((N, 4))
    # Z[np.arange(N), X[:,D-1].astype(np.int32)] = 1
    # # assign: X2[:,-4:] = Z
    # assert(np.abs(X2[:,-4:] - Z).sum() < 1e-10)


who am i

Engineer in Barcelona, working in BI and Cloud service projects. Very interested in the new wave of Machine-Learning and IA applications

what is this

This is a blog about software, some mathematics and python libraries used in Mathematics and Machine-Learning problems

where am i

github//m-alcu
twitter//alcubierre
linkedin//martinalcubierre
facebook//m.alcubierre
2017 by Martín Alcubierre Arenillas.
Content available under Creative Commons (BY-NC-SA) unless otherwise noted.
This site is hosted at Github Pages and created with Jekyll.