OneHotEncoding vs LabelEncoder vs pandas getdummies — How and Why?
2 min readMar 29, 2022
Code First for the Quick Birds:
from numpy import array
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelBinarizer
# define example
values = np.array(['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot'])
print("Data: ", values)# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print("Label Encoder:" ,integer_encoded)
# onehot encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print("OneHot Encoder:", onehot_encoded)
#Binary encode
lb = LabelBinarizer()
print("Label Binarizer:", lb.fit_transform(values))
Code for Pandas get dummies() vs sklearn OneHotEncoding
data= pd.DataFrame(
pd.Series(['good','bad','worst','good', 'good', 'bad','excellent', 'perfect']))df = pd.get_dummies(data)
col_list = df.columns.tolist()
print(df)
0_bad 0_good 0_worst
# 0 0 1 0
# 1 1 0 0
# 2 0 0 1
# 3 0 1 0
# 4 0 1 0
# 5 1 0 0
# 6 0 0 0
# 7 0 0 0encoder = OneHotEncoder(handle_unknown="ignore", sparse=False)
encoder.fit(data)
encoder.transform(new_data)
# array([[0., 1., 0.],
# [1., 0., 0.],
# [0., 0., 1.],
# [0., 1., 0.],
# [0., 1., 0.],
# [1., 0., 0.],
# [0., 0., 0.],
# [0., 0., 0.]])
Insights:
- OneHotEncoder needs data in integer encoded form first to convert into its respective encoding which is not required in the case of LabelBinarizer.
- Scikitlearn suggests using OneHotEncoder for X matrix i.e. the features you feed in a model, and to use a LabelBinarizer for the y labels.
- They are quite similar, except that OneHotEncoder could return a sparse matrix that saves a lot of memory and you won’t really need that in y labels.
- Even if you have a multi-label multi-class problem, you can use MultiLabelBinarizer for your y labels rather than switching to OneHotEncoder for multi hot encoding
- For machine learning, you almost definitely want to use
sklearn.OneHotEncoder
. For other tasks like simple analyses, you might be able to usepd.get_dummies
, which is a bit more convenient. - The crux of it is that the
sklearn
encoder creates a function which persists and can then be applied to new data sets which use the same categorical variables, with consistent results.
A quick summary:
LabelEncoder — for labels(response variable) coding 1,2,3… [implies order]
OrdinalEncoder — for features coding 1,2,3 … [implies order]
Label Binarizer — for response variable, coding 0 & 1 [ creating multiple dummy columns]
OneHotEncoder — for feature variables, coding 0 & 1 [ creating multiple dummy columns]
pd.get_dummies() —for feature variables, coding 0 & 1 [ creating multiple dummy columns]