[100 Days of ML Code] OneHotEncoder
One Hot Encoding is process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.
Normally, we will have a dataset of:
Now, 'categoricalValue' is a numerical value that represents the companyName.
However, the problem with the above table is it assumes that higher the categorical value, the better the category is; and we know that that is not correct.
Therefore, we use OneHotEncoding, which makes our table look like this:
One hot encoder performs "binarization" of the category.
The CategoricalValue is removed and replaced with 3 columns because we have 3 different label (vw, acura, honda).
You can think of each column as is_vw, is_acura, is_honda, relatively.
While this table looks overly-complicated, it works very well with most ML algorithms.
The binary variables (VW, Acura, Honda) are known as "Dummy Variables"
https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f
https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/
Normally, we will have a dataset of:
╔════════════╦════════╗ ║ CompanyName║ Price ║ ╠════════════╣════════║ ║ VW ║ 20000 ║ ║ Acura ║ 10011 ║ ║ Honda ║ 50000 ║ ║ Honda ║ 10000 ║ ╚════════════╩════════╝
Many ML algorithms cannot work with label data (as a string) directly. Therefore, we need to convert these labels into some numeric value:
╔════════════╦═════════════════╦════════╗ ║ CompanyName Categoricalvalue ║ Price ║ ╠════════════╬═════════════════╣════════║ ║ VW ╬ 1 ║ 20000 ║ ║ Acura ╬ 2 ║ 10011 ║ ║ Honda ╬ 3 ║ 50000 ║ ║ Honda ╬ 3 ║ 10000 ║ ╚════════════╩═════════════════╩════════╝
Now, 'categoricalValue' is a numerical value that represents the companyName.
However, the problem with the above table is it assumes that higher the categorical value, the better the category is; and we know that that is not correct.
Therefore, we use OneHotEncoding, which makes our table look like this:
╔════╦══════╦══════╦════════╦ ║ VW ║ Acura║ Honda║ Price ║ ╠════╬══════╬══════╬════════╬ ║ 1 ╬ 0 ╬ 0 ║ 20000 ║ ║ 0 ╬ 1 ╬ 0 ║ 10011 ║ ║ 0 ╬ 0 ╬ 1 ║ 50000 ║ ║ 0 ╬ 0 ╬ 1 ║ 10000 ║ ╚════╩══════╩══════╩════════╝
One hot encoder performs "binarization" of the category.
The CategoricalValue is removed and replaced with 3 columns because we have 3 different label (vw, acura, honda).
You can think of each column as is_vw, is_acura, is_honda, relatively.
While this table looks overly-complicated, it works very well with most ML algorithms.
The binary variables (VW, Acura, Honda) are known as "Dummy Variables"
Resources
https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-sciencehttps://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f
https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/
Comments
Post a Comment