[100 Days of ML Code] OneHotEncoder

One Hot Encoding is process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

Normally, we will have a dataset of:
╔════════════╦════════╗ 
║ CompanyName║ Price  ║
╠════════════╣════════║ 
║ VW         ║ 20000  ║
║ Acura      ║ 10011  ║
║ Honda      ║ 50000  ║
║ Honda      ║ 10000  ║
╚════════════╩════════╝

Many ML algorithms cannot work with label data (as a string) directly. Therefore, we need to convert these labels into some numeric value:

╔════════════╦═════════════════╦════════╗ 
║ CompanyName Categoricalvalue ║ Price  ║
╠════════════╬═════════════════╣════════║ 
║ VW         ╬      1          ║ 20000  ║
║ Acura      ╬      2          ║ 10011  ║
║ Honda      ╬      3          ║ 50000  ║
║ Honda      ╬      3          ║ 10000  ║
╚════════════╩═════════════════╩════════╝

Now, 'categoricalValue' is a numerical value that represents the companyName.

However, the problem with the above table is it assumes that higher the categorical value, the better the category is; and we know that that is not correct.

Therefore, we use OneHotEncoding, which makes our table look like this:
╔════╦══════╦══════╦════════╦
║ VW ║ Acura║ Honda║ Price  ║
╠════╬══════╬══════╬════════╬
║ 1  ╬ 0    ╬ 0    ║ 20000  ║
║ 0  ╬ 1    ╬ 0    ║ 10011  ║
║ 0  ╬ 0    ╬ 1    ║ 50000  ║
║ 0  ╬ 0    ╬ 1    ║ 10000  ║
╚════╩══════╩══════╩════════╝

One hot encoder performs "binarization" of the category.
The CategoricalValue is removed and replaced with 3 columns because we have 3 different label (vw, acura, honda).
You can think of each column as is_vw, is_acura, is_honda, relatively.

While this table looks overly-complicated, it works very well with most ML algorithms.
The binary variables (VW, Acura, Honda) are known as "Dummy Variables"


Resources

https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-science
https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f
https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

Comments

Popular posts from this blog

[Redis] Redis Cluster vs Redis Sentinel

[Unit Testing] Test Doubles (Stubs, Mocks....etc)

[Node.js] Pending HTTP requests lead to unresponsive nodeJS