[100 Days of ML Code] OneHotEncoder

One Hot Encoding is process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

Normally, we will have a dataset of:
║ CompanyName║ Price  ║
║ VW         ║ 20000  ║
║ Acura      ║ 10011  ║
║ Honda      ║ 50000  ║
║ Honda      ║ 10000  ║

Many ML algorithms cannot work with label data (as a string) directly. Therefore, we need to convert these labels into some numeric value:

║ CompanyName Categoricalvalue ║ Price  ║
║ VW         ╬      1          ║ 20000  ║
║ Acura      ╬      2          ║ 10011  ║
║ Honda      ╬      3          ║ 50000  ║
║ Honda      ╬      3          ║ 10000  ║

Now, 'categoricalValue' is a numerical value that represents the companyName.

However, the problem with the above table is it assumes that higher the categorical value, the better the category is; and we know that that is not correct.

Therefore, we use OneHotEncoding, which makes our table look like this:
║ VW ║ Acura║ Honda║ Price  ║
║ 1  ╬ 0    ╬ 0    ║ 20000  ║
║ 0  ╬ 1    ╬ 0    ║ 10011  ║
║ 0  ╬ 0    ╬ 1    ║ 50000  ║
║ 0  ╬ 0    ╬ 1    ║ 10000  ║

One hot encoder performs "binarization" of the category.
The CategoricalValue is removed and replaced with 3 columns because we have 3 different label (vw, acura, honda).
You can think of each column as is_vw, is_acura, is_honda, relatively.

While this table looks overly-complicated, it works very well with most ML algorithms.
The binary variables (VW, Acura, Honda) are known as "Dummy Variables"




Popular posts from this blog

[Redis] Redis Cluster vs Redis Sentinel

[Unit Testing] Test Doubles (Stubs, Mocks....etc)

[Java - Synchronization] Semaphores and Mutex