Representing categorical and ordinal linear regression data (machine learning)?

C

Chichi2015-12-04 15:08:55

Machine learning

Chichi, 2015-12-04 15:08:55

I'm trying to fully understand the difference between representing categorical and ordinal data types when doing a regression. At the moment there are the following rules:
Categorical variable and example:
Color: red, white, black
Why categorical: red < white < black is logically wrong
Ordinal variable and example:
State: old, restored, new
Why ordinal: old < restored < new logically correct
Methods for converting categorical and ordinal data to numerical format:
Direct coding (display) for categorical data
Ordinal representation for ordinal data.
An example of converting categorical data into numbers:
data = {'color': ['blue', 'green', 'green', 'red']}
Number format:

id          Blue       Green      Red
0           1            0          0
1           0            1          0
2           0            1          0
3           0            0          1

Example of translating ordinal data into numbers:
data = {'con': ['old', 'new', 'new', 'renovated']}
Number format after ordinal matching: Old < Restored < New → 0, 1, 2

In my data, I have a "color" property. If the color changes from white to black, then the price goes up. From the above data representation rules, I should probably use direct coding for my categorical variable. But I can't figure out why I can't use ordinal representation. Below I presented my observations from which I had a question.
First, let's introduce the linear regression formula:

Now let's look at the different data representation for the "color" property

In the picture, One-hot encoding is direct encoding. And then this is my ordinal coding.
Now let's try to predict the price for data items 1 and 2 using the formula for both representations:
Direct coding:
In this case, there will be different Theta (coefficients) for different colors. For example, I assumed that all coefficients are defined (20, 50, 100) for three colors. The forecast will be as follows:
Price (1st element) = 0 + 20*1 + 50*0 + 100*0 = $20
Price (2nd element) = 0 + 20*0 + 50*1 + 100*0 = $50
Ordinal coding:
In this case, all colors will have a common Theta (coefficient) but different multipliers (my ordinal codes). The forecast will look like:
Price (1st element) = 0 + 20*10 = $200
Price (2nd element) = 0 + 20*20 = $400
In my model White < Red < Black in terms of price. The correlation seems to work in both cases and the predictions look logical for ordinal and categorical representations. That is, it turns out no matter what type of data I use (ordinal or categorical), can I use any method of converting data into a numerical format? And this division into two types is more of a convention and more of a computer-centric view than a problem with the logic of the regression itself. Will there be a correct regression model in both cases?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

S

SeptiM, 2015-12-04
@ChicoId

1. In fact, the whole machine comes down to solving optimization problems. There is a set of restrictions and there is a function that needs to be optimized (min, max). In your case, you are most likely minimizing the standard deviation. Divide the sample into two parts, study on the training, count the value on the control. This value is the quality criterion of your model.
2. If there are several models and it is not clear which one to choose. We need to divide the sample into three parts. In the first part, we train the models, in the second, we select the model with the best performance, in the third, we get the value of the optimized function on the winner of the previous part, the same quality criterion.
3. Conclusion: theory is good, but it's better to honestly compare models against data.
4. Theory. If you represent one category with several variables, then you get a large dimension. For example, if the color contributes according to the principle white - 0, red - 10, black - 20, then in one model it will be 10 * x_color, and in the other 0 * x_white + 10 * x_red + 20 * x_black. At the same time, the situation looks like white - 0, red - 10, black - 100, then in the first model the exact representation will no longer work, and in the second one you can still place the appropriate weights.
In essence, a model with many variables is a generalization of a model with one. The only problem is that the number of variables is growing...