S
S
strcpy2017-03-26 00:16:49
Machine learning
strcpy, 2017-03-26 00:16:49

Which model to choose for estimating the salary of a technology developer?

I downloaded the StackOverlow survey data, there is 60mb of csv data. Of these, 6K records for the Russian Federation, including developers' salaries, technologies used and age. Each case can have one or more technologies. Based on these data, I want to make a simple site where the user enters technology, age and can evaluate his salary in the market.
Problem: I don't know which model to use for stretching data. Tried converting technology labels to binary columns:


js css java ... salary
0 0 1 2k
1 0 0 2.3k
etc

there are about 30 binary categories in total. I divided the sample into training and test 9:1, went through linear regression, when validating on test data, we get a huge root-mean-square error of 1K$. It seems to me that my model is wrong, how would you solve this problem?
Thank you.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
S
Sergey Sokolov, 2017-03-26
@sergiks

Maybe take the most similar sets of skills and average their salaries, taking into account the “distance” from the sample? Those. no ML, just search.
For example, a salary is searched for a set of skills [A, B, C]. Found in the database with at least 2 of the required skills:
A, B, C: $X 1 (exact match, distance 0)
A, B, C, D: $X 2 (1 extra skill)
A, C: $X 3 (1 skill missing)
A, C, F: $X 4 (1 extra, 1 missing = distance 2)
"Distance" is the number of skills that differ (extra + missing). For example, squaring the distance of the set to the required one and dividing by (1 + Dist 2 )
Expected salary: ($X 1 /(1+0) + $X2 /(1+1 2 ) + $X 3 /(1+1 2 ) + $X 4 /(1+2 2 ) ) / 4
Or to deviate more sharply from the left data: divide by the number e to the power of Dist.
($X 1 /e 0 + $X 2 /e 1 + $X 3 /e 1 + $X 4 /e 2 + ... + $X n /e Dist n ) / n

X
xmoonlight, 2017-03-26
@xmoonlight

It is necessary to solve a system of linear equations and find the technology complexity coefficients (age has nothing to do with it):

k11*x11+...+k1N*x1N=b1
.....
kN1*xN1+...+kNN*xNN=bN,
where kNN - technology complexity coefficients xNN
bN - developer salaries

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question