How, using the age of friends, to determine the age of the user?

efremovaleksey2015-07-15 09:35:58

data mining

efremovaleksey, 2015-07-15 09:35:58

There is a user base in the csv format of the following form:
age_of a person; age_of_his_friend_n1; age_of_his_friend_n2; age_of_his_friend_n3; age_of_his_friend_n...
The first problem is data conversion to a general form.
Friends, as you know, are always a different number. You can build histograms, but this is a dead end for a small number of friends. You can operate with medians, averages, etc., but the accuracy is poor.
I would like to have an algorithm that determines the user's age by the age of friends. The question is how best to approach building an algorithm and extracting rules from a data set.

Answer the question

In order to leave comments, you need to log in

6 answer(s)

myfirepukan, 2015-07-15
@myfirepukan

1. Determine the amount of deviations.
There are people who have a bunch of friends of all ages, and there are those who are about the same age.
2. Determine the number of deviations.
User, he has 100 friends aged 15-17 + 10 more aged 30 and over. Prediction: this is a schoolboy, and 10 outliers are two parents + teachers, for example.
3. If the value of deviations is low and the number of deviations is not large, then we can take the age of a person as an average of the age of his friends, otherwise we dig further.
Well, in general, one parameter (the age of friends) is not enough to determine if something else is needed.

Roman Mirilaczvili, 2015-07-15
@2ord

From the article on Habré "Data Analysis of the Facebook World"
there is such data:
That is, the higher the age, the wider the spread of friends' ages.
It turns out that each age corresponds to a certain histogram of the distribution of ages.
Taking any of the histograms, you can see that the amplitudes are normalized relative to the amplitude with the peak amplitude.
That is, each age corresponds to its own pattern (a curve of amplitude peaks). Then, having calculated the histogram for the desired user, we calculate with the help of approximation, comparing with the available histograms.
In addition to the obtained histograms of ages, additional parameters can be:

cloud of user interest categories: books, movies, disco
cloud of categories of user groups: society of punks, private business, serving the motherland, etc.
cloud of categories of visited user events: museums, stadiums, educational institutions, entertainment facilities

�

⚡ Kotobotov ⚡, 2015-07-16
@angrySCV

how everything is running.
no averaging, you have a bunch of different users, with different ages, so averaging will lead to some strange results, GUARANTEED not to the age you need to predict.
persons with an age closer to yours represent the largest cluster.
so you just need to count the most frequently repeated age.
presumably this will be the desired age.

Alexey Nikolaev, 2015-07-15
@Heian

Two camels were flying - one red, the other to the left. What is their exact total speed if the hedgehog is 24 years old?

It seems to me that the most adequate option is simply to find the average number among all ages. This is the maximum possible accuracy, because the problem above is slightly different from yours)

Viktor Vsk, 2015-07-15
@viktorvsk

With such a formulation of the problem, it really seems to be the best option to simply take the average, because You can’t achieve special accuracy with a formula - there are too many factors.
If there are resources, you can try to approach from the side of machine learning: mark up some kind of sample (or take ready-made data from social networks), add additional parameters (gender, min.\max. age, age range, number of friends, etc., everything that is available) and try to train the network. Although, of course, there are no guarantees here either. the task, especially with such initial data, is rather nontrivial.

Viktor Vsk, 2015-07-15
@viktorvsk