Evaluate the task I give job candidates. Am I too harsh?

tersuren2015-01-04 20:36:33

Software testing

tersuren, 2015-01-04 20:36:33

When programmers come to me for an interview, I torture them with tasks from real life. And here's something I thought - but am I too harsh? Look, there is such a task:
A shipping company from the States has a partner in England. A kind of English local DHL. This English company puts software on clients' computers, and they can put the package data there. They will then receive a prepaid postage stamp. The end client (that is, the sender) fills in the recipient's address in the states, weight, size of the parcel, his address in the program settings, and the client must also enter a description of the goods. The software sends the XML package to the British to the server, they send it to us in the states and we return either the finished label or an error message. The problem is that some clients are too lazy to write a good description and sometimes they drive in some kind of "atsukkauktsuka tskutsukp u" from the bulldozer. When a US company sends such data to US customs, then they are fined for obvious nonsense and an attempt to shift control over data quality to customs. Task: to find a way to automatically detect such randomly filled product descriptions with a high probability.
Details:
1) No, we can't monitor the recruitment rate as we have no control over the shipping program.
2) We don't care if the person wrote "green peas" on the package with the table lamp. In this case, the customs fines the sender, as he lied. But for "kuatsuk tsuktsuk" we are fined, since the customs officer rightly points out that we should have known that there was no "kuatsuk tsuktsuk" in the package and should have edited the description before sending it to customs.
3) no perfect solution is required. Requires probabilistic. That is, it is clear that it is possible to put a person in charge of monitoring all parcels, but it is expensive. I would like to have something like auto sorting, which will filter out 85% of the messages as probably correct (and skip without moderation), 10% as definitely wrong (and reject such requests by giving an error message instead of a label), and send only the last 5 for manual moderation % of doubtful cases.
So that's my question: am I too cruel? Is this task too difficult? That is, I emphasize: I do not ask people for the code immediately. I ask for a description of the outlines, how you can solve the problem, what the code will approximately do.
Right after: the two most popular suggestions are: 1) use dictionaries and spell checking, and 2) google the text to see if it recognizes the word. Both are bad because 1) the warehouse workers are usually illiterate and use abbreviations, 2) the mechanics of the hand make random character sets not random. For example, Google does not even consider yvpayvp a typo.

Answer the question

In order to leave comments, you need to log in

10 answer(s)

globuser, 2015-01-04
@globuzer

Now the interviewer will read this post-question and tomorrow he will come for an interview again))))

mamkaololosha, 2015-01-04
@mamkaololosha

Use a library that implements some kind of Fuzzy Search. Or write yourself. This is a rather complex task. Perhaps your own database, self-written. It’s easier to connect some MIT department than to look for a middle that Oracle will write to you on your knee. One Levenshtein Distance may not be enough.
habrahabr.ru/post/114997
algolist.manual.ru/search/fsearch

Sergey, 2015-01-04
@begemot_sun

I think no. The task seems elementary.
You need to take the distribution of letters/syllables in the reference text and compare it with the same distribution on the label. If, within a certain error, they coincide, then everything is OK. Let's skip the label.
Well, am I hired? :)

SHVV, 2015-01-04
@SHVV

And the option - asking Google for the number of pages with that name is considered cheating? Still, his base of words and phrases is very large.

Spetros, 2015-01-04
@Spetros

White and black lists based on updated dictionaries.
White - words and abbreviations that are already in your database, plus a regular dictionary.
Black - clearly invalid occurrences of combinations of letters and stop words.
All this can be supplemented by some algorithm for filtering by the composition of the word.
The probabilistic part is to analyze the percentage of occurrences in these lists.

Sergey Lerg, 2015-01-04
@Lerg

It is obvious that an integrated approach is needed - the use of several methods and a combination of their results.
At the first stage, the check is simply in the dictionary - are there such words, then you can check if there is a noun, if there is an adjective, in general morphology, syntax and punctuation.
There should be two dictionaries - one is common, and the other is the base of all shipments through the service (the used abbreviations and jargon will fall into it).
It would be logical to negotiate with another large transport company and purchase a database of shipment descriptions from them.
At the second stage, the statistical approach is the distribution of letters on the keyboard, the length and number of words, how words are similar to words (prefixes, syllables, endings). Here you can also calculate the amount of information (entropy) of the description and compare it with the average value.
And at the third stage - the help of the Internet, search for a description on the Yandex market, amazon, ebay and so on.
Each stage sets its own score and then they are combined with the coefficients into the resulting score.

Alexey, 2015-01-05
@ZeLib0ba

And why not choose a product description from the list, which will contain something like a general description: household appliances, el. Components, clothing, etc.

tersuren, 2015-01-06
@tersuren

The solution actually has two levels.
The first is to check out the (not) randomness of the input set. We're only looking at a string of characters. There are two main options: either we take a huge existing text and use it as a donor of good descriptions for Bayes, or, which is essentially the same, but use a different mathematical apparatus, we use Markov chains. Roughly speaking, in both cases we use the fact that in English after the letter, say E, the letters R and U come with different probabilities. And this probability just characterizes the language. Short lines, when the description consists of only one word about 3-4 letters) are cut according to the dictionary, since statistical methods do not work there. But here the dictionary works well in the forehead. If a person planted a mistake in the word car, then one horseradish cannot be understood what it is.
The second layer lies precisely in the fact that although the data set arrives in a random order, its nature is not initially random. The shipping company's customers follow the same normal distribution as everyone else. Well, or the Pareto Rule, if anyone is more familiar with this terminology. A completely random client is not aware of which fields are important and which are not. He is usually quite meticulous. In addition, it makes less sense for him to drive in junkyard, since his labor costs will change by a couple of seconds. The main source of rubbish is a lazy employee of some warehouse of an online store constantly sending parcels or something similar. Firstly, he sends not his own, and secondly, for him, just a couple of seconds on each parcel add up to a significant gain. We always have the sender's address - because this is a package. That is, we are always aware of who sends a lot and who often cheats and, accordingly, who works honestly. This helps us sort out dubious cases where our Bayes/Markov/Frequency Distribution gives us 50/50.

Olga Morozova, 2015-05-10
@Helga_moroz

is this a task for a tester?

Xee, 2015-07-06
@Xee

As an option to analyze by bilinguals. If the percentage of rare occurrences in the language is greater than a certain value, an error is very likely.