What is the best way to optimize such actions with arrays?

S

Soft_touch_plastic2022-02-27 11:59:36

C++ / C#

Soft_touch_plastic, 2022-02-27 11:59:36

I'm still new to c++. Solving an olympiad problem. I solved it first in python. Everything is fine, but the speed is depressing. According to calculations, about two or three hours will be decided.
In short, the essence of the problem is that there is a file in which 100,000 words with errors are written per line, and there is a dictionary for 50,000 lines in which all these words are in the correct version. In response, you need to give a file in which, for each incorrectly written word, its analogue from the dictionary is selected (of course, the correct one).

The algorithm is optimized in terms of logic, but I decided to rewrite it in c ++ and faced a depressing loss in speed.
For example, only data preparation took about a second in c++, and the execution of the entire algorithm in python on 10 lines took 800ms.
I tried to fill different types of arrays, anyway, above 1 second for just filling an array of 100,000 rows could not jump out.

It is assumed that in the future data structures will be stored in the form of dictionaries, in which the key is a number - the length of the strings, and the value is an array of strings of such length. In theory, it will be used map<int, vector<string>>, but will it not kill the speed of the crosses to zero?
How are such tasks solved, is python really going to be faster?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

W

Wataru, 2022-02-27
@Soft_touch_plastic

First, there may be an input problem. cin and cout are slow with large amounts of data. Or read through scanf, or you can disable synchronization with stdio.
Secondly, the data structure described by you will not kill the performance, but it is not clear to me how it will help in the task.
You will definitely need maps from a string to something else. In python, you probably used dictionaries (you used strings as a key in an array), and this is what it will be in c++. You can experiment std::unordered_map can be faster than std::map. In general, you can get a particularly fast solution using the data structure boron (aka trie). True, you will have to write it yourself.

R

rPman, 2022-02-27
@rPman

What is the exact definition of the wrong word and how to define the right one?
What makes a word more incorrect than the absence of a letter? permutation? substitution? which one is more wrong? is there any difference in which position of the word the error occurred, in the first character or the rest?
For example, a list of words without errors:

sea
sea
march

And here are our words:

Mar is sea or March?
mare - is it March or the sea?
so - this

Those. First, you need to define a function for comparing a word from the analyzed file with words from the list of correct ones.
I would take the ready-made levenshtein function (with different estimates for the types of changes) and, for example, take the first word from the list with the minimum error estimate.
Further algorithm
* If you decide head-on, there will not be enough resources, just for each word from the list you calculate the error score with the correct one, sorting through them until it meets the score 0.
Labor input is the square of the exponent of the average word length - i.e. e. long
* Previously, you can collect the original analyzed data in a map of words to eliminate repetitions
* You can slightly optimize this algorithm, if there are few words with an error in the source file, before comparing, look for a word in the dictionary, having built a map in advance, and look for the first minimum comparison error, i.e. use the fastest possible search algorithm for the correct words
, excluding them from the slow comparison algorithmwords (i.e. for each correct word put this changed word and a link to the correct one in the map) - we get an array of erroneous words with an error of 1, i.e. all words with an error of 1 can be detected at the speed of map, since the number of changes in this case is comparable to the number of characters used (multiply by 3) and the problem is about words, i.e. few characters? then for each word in the map there will be 3*n entries
* In the same way, you can make an array of all erroneous words for 2 changes (for example, 1-change for each entry from a list with 1-change)
* 3-ex,..4- ex etc.
It is obvious that storing such an amount of data in memory is very expensive (you can not store the values themselves in the map, but only hashes for searching and resolving collisions using this hash), plus the pre-filling of such arrays is long, and makes sense only for a small depth (for example, it is known that the main number of erroneous words has a small number of errors, and words with a large number of errors are useless - in the real problem of finding errors, this is true, no one is interested in cases when all letters in a word are erroneous, usually we are talking about 2-3 errors)
* Further optimization - turn the algorithm to breadth-first search in the graph of all possible changes of the correct words (this is not a tree but a graph, since the correct words for a finite number of changes will pass into each other or other erroneous words created from other correct words), t .e. we start the search and at each step we compare the received string with an error with all the words from the analyzed list, here the search is fast by map)
This approach makes sense if there are a lot of analyzed words (and they are all with errors) and the overhead for comparing with all combinations of errors - not large, from memory - it will also be required to maintain the breadth-first search itself

R

res2001, 2022-02-28
@res2001

Read files at once in large blocks (up to the entire file at once). For large blocks, you can use it with a pre-set vector size ( ) Then manually divide the read block into lines, apparently by replacing it with 0. Immediately add all pointers to the found lines into the data structure used in the algorithm. Don't use std::string because it reallocates memory for each sneeze, this leads to re-allocation of the same amount of memory but cut into small pieces and additional copying of lines. Use std::string_view (available in C++17) or raw C-strings in general as the fastest option.std::vector<char>std::vector::reserve()
In general, all plus arrays (vector, array, "raw" dynamic and static arrays) work equally fast if we consider the operation of accessing an array element by index. But in a vector, many other operations can result in memory reallocation and array copying. In raw dynamic arrays, you cannot simply change the size of the array, this must be done explicitly using the realloc call, and therefore here you explicitly control this operation, but in vectors (as well as strings) this happens implicitly, so often developers do not attach it values, while calls to the memory manager are quite expensive in terms of performance.