Where to see how many words starting with a specific letter in which language

Mikhail Tchervonnko2013-02-08 16:31:16

Google

Mikhail Tchervonnko, 2013-02-08 16:31:16

In other words, you need to scatter the words on the plates depending on the first letter (in order to reduce the size) and you need to decide for which letters you need a separate plate and which ones can be merged into a common one (because there are few of them).

Thank you.

Answer the question

In order to leave comments, you need to log in

12 answer(s)

EvilX, 2013-02-09
@EvilX

Alternatively:
cat big_dict_en.txt | while read s; do for a in $s; do echo $a;done;done | uniq | grep "^a" | wc -l

Mikhail Tchervonnko, 2013-02-08
@RusMikle

First: a very large volume of texts. Second: There, in the database, the search is organized like Google’s bubet, each word has its own offset in the article and then search by words and find those where the mixing between words is minimal, etc. (there are many criteria) and not only morphological. In a large volume, these tablets are supposed to be stored on different servers and the search will take place in parallel. In general, a lot of things (almost your Google, laughter).
As for the question, you can, of course, just see how many pages there are in Slovak for which letter, but I don’t have all the necessary dictionaries.

Mikhail Tchervonnko, 2013-02-08
@RusMikle

Yes, still, I would like to see a base of synonyms for languages somewhere in order to tie it to all this.
Unfortunately, I can’t give everything at the mercy of Google. the information in which the search will be carried out is not public.
And those search engines that I have seen are not optimized for load distribution and database across multiple servers.
(can someone tell me, then it is not necessary to reinvent the wheel).

Mikhail Tchervonnko, 2013-02-08
@RusMikle

it won’t work out, there are still requirements for integration into existing systems, it’s easier to write your own than to finalize Sphinx. I've been looking at him too.

Mikhail Tchervonnko, 2013-02-08
@RusMikle

I bought a book on it here in the evening I'll read it again. Might really try.

Mikhail Tchervonnko, 2013-02-09
@RusMikle

Thank you, I just returned from the library, stupidly surrounded myself with dictionaries and counted the number of pages occupied by each letter.

rin, 2013-02-13
@rin

1. Usually not all words are in the dictionary. There are no word forms. Therefore, it may not turn out exactly what you need.
2. No need to know the exact amount. Just knowing the ratio is enough. To do this, you can take a not very large number of documents for each language (for example, from Wikipedia) and calculate the distribution in these documents.
3. In fact, you can avoid all this if you do not divide by the first letter, but calculate the hash from the word and take the remainder of the division by the desired number of tables.

iamnothing, 2014-07-10
@iamnothing

If you have entries like this:

SELECT op.parcel_cn, op.id
FROM objects_process AS op 
WHERE op.status IS NULL

And this is how the entries will be:

SELECT op.parcel_cn, op.id, o.area_value 
FROM objects_process AS op 
JOIN objects_copy AS o 
  ON op.parcel_cn = o.parcel_cn

This means that you have an empty set of lines when two conditions intersect
ON op.parcel_cn = o.parcel_cn:
WHERE op.status IS NULLi.e. there are no such records that would satisfy both conditions at once

Melkij, 2014-07-10
@melkij

Oddly enough, but perhaps there really are no such lines?

tsarevfs, 2014-07-10
@tsarevfs

Create a couple of signs with 2-3 lines for which this should definitely work. If it doesn't work, cut off pieces from the request until it becomes clear what exactly is wrong.

Vergileey, 2014-07-10
@Vergileey

Of course, I don’t have experience with postgresql, but if you look from the point of view of a simple sql query, you can write like this:

SELECT
op.parcel_cn,
op.id,
o.area_value
FROM
objects_process  op,
objects_copy  o
WHERE
op.parcel_cn=o.parcel_cn and
op.status IS NULL