A
A
Alexander Ivanov2021-12-25 15:34:23
Search engines
Alexander Ivanov, 2021-12-25 15:34:23

How to organize a quick search on 78 million lines?

There is a csv file with about 78 million lines. How to quickly organize the search for these lines?
In practice, I'm only interested in one column, out of the six available in it. I can move this column into a word document if that makes searching faster.

I understand that, in theory, you need to load this file into RAM so that it is constantly open in it for the fastest search for strings.
It is necessary to achieve at least a few 10-20 million searches per second. What can you advise to solve this problem? What to store, how to look for, what is the best iron to use for this?
Preferred languages ​​python or C#.

Answer the question

In order to leave comments, you need to log in

4 answer(s)
V
Vasily Bannikov, 2021-12-25
@alexivanov77

Depends on what search and what data.
Again - if there is a lot of data, then it is unlikely that everything will be loaded into RAM
. If by exact match, and they are all unique - use the hash of the table.
If they are sortable - sort and use binary search.
If you need a full-text / fuzzy search - it's easier to take a third-party DBMS.

D
Dimonchik, 2021-12-25
@dimonchik2013

  • Clickhouse,
  • Sphinx/Manticore search
  • reindexer
  • literate Sishnik / Plumber / Gopher

R
Roman Mirilaczvili, 2021-12-25
@2ord

In order not to fence the garden, it is enough to import into SQLite. Well, add an index to the desired column. If necessary, there is also a full-text search.

U
uvelichitel, 2021-12-26
@uvelichitel

  • Map file to memory memory-mapped-file System.IO.MemoryMappedFiles, which is about 30x faster than just reading from disk
  • Make and permanently update search index search_key->file_offset , direct solution - associative arraySystem.Collection.Generics Dictionary

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question