Full text search

R

RomanovAS2012-11-23 07:04:03

PHP

RomanovAS, 2012-11-23 07:04:03

Full text search

Hello dear experts!

There was a need to implement a search in the database of documents ranging in size from 100 to 150 gigabytes. Among the documents there are formats: text, HTML, PDF, OpenOffice, OpenDocument, Microsoft Word/Excel, RTF.

The search will be carried out by about 300 people permanently connected to the database, people are located in different cities of Russia.

Texts in Russian.

What do you advise?

1. What systems exist to implement such a search?
2. Is it possible to index such a volume of information?
3. How long will it take to search such a database?
4. What capacity of the server should be allocated to solve such a problem?
5. Where is the best place to store indexes?
6. Is it possible to access the search engine through PHP?
7. Deadlines for the implementation of the task?

Thank you in advance!

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

R

Roman Makarov, 2012-11-23
@vollossy

And you did not look towards Sphinx or, for example, Zend Lucene . This is the first thing that came to mind. Although, to be honest, I'm not sure how suitable they are for a particular task.

R

Renat Ibragimov, 2012-11-23
@MpaK999

Solr is able to index documents very well - lucene.apache.org/solr/ will have to play a little with the Russian language.
I won't speak for the PHP adapter (but it should be), there is a REST API, so everything is easy.

S

sajgak, 2012-11-23
@sajgak

Instead of Solr, I advise you to use ElasticSearch. Both of them are built on Lucene, but in elastic things are much better with the speed of adding, changing documents in the index + sharding out of the box. Worked with both systems, even subjectively elastic has a more user-friendly query language

D

dali, 2012-11-23
@dali

If we talk about Sphinx, then it needs a database as a data source (so that it can retrieve data using sql) or xml (xml-pipes). That is, before setting up the sphinx, you will need to either write all your documents to the database or convert them to xml. Here you can come up with several solutions, depending on what you need. Suppose you do not need a full-text search for documents, you can assign keywords to each document, then in xml you can write the keywords and the name of the document, the search is carried out by keywords, the document is issued. Well, or completely pull out the text from the documents, put it in the database (having come up with a structure before that), perform a full-text search.
About volumes: 100-150 GB is certainly a lot, although it is possible that when you pull out the text from documents, the volume will be reduced, but not a fact. But keep in mind that Sphinx builds indexes, and they will take at least three times more space. Therefore, the disk will need a minimum of terabytes and as quickly as possible.
For hardware: I have a search for a 1GB table that works quietly (and very fast) at 512 meters of RAM on 1 core on hosting under Debian 5.5. The indexer can limit memory consumption, then it will index more slowly. But you can configure it to index only what you need.
And there are many ways to implement it, but it all depends on what you need to get at the output, how to look for it and what.

C

cat_crash, 2012-11-23
@cat_crash

Perhaps a panacea - company.yandex.ru/technologies/server/
Sphinx - does not know how to index files.