Is it possible to use Apache Lucene to determine if a certain string from a set is included in the text?

S

Sevak Avetisyan2015-06-28 23:43:11

Java

Sevak Avetisyan, 2015-06-28 23:43:11

Hello!
There is such a task: there is a list of words (for example, ["mother", "home", "family"]), and there are also texts ("I live with my mother", "It was cold in the house", etc. ). It is necessary to determine whether any word from the list occurs in the specified texts. For example, in the first text there is the word "mom", if we bring it to its original form ("mother"), we will see that it is contained in the original list, the same with the second sentence. Can Apache Lucene help me with this? Well, or some other Java library that will cope with the task.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

V

Vladimir Smirnov, 2015-06-29
@Sevak_Avet

Here is a working example of a fuzzy match:

public static void main(String[] args) throws Exception {
        String fieldName = "myField";

        //создание тестового индекса
        Directory directory = new RAMDirectory();//в "настоящей" Системе здесь должно быть FSDirectory.open(dir)
        RussianAnalyzer analyzer = new RussianAnalyzer(Version.LUCENE_46);
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, analyzer);
        IndexWriter writer = new IndexWriter(directory, config);
        writer.addDocument(createDocument(fieldName, "Я живу у мамы"));
        writer.addDocument(createDocument(fieldName, "В доме было холодно"));
        writer.commit();
        writer.close();

        //поиск
        int startFrom = 0;
        int pageSize = 20;
        DirectoryReader ireader = DirectoryReader.open(directory);
        IndexSearcher indexSearcher = new IndexSearcher(ireader);
        //FuzzyQuery осуществляет поиск неточных вхождений
        FuzzyQuery wildcardQuery = new FuzzyQuery(new Term(fieldName, "мама"));
        TopDocs topDocs = indexSearcher.search(wildcardQuery, startFrom + pageSize);
        ScoreDoc[] hits = topDocs.scoreDocs;
        for (int i = startFrom; i < topDocs.totalHits; i++) {
            if (i > (startFrom + pageSize) - 1) {
                break;
            }
            Document hitDoc = indexSearcher.doc(hits[i].doc);
            if (hitDoc != null) {
                System.out.println(hitDoc.get(fieldName));
            }
        }
    }

Some of the sources are taken from a serious industrial System, so if something seems strange, don't think, just use it. When everything works "like clockwork", then return to the "strangeness" and consider whether it is worth redoing ...