Algorithms for parsing strings, tokenization?

O

OneManStartup2012-05-15 18:02:08

ruby

OneManStartup, 2012-05-15 18:02:08

There is such ruby project Picky. This is a search engine that accepts only one search query as input, but can separate semantic parts. This is done in the form of firstname:vova lastname:gagarin.
Of course, there can be many categories, but I want to make sure that the search query is analyzed for the content of semantic blocks without specific instructions. For example, by keywords from the dictionary.
And then, if there is any doubt, the system would give an answer in the form of “did you mean the last name or street?”
I searched for a long time, but a lot of tokenizers work on one word, i.e. do not break the line into several semantic ones.
It seems like in solr you can do it through filters. But for me this whole topic is new, so I hope for hints where to dig.
(if there are any libraries to help in ruby it would be great)

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

@

@ksurent, 2012-05-16
_

I'm not sure that I understood the task correctly, but it sounds like NER - Named Entity Recognition. With the help of NER algorithms, it is possible to extract names, names of geographical objects, etc. from texts. But this is essentially a simple classification, not a semantic analysis (i.e., no "meaning analysis").

L

lightcaster, 2012-05-16
@lightcaster

It's better to use a classifier. Regulars - only if something very simple with an established pattern (phone numbers). According to algorithms, CRF is better. Naive Bayes is fine too. The main thing is a good body for training.
And, yes, you can also look here - www.freebase.com/ . Google project, where people manually drive in.
And yes, do not throw around the terms "meaning". The people who deal with NLP don't really like it :).