A
A
Anton2015-04-13 10:47:42
Parsing
Anton, 2015-04-13 10:47:42

What methods are used to highlight the semantic elements of a bibliographic reference?

Hello. I am writing a term paper. The input is a line of the form:
Financial markets and neural networks [Text]: [training. allowance for specialties Applied Mathematics, Applied Informatics (according to the region) and other specialties] / V. I. Shiryaev. - M. : URSS, 2007. - 221, [1] p. ; 22 cm - ISBN 978-5-382-00330-6 : $137.59
The task is to highlight useful information (author, title, edition, page).
The format can vary greatly from link to link. (There is GOST, but not everyone observes it).
So far, the best thing I have come up with is to use 3-grams to build a probabilistic model for the location of semantic blocks like

P(TRUE | "<start><book_title>:<additional_title>/<autors>_<publishers>_<pages>;<library_info>_<ISBN>:<cost><end>") = P(TRUE | "<start><book_title>:")*P(TRUE | "<book_title>:<additional_title>")*...*P(TRUE | ":<cost><end>")

And then to determine the belonging of the text to a specific block using the same 3-grams, previously divided into tokens.
Wwlww[W]:[w.wwwWw,Ww(ww.)lww]/LLW_L.:A,n_n,[n]l.nw_An-nnnn:n.nl.
P(<book_title> | "Wwlww[w]") = P(<book_title> | "<start>Ww")*P(<book_title> | "Www")*...

But there is uncertainty with how exactly to determine the boundaries of semantic blocks.
In truth, I strongly doubt that I am thinking in the right direction. Prompt, please, who is familiar with a subject, what effective approaches to the decision exist. Thanks in advance.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
B
becks, 2015-04-13
@becks

Take Yandex-Tomita (you will need to write the rules, there is nothing complicated) or AOT.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question