Answer the question
In order to leave comments, you need to log in
Getting the text of articles (posts) from a page without tags
Hello!
Can anyone recommend libraries (preferably written in java) for extracting the main text and related images from the html page?
Example: by passing a link to the page habrahabr.ru/post/193226/ to the input , the output will be:
Answer the question
In order to leave comments, you need to log in
Found it myself: code.google.com/p/boilerpipe/
Allows you to pull out the main content of the page (without all the secondary blocks).
You can try to search with the same jQuery in the page code for a block with a lot of text, remove formatting tags from it, parse img tags and remove their code, replacing it with a simple link. Something like this.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question