N
N
No_Time2012-05-11 16:45:30
Java
No_Time, 2012-05-11 16:45:30

A simple web parser

Greetings. It was necessary to write an application, the purpose of which would be to parse a certain site, pack its contents into a database and distribute this very database using api. I would like to implement everything in Java, but if there are better and simpler ideas, then I will gladly listen. The main priority of this undertaking is speed and reliability.

If you opt for Java, then I would like an elegant solution, without ent scripts and huge xml. I am not very familiar with web java, I will be glad to link to a good tutorial!

Thanks in advance!

Answer the question

In order to leave comments, you need to log in

9 answer(s)
A
Anatoly, 2012-05-11
@No_Time

It’s better and easier to do what you know best, but if you want to use Java “because I want to get to know it better”, then do it in Java and don’t pay attention to other tips (otherwise read first).

E
egorinsk, 2012-05-11
@egorinsk

Drop Java. In PHP, the parser is written in 10 lines: curl_init(), curl_exec(), preg_match_all(), mysql_connect(), mysql_select_db(), mysql_query(). Substitute the function parameters yourself, based on the conditions of your task.

S
serso, 2012-05-11
@serso

If you want a normal solution, this is Spring MVC + Web, as an ORM - Hibernate, OpenJPA, or something of your own.
What kind of API do you need - ajax, web services? For the first case, the solution is to simply write a controller in Spring, for the second, Jax WS or something like that.
Administration - Spring Security will most likely suffice, if not - Apache Shiro can be looked at.
The web interface is anything from JSP to Ext GWT, although, as I understand it, you don’t need it.
Links:
Lots of tutorials on the website www.springsource.org/tutorials
www.springbyexample.org/
PS Java hosting will be more expensive unless you have your own virtual/dedicated server...

V
vimvim, 2012-05-12
@vimvim

Check out web-harvest.sourceforge.net/
This is a Java application with its own little functional language.
Here is an example of flickr parsing:

<?xml version="1.0" encoding="UTF-8"?>
 
<config>
    <include path="functions.xml"/>
    
    <var-def name="tags" overwrite="false">art</var-def>
    <var-def name="num" overwrite="false">1</var-def>
    
    <loop index="i" item="url">
        <list>
            <var-def name="imagelinks">    
                <call name="download-multipage-list">
                    <call-param name="pageUrl">
                          <template>http://www.flickr.com/search/?q=${tags}&m=tags</template>
                    </call-param>
                    <call-param name="nextXPath">//a[contains(., 'Next')]/@href</call-param>
                    <call-param name="itemXPath">//img[@class='pc_img']/@src</call-param>
                    <call-param name="maxloops"><template>${num}</template></call-param>
                </call>
            </var-def>
        </list>
        <body>
            <empty>
                <file action="write" path='flickr/${tags.toString().replaceAll(" ", "")}/${i}.jpg' type="binary">
                    <http url='${url.toString().replaceFirst("_m.jpg", ".jpg?v=0")}'/>
                </file>
            </empty>
        </body>
    </loop>
    
</config>

O
ophiuhus, 2012-05-11
@ophiuhus

I would like an elegant solution, without ent scripts and huge xml

After that, it was possible not to specify
I'm not very familiar with web java.

S
Sergey, 2012-05-11
@butteff

I always parse using simple html dom parser
Pluses are that the desired element is selected according to the principle of selectors in jQuery
Cons - php. But since I know him best, which is a plus for me.
I don't know if it will help you, but I hope.

T
tsegorah, 2012-05-12
@tsegorah

you can parse html with your hands with the same sax, or you can try one of the libraries at the link
here
as written above, any jpa is suitable for working with the database, the same Hibernate
for the rest of the interface for third-party applications, try the jersey
library if you need something small , then I recommend paying attention first to the standard things that everyone can do this, and then to individual frameworks

S
Snowindy, 2012-05-12
@Snowwindy

Before creating the site.
1. Create a database structure.
2. Parse the content of the source site using the dirty html tags cleaner (required if the markup of the source site is incorrect), write it to the database.
Website creation:
1. We use the Grails framework (rather simple, without xml-configs, etc., but powerful) to generate views, read data from the database.
2. The site is deployed to the hosting, the base is transferred there.

F
FanKiLL, 2012-05-12
@FanKiLL

Judging by the tag, you want to parse habr, leaching is bad :)
You can use jsoup.org to parse the site, a very convenient library, you can take elements by css classes as an option, like in jquery.
You can simply create a parser.jar that will parse the site using the crown and enter the database into the database.
For api without any xml configs, I would suggest using jersey.java.net itself, which will take data from the database and give it out in the form of json / xml Everything is very simple, for example class Post with the getID (int id) method you can map onto domen .com/post/getid/1 and you can give out both xml and json, depending on which Content-Type the consumer of your api indicates in the headers.
Good luck. If you write, I will help you in any way I can.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question