Preserve html source code when gathering information with Apache Nutch?

S

Seldon2015-01-26 18:08:11

Crawling

Seldon, 2015-01-26 18:08:11

Good day, I use nutch (version 2.3) to collect pages, the task arose to save the original html and make an index about it. Can anyone suggest an elegant solution to the problem?
I also found a plugin that allows you to save the original html, but it is written for nutch 1.x when you try to change it to version 2.x, the question arises how to get to the method. which will be written to the database.
The plugin code is generally 3 lines, the question is how to rewrite the line in the new api

Metadata metadata = parseResult.get(content.getUrl()).getData().getParseMeta();

ParseResult is no longer there and there is actually nowhere to get Metadata from in the new ParseFilter interface.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

Denis Holub, 2017-08-12
@denman1985

As I understand it, it is necessary that the unnecessary region_id be 1 time (considering that there were orders for them) and with all NULL fields except product_id and region_id.
According to the given condition, I see the following options:
1. Leave the query as it is and add 3xUNION select 100, NULL, NULL, NULL, 5 (6) (7) at the end
2. Change the query to:

select distinct po.product_id, 
(case when po.region_id in (1,2,3,4) then order_id else null end) as order_id,
(case when po.region_id in (1,2,3,4) then p.product_name else null end) as product_name,
(case when po.region_id in (1,2,3,4) then po.order_date else null end) as order_date,
region_id
from products_orders po
left join products p
    on p.id = po.product_id
where
    po.product_id = 100