Answer the question
In order to leave comments, you need to log in
Get
In general, the user enters the portal and specifies the url.
On the server side, I need to go to the specified url and get the value of the title tag.
On the server I have tomcat (generally java) and it all spins on nix.
Question: What is the best way to implement this?
Should I write in java or use some nix commands? (lynx seems to be able to help)
Or perhaps there is some third party service for such things, like those that provide screenshots of pages or send email alerts?
There are two points here:
First, encoding. Everything is stored in the database in UTF-8, and pages can be very different ... and windows-1251 and ISO-8859-1 and even GB2312. And this encoding value still needs to be obtained. And it can be in the header and / or in the meta. Or maybe not at all, this also happens.
secondly, speed.
Answer the question
In order to leave comments, you need to log in
There was a need to do it in PHP. I implemented it simply:
- we request a page via url (in the request headers we indicate the preference to receive a response in utf-8 - for those web servers that give in the encoding that you request)
- we check the status of the response (maybe this url does not exist at all)
- we look encoding in titles (regulars)
- look at the encoding in meta (regulars)
- look for the title, convert its value to a specific encoding.
ps^ I'll look for regular expressions for all this ...
Under Java, there is an excellent library that is suitable for loading remote pages - Apache Httpclient.
hc.apache.org/httpcomponents-client-ga/examples.html
It's best to use Curl. Load part of code with restriction via CURLOPT_WRITEFUNCTION along with header. Check the response from the server by the header. Curl example goo.gl/0EOFQ Parser example goo.gl/sFP8t , defining encoding with a simple function pastebin.com/51p9NUAX
Apache Httpclient + cpdetector (to determine the encoding)
Only it is quite heavy and makes mistakes :)
The algorithm for determining the encoding is as follows:
1. From the server headers (Httpclient), if not, then
: :
3. cpdetector, if not, then no idea :)
In general, the task is not quite trivial. And keep in mind that when you get a byte[] array from Httpclient, don't convert it to String, otherwise you'll screw up the encoding :)
mb_convert_encoding (PHP) allows you to automatically determine the source encoding. Checked only on cp1251/utf8/koi8-r — normally. The first parameter is the string itself. The second parameter is what encoding. The third parameter is optional - from which encoding.
php.net/mb_convert_encoding
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question