A
A
Alexander Petrov2018-11-29 22:55:19
Ruby on Rails
Alexander Petrov, 2018-11-29 22:55:19

How to get the html code of the page to be parsed?

I need to parse a page: https://www.controller.com
I'm using nokogiri. I make a request:

page = Nokogiri::HTML(open('https://www.controller.com'))

And in response to me: 416 Requested Range Not Satisfiable
I dug in the direction of this question, there is a guess that some no-cors parameters are required, but I can’t really understand what it is and how to use it in rails.
I managed to make a request using js and fetch.
fetch('https://www.controller.com/listings/aircraft/for-sale/list/category/6/piston-single-aircraft', {method: 'GET',mode: 'no-cors'})  
  .then(function(response) {  
  	console.log(response);
    return response;  
  })  
  .then(function(text) {  
    console.log('Request successful', text);  
  })  
  .catch(function(error) {  
    log('Request failed', error)  
  });

But in the response of the html page does not come. In general, I was completely confused about what to do.
Tell me how to parse this site? What is not so special laid in that it is impossible to make a request to it?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
O
oh_shi, 2018-11-30
@Mirkom63

If this is an order from freelance for a pack of doshirak, then it’s better to just refuse. Owners strongly do not want to be parsed.

spoiler
Pardon Our Interruption...
As you were browsing www.controller.com something about your browser made us think you were a bot. There are a few reasons this might happen:
You're a power user moving through this website with super-human speed.
You've disabled JavaScript in your web browser.
A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article.
To request an unblock, please fill out the form below and we will review it as soon as possible.
First Name:
Last Name:
E-mail:
You reached this page when attempting to access https://www.controller.com/info/site-map from 127.0.0.1 on 2018-11-30 12:58:16 UTC.
Trace: ead57087-e556-473f-880f-707c3bfa87c1 via 449bb29d-9aa5-44ea-a964-418570a62186

Already at first glance, you can see that they have ip verification with popular vpn services, several types of captcha, cursor tracking, a dozen cookies for validation. Where did you come from, they are also watching UserReferrer=https://toster.ru/q/583813.
If you still want to try to win it all, I can say for sure that getting a valid cookie 1 time and adding it to the requests will not work. You need a headless browser, for example Capybara + Poltergeist.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question