Why does the parser return 403 even after specifying Cookie and User-Agent?

V

VuztreeCalan2019-08-30 23:15:50

Python

VuztreeCalan, 2019-08-30 23:15:50

I tried to write a parser to upload pictures to myself from artstation.com, I took a random profile , almost all the content is loaded there with json, I found a GET request , it opens normally in the browser, and through requests.get it gives 403. In Google, everyone advises you to specify the User header -Agent and Cookie, used requests.sessions and specified a User-Agent, but still the picture is the same, WHAT?

import requests

url = 'https://www.artstation.com/users/kuvshinov_ilya'
json_url = 'https://www.artstation.com/users/kuvshinov_ilya/projects.json?page=1'
header = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0',}

session = requests.Session()
r = session.get(url, headers=header)
json_r = session.get(json_url, headers=header)
print(json_r)
> Response [403]

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

A

AWEme, 2019-08-31
@VuztreeCalan

The fault of the 403 code is cloudflare. cfscrape
helped me to bypass

def get_session():
    session = requests.Session()
    session.headers = {
        'Host':'www.artstation.com',
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0)   Gecko/20100101 Firefox/69.0',
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language':'ru,en-US;q=0.5',
        'Accept-Encoding':'gzip, deflate, br',
        'DNT':'1',
        'Connection':'keep-alive',
        'Upgrade-Insecure-Requests':'1',
        'Pragma':'no-cache',
        'Cache-Control':'no-cache'}
    return cfscrape.create_scraper(sess=session)
session = get_session() # Дальше работать как с обычной requests.Session

Some code about pulling direct links to Hires Picchi:

The code

import requests
import cfscrape

def get_session():
    session = requests.Session()
    session.headers = {
        'Host':'www.artstation.com',
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0)   Gecko/20100101 Firefox/69.0',
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language':'ru,en-US;q=0.5',
        'Accept-Encoding':'gzip, deflate, br',
        'DNT':'1',
        'Connection':'keep-alive',
        'Upgrade-Insecure-Requests':'1',
        'Pragma':'no-cache',
        'Cache-Control':'no-cache'}
    return cfscrape.create_scraper(sess=session)

def artstation():
    url = 'https://www.artstation.com/kyuyongeom'
    page_url = 'https://www.artstation.com/users/kyuyongeom/projects.json'
    post_pattern = 'https://www.artstation.com/projects/{}.json'
    session = get_session()
    absolute_links = []

    response = session.get(page_url, params={'page':1}).json()
    pages, modulo = divmod(response['total_count'], 50)
    if modulo: pages += 1

    for page in range(1, pages+1):
        if page != 1:
            response = session.get(page_url, params={'page':page}).json()
        for post in response['data']:
            shortcode = post['permalink'].split('/')[-1]
            inner_resp = session.get(post_pattern.format(shortcode)).json()
            for img in inner_resp['assets']:
                if img['asset_type'] == 'image':
                    absolute_links.append(img['image_url'])

    with open('links.txt', 'w') as file:
        file.write('\n'.join(absolute_links))

if __name__ == '__main__':
    artstation()

G

Grigory Boev, 2019-08-31
@ProgrammerForever

It makes sense to specify all the fields from the Header, not just the User-Agent

D

Dimonchik, 2019-08-31
@dimonchik2013

because, bro, the AJAX request has other headers , the
server doesn't care, but you're trying to get Ajax with a non-Ajax client - that's 403