Python selenium, how to do a validation when uploading a file to a specified directory?

P

Pavel2021-02-25 22:50:20

Python

Pavel, 2021-02-25 22:50:20

Good evening dear connoisseurs.
There is a parser that collects data, and also loads files into a folder on the laptop disk. using selenium
There is a catch, the files come across the same (different types of goods have the same description), and their sizes are not small, and he has to download them again every time, but I would like to implement the check at night!.
How to make a check when loading the parser so that it checks if there is such a file in the folder or not. I can’t get the file name, because the download link is generated when you click
on the product link, you need to log in to see the file https://stomshop.pro/hlw-31-45b#tab-documentation

An example of my piece of file download code

options = webdriver.ChromeOptions()

# options.add_argument(f"user-agent={user_agent.random}")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--headless")
options.add_experimental_option('prefs', {
    "download.default_directory": path_registration_documents,
    "download.prompt_for_download": False,
    "download.directory_upgrade": True,
    "plugins.always_open_pdf_externally": True,
}
)

driver = webdriver.Chrome(
    executable_path=f"{base_path}/chromedriver",
    options=options
)

driver.find_element_by_id("tab-documentation-li").click()
    time.sleep(0.5)

    documents = driver.find_elements_by_class_name("docext-container")

    for document in documents:
        document.click()
        time.sleep(1)

Help or tell me where to go?

Thank you in advance!

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

S

soremix, 2021-02-25
@velllum

I am sure that there are many special handlers and other things in selenium to get information about the file being uploaded, etc., but so far no one sees, I suggest a crutch: we will manually form a request to receive the file, and without unloading the request, completely get the file name from the response headers

import requests
import re
import os

#...

headers = {'Content-Type': 'application/x-www-form-urlencoded'}
documents = driver.find_elements_by_class_name("docext-container")

for document in documents:
    # тут ищем родительский элемент, в нем есть нужный нам ID
    document_id = document.find_element_by_xpath('..').get_attribute('data-documentation-id')
    # в пейлод вписываем нужные данные от формы, и вставляем наш ID
    payload='cr_documentation_action=download&documentation_id={}&email='.format(document_id)
    # url для запроса - текущая страница
    # ставим обязательно stream=True, чтобы файл не выкачивался сразу
    r = requests.post(driver.current_url, headers=headers, data=payload, stream=True)
    # название файлов всегда есть в заголовках запроса, response.headers
    # поэтому берем их, видим в нужном ключе "attachment; filename*=UTF-8''hlw-shiptsy-ortodonticheskie-reg.pdf"
    # ну и недолго думая дергаем регуляркой
    document_name = re.search(r'\'\'(.+?\.pdf)', r.headers['Content-Disposition']).group(1)

    # дальше уже нужно проверить наличие файла в папке
    # я так понял путь до папки с загрузками в переменной path_registration_documents, так что:
    if document_name in os.listdir(path_registration_documents):
        print('Не новый')
    else:
        print('Новый док')
        document.click()

I did not insert additional headers in requests, only one was enough. Authorization is also not needed for this, but you never know what will change over time - it will be necessary to add.
Well, os.listdir () you need to specify your path normally, if it is suddenly wrong. In general, the idea is clear, then my powers are all