I
I
Ivan Danchuk2020-06-02 13:15:59
Python
Ivan Danchuk, 2020-06-02 13:15:59

How to find downloaded files?

Hello!
There is a site from which you need to extract financial reporting data, for example , here is the link
Reporting files in Excel format are downloaded from this page by clicking on the button, which is a link to the corresponding part of the file archive. Screenshot of the page and element code: 5ed61553047ae749713901.png
Wrote a code that generates the necessary download links and accesses the site using them

import requests
from bs4 import BeautifulSoup
import re
from time import sleep
import urllib.request, urllib.parse, urllib.error

 # парсинг на примере Agilent Technologies, Inc.

url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=A&type=10-K&dateb=&owner=exclude&count=40'
user_agent = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 YaBrowser/20.4.3.257 Yowser/2.5 Yptp/1.23 Safari/537.36'}

res = requests.get(url, headers=user_agent).text
soup = BeautifulSoup(res, 'lxml')
CIK_data = soup.find('span', class_='companyName').find('a').get_text()
CIK_num = re.findall(r'\d+',CIK_data)
CIK = CIK_num[0].lstrip('0')
print('CIK: ',CIK)
Acc_no_data = soup.find_all('tr')[3:]
# print(Acc_no_data)
for elem in Acc_no_data:
    sleep(40)
    Acc_no = re.findall(r'\d+', elem.find('td', class_='small').text.replace('-',''))[3]
    # print(Acc_no)
    date = elem.find_all('td')[3].text
    get_file = requests.get(f'https://www.sec.gov/Archives/edgar/data/{CIK}/{Acc_no}/Financial_Report.xlsx', headers=user_agent)


And after that, I can't find the files anywhere, which, according to the idea, should be downloaded. I checked - the links are formed correctly, by clicking on the link the files are downloaded to the "Downloads" folder.
Smart people, tell me - how to make the files downloaded (and preferably in a predefined directory)? The OS is Windows 10.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
S
soremix, 2020-06-02
@IvanIvanichNN

It doesn't work that way a little. Make a request to the desired URL, you get the information in bytes. Then you just need to open a record to a file by selecting the WB (Write-Bytes) mode, and simply write the information received from the link, which is available in r.content

import requests

url = 'https://www.sec.gov/Archives/edgar/data/1090872/000104746909010861/Financial_Report.xls'
r = requests.get(url)

with open('report.xls', 'wb') as f:
    f.write(r.content)

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question