How to read file paths in Excel, download them and sort them into folders?

M

Mixa2021-11-23 17:40:38

Python

Mixa, 2021-11-23 17:40:38

The idea is this: There is an Excel file, in each line of which there is a field with a unique ID, and then there are fields, each of which contains external file paths.

The task is to go through all the lines, downloading the files and save them in a folder, which also needs to be created, using the unique id specified in the line for its name.

I am an amateur in programming, but I suspect that a similar task can be solved with the help of some libraries for Python. I would be glad to receive recommendations of such or maybe there are already semi-finished solutions for such tasks?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

S

Sergey Karbivnichy, 2021-11-23
@hottabxp

You can convert excel to csv - it will be much easier. This task is solved in 2 minutes.

Or maybe there are already semi-finished solutions for such tasks?

This is a simple task, like 2 pennies. So there are hardly any solutions. requests to help you.

M

MrSpirit, 2021-11-24
@MrSpirit

With the help of openpyxl (of course you can use pandas, it is done there in an elementary way) you read your Excel, then with the help of a loop and requests you download files, if there are a lot of files, try to write an asynchronous uploader, it will download 10 times faster (aiohttp and aiofiles will help). Well, to create folders, you will need os.
This is how you can pick up the entire column from Excel

import pandas as pd
from glob import glob

file = glob('*.xlsx')[0]
table = pd.read_excel(file)
urls_list = table['Название столбца'].to_list()

And here is an example of a bootloader, I wrote (I don’t pretend to be perfect, it does its job well)

import asyncio
import os
from os.path import join as pth_join

import aiofiles
import aiohttp

DWNLD_FLDR = "Download"


async def download_file(session: aiohttp.ClientSession, link: str, file_name: str):
    async with session.get(link) as resp:
        if resp.status == 200:
            f = await aiofiles.open(pth_join(DWNLD_FLDR, file_name), "wb")
            await f.write(await resp.read())
            await f.close()
        else:
            print(f"Error: {resp.status}")


async def gather_files(files_urls: list[dict]):
    async with aiohttp.ClientSession(headers=HEADERS) as session:
        tasks = []
        for item in files_urls:
            try:
                if os.stat(pth_join(DWNLD_FLDR, item["file_name"])).st_size:
                    continue
            except FileNotFoundError:
                pass
            task = asyncio.create_task(
                download_file(session, item["file_link"], item["file_name"])
            )
            tasks.append(task)
        await asyncio.gather(*tasks)


def main(file_list):
    os.makedirs(DWNLD_FLDR) if not os.path.exists(DWNLD_FLDR) else None
    asyncio.run(gather_files(file_list))

if __name__ == "__main__":
    main([{'file_name': 'test.txt',  'file_link': 'http://file_url'}, ])