How to parse multiple urls from an excel file?

F

Fokinura_30002022-01-18 21:53:07

Python

Fokinura_3000, 2022-01-18 21:53:07

There is an Excel file that contains links in one column.
How to correctly pull out the name of the applications that it gives out by links?

An example of a file in the picture:

I will duplicate here some of the links from the list:

https://play.google.com/store/apps/details?id=com.vkontakte.android
https://play.google.com/store/apps/details?id=ru.ok.android
https://play.google.com/store/apps/details?id=com.outfit7.talkingtomgoldrun
https://play.google.com/store/apps/details?id=com.tapclap.piratetreasures2
https://play.google.com/store/apps/details?id=com.openmygame.games.android.wordpizza
https://play.google.com/store/apps/details?id=com.outfit7.mytalkingtomfriends
https://play.google.com/store/apps/details?id=com.hornet.android

Sample of my code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

df = pd.read_excel('ids.xlsx')
url = df

for urlibs in url:
    response = requests.get(urlibs)
    soup = BeautifulSoup(response.text, 'lxml')
    quotes = soup.find_all('h1', class_='AHFaub')
for quote in quotes:
    print(quote.text)

With my code, it gives out only the first line, but how can I get all the lines?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

X

xotkot, 2022-01-18
@Fokinura_3000

save the document in csv, and process it with any convenient tool

W

webster_r, 2022-01-19
@webster_r

# -*- coding: utf-8 -*-
import pandas as pd
import requests
from bs4 import BeautifulSoup

filename = 'ids.xlsx'
df = pd.read_excel(filename)
url = df.iloc[:, 0].tolist() # Преобразую нулевую колонку к списку

for urlibs in url:
    response = requests.get(urlibs)
    soup = BeautifulSoup(response.text, 'lxml')
    appname = soup.find('h1', class_='AHFaub').text
    print(appname)

Fast parsing option (multithreading)

# -*- coding: utf-8 -*-
import pandas as pd
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

filename = 'ids.xlsx'
df = pd.read_excel(filename)
urls = df.iloc[:, 0].tolist()


def get_app_name(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    appname = soup.find('h1', class_='AHFaub').text
    print(appname)

# Число воркеров можно изменить на свое усмотрение
with ThreadPoolExecutor(max_workers=16) as executor: 
    executor.map(get_app_name, urls)

K

kapp1, 2022-01-18
@kapp1

You can’t just run through the dataframe in a cycle, you need to get the value and submit them to the input to the request.