F
F
Fokinura_30002022-01-18 21:53:07
Python
Fokinura_3000, 2022-01-18 21:53:07

How to parse multiple urls from an excel file?

There is an Excel file that contains links in one column.
How to correctly pull out the name of the applications that it gives out by links?

An example of a file in the picture:
61e70a469dba3168477345.jpeg
I will duplicate here some of the links from the list:

https://play.google.com/store/apps/details?id=com.vkontakte.android
https://play.google.com/store/apps/details?id=ru.ok.android
https://play.google.com/store/apps/details?id=com.outfit7.talkingtomgoldrun
https://play.google.com/store/apps/details?id=com.tapclap.piratetreasures2
https://play.google.com/store/apps/details?id=com.openmygame.games.android.wordpizza
https://play.google.com/store/apps/details?id=com.outfit7.mytalkingtomfriends
https://play.google.com/store/apps/details?id=com.hornet.android

Sample of my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd

df = pd.read_excel('ids.xlsx')
url = df

for urlibs in url:
    response = requests.get(urlibs)
    soup = BeautifulSoup(response.text, 'lxml')
    quotes = soup.find_all('h1', class_='AHFaub')
for quote in quotes:
    print(quote.text)

With my code, it gives out only the first line, but how can I get all the lines?

Answer the question

In order to leave comments, you need to log in

3 answer(s)
X
xotkot, 2022-01-18
@Fokinura_3000

save the document in csv, and process it with any convenient tool

W
webster_r, 2022-01-19
@webster_r

# -*- coding: utf-8 -*-
import pandas as pd
import requests
from bs4 import BeautifulSoup

filename = 'ids.xlsx'
df = pd.read_excel(filename)
url = df.iloc[:, 0].tolist() # Преобразую нулевую колонку к списку

for urlibs in url:
    response = requests.get(urlibs)
    soup = BeautifulSoup(response.text, 'lxml')
    appname = soup.find('h1', class_='AHFaub').text
    print(appname)

Fast parsing option (multithreading)
# -*- coding: utf-8 -*-
import pandas as pd
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

filename = 'ids.xlsx'
df = pd.read_excel(filename)
urls = df.iloc[:, 0].tolist()


def get_app_name(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    appname = soup.find('h1', class_='AHFaub').text
    print(appname)

# Число воркеров можно изменить на свое усмотрение
with ThreadPoolExecutor(max_workers=16) as executor: 
    executor.map(get_app_name, urls)

K
kapp1, 2022-01-18
@kapp1

You can’t just run through the dataframe in a cycle, you need to get the value and submit them to the input to the request.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question