I
I
ID-ZONE2020-08-30 16:31:47
Parsing
ID-ZONE, 2020-08-30 16:31:47

Why only 200 items are displayed?

I wrote a parser in python using the bs4 library, here is the code:

spoiler
from bs4 import BeautifulSoup
import requests

from selenium import webdriver
import sqlite3


def url_generator():
    home_links = []
    for ino in range(1, 25):
        ino += 1
        home_links.append(f"https://www.olx.ua/elektronika/?search%5Bad_homepage_to%3Afrom%5D=2020-08-25&page={ino}")
    return home_links


# no fasted(
def get_links_by_selenium():
    list_links = []
    driver = webdriver.Chrome(
           r'C:\Users\tester\Documents\mypythonproject\myparrsers\chromedriver.exe')
    driver.get(
           "https://www.olx.ua/elektronika/?search%5Bad_homepage_to%3Afrom%5D=2020-08-25&page=2")
    driver.find_element_by_xpath(
           '//button[@class="cookie-close abs cookiesBarClose"]').click()


    for url in url_generator():
        driver.get(url)

        for link in driver.find_elements_by_xpath('//a[@class="marginright5 link linkWithHash detailsLink"]'):
            list_links.append(link.get_attribute('href'))

    return list_links

# fasted)
def get_links_by_beautifulsoup():
    list_links = []

    for url in url_generator():
        html = requests.get(url).text
        soup = BeautifulSoup(html, 'lxml')

        for link in soup.find_all('a', {'class': 'marginright5 link linkWithHash detailsLink'}):
            list_links.append(link['href'])
            print(len(list_links))

    return list_links


def get_content_from_page():
    nam = 0

    for link in get_links_by_beautifulsoup():
        page_html = requests.get(link).text
        page_soup = BeautifulSoup(page_html, 'lxml')

        try:
            price = page_soup.find('strong', {'class': 'pricelabel__value arranged'}).text
            name = page_soup.find('h1').text.strip()
        except AttributeError:
            pass
        else:
            nam += 1
            print(f'{nam}.) {name} | {price} | {link}')
            yield name, price, link

def save_content_in_db():
    db = sqlite3.connect('links.db')
    sql = db.cursor()
    for n, p, l in get_content_from_page():
        sql.execute('INSERT into commodity (name, prise, link) values (?, ?, ?)', (n, p, l))
        db.commit()
    db.close()

def main():
    save_content_in_db()
    print('finished')

if __name__ == '__main__':
    main()

only here there is one problem: the code receives 900 links, and parses only 200 +-.
I tried to figure it out but couldn't even figure out why.

sorry for the code without comments :3

Answer the question

In order to leave comments, you need to log in

1 answer(s)
R
Ruslan., 2020-08-30
@ID-ZONE

It is possible that the rest of the links fall with an AttributeError.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question