D
D
Dane4kaa2018-12-19 18:03:27
Python
Dane4kaa, 2018-12-19 18:03:27

Page parsing task, how to arrange it in a loop / recursion with try / except?

Please help a newbie figure it out
. Given the task:
Given a page (in our case, Wikipedia), you need to parse it and extract all the links, then follow the collected links and extract all the links already from them. It was advised to use recursion, recursion depth 3. As a result, you need to select '.png' from all collected links and write it to a file.
I was only able to collect and sort everything from 1 page, it doesn’t work either with recursion or with a cycle. I keep getting either a ConnectionError or a MemoryError. I understand that you need to introduce try / except, but I'm already completely confused.
Thank you in advance!

from bs4 import BeautifulSoup, SoupStrainer
import requests

class Links:
    def get_urls(self, level: int) -> []:
        urls = []
        try:
            links_1 = []
            start_link = "https://ru.wikipedia.org/"
            links_1.append(start_link)
            for i in links_1:
                response = requests.get(i)
                soup = BeautifulSoup(response.content, "html.parser", parse_only=SoupStrainer(['a', 'img']))
                full_list = [link['href'] for link in soup if link.get('href')] + [img['src'] for img in soup if img.get('src')]
                full_list = list(set(full_list))
                for url in full_list:
                    if not url.startswith('https:/'):
                        if url.startswith('/'):
                            if url.find('.org') == -1:
                                url = start_link + url[1:]
                                full_list.append(url)
                            elif url.find('.org'):
                                url = 'https:' + url
                                full_list.append(url)
                        elif url.startswith('//'):
                            url = start_link + url[2:]
                            full_list.append(url)
                        else:
                            pass
                    elif url.startswith('https:/'):
                        full_list.append(url)
                        urls.append(full_list)
                self.get_urls(level - 1)
                links_1 = full_list
                links_1 = list(set(links_1))
                return links_1
        except MemoryError as e:
            print(e)

        return urls


links = Links()
list_links = links.get_urls(level=3)
#with open('text.txt', 'w') as f:
#    for x in list_links:
#        if x.endswith('.png'):
#            f.write('%s\n' % x)

Answer the question

In order to leave comments, you need to log in

2 answer(s)
D
Dmitry Shitskov, 2018-12-19
@Zarom

  1. Nowhere in the code is there a check that level has become 0 . The recursion turns out to be infinite (you run into a MemoryError)
  2. self.get_urls(level - 1) doesn't return a list of found urls

R
Roman Kitaev, 2018-12-20
@deliro

Advised to use recursion, recursion depth 3

In human terms, this means that from the original page you need to go a maximum of three levels.
You need to do:
1. Get rid of duplicates. In a regular set(), put urls where you have already collected links, so as not to waste extra time
2. Change the Links class. It should take one url - where to look for links and the current level. The current level is needed in order to stop the process at level 2 (if starting from zero).
3. An instance of Links(url="something", level=0) will spawn other Links(url="something-else", level=1 ) and be able to return back a list of links. Accordingly, if self.level == 2, then we do not parse the found links, but simply give them up

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question