Why does the parser with reg. expression?

W

Web__Nikita032019-08-20 19:28:45

Python

Web__Nikita03, 2019-08-20 19:28:45

I parse wikipedia and extract all links to other pages starting with /wiki/. The link must not contain the : sign. Here is my code

from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import re

site = urlopen('https://en.wikipedia.org/wiki/Kevin_Bacon')
soup = bs(site, features='html.parser')

for i in soup.find('div', {'id': 'bodyContent'}).findAll('a', {'href': re.compile('\/wiki\/(?!:)[\w\/()%]+')}):
    if 'href' in i.attrs:
        print(i.attrs['href'])

In theory, my regex shouldn't skip links like /wiki/Category:Wikipedia_articles_with_NKC_identifiers ( https://www.regextester.com/?fam=110956 ), but here's my output https://pastebin.com/Nzad6ZUG . And the links are there. Explain why?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

dodo512, 2019-08-20
@Web__Nikita03

\/wiki\/(?!:)will not miss only /wiki/:
\/wiki\/(?!.*:)and so already/wiki/Category: