D
D
David It2022-03-02 10:04:39
Python
David It, 2022-03-02 10:04:39

How to get all links on a website page?

How to get all links on a website page using the command line? Or does it have to be done in python? there are no other options? If not, which is not desirable, then how can this be organized in python?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
V
Vladimir Kuts, 2022-03-02
@David138

From the command line:

curl https://yandex.ru | grep -o -E 'href=\".*?\"' | sed 's/href=\"//' | sed 's/\"//' | sort | uniq


# //yandex.ru/opensearch.xml
# //yastatic.net/jquery/2.1.4/jquery.min.js
# https://afisha.yandex.ru/rostov-na-donu/cinema/cyrano-2022? utm_source=yamain&utm_medium=yamain_afisha_kp
# https://afisha.yandex.ru/rostov-na-donu/cinema/dog-2021?utm_source=yamain&utm_medium=yamain_afisha_kp
# https://afisha.yandex.ru/rostov-na-donu/cinema/kroletsyp-i-khomiak-tmy?utm_source=yamain&utm_medium=yamain_afisha_kp
...

In Python
import io
import requests
from lxml import etree

data = requests.get('https://yandex.ru').text

parser = etree.HTMLParser()
tree   = etree.parse(io.StringIO(data), parser)
for im in tree.xpath('//a'):
    print(im.get('href'))

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question