How to get all links on a website page?

D

David It2022-03-02 10:04:39

Python

David It, 2022-03-02 10:04:39

How to get all links on a website page using the command line? Or does it have to be done in python? there are no other options? If not, which is not desirable, then how can this be organized in python?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

V

Vladimir Kuts, 2022-03-02
@David138

From the command line:

curl https://yandex.ru | grep -o -E 'href=\".*?\"' | sed 's/href=\"//' | sed 's/\"//' | sort | uniq


# //yandex.ru/opensearch.xml
# //yastatic.net/jquery/2.1.4/jquery.min.js
# https://afisha.yandex.ru/rostov-na-donu/cinema/cyrano-2022? utm_source=yamain&amp;utm_medium=yamain_afisha_kp
# https://afisha.yandex.ru/rostov-na-donu/cinema/dog-2021?utm_source=yamain&amp;utm_medium=yamain_afisha_kp
# https://afisha.yandex.ru/rostov-na-donu/cinema/kroletsyp-i-khomiak-tmy?utm_source=yamain&amp;utm_medium=yamain_afisha_kp
...

In Python

import io
import requests
from lxml import etree

data = requests.get('https://yandex.ru').text

parser = etree.HTMLParser()
tree   = etree.parse(io.StringIO(data), parser)
for im in tree.xpath('//a'):
    print(im.get('href'))