How to read newline using re library?

A

Andrey P2017-05-01 16:51:20

Python

Andrey P, 2017-05-01 16:51:20

Good evening!
I am writing a simple parser to display some fields from the page of an online store. For example, like this:

rx_image = r'class="jshop_img (.*)" src="(.*)" alt='
image = re.compile(rx_image)
     for line in page:
        img_obj = image.search(line)
        if img_obj:
            img_item = img_obj.group(2)
            print "Picture:    ", img_item

The problem is that in some places in the HTML code there is a newline and re does not find a match.
That is, if the code is like this:

<img class="jshop_img second-image" src=/components/com_jshopping/files/img_products/thumb_________________________1_.jpg" alt="">

That all works as it should. But if there is such code:

<div class="name">
            <a href="/component/jshopping/product/view/97/334?Itemid=101">Коктейль молочный малый</a>
                      </div>

Then it doesn't find the string. If you search only for the code after the line break, for example: then the comparison is fine, but unnecessary lines are added to me. How can this limitation be bypassed? Tried to write The result has not changed.
rx_name = r'<a href="(.*)">(.*)</a>'
image = re.compile(rx_image, re.DOTALL)

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

A

Andrey P, 2017-05-02
@PanchAS

In general, I decided not to bother and heaped crutches.

rx_name_f = r'<div class="name">'
rx_name = r'<a href=.*>(.*)</a>'

name = re.compile(rx_name)
name_f = re.compile(rx_name_f)
i = False
for line in page:
    name_obj = name.search(line)
    namef_obj = name_f.search(line)
            
    if i and name_obj:
        name_item = name_obj.group(1)
        print "Name:", name_item
        i = False
    else:
        i = False
    if namef_obj:
        i = True

Thanks everyone for the replies!

Q

qlkvg, 2017-05-02
@qlkvg

Classic
Don't hammer nails with a microscope. Spend a day learning how to parse with lxml or beautifulsoup and you'll find the joy.

L

lega, 2017-05-02
@lega

You can prepare/repartition html, remove newlines, etc. without split, you can use re.finditer . You can also first get all img, and then manually filter by class. Once I successfully used the bike , for you there will be something like: qlkvg The trick is that parsing is not always needed, sometimes you just need to bite / get a couple of words from html. In my case, regex worked fine and was 100-1000 times faster than lxml (and its equivalents) . it was necessary to process only 1% of the document, and not parse the entire one.