I
I
Ilya Vegner2018-03-19 23:00:28
Python
Ilya Vegner, 2018-03-19 23:00:28

How to remove unwanted characters in a string?

I wrote a parser using silenium and tesseract
When I try to display the image_link variable along with the numbers, I also get unwanted characters
. Is it possible to remove them and display only numbers?

from selenium import webdriver
from time import sleep
from PIL import Image
from pytesseract import image_to_string

class Bot_dzen:
  def __init__(self):
    self.driver = webdriver.Firefox(executable_path='C:\\Users\\ilya_pc\\Documents\\gecko\\geckodriver.exe')
    self.navigate()

  def views_recon(self):
    image = Image.open('views.gif')
    image_link = image_to_string(image).split('@ ')
    views_dzen = int(image_link[0])
    views_dzen_2 = int(image_link[1])
    views_dzen_3 = int(image_link[2])

  def crop(self, location, size):
    image = Image.open('dzen_pars.png')

    x =location['x']
    y = location['y']
    width = size['width']
    height = size['height']

    image.crop((x, y, x+width, y+height)).save('views.gif')

    self.views_recon()

  def take_screen(self):
    self.driver.save_screenshot('dzen_pars.png')

  def navigate(self):
    self.driver.get('https://zen.yandex.ru/media/id/5a9d345c1aa80c262cd25c42/3-ujasnye-oshibki-v-otjimaniiah-meshaiuscie-rostu-grudi-5aa7c0739b403cd7a6cc68f4')
    views = self.driver.find_element_by_xpath('/html/body/article/div/div[2]/div')

    sleep(3)

    self.take_screen()

    location = views.location
    size = views.size

    self.crop(location, size)

def main():
  b = Bot_dzen()

if __name__ == '__main__':
  main()

5ab016d60b67f987816271.png

Answer the question

In order to leave comments, you need to log in

3 answer(s)
V
val_vp, 2018-03-20
@val_vp

Ilya, good evening.
There is a question - the purpose of the script is to get the numbers from the image view.gif ?
if not, then you can get the desired numbers directly from the site and then there will be no problem of a "broken" character.
if you still need to parse the image, then there are a couple of options:
1) Will it help if the code on the view_dzen = int(image_link[0]) call failed?
in the crop method, try to crop more horizontally,
2) regex. after image_link = image_to_string(image) try to select groups of digits (\d+) from image_link

I
Ilya Vegner, 2018-03-20
@jKEeY

There, views are loaded by js, if I'm not mistaken, but I don't know how to interact with selemium and js, could you tell me))

M
Mikhail Sisin, 2018-03-21
@JabbaHotep

Why is it so difficult and inhumane to yourself and Yandex. Less expensive to take from here:
https://zen.yandex.ru/media-api/publication-view-s...
without selenium, use urllib2 for example

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question