H
H
hey_umbrella2020-09-16 14:18:09
Python
hey_umbrella, 2020-09-16 14:18:09

How to parse a python .doc file?

I need to parse the .doc file, this is the school schedule for the telegram bot. The question is, how to get the 5f61f45969476878711426.pnglink to the file from it?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
P
PavelMos, 2020-09-16
@PavelMos

You can use regular expressions. The expression looks for a phrase from diff to a numeric combination up to the first "doc"
https://regex101.com/r/XLJ1t4/1

import re
import urllib
regexp1='(\/diff\/\d{1,2}-\d{1,2}.?doc)'
f=urllib.request.urlopen('http://1311.ru/info/info.php') #открывает, возвращает объект http (не текст)
b=f.read() #читает из него в bytes
text=b.decode() #из bytes в utf-8 (кодировка по умолчанию, поэтому в аргументах декод можно не писать) переводит в текст
out=re.findall(regexp1, text)
#далее, зная адрес сайта
for i in out:
   print ("http://1311.ru"+i)
http://1311.ru/diff/16-09.doc
http://1311.ru/diff/17-09.doc

But here, probably, you need to take the newest schedule, then they need to be sorted by date, separating the date and month, or somehow check the date of the file on the server

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question