T
T
tropin2022-03-07 08:45:08
Python
tropin, 2022-03-07 08:45:08

Scraping information from PDF with pyPDF4 / Beuatyfulsoup?

I'm just starting to learn Python for the purpose of working with data. I came across an article where PDF is downloaded using Beuatyfulsoup and text is extracted using pyPDF



import read as read
import requests
from bs4 import BeautifulSoup
import io
from PyPDF4 import PdfFileReader
from urllib3 import response

url = "https://..."
read = requests.get(url)
html_content = read.content
soup = BeautifulSoup(html_content, "html.parser")

list_of_pdf = set()
l = soup.find ('p')
p = l.find_all('a')

for link in (p):
    pdf_link = (link.get('href')[:-5]) + ".pdf"
    print(pdf_link)
    list_of_pdf.add(pdf_link)

def into(pdf_path): pdf_link = requests.get(pdf_path)

    with io.BytesIO(response.content) as f:
    pdf = PdfFileReader(f)
    information = pdf.getDocumentInto()
    number_of_pages = pdf.getNumPages()
    txt = (f"\n"
       f"Info: {pdf_path}\n"
       f"Author: {information.author}\n"
       f"Number of pages: {number_of_pages}\n")
    print(txt)
return information

for i in list_of_pdf:
    info(i)


pyCharm swears at line 34

return information


return information
^
SyntaxError: 'return' outside function


What's wrong with the code?
Thanks

Answer the question

In order to leave comments, you need to log in

1 answer(s)
A
AVKor, 2022-03-07
@AVKor

What's wrong with the code?

You have been told that this is not true:
SyntaxError: 'return' outside function

You return informationare out of function.
And a bunch of other bugs.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question