Scraping information from PDF with pyPDF4 / Beuatyfulsoup?

T

tropin2022-03-07 08:45:08

Python

tropin, 2022-03-07 08:45:08

I'm just starting to learn Python for the purpose of working with data. I came across an article where PDF is downloaded using Beuatyfulsoup and text is extracted using pyPDF

import read as read
import requests
from bs4 import BeautifulSoup
import io
from PyPDF4 import PdfFileReader
from urllib3 import response

url = "https://..."
read = requests.get(url)
html_content = read.content
soup = BeautifulSoup(html_content, "html.parser")

list_of_pdf = set()
l = soup.find ('p')
p = l.find_all('a')

for link in (p):
    pdf_link = (link.get('href')[:-5]) + ".pdf"
    print(pdf_link)
    list_of_pdf.add(pdf_link)

def into(pdf_path): pdf_link = requests.get(pdf_path)

    with io.BytesIO(response.content) as f:
    pdf = PdfFileReader(f)
    information = pdf.getDocumentInto()
    number_of_pages = pdf.getNumPages()
    txt = (f"\n"
       f"Info: {pdf_path}\n"
       f"Author: {information.author}\n"
       f"Number of pages: {number_of_pages}\n")
    print(txt)
return information

for i in list_of_pdf:
    info(i)

pyCharm swears at line 34

return information

return information
^
SyntaxError: 'return' outside function

What's wrong with the code?
Thanks

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

A

AVKor, 2022-03-07
@AVKor

What's wrong with the code?

You have been told that this is not true:

SyntaxError: 'return' outside function

You return informationare out of function.
And a bunch of other bugs.