Answer the question
In order to leave comments, you need to log in
Scraping information from PDF with pyPDF4 / Beuatyfulsoup?
I'm just starting to learn Python for the purpose of working with data. I came across an article where PDF is downloaded using Beuatyfulsoup and text is extracted using pyPDF
import read as read
import requests
from bs4 import BeautifulSoup
import io
from PyPDF4 import PdfFileReader
from urllib3 import response
url = "https://..."
read = requests.get(url)
html_content = read.content
soup = BeautifulSoup(html_content, "html.parser")
list_of_pdf = set()
l = soup.find ('p')
p = l.find_all('a')
for link in (p):
pdf_link = (link.get('href')[:-5]) + ".pdf"
print(pdf_link)
list_of_pdf.add(pdf_link)
def into(pdf_path): pdf_link = requests.get(pdf_path)
with io.BytesIO(response.content) as f:
pdf = PdfFileReader(f)
information = pdf.getDocumentInto()
number_of_pages = pdf.getNumPages()
txt = (f"\n"
f"Info: {pdf_path}\n"
f"Author: {information.author}\n"
f"Number of pages: {number_of_pages}\n")
print(txt)
return information
for i in list_of_pdf:
info(i)
return information
return information
^
SyntaxError: 'return' outside function
Answer the question
In order to leave comments, you need to log in
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question