网站首页 > 技术文章正文

RAG高级技术:PDF处理，提取文本、表格和图像的最佳工具

nanyue 2024-10-16 11:01:43 技术文章 8 ℃

许多信息来自文本数据，例如 PDF 文档。处理 PDF 可能特别具有挑战性，尤其是表格和图像。

如果您使用单模态语言模型，那么您可能已经知道它不具备直接解释或“读取”文档的能力。它只能处理一种类型的输入，例如仅文本或仅图像。如果您需要分析 PDF 中的图像或信息图表，以执行下游任务（例如问答），您通常会使用专门的包来解析文档。这些工具可以将文档、文档中的图像和表格转换为模型可以理解和分析的文本格式。

有几个很好的工具可用于为下游任务解析 PDF 文档。在本文中，我们将列出一些不错的工具，包括 PyPDF、Adobe PDF Services API、Camelot 和 Tabula。

首先我们安装相关的库：

!pip install pdfservices-sdk
!pip install openpyxl
!pip install camelot-py
!pip install opencv-python
!pip install tabula-py==2.9.0
!pip install jpype1
!pip install langchain
!pip install langchain-core==0.1.40

使用 PyPDF 提取文本、表格和图像

Pypdf 是一个用于解析 PDF 文档的通用通用库。它可以将文档（包括文档中的表格）解析为纯文本。大多数时候，使用 PyPDF 解析文档时，表格的格式也得到了很好的保留。

图片由作者提供

解析文本和表格

Langchain document_loaders 包含许多不同的包，用于读取各种文件格式，包括 PyPDF。以下脚本使用 PyPDF 处理文档并将其保存为数据帧格式：

from langchain_community.document_loaders import PyPDFLoader

def extract_text_from_file(df, file_path):
    file_name = file_path.split("/")[-1]
    file_type = file_name.split(".")[-1]
    if file_type == "pdf":
        loader = PyPDFLoader(file_path)
    else:
        return df

    text = ""
    pages = loader.load_and_split()
    for page in pages:
        text += page.page_content
    # Create a new df and concatenate
    new_row = pd.DataFrame({"file": [file_name], "text": [text]})
    df = pd.concat([df, new_row], ignore_index=True)
    
    return df

#Apply the function:
folder_path = '../data/raw'
pathlist = Path(folder_path).glob('*.pdf')
filenames = []
for file_path in pathlist:
    filename = os.path.basename(file_path)
    filenames.append(filename)


df = pd.DataFrame()
for filename in filenames:
    file_path = folder_path + "/" + filename
    file_name = os.path.basename(file_path)
    print(f"{datetime.now().strftime('%Y-%m-%d %H:%M:%S')} process {file_name}")
    # Initialize an empty df
    df_file = pd.DataFrame(columns=["file", "text"])
    print(f"{datetime.now().strftime('%Y-%m-%d %H:%M:%S')} extract text")
    try:
        df_file = extract_text_from_file(df_file, file_path)
    except Exception as e:
        print("----Error: cannot extract text")
        print(f"----error: {e}")
    df = pd.concat([df, df_file])
    
df

还可以单独处理每个页面，例如，如果您想在每个块/页面上执行下游问答任务。在这种情况下，您可以按如下方式修改脚本：

def extract_text_from_file(df, file_path):
    file_name = file_path.split("/")[-1]
    file_type = file_name.split(".")[-1]
    if file_type == "pdf":
        loader = PyPDFLoader(file_path)
    elif file_type == "docx":
        loader = Docx2txtLoader(file_path)
    else:
        return df

    pages = loader.load_and_split()
    for page_number, page in enumerate(pages, start=1):
        # Each page's text is added as a new row in the DataFrame
        new_row = pd.DataFrame({
            "file": [file_name],
            "page_number": [page_number],
            "text": [page.page_content]
        })
        df = pd.concat([df, new_row], ignore_index=True)
    
    return df

提取图像

PDF 文档的每一页都可以包含任意数量的图像。您知道可以使用 PyPDF 从文档中提取所有图像吗？

以下代码块从 pdf 文件中提取所有图像，并创建一个新文件夹来存储提取的图像：

from pypdf import PdfReader
import os
output_directory = '../data/processed/images/image_pypdf'
if not os.path.exists(output_directory):
    os.mkdir(output_directory)
reader = PdfReader("../data/raw/GPTsareGPTs.pdf")
for page in reader.pages:
    for image in page.images:
        with open(os.path.join(ouput_directory,image.name), "wb") as fp:
            fp.write(image.data)

在该文件夹中，您将找到 PDF 中的所有图像：

PyPDF 提取的 PDF 中所有图像的列表。图片由作者提供

使用 Adobe PDF Services API 提取文本、表格和图像

PDF Extract API（包含在 PDF Services API 中）提供基于云的功能，用于自动从 PDF 中提取内容。

来源：PDF 提取 API | Adobe PDF 服务

PDF 服务 API 需要 access_token 来授权请求??。为了使用访问令牌，您需要创建一个.一旦您收到包含 json 格式的 client_id 和 client_secret 的开发人员凭证，您就可以使用它来处理您的 PDF。我们首先导入相关的库：

from adobe.pdfservices.operation.auth.credentials import Credentials
from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException
from adobe.pdfservices.operation.execution_context import ExecutionContext
from adobe.pdfservices.operation.io.file_ref import FileRef
from adobe.pdfservices.operation.pdfops.extract_pdf_operation import ExtractPDFOperation
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options import ExtractPDFOptions
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_element_type import ExtractElementType
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_renditions_element_type import \
    ExtractRenditionsElementType
import os.path
import zipfile
import json
import pandas as pd
import re
import openpyxl
from datetime import datetime

以下脚本使用必要的凭据设置 Adob??e PDF Services API，处理 PDF 文件并将结果保存在 zip 文件中：

def adobeLoader(input_pdf, output_zip_path,client_id, client_secret):
    """
    Function to run adobe API and create output zip file
    """
    # Initial setup, create credentials instance.
    credentials = Credentials.service_principal_credentials_builder() \
        .with_client_id(client_id) \
        .with_client_secret(client_secret) \
        .build()

    # Create an ExecutionContext using credentials and create a new operation instance.
    execution_context = ExecutionContext.create(credentials)
    extract_pdf_operation = ExtractPDFOperation.create_new()

    # Set operation input from a source file.
    source = FileRef.create_from_local_file(input_pdf)
    extract_pdf_operation.set_input(source)

    # Build ExtractPDF options and set them into the operation
    extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \
        .with_elements_to_extract([ExtractElementType.TEXT, ExtractElementType.TABLES]) \
        .with_elements_to_extract_renditions([ExtractRenditionsElementType.TABLES,
                                                ExtractRenditionsElementType.FIGURES]) \
        .build()
    extract_pdf_operation.set_options(extract_pdf_options)

    # Execute the operation.
    result: FileRef = extract_pdf_operation.execute(execution_context)

    # Save result to output path
    if os.path.exists(output_zip_path):
        os.remove(output_zip_path)
    result.save_as(output_zip_path)

此操作的输出adobeLoader是一个 sdk.zip 包，其中包含以下内容：

StructuredData.json 文件
文本存储在 json 文件中，并在上下文块中提取 - 段落、标题、列表、脚注。
“table”文件夹：表数据在生成的 JSON 中传递，也以 CSV 和 XLSX 文件形式输出。表格还输出为 PNG 图像，从而可以直观地验证表格数据。
“figures”文件夹：被识别为图形或图像的对象被提取为 PNG 文件。

输出文件夹的结构

现在您可以在文档中应用该函数：

# Adobe output zip file path
input_pdf = 'data/raw/GPTsareGPTs.pdf'
output_zip_path = 'data/processed/adobe_result/sdk.zip'
output_zipextract_folder = 'data/processed/adobe_result/'
# Run adobe API
adobeLoader(input_pdf, output_zip_path)

您可以看到“figures”文件夹以 .png 格式返回 PDF 文档中的所有图像。表格文件夹返回表格的 Excel 工作表，确保高保真度和准确性，以及用于视觉比较目的的 .png 图像。

您还可以进一步处理结构化 JSON 文件structuredData.json以收集文本和表格，并将这些数据组织到 pandas DataFrame 中以用于进一步的下游任务。

使用 Camelot 和 Tabular 提取表格

Tabula 和 Camelot 是两个专门用于从 PDF 中提取表格的 Python 库。

以下脚本使用 Tabula 或 Camelot 处理 PDF 文档，将文档中的每个表格转换为 JSON 格式，捕获实际表格数据和元数据（例如表格编号和页码）：

def extract_tables(file_path, pages="all", package="tabula"):
    if package == "camelot":
        # Extract tables with camelot
        # flavor could be 'stream' or 'lattice', for documents where tables do not have clear borders, the stream flavor is generally more appropriate.
        tables = camelot.read_pdf(file_path, pages=pages, flavor="stream")
    else:
        tables = tabula.read_pdf(file_path, pages=pages, stream=True, silent=True)

    # Convert tables to JSON
    tables_json = []
    for idx, table in enumerate(tables):

        if package == "camelot":
            page_number = table.parsing_report["page"]
            data = table.df.to_json(orient="records")
        else:
            page_number = ""
            data = table.to_json(orient="records")

        data = {
            "table_number": idx,
            "page_number": page_number,
            "data": data,
        }
        tables_json.append(data)
    return tables_json

good！现在我们有了处理表格的脚本，我们可以对同一个 Pdf 文档应用该函数：

file_path = '../data/raw/GPTsareGPTs.pdf'
file_name = os.path.basename(file_path)
df_file = pd.DataFrame()
print(f"{datetime.now().strftime('%Y-%m-%d %H:%M:%S')} process {file_name}")

print(f"{datetime.now().strftime('%Y-%m-%d %H:%M:%S')} extract table")

all_tables = []
for package in ["camelot", "tabula"]:
    print(f"{datetime.now().strftime('%Y-%m-%d %H:%M:%S')} extract table with {package}")
    try:
        tables_from_package = extract_tables(file_path, pages="all", package=package) # list of json
        for table in tables_from_package:
            all_tables.append({"table": table, "source": package})
    except Exception as e:
        print("----Error: cannot extract table")
        print(f"----error: {e}")

# Now you can access each table along with its source
for entry in all_tables:
    print(f"Source: {entry['source']}, Table: {entry['table']}")

以下是源 PDF 文档中的表格示例：

Camelot 或 Tabula 操作的输出格式实际上是表的字符串表示形式，如下面的 json 对象所示：

表格输出

当您对 json 对象中的 ['data'] 键进行切片时，VS Code 似乎理解它是一种表格格式，并显示字符串的表格表示形式，它看起来与 PDF 文件中的源表格完全相同。 Tabula 似乎正确检测了表格。牛掰！

表格输出

现在，让我们看看 Camelot 的输出。下面展示的是同一个表的json对象。

Camelot的输出

以及字符串的表格表示：

Camelot 无法检测到表格的边界。当文本距离表格太近时，它会包含文本。

在此示例中，Tabula 和 Camelot 都能够检测到表格，但是 Tabular 的输出是干净的，并且反映了 PDF 中的原始表格。与此同时，Camelot 似乎无法检测到表格的边界。当文本距离表格太近时，它会包含文本。

然而，在另一个示例中，当页面上存在多个表格时，并且没有清晰的表格边框，如下所示：

Camelot 成功检测到这两个表，而 Tabula 未能检测到其中任何一个表：

Camelot的输出

不同工具的比较

当考虑选择哪些选项来解析 PDF 文档时，PyPDF 非常适合基本的提取需求，其中表结构可能不是优先考虑的。它完全免费，适合预算紧张、需要简单且高精度的文本和图像提取解决方案的用户。根据我的经验，大多数时候，将表格保存为文本格式是可以接受的。

Camelot和Tabula专门用于表提取，最适合需要提取表数据的场景。它们也是完全免费的，在我看来，如果你能接受偶尔的不准确的话，那就足够了。

Adobe PDF Services API 为对文本、表格和图像提取的高精度至关重要的企业或应用程序提供了非常强大的解决方案。不过，目前还没有关于 API 定价的明确信息。这里说您需要联系销售人员获取报价。在这个线程上，Adobe Extract API 似乎相当昂贵。实际上，我愿意为实际使用付费，因为提取的输出质量非常好！

结论

在本文中，我们学习了四种不同的工具来解析 PDF 文档并从 PDF 文件中提取文本、表格和图像数据：PyPDF、Camelot、Tabula 和 Adob??e PDF Services API。

原文地址：https://levelup.gitconnected.com/working-with-pdfs-the-best-tools-for-extracting-text-tables-and-images-2e07a540c5cc

网站首页 > 技术文章 正文