Python导出PDF表格到CSV：Spire与pdfplumber多方案实践

脚本专家 · 发表于 2026-6-10 09:25:48

PDF格式凭借版式固定性广泛应用于文档交换，但其结构化数据（如表格）难以直接抽取。针对带有明确边框的表格，专用提取器能提供更高准确度。本文介绍使用免费库Spire.PDF.Free和备选库pdfplumber、camelot、tabula实现PDF表格导出为CSV的完整方案，包含核心类说明、参数含义、文本清洗逻辑及跨页合并技巧。

一、环境准备

核心依赖为Spire.PDF.Free，执行以下命令安装：

pip install Spire.Pdf.Free

复制代码

此外，导出CSV时使用Python内置csv、os标准库，无需额外依赖。

二、基于Spire.PDF.Free的核心实现

完整代码示例：

from spire.pdf import PdfDocument, PdfTableExtractor
import csv
import os
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")
extractor = PdfTableExtractor(pdf)
output_root = "output/Tables"
os.makedirs(output_root, exist_ok=True)
for page_index in range(pdf.Pages.Count):
tables = extractor.ExtractTable(page_index)
for table_index, table in enumerate(tables):
table_data = []
row_total = table.GetRowCount()
for row in range(row_total):
row_data = []
col_total = table.GetColumnCount()
for col in range(col_total):
cell_text = table.GetText(row, col).replace("\n", "").strip()
row_data.append(cell_text)
table_data.append(row_data)
csv_name = f"Page{page_index + 1}-Table{table_index + 1}.csv"
csv_path = os.path.join(output_root, csv_name)
with open(csv_path, "w", newline="", encoding="utf-8") as csvfile:
writer = csv.writer(csvfile)
writer.writerows(table_data)
print(f"已导出：{csv_path}")
pdf.Dispose()
print("所有表格导出完成")

复制代码

代码解析：
- PdfDocument：主操作类，LoadFromFile()加载本地PDF，Pages.Count获取总页数，Dispose()释放资源。
- PdfTableExtractor：专用表格提取器，绑定已加载的PdfDocument对象；ExtractTable(page_index)返回当前页所有表格对象集合。
- 数据读取：GetRowCount()获取总行数，GetColumnCount()获取总列数，GetText(row,col)读取单元格原始文本。
- 文本清洗：replace("\n","")去除换行符，strip()清除首尾空白，避免CSV格式错乱。
- CSV写入关键配置：encoding="utf-8"防止中文乱码；newline=""避免多余空行；os.makedirs(...,exist_ok=True)自动创建目录。
- 文件命名：Page页码-Table表格序号.csv，便于溯源。

三、备选方案：pdfplumber（通用性最佳）

当表格无明确边框时，Spire可能无法识别，推荐使用pdfplumber。安装：

pip install pdfplumber

复制代码

核心代码示例（自动识别表格并导出）：

import pdfplumber
import csv
import os
def pdf_table_to_csv(pdf_path, csv_path, pages='all', table_settings=None):
base, ext = os.path.splitext(csv_path)
table_count = 0
with pdfplumber.open(pdf_path) as pdf:
if pages == 'all':
pages_to_process = pdf.pages
else:
pages_to_process = [pdf.pages[i] for i in range(pdf.pages) if i+1 in parse_pages(pages)]
for page_num, page in enumerate(pages_to_process, start=1):
tables = page.extract_tables(table_settings)
if not tables:
print(f"第 {page_num} 页未找到表格")
continue
for i, table in enumerate(tables):
if not table:
continue
table_count += 1
out_path = csv_path if (len(tables)==1 and len(pages_to_process)==1) else f"{base}_{table_count}{ext}"
with open(out_path, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
for row in table:
cleaned_row = [cell if cell is not None else '' for cell in row]
while cleaned_row and cleaned_row[-1] == '':
cleaned_row.pop()
if cleaned_row:
writer.writerow(cleaned_row)
print(f"已保存表格 {table_count} -> {out_path}")
print(f"完成！共导出 {table_count} 个表格。")
def parse_pages(page_spec):
pages = set()
for part in page_spec.split(','):
if '-' in part:
start, end = map(int, part.split('-'))
pages.update(range(start, end+1))
else:
pages.add(int(part))
return pages

复制代码

调整解析策略（应对复杂表格）：
- 有线表格（边框完整）：vertical_strategy="lines", horizontal_strategy="lines"
- 无线表格（靠文本对齐）：vertical_strategy="text", horizontal_strategy="text"
- 混合模式：vertical_strategy="lines", horizontal_strategy="text"
- 手动指定竖线：vertical_strategy="explicit", explicit_vertical_lines=[100,200,300]

可通过page.debug_tablefinder(table_settings)可视化识别效果。

四、备选方案：camelot-py（高精度）

适用于带清晰边框的表格，需安装OpenCV和Ghostscript。

pip install camelot-py[cv]

复制代码

使用示例：

import camelot
def pdf_table_to_csv_camelot(pdf_path, csv_path, flavor='lattice', pages='all'):
tables = camelot.read_pdf(pdf_path, flavor=flavor, pages=pages)
if not tables:
print("未找到表格")
return
for i, table in enumerate(tables):
out_path = csv_path if i==0 else csv_path.replace('.csv', f'_{i+1}.csv')
table.to_csv(out_path)
print(f"已保存表格 {i+1} -> {out_path}")

复制代码

五、备选方案：tabula-py（需Java环境）

安装：pip install tabula-py，确保系统已安装Java 8+。

import tabula
def pdf_table_to_csv_tabula(pdf_path, csv_path, pages='all', area=None):
dfs = tabula.read_pdf(pdf_path, pages=pages, area=area, multiple_tables=True)
if not dfs:
print("未找到表格")
return
for i, df in enumerate(dfs):
out_path = csv_path if i==0 else csv_path.replace('.csv', f'_{i+1}.csv')
df.to_csv(out_path, index=False)
print(f"已保存表格 {i+1} -> {out_path}")

复制代码

area参数：[top,left,bottom,right]（单位毫米），用于指定页面区域。

六、跨页表格合并

若表格跨页，需手动合并。使用pdfplumber的示例：

def merge_multipage_tables(pdf_path, pages, table_settings=None):
all_rows = []
with pdfplumber.open(pdf_path) as pdf:
for page_num in pages:
page = pdf.pages[page_num-1]
tables = page.extract_tables(table_settings)
if tables:
table = tables[0]
if all_rows:
all_rows.extend(table[1:]) # 跳过第一页表头
else:
all_rows.extend(table)
return all_rows

复制代码

七、自动选择最佳策略

可编写智能函数自动尝试多种策略：

def auto_extract_tables(pdf_path, pages='all'):
strategies = [
{"vertical_strategy": "lines", "horizontal_strategy": "lines"},
{"vertical_strategy": "text", "horizontal_strategy": "text"},
{"vertical_strategy": "lines", "horizontal_strategy": "text"},
{"vertical_strategy": "text", "horizontal_strategy": "lines"},
]
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
for settings in strategies:
tables = page.extract_tables(settings)
if tables and len(tables[0]) > 1:
return tables
return None

复制代码

八、拓展应用方向

- 批量处理：使用os.listdir()遍历文件夹，循环调用上述函数。
- 数据二次加工：导出CSV后结合pandas筛选、去重、统计。
- 表头单独处理：识别第一行为表头，写入时单独指定header。
- 过滤空表格：增加数据判空逻辑，跳过无内容表格。

九、总结

通过Spire.PDF.Free可快速提取带边框PDF表格，配合pdfplumber等备选方案可覆盖更复杂的表格类型。在实际集成中，需注意边框依赖、页面限制及资源释放，并根据文档特性调整文本清洗规则。该方法不依赖中间格式，代码结构清晰，适用于自动化数据处理管道。

热心网友6 · 发表于 2026-6-10 09:35:00

感谢楼主的详细分享！代码结构清晰，注释到位，尤其是Spire.PDF.Free和pdfplumber两个方案的对比很实用。我之前也遇到过无边框表格识别困难的问题，楼主提到跨页合并技巧，能否具体说说跨页表格的合并逻辑？比如当表格跨两页时，是手动拼接行数据还是靠库自动处理？另外，pdfplumber的table_settings参数对中文表格的适应性如何？期待后续更多实践分享！

热心网友2 · 发表于 2026-6-22 20:10:01

楼主的分享很详细，尤其是Spire.PDF.Free的代码结构清晰，对新手友好。之前我主要用pdfplumber处理无边框表格，但跨页合并一直手动处理比较头疼。想请教一下楼主，在Spire方案里有没有处理跨页表格合并的现成方法？比如表格被分到两页时，你是手动拼接还是有什么参数可以自动合并？

热心网友6 · 发表于 2026-6-22 20:20:01

感谢分享这个实用的PDF表格提取方案！Spire.PDF.Free 的代码结构很清晰，特别是针对带有明确边框的表格能稳定输出，文件命名规则也方便后续管理。pdfplumber 作为备选库对无边框表格的处理确实更灵活，两种方案互补性很强。另外文本清洗时对换行符和首尾空格的去除很关键，能避免CSV文件出现格式错乱。这个跨页合并技巧对多页表格也很实用，整体思路值得参考。

Python导出PDF表格到CSV：Spire与pdfplumber多方案实践

Re: Python导出PDF表格到CSV：Spire与pdfplumber多方案实践

Re: Python导出PDF表格到CSV：Spire与pdfplumber多方案实践

Re: Python导出PDF表格到CSV：Spire与pdfplumber多方案实践

指导单位

旗下站点

联系我们