python - AWS Textract - UnsupportedDocumentException - PDF

标签 python amazon-web-services boto3 amazon-textract

我正在使用 boto3(用于 python 的 aws sdk)来分析文档(pdf)以获取表单键:值对。

import boto3

def process_text_analysis(bucket, document):
    # Get the document from S3
    s3_connection = boto3.resource('s3')
    s3_object = s3_connection.Object(bucket, document)
    s3_response = s3_object.get()
    # Analyze the document
    client = boto3.client('textract')
    response = client.analyze_document(Document={'S3Object': {'Bucket': bucket, 'Name': document}},
                                       FeatureTypes=["FORMS"])


process_text_analysis('francismorgan-01', '709 Privado M SURESTE.pdf')

我已使用分析文档遵循 AWS 文档，但在运行函数时出现错误。

botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format

我错过了什么吗？

最佳答案

AnalyzeDocument是一个同步 API，只支持 PNG 或 JPG 图片。

由于您要处理 PDF 文件，因此您需要使用 Amazon Textract 异步 API 例如 StartDocumentAnalysis , StartDocumentTextDetection

关于python - AWS Textract - UnsupportedDocumentException - PDF，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60501332/

上一篇：google-colaboratory - 在 Google Colaboratory Notebook 上安装 GeoViews

下一篇：javascript - Puppeteer:如何在不使用 evaluate 的情况下获取 parentNode？

相关文章：

python - Django模型和mptt集成

amazon-s3 - 使用 AWS S3 和 Route 53 托管静态网站

python - 如何使用pdfminer从存储在S3存储桶中的PDF文件中提取文本而不需要下载到本地？

python - 等待使用 boto3 完全删除 DynamoDB 表

python - 具有不同节点集的两个 NetworkX 图之间的差异

python - socket.gethostbyaddr() 在某些计算机上返回错误，而在其他计算机上不返回错误

java - 在java中上传时如何在aws s3对象中设置缓存控制？

amazon-web-services - 同步 RDS DB 和 Cognito 的最佳方式

python - 使用 Boto3 的 S3 存储桶策略

python - 使用祖先与 StructuredProperty 来建立实体之间的关系