總整合專案：智慧文檔分析系統

2025 iThome 鐵人賽

DAY 27

Build on AWS

從零開始的AWS AI之路：用Bedrock與SageMaker打造智慧應用的30天實戰系列第 27 篇

17th鐵人賽 aws bedrock

MichaelHo

2025-10-11 17:58:24

141 瀏覽

分享至

前言

經過前26天的學習，我們已經掌握了AWS Bedrock和SageMaker的核心技術。
今天我們要將所學的知識整合起來，
打造一個實用的智慧文檔分析系統。這個系統將結合文字識別、內容理解、摘要生成等多項AI能力，
展現AWS AI服務的完整應用

要實現的功能

文檔上傳與儲存：使用S3儲存各類文檔
文字提取：支援PDF、圖片等多種格式
內容分析：使用Bedrock進行語義理解
智慧摘要：自動生成文檔摘要
關鍵資訊提取：識別重要實體和關鍵字
問答互動：基於文檔內容的智慧問答

技術選擇

前端：Streamlit（快速原型開發）
後端：Python + Boto3
AI服務：
  - Amazon Bedrock（Claude 3.5 Sonnet）
  - Amazon Textract（文字識別）
  - Amazon Comprehend（實體識別）
儲存：Amazon S3

可以用 uv 也可以用 requirement.txt

如果是 reuiqrement.txt

# requirements.txt
boto3>=1.34.0
streamlit>=1.31.0
PyPDF2>=3.0.0
Pillow>=10.0.0
pandas>=2.0.0
python-dotenv>=1.0.0

可以用 uv add -r requirement.txt 導入

權限控制

Iam 設定

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "s3:PutObject",
                "s3:GetObject",
                "textract:DetectDocumentText",
                "textract:AnalyzeDocument",
                "comprehend:DetectEntities"
            ],
            "Resource": "*"
        }
    ]
}

實作

文檔主禮

import boto3
import PyPDF2
from io import BytesIO
from PIL import Image

class DocumentProcessor:
    def __init__(self):
        self.s3_client = boto3.client('s3')
        self.textract_client = boto3.client('textract')
        
    def upload_to_s3(self, file_content, bucket_name, file_name):
        """上傳文件到S3"""
        try:
            self.s3_client.put_object(
                Bucket=bucket_name,
                Key=file_name,
                Body=file_content
            )
            return f"s3://{bucket_name}/{file_name}"
        except Exception as e:
            raise Exception(f"上傳失敗: {str(e)}")
    
    def extract_text_from_pdf(self, pdf_file):
        """從PDF提取文字"""
        try:
            pdf_reader = PyPDF2.PdfReader(pdf_file)
            text = ""
            for page in pdf_reader.pages:
                text += page.extract_text() + "\n"
            return text
        except Exception as e:
            raise Exception(f"PDF解析失敗: {str(e)}")
    
    def extract_text_from_image(self, image_bytes, bucket_name, file_name):
        """使用Textract從圖片提取文字"""
        try:
            # 先上傳到S3
            self.s3_client.put_object(
                Bucket=bucket_name,
                Key=file_name,
                Body=image_bytes
            )
            
            # 使用Textract分析
            response = self.textract_client.detect_document_text(
                Document={
                    'S3Object': {
                        'Bucket': bucket_name,
                        'Name': file_name
                    }
                }
            )
            
            # 組合文字
            text = ""
            for item in response['Blocks']:
                if item['BlockType'] == 'LINE':
                    text += item['Text'] + "\n"
            
            return text
        except Exception as e:
            raise Exception(f"圖片文字識別失敗: {str(e)}")

Bedrock 分析

import json

class BedrockAnalyzer:
    def __init__(self):
        self.bedrock_client = boto3.client('bedrock-runtime')
        self.model_id = "anthropic.claude-3-5-sonnet-20241022-v2:0"
    
    def generate_summary(self, text, language="zh-TW"):
        """生成文檔摘要"""
        prompt = f"""請為以下文檔生成一份簡潔的摘要，使用繁體中文回答：

文檔內容：
{text[:4000]}  # 限制長度避免超過token限制

請提供：
1. 核心主題（2-3句話）
2. 主要論點（3-5個重點）
3. 關鍵結論（1-2句話）

請以清晰的格式呈現摘要。"""

        body = json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 2000,
            "messages": [
                {
                    "role": "user",
                    "content": prompt
                }
            ],
            "temperature": 0.5
        })
        
        try:
            response = self.bedrock_client.invoke_model(
                modelId=self.model_id,
                body=body
            )
            
            response_body = json.loads(response['body'].read())
            return response_body['content'][0]['text']
        except Exception as e:
            raise Exception(f"摘要生成失敗: {str(e)}")
    
    def extract_key_information(self, text):
        """提取關鍵資訊"""
        prompt = f"""請分析以下文檔，提取關鍵資訊，使用繁體中文回答：

文檔內容：
{text[:4000]}

請提取：
1. 重要日期和時間
2. 人名和組織名稱
3. 數字和統計數據
4. 重要術語和概念
5. 行動項目或待辦事項

請以結構化的格式列出這些資訊。"""

        body = json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1500,
            "messages": [
                {
                    "role": "user",
                    "content": prompt
                }
            ],
            "temperature": 0.3
        })
        
        try:
            response = self.bedrock_client.invoke_model(
                modelId=self.model_id,
                body=body
            )
            
            response_body = json.loads(response['body'].read())
            return response_body['content'][0]['text']
        except Exception as e:
            raise Exception(f"資訊提取失敗: {str(e)}")
    
    def answer_question(self, text, question):
        """基於文檔內容回答問題"""
        prompt = f"""基於以下文檔內容，請回答使用者的問題。如果文檔中沒有相關資訊，請誠實告知。

文檔內容：
{text[:4000]}

使用者問題：{question}

請提供詳細且準確的回答，使用繁體中文。"""

        body = json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1000,
            "messages": [
                {
                    "role": "user",
                    "content": prompt
                }
            ],
            "temperature": 0.4
        })
        
        try:
            response = self.bedrock_client.invoke_model(
                modelId=self.model_id,
                body=body
            )
            
            response_body = json.loads(response['body'].read())
            return response_body['content'][0]['text']
        except Exception as e:
            raise Exception(f"問答失敗: {str(e)}")

用 streamlit 搞出介面

import streamlit as st
import os
from datetime import datetime

# 初始化
if 'document_text' not in st.session_state:
    st.session_state.document_text = None
if 'analysis_results' not in st.session_state:
    st.session_state.analysis_results = {}

def main():
    st.set_page_config(
        page_title="智慧文檔分析系統",
        page_icon="📄",
        layout="wide"
    )
    
    st.title("📄 智慧文檔分析系統")
    st.markdown("### 使用AWS Bedrock和Textract打造的智慧分析工具")
    
    # 側邊欄配置
    with st.sidebar:
        st.header("⚙️ 系統設定")
        
        # S3配置
        bucket_name = st.text_input(
            "S3 Bucket名稱",
            value="my-document-analysis-bucket"
        )
        
        # AWS Region
        aws_region = st.selectbox(
            "AWS Region",
            ["us-east-1", "us-west-2", "ap-northeast-1"]
        )
        
        st.divider()
        st.markdown("### 📊 分析選項")
        
        enable_summary = st.checkbox("生成摘要", value=True)
        enable_key_info = st.checkbox("提取關鍵資訊", value=True)
        enable_qa = st.checkbox("啟用問答", value=True)
    
    # 主要內容區域
    tab1, tab2, tab3, tab4 = st.tabs(
        ["📤 上傳文檔", "📝 摘要分析", "🔍 關鍵資訊", "💬 智慧問答"]
    )
    
    # Tab 1: 文檔上傳
    with tab1:
        st.header("上傳文檔")
        
        uploaded_file = st.file_uploader(
            "選擇文檔",
            type=['pdf', 'png', 'jpg', 'jpeg'],
            help="支援PDF和圖片格式"
        )
        
        if uploaded_file:
            st.success(f"已上傳: {uploaded_file.name}")
            
            col1, col2 = st.columns([3, 1])
            
            with col2:
                if st.button("🚀 開始分析", type="primary"):
                    with st.spinner("處理中..."):
                        try:
                            # 初始化處理器
                            processor = DocumentProcessor()
                            analyzer = BedrockAnalyzer()
                            
                            # 根據文件類型提取文字
                            file_extension = uploaded_file.name.split('.')[-1].lower()
                            
                            if file_extension == 'pdf':
                                text = processor.extract_text_from_pdf(uploaded_file)
                            else:
                                file_name = f"temp_{datetime.now().strftime('%Y%m%d_%H%M%S')}.{file_extension}"
                                text = processor.extract_text_from_image(
                                    uploaded_file.read(),
                                    bucket_name,
                                    file_name
                                )
                            
                            st.session_state.document_text = text
                            
                            # 執行分析
                            if enable_summary:
                                st.session_state.analysis_results['summary'] = \
                                    analyzer.generate_summary(text)
                            
                            if enable_key_info:
                                st.session_state.analysis_results['key_info'] = \
                                    analyzer.extract_key_information(text)
                            
                            st.success("✅ 分析完成！請切換到其他分頁查看結果。")
                            
                        except Exception as e:
                            st.error(f"❌ 錯誤: {str(e)}")
            
            # 顯示提取的文字預覽
            if st.session_state.document_text:
                with st.expander("📄 查看提取的文字內容"):
                    st.text_area(
                        "文檔內容",
                        st.session_state.document_text,
                        height=300
                    )
    
    # Tab 2: 摘要分析
    with tab2:
        st.header("文檔摘要")
        
        if 'summary' in st.session_state.analysis_results:
            st.markdown(st.session_state.analysis_results['summary'])
            
            # 下載選項
            st.download_button(
                label="📥 下載摘要",
                data=st.session_state.analysis_results['summary'],
                file_name=f"summary_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt",
                mime="text/plain"
            )
        else:
            st.info("請先上傳並分析文檔")
    
    # Tab 3: 關鍵資訊
    with tab3:
        st.header("關鍵資訊提取")
        
        if 'key_info' in st.session_state.analysis_results:
            st.markdown(st.session_state.analysis_results['key_info'])
            
            st.download_button(
                label="📥 下載關鍵資訊",
                data=st.session_state.analysis_results['key_info'],
                file_name=f"key_info_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt",
                mime="text/plain"
            )
        else:
            st.info("請先上傳並分析文檔")
    
    # Tab 4: 智慧問答
    with tab4:
        st.header("基於文檔的智慧問答")
        
        if st.session_state.document_text and enable_qa:
            question = st.text_input(
                "請輸入您的問題",
                placeholder="例如：這份文檔的主要結論是什麼？"
            )
            
            if question and st.button("🔍 獲取答案"):
                with st.spinner("思考中..."):
                    try:
                        analyzer = BedrockAnalyzer()
                        answer = analyzer.answer_question(
                            st.session_state.document_text,
                            question
                        )
                        
                        st.markdown("### 💡 回答")
                        st.markdown(answer)
                        
                    except Exception as e:
                        st.error(f"❌ 錯誤: {str(e)}")
            
            # 常見問題範例
            with st.expander("💭 問題範例"):
                st.markdown("""
                - 這份文檔的主要內容是什麼？
                - 有哪些重要的日期或時間點？
                - 文檔中提到的關鍵人物有誰？
                - 有什麼重要的數據或統計資訊？
                - 文檔的結論或建議是什麼？
                """)
        else:
            st.info("請先上傳並分析文檔，並確保啟用問答功能")
    
    # 頁尾
    st.divider()
    st.markdown("""
    <div style='text-align: center; color: gray;'>
        <p>🚀 Build on AWS - IT鐵人賽2025 | 使用AWS Bedrock & Textract構建</p>
    </div>
    """, unsafe_allow_html=True)

if __name__ == "__main__":
    main()

執行 : streamlit run app.py