day19 製作給予GenAI的Diagram as Code教科書(一)：課本與爬蟲資料標記

16th鐵人賽 rag python beautifulsoup diagram as code

jay0810

團隊不時以註解遮羞的實習同學

2024-09-18 02:34:31

406 瀏覽

分享至

前言

在day18我們說明我們的計畫，主要兩部分教科書課本和擴充資料，今天我們會製作課本的內容，主要會將Guide的內容進行整理，並且存成txt檔案。

正文

標記

為了讓LLM更好的翻閱，我們會進行標記，例如「Paragraph:」和「Code Block:」，這些標記能幫助語言模型識別內容類型，從而更好地生成相關回應。

爬蟲程式碼

url可以根據搜尋的類別進行修改

import requests
from bs4 import BeautifulSoup

# url = 'https://diagrams.mingrammer.com/docs/guides/diagram'
# url = 'https://diagrams.mingrammer.com/docs/guides/node'
# url = 'https://diagrams.mingrammer.com/docs/guides/cluster'
url = 'https://diagrams.mingrammer.com/docs/guides/edge'

response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    content_blocks = soup.find_all(['p', 'pre'])
    for block in content_blocks:
        if block.name == 'p':
            print("Paragraph:", block.get_text())
        elif block.name == 'pre':
            print("Code Block:\n", block.get_text())

else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

Diagrams

https://diagrams.mingrammer.com/docs/guides/diagram

爬取後的資料

Paragraph: Diagram is a primary object representing a diagram.
Paragraph: Diagram represents a global diagram context.
Paragraph: You can create a diagram context with Diagram class. The first parameter of Diagram constructor will be used for output filename.
Code Block:
 from diagrams import Diagram
from diagrams.aws.compute import EC2

with Diagram("Simple Diagram"):
    EC2("web")

Paragraph: And if you run the above script with below command,
Code Block:
 $ python diagram.py

Paragraph: It will generate an image file with single EC2 node drawn as simple_diagram.png on your working directory, and open that created image 
file immediately.
Paragraph: Diagrams can be also rendered directly inside the notebook as like this:
Code Block:
 from diagrams import Diagram
from diagrams.aws.compute import EC2

with Diagram("Simple Diagram") as diag:
    EC2("web")
diag

Paragraph: You can specify the output file format with outformat parameter. Default is png.
Paragraph: (png, jpg, svg, pdf and dot) are allowed.
Code Block:
 from diagrams import Diagram
from diagrams.aws.compute import EC2

with Diagram("Simple Diagram", outformat="jpg"):
    EC2("web")

Paragraph: The outformat parameter also support list to output all the defined output in one call.
Code Block:
 from diagrams import Diagram
from diagrams.aws.compute import EC2

with Diagram("Simple Diagram Multi Output", outformat=["jpg", "png", "dot"]):
    EC2("web")

Paragraph: You can specify the output filename with filename parameter. The extension shouldn't be included, it's determined by the outformat parameter.
Code Block:
 from diagrams import Diagram
from diagrams.aws.compute import EC2

with Diagram("Simple Diagram", filename="my_diagram"):
    EC2("web")

Paragraph: You can also disable the automatic file opening by setting the show parameter as false. Default is true.
Code Block:
 from diagrams import Diagram
from diagrams.aws.compute import EC2

with Diagram("Simple Diagram", show=False):
    EC2("web")

Paragraph: It allows custom Graphviz dot attributes options.
Paragraph: graph_attr, node_attr and edge_attr are supported. Here is a reference link.
Code Block:
 from diagrams import Diagram
from diagrams.aws.compute import EC2

graph_attr = {
    "fontsize": "45",
    "bgcolor": "transparent"
}

with Diagram("Simple Diagram", show=False, graph_attr=graph_attr):
    EC2("web")

Nodes

https://diagrams.mingrammer.com/docs/guides/node

爬取後的資料

Paragraph: Node is a second object representing a node or system component.
Paragraph: Node is an abstract concept that represents a single system component object.
Paragraph: A node object consists of three parts: provider, resource type and name. You may already have seen each part in the previous example.
Code Block:
 from diagrams import Diagram
from diagrams.aws.compute import EC2

with Diagram("Simple Diagram"):
    EC2("web")

Paragraph: In above example, the EC2 is a node of compute resource type which provided by aws provider.
Paragraph: You can use other node objects in a similar manner like:
Code Block:
 ### aws resources
from diagrams.aws.compute import ECS, Lambda
from diagrams.aws.database import RDS, ElastiCache
from diagrams.aws.network import ELB, Route53, VPC
...

# azure resources
from diagrams.azure.compute import FunctionApps
from diagrams.azure.storage import BlobStorage
...

# alibaba cloud resources
from diagrams.alibabacloud.compute import ECS
from diagrams.alibabacloud.storage import ObjectTableStore
...

# gcp resources
from diagrams.gcp.compute import AppEngine, GKE
from diagrams.gcp.ml import AutoML
...

# k8s resources
from diagrams.k8s.compute import Pod, StatefulSet
from diagrams.k8s.network import Service
from diagrams.k8s.storage import PV, PVC, StorageClass
...

# oracle resources
from diagrams.oci.compute import VirtualMachine, Container
from diagrams.oci.network import Firewall
from diagrams.oci.storage import FileStorage, StorageGateway

Paragraph: You can find all available nodes list in Here.
Paragraph: You can represent data flow by connecting the nodes with these operators: >>, << and -.
Code Block:
 from diagrams import Diagram
from diagrams.aws.compute import EC2
from diagrams.aws.database import RDS
from diagrams.aws.network import ELB
from diagrams.aws.storage import S3

with Diagram("Web Services", show=False):
    ELB("lb") >> EC2("web") >> RDS("userdb") >> S3("store")
    ELB("lb") >> EC2("web") >> RDS("userdb") << EC2("stat")
    (ELB("lb") >> EC2("web")) - EC2("web") >> RDS("userdb")

Paragraph: Be careful when using the - and any shift operators together, which could cause unexpected results due to operator precedence.
Paragraph:
Paragraph: The order of rendered diagrams is the reverse of the declaration order.
Paragraph: You can change the data flow direction with direction parameter. Default is LR.
Paragraph: (TB, BT, LR and RL) are allowed.
Code Block:
 from diagrams import Diagram
from diagrams.aws.compute import EC2
from diagrams.aws.database import RDS
from diagrams.aws.network import ELB

with Diagram("Workers", show=False, direction="TB"):
    lb = ELB("lb")
    db = RDS("events")
    lb >> EC2("worker1") >> db
    lb >> EC2("worker2") >> db
    lb >> EC2("worker3") >> db
    lb >> EC2("worker4") >> db
    lb >> EC2("worker5") >> db

Paragraph:
Paragraph: Above worker example has too many redundant flows. In this case, you can group nodes into a list so that all nodes are connected to other nodes at once.
Code Block:
 from diagrams import Diagram
from diagrams.aws.compute import EC2
from diagrams.aws.database import RDS
from diagrams.aws.network import ELB

with Diagram("Grouped Workers", show=False, direction="TB"):
    ELB("lb") >> [EC2("worker1"),
                  EC2("worker2"),
                  EC2("worker3"),
                  EC2("worker4"),
                  EC2("worker5")] >> RDS("events")

Paragraph:
Paragraph: You can't connect two lists directly because shift/arithmetic operations between lists are not allowed in Python.

Clusters

https://diagrams.mingrammer.com/docs/guides/cluster

爬取後的資料

Paragraph: Cluster allows you group (or clustering) the nodes in an isolated group.
Paragraph: Cluster represents a local cluster context.
Paragraph: You can create a cluster context with Cluster class. And you can also connect the nodes in a cluster to other nodes outside a cluster.
Code Block:
 from diagrams import Cluster, Diagram
from diagrams.aws.compute import ECS
from diagrams.aws.database import RDS
from diagrams.aws.network import Route53

with Diagram("Simple Web Service with DB Cluster", show=False):
    dns = Route53("dns")
    web = ECS("service")

    with Cluster("DB Cluster"):
        db_primary = RDS("primary")
        db_primary - [RDS("replica1"),
                     RDS("replica2")]

    dns >> web >> db_primary

Paragraph:
Paragraph: Nested clustering is also possible.
Code Block:
 from diagrams import Cluster, Diagram
from diagrams.aws.compute import ECS, EKS, Lambda
from diagrams.aws.database import Redshift
from diagrams.aws.integration import SQS
from diagrams.aws.storage import S3

with Diagram("Event Processing", show=False):
    source = EKS("k8s source")

    with Cluster("Event Flows"):
        with Cluster("Event Workers"):
            workers = [ECS("worker1"),
                       ECS("worker2"),
                       ECS("worker3")]

        queue = SQS("event queue")

        with Cluster("Processing"):
            handlers = [Lambda("proc1"),
                        Lambda("proc2"),
                        Lambda("proc3")]

    store = S3("events store")
    dw = Redshift("analytics")

    source >> workers >> queue >> handlers
    handlers >> store
    handlers >> dw

Paragraph:
Paragraph: There is no depth limit of nesting. Feel free to create nested clusters as deep as you want.

Edges

https://diagrams.mingrammer.com/docs/guides/edge

爬取後的資料

Paragraph: Edge is representing an edge between Nodes.
Paragraph: Edge is an object representing a connection between Nodes with some additional properties.
Paragraph: An edge object contains three attributes: label, color and style which mirror corresponding graphviz edge attributes.
Code Block:
 from diagrams import Cluster, Diagram, Edge
from diagrams.onprem.analytics import Spark
from diagrams.onprem.compute import Server
from diagrams.onprem.database import PostgreSQL
from diagrams.onprem.inmemory import Redis
from diagrams.onprem.aggregator import Fluentd
from diagrams.onprem.monitoring import Grafana, Prometheus
from diagrams.onprem.network import Nginx
from diagrams.onprem.queue import Kafka

with Diagram(name="Advanced Web Service with On-Premise (colored)", show=False):
    ingress = Nginx("ingress")

    metrics = Prometheus("metric")
    metrics << Edge(color="firebrick", style="dashed") << Grafana("monitoring")

    with Cluster("Service Cluster"):
        grpcsvc = [
            Server("grpc1"),
            Server("grpc2"),
            Server("grpc3")]

    with Cluster("Sessions HA"):
        primary = Redis("session")
        primary \
            - Edge(color="brown", style="dashed") \
            - Redis("replica") \
            << Edge(label="collect") \
            << metrics
        grpcsvc >> Edge(color="brown") >> primary

    with Cluster("Database HA"):
        primary = PostgreSQL("users")
        primary \
            - Edge(color="brown", style="dotted") \
            - PostgreSQL("replica") \
            << Edge(label="collect") \
            << metrics
        grpcsvc >> Edge(color="black") >> primary

    aggregator = Fluentd("logging")
    aggregator \
        >> Edge(label="parse") \
        >> Kafka("stream") \
        >> Edge(color="black", style="bold") \
        >> Spark("analytics")

    ingress \
        >> Edge(color="darkgreen") \
        << grpcsvc \
        >> Edge(color="darkorange") \
        >> aggregator

Paragraph: