2022 iThome 鐵人賽

DAY 10

DevOps

淺談DevOps與Observability系列第 10 篇

淺談OpenTelemetry Specification - Trace

14th鐵人賽 opentelemetry

雷N

團隊E04

2022-09-10 02:10:02

3031 瀏覽

分享至

什麼是Trace?

以前講Trace, 講的比較多得是Linux上的dtrace或者Windows上的ETW
在OpenTelemetry裡, Trace通常指的是Distributed tracing或者叫distributed request tracing.
最常被應用在微服務架構中的應用.

若一個單體應用系統, 所有功能的程式碼都在一個主機上, 排查問題時, 可以像常見的excpetion訊息裡的stack trace追蹤一樣簡單.
若一份程式拆開來運行在許多主機上時, 難以依賴單個stack trace來排查問題.
相反的, 我們需要一個能代表整個request的事務, 從服務間的視角, 從功能組件的視角, 來讓我們能綜觀全貌, 才方便排查問題.

Trace允許我們看到request在不同服務的情況, 每個操作的時間成本, 一些log和發生的錯誤.
再透過一些tool幫助我們了解服務之間的關係與交互.

上圖是Grafana Tempo的Node Graph

Distributed tracing Trace and Span

能參考前年小弟的文章Distributed Tracing & OpenTelemetry介紹

OTel Trace API

Trace API最主要的功能就是用來生成Spans, 給Span分配一個唯一的TraceId.

組成跟Metric API非常雷同

TraceProvider : API的入口點, 一些配置註冊在這
Tracer : 負責建立Span的類別
Span : 負責追蹤一些操作

API還有作用, 就是跟Context做互動.

從Context中, inject提取Span
從現有的Span組合Context, 生成出一個新的Context.
這裡要提到SpanContext

SpanContext

SpanContext從OpenTracing裡來的.
講的是跨越程序邊界, 傳遞到下層Span的狀態. 換句話說, 就是Span的上下文物件. 所以SpanContext是Span的一部分.
可以序列化, 並且沿著context進行傳播(Propagation).
SpanConext在當下是不可變得, 只能提取提取或注入新的資訊來生成一個全新的spancontext.
OTel的SpanConext是符合W3C TraceCOntext標準的,
在標準的Ch2有提到

At a minimum they MUST propagate the traceparent and tracestate headers and guarantee traces are not broken. This behavior is also referred to as forwarding a trace.

traceparent則包含了

The traceparent HTTP header field identifies the incoming request in a tracing system. It has four fields:

version
trace-id
parent-id
trace-flags

tracestate則是

The tracestate header includes the parent in a potentially vendor-specific format:
tracestate: congo=t61rcWkgMzE
For example, say a client and server in a system use different tracing vendors: Congo and Rojo. A client traced in the Congo system adds the following headers to an outbound HTTP request.
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE

底下Go的程式碼可以清楚看到OTel是遵守這標準在設計的.

1個Span是可以與1個或多個SpanContext存在因果關係的.

像之前的這圖, 一個圓圈表示一個程序邊界.
源頭從context內提取資訊注入行成新的SpanContext.
另一邊則提取Span並組合自己的context, 形成新的SpanContext.
但兩個span之間是有關連的.

且也可以一個span, 同時跨越多個程序邊界.

// TraceID is a unique identity of a trace.
// nolint:revive // revive complains about stutter of `trace.TraceID`.
type TraceID [16]byte

// SpanID is a unique identity of a span in a trace.
type SpanID [8]byte

// SpanContext contains identifying trace information about a Span.
type SpanContext struct {
	traceID    TraceID
	spanID     SpanID
    // TraceFlags, 包含該trace的詳情, 這裡的資訊會影響所有的traces
	traceFlags TraceFlags
	traceState TraceState
	remote     bool
}

// TraceState用來提供額外的特定vendor的trace identification
type TraceState struct {
	list []member
}

type member struct {
	Key   string
	Value string
}

這是Inject與Extract在Go的程式碼連結

// Inject set tracecontext from the Context into the carrier.
func (tc TraceContext) Inject(ctx context.Context, carrier TextMapCarrier) {
    // 這裡從context生成一個SpanContext
	sc := trace.SpanContextFromContext(ctx)
	if !sc.IsValid() {
		return
	}

	if ts := sc.TraceState().String(); ts != "" {
		carrier.Set(tracestateHeader, ts)
	}

	// Clear all flags other than the trace-context supported sampling bit.
	flags := sc.TraceFlags() & trace.FlagsSampled

	h := fmt.Sprintf("%.2x-%s-%s-%s",
		supportedVersion,
		sc.TraceID(),
		sc.SpanID(),
		flags)
	carrier.Set(traceparentHeader, h)
}

// Extract reads tracecontext from the carrier into a returned Context.
//
// The returned Context will be a copy of ctx and contain the extracted
// tracecontext as the remote SpanContext. If the extracted tracecontext is
// invalid, the passed ctx will be returned directly instead.
func (tc TraceContext) Extract(ctx context.Context, carrier TextMapCarrier) context.Context {
	sc := tc.extract(carrier)
	if !sc.IsValid() {
		return ctx
	}
	return trace.ContextWithRemoteSpanContext(ctx, sc)
}

Span

看到Interface, 也看到SpanContext的存取方法

// Warning: methods may be added to this interface in minor releases.
type Span interface {
	End(options ...SpanEndOption)

	AddEvent(name string, options ...EventOption)

	IsRecording() bool
	RecordError(err error, options ...EventOption)

	// SpanContext returns the SpanContext of the Span. The returned SpanContext
	// is usable even after the End method has been called for the Span.
	SpanContext() SpanContext
	SetStatus(code codes.Code, description string)
	SetName(name string)
	SetAttributes(kv ...attribute.KeyValue)
	TracerProvider() TracerProvider
}

TraceID的產生過程

其實Trace是一堆Span的關聯樹.
如果一個span沒有parent span, 則會被稱為root span.
當該root span被建立時, 也會生成一個新的TraceID.
對於有parent span的child span, 它們的traceID會一樣.
且child span會繼承parent span的所有TraceState的內容.

// WithNewRoot specifies that the Span should be treated as a root Span. Any
// existing parent span context will be ignored when defining the Span's trace
// identifiers.
func WithNewRoot() SpanStartOption {
	return spanOptionFunc(func(cfg SpanConfig) SpanConfig {
		cfg.newRoot = true
		return cfg
	})
}

// Tracer is the creator of Spans.
//
// Warning: methods may be added to this interface in minor releases.
type Tracer interface {
	// Start creates a span and a context.Context containing the newly-created span.
	//
	// If the context.Context provided in `ctx` contains a Span then the newly-created
	// Span will be a child of that span, otherwise it will be a root span. This behavior
	// can be overridden by providing `WithNewRoot()` as a SpanOption, causing the
	// newly-created Span to be a root span even if `ctx` contains a Span.
	//
	// When creating a Span it is recommended to provide all known span attributes using
	// the `WithAttributes()` SpanOption as samplers will only have access to the
	// attributes provided when a Span is created.
	//
	// Any Span that is created MUST also be ended. This is the responsibility of the user.
	// Implementations of this API may leak memory or other resources if Spans are not ended.
	Start(ctx context.Context, spanName string, opts ...SpanStartOption) (context.Context, Span)
}

在Tracer interface內也有提到root span.

生成TraceID的實做則是在SDK內, 這裡給IDGenerator的連結做參考.

Span是一個操作的狀態反應, 操作可能成功或失敗.
這狀態也是會反應在Span的StatusCode,
StatusCode有三種狀態

Unset : 預設默認狀態
Ok : 被驗證成功且完成操作
Error : 該操作有異常

這裡是程式碼連結

s := &tracepb.Span{
		TraceId:                tid[:],
		SpanId:                 sid[:],
		TraceState:             sd.SpanContext().TraceState().String(),
		Status:                 status(sd.Status().Code, sd.Status().Description),
		StartTimeUnixNano:      uint64(sd.StartTime().UnixNano()),
		EndTimeUnixNano:        uint64(sd.EndTime().UnixNano()),
		Links:                  links(sd.Links()),
		Kind:                   spanKind(sd.SpanKind()),
		Name:                   sd.Name(),
		Attributes:             KeyValues(sd.Attributes()),
		Events:                 spanEvents(sd.Events()),
		DroppedAttributesCount: uint32(sd.DroppedAttributes()),
		DroppedEventsCount:     uint32(sd.DroppedEvents()),
		DroppedLinksCount:      uint32(sd.DroppedLinks()),
	}

// status transform a span code and message into an OTLP span status.
func status(status codes.Code, message string) *tracepb.Status {
	var c tracepb.Status_StatusCode
	switch status {
	case codes.Ok:
		c = tracepb.Status_STATUS_CODE_OK
	case codes.Error:
		c = tracepb.Status_STATUS_CODE_ERROR
	default:
		c = tracepb.Status_STATUS_CODE_UNSET
	}
	return &tracepb.Status{
		Code:    c,
		Message: message,
	}
}

看到trace 列表上有出現**1 Error```

點進去該Trace就能看到Error的tag是True

點到JaegerMonitor頁面, 還能反應出Error rate數據

賽後小補充

朋友今天傳了這張圖, 詢問上面各個屬性是什麼呢?

這些都是SpanContext內的欄位
Span Context

SpanId, TraceId, ParentId我想閱讀到這的朋友都知道了!?

TraceState與TraceFlag則是與sampling採樣有關
TraceState則是有兩個值組成的r-value 和 p-value

Trace Flag只是一個8-bit欄位, 只有指定了sampled這flag
TraceState: Probability Sampling
W3C - Sampled Flag
我讀懂後再多補充

Tags則是對該Span進行註解與補充的K:V pair.
我們可以對已知的場景補充K-V, 甚至你能拿這K-V, 輔助你找log與metric(PromQL的Label與Value)
已知場景通常是跟業務相關的資料.

Baggage也是以K:V pair形式存在SpanContext內, 但是這資訊會在這條請求鏈入上所有的Span內傳播.
用途跟Tags也很類似, 就是協助我們對這整個上下文內容的註解跟補充, 對我們找Log, Metrics都有所幫助.
Baggage通常相對是存放非業務相關的資料, 像是UserID, AccountID ...

Baggage API