前幾天我們介紹了processes之間的link機制.
也看到了子process脫離或死亡,可以將其隔離,
而不致影響到母process,或是上層的process.
我們可以設計如下概念的容錯機制(fault tolerance)
[Supervisor]
/ | \
[Supervisor] [Worker] [Worker]
| \
[Worker] [Worker]
透過Supervisor process來監控Worker process,
若有Worker process死亡,Supervisor就可以再建立新的
Worker process.
在介紹Supervisor程式之前,我們先來建立一個Worker.
不同於第17天介紹的a,b,c三個簡易函數,Worker會是一個
典型的server型態的程式,擁有start/0,loop/0 自行遞迴以維持
服務,以及與shell或其他process通訊的接口(interface)函數,
request/1.
-module(times_ten).
-export([start/0, request/1, loop/0]).
start() ->
process_flag(trap_exit, true),
Pid = spawn_link(times_ten, loop, []),
register(ten, Pid),
{ok, Pid}.
%
loop() ->
% I have another name: ten
receive
{atom_request, Pid, Msg} ->
% send my calculate result back to interface
Pid ! {atom_result, Msg * 10}
end,
loop().
%
request(Int) ->
% I am the interface
% send atom_request to loop(ten)
% and wait loop respone at receive block
ten ! {atom_request, self(), Int},
receive
{atom_result, Result}
-> Result;
% Maybe ten will be fail
{'EXIT', _Pid, Reason}
-> {atom_error, Reason}
after 1000
-> atom_timeout
end.
測試過程:
1> c(times_ten).
{ok,times_ten}
2> times_ten:start().
{ok,<0.40.0>}
3> times_ten:request(10).
100
4> times_ten:request(ten).
{atom_error,{badarith,[{times_ten,loop,0,
[{file,"times_ten.erl"},{line,15}]}]}}
5>
=ERROR REPORT==== 19-Oct-2014::16:07:43 ===
Error in process <0.40.0> with exit value: {badarith,[{times_ten,loop,0,[{file,"times_ten.erl"},{line,15}]}]}
如預期般順利運作,各位可以觀察到,第4,回應的tuple,開頭是 atom_error,這是故意用atom開頭,
與一般erlang慣例並不同,只是為了讓各位比較清楚了解這是一個atom.
不過在實務應用上,最好是改回用erlang慣例的error這個atom,因為其他人開發的程式,可能預期是
收到 {error, Reason}; {atom_result, Result} 部份也是一樣原因.
erlang 的慣例很簡潔,時常會用 {result, Result} 這樣的方式,一般程序語言使用習慣了,
result會想成是變數,但是這裡是原子(atom).
這也是一般大家會覺得erlang 程式不容易看懂的一個因素.
一個supervisor的範例:
-module(my_supervisor).
-export([start_link/2, stop/1, init/1]).
start_link(Name, ChildSpecList) ->
register(Name, spawn_link(my_supervisor, init, [ChildSpecList])),
ok.
%
stop(Name) ->
% stop supervisor by register name in start_link
Name ! {stop, self()},
receive {reply, Reply} ->
Reply
end.
%
init(ChildSpecList) ->
process_flag(trap_exit, true),
loop(start_children(ChildSpecList)).
%
start_children([]) -> [];
start_children([{M, F, A} | ChildSpecList]) ->
% M: Module, F: Function, A;Args
% Get Head in List : {M,F,A} when it invoke
case (catch apply(M,F,A)) of
{ok, Pid} ->
[{Pid, {M,F,A}} | start_children(ChildSpecList)];
_ ->
start_children(ChildSpecList)
end.
%
restart_child(Pid, ChildList) ->
{value, {Pid, {M,F,A}}} = lists:keysearch(Pid, 1, ChildList),
{ok, NewPid} = apply(M,F,A),
[{NewPid, {M,F,A}} | lists:keydelete(Pid, 1, ChildList)].
%
loop(ChildList) ->
receive
{'EXIT', Pid, _Reason} ->
NewChildList = restart_child(Pid, ChildList),
loop(NewChildList);
{stop, From} ->
From ! {reply, terminate(ChildList)}
end.
%
terminate([{Pid, _} | ChildList]) ->
exit(Pid, kill),
% send kill message to every child in list by exit/2
terminate(ChildList);
terminate(_ChildList) -> ok.
測試過程:
1> my_supervisor:start_link(miku, [{times_ten, start, []}]).
ok
2> times_ten:request(10).
100
3> whereis(ten).
<0.36.0>
4> exit(whereis(ten), kill).
true
5> times_ten:request(5).
50
6> whereis(ten).
<0.40.0>
7> my_supervisor:stop(miku).
ok
8> whereis(ten).
undefined
搭配上面的times_ten執行.注意:time_ten, 會自行register為ten.
在此運用BIF whereis, 尋找註冊名,回應Pid.
解釋運作過程:
1> 啟動supervisor,取名為miku, 會連帶啟動 times_ten,並放入ChildList.
2> 執行request, times_ten能正常運作
3> 找出 times_ten的Pid.
4> 透過exit/2, 傳送kill給 times_ten,讓其脫離.
5> 執行request, times_ten依然能正常接受要求並回應.
6> 查找現在 time_ten的Pid. 跟之前不同.
7> 通知supervisor,取名為miku,進行脫離,在miku脫離之前,會將child也脫離.
8> 查找times_ten的Pid,為undefined,代表child也脫離了.
這是到目前為止,最複雜的Code.不過這僅是雛型而已.
也許你會想到,這樣子要作監控,實做上很複雜.
但是別擔心.erlang有開發出OTP(Open Telecom Platform),
我們需要開發server時,只需要使用OTP,會比我們完全手動打造
還要方便且穩固.