You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Firstly add a unit test that write a event(force the EventProcessor.Serve() go routine to create a worker) and sleep two second(wait the worker to be deleted). Finally, the unit test write tons of the same event.
Secondly inject some delay in the following code before ew.Close (in pkg/event_processor/iworker.go):
ifew.tickerCount>MaxTickerCount {
//ew.processor.GetLogger().Printf("eventWorker TickerCount > %d, event closed.", MaxTickerCount)time.Sleep(time.Second*15) // inject some delay here (just to show the possible interleaving)ew.Close()
return
}
That is reasonable, because we should assume that go routine might be scheduled out at any time.
On the poc commit, run
go test -v ./pkg/event_processor/ -run TestHang
The output:
begin Write if you can not see end later, it mean eventProcessor.Serve routine might hang
end Write
begin Write if you can not see end later, it mean eventProcessor.Serve routine might hang
end Write
begin Write if you can not see end later, it mean eventProcessor.Serve routine might hang
end Write
begin Write if you can not see end later, it mean eventProcessor.Serve routine might hang
Obviously, eCapture hang.
Expected behavior
eCapture might be slowed down, but should not hang.
The text was updated successfully, but these errors were encountered:
Just at the same time, multiple events from the same process might be generated and the are passed firstly toEventProcessor.Serve routine through ep.incoming channel(1024) , and then sent to Worker routine through ew.incoming channel(16). These events with same uuid are allocated to the same worker that executed before ew.Close(). Because the worker do not received from the channel at the moment , the event is put in the buffer allocated for the channel and it would not be blocked for now . However, This event will be lost, because this worker will never handle the event in buffer. Moreover, if the many events are generated, the buffer in ew.incoming(size is 16) is full and the EventProcessor.Serve routine hang first which is impossible to recovered.
Again, that is because the worker is about to be removed and will never receive events from that channel.
With the EventProcessor.Serve routine hang, the main routine that generates events will ultimately hang because the ep.incoming channel(with size of 1024) will be full sooner or later.
from top to down
event generate from main routine
|
|
| ep.incoming channel(1024)
|
EventProcessor.Serve routine
|
|
|
| ew.incoming channel(16)
|
|
Worker routine
Root Cause
The hang is due to the fact that worker returned by getWorkerByUUID() may be retired(i.e, never read from the channel). It should be fixed.
Describe the bug
Due to the design flaw of
iworker
model, under rare condition, eCapture might hang.This bug is discovered through code review and to reliably prove it, I design a PoC(Proof of Concept) .
To Reproduce
This is my poc commit ruitianzhong@a2e4111.
It essentially do two thing:
pkg/event_processor/iworker.go
):That is reasonable, because we should assume that go routine might be scheduled out at any time.
On the poc commit, run
go test -v ./pkg/event_processor/ -run TestHang
The output:
Obviously, eCapture hang.
Expected behavior
eCapture might be slowed down, but should not hang.
The text was updated successfully, but these errors were encountered: