Potential hang for eCapture under really rare circumstances #500

ruitianzhong · 2024-03-03T08:37:14Z

Describe the bug
Due to the design flaw of iworker model, under rare condition, eCapture might hang.

This bug is discovered through code review and to reliably prove it, I design a PoC(Proof of Concept) .

To Reproduce

This is my poc commit ruitianzhong@a2e4111.

It essentially do two thing:

Firstly add a unit test that write a event(force the EventProcessor.Serve() go routine to create a worker) and sleep two second(wait the worker to be deleted). Finally, the unit test write tons of the same event.
Secondly inject some delay in the following code before ew.Close (in pkg/event_processor/iworker.go):

if ew.tickerCount > MaxTickerCount {
				//ew.processor.GetLogger().Printf("eventWorker TickerCount > %d, event closed.", MaxTickerCount)
				time.Sleep(time.Second*15) // inject some delay here (just to show the possible interleaving)
				ew.Close()
				return
			}

That is reasonable, because we should assume that go routine might be scheduled out at any time.

On the poc commit, run

go test  -v ./pkg/event_processor/  -run TestHang

The output:

begin Write if you can not see end later, it mean eventProcessor.Serve routine might hang
end Write
begin Write if you can not see end later, it mean eventProcessor.Serve routine might hang
end Write
begin Write if you can not see end later, it mean eventProcessor.Serve routine might hang
end Write
begin Write if you can not see end later, it mean eventProcessor.Serve routine might hang

Obviously, eCapture hang.

Expected behavior
eCapture might be slowed down, but should not hang.

The text was updated successfully, but these errors were encountered:

ruitianzhong · 2024-03-03T09:14:14Z

Bug Analysis

In the following code, assuming now the worker routine is before ew.Close()

case _ = <-ew.ticker.C:
			// 输出包
			if ew.tickerCount > MaxTickerCount {
				//ew.processor.GetLogger().Printf("eventWorker TickerCount > %d, event closed.", MaxTickerCount)
				ew.Close()
				return
			}

Just at the same time, multiple events from the same process might be generated and the are passed firstly toEventProcessor.Serve routine through ep.incoming channel(1024) , and then sent to Worker routine through ew.incoming channel(16). These events with same uuid are allocated to the same worker that executed before ew.Close(). Because the worker do not received from the channel at the moment , the event is put in the buffer allocated for the channel and it would not be blocked for now . However, This event will be lost, because this worker will never handle the event in buffer. Moreover, if the many events are generated, the buffer in ew.incoming(size is 16) is full and the EventProcessor.Serve routine hang first which is impossible to recovered.

Again, that is because the worker is about to be removed and will never receive events from that channel.

With the EventProcessor.Serve routine hang, the main routine that generates events will ultimately hang because the ep.incoming channel(with size of 1024) will be full sooner or later.

from top to down                                                         
event generate from main routine
|
|
| ep.incoming channel(1024)          
|
EventProcessor.Serve routine 
|
|
|
| ew.incoming channel(16)
|
|
Worker routine

Root Cause
The hang is due to the fact that worker returned by getWorkerByUUID() may be retired(i.e, never read from the channel). It should be fixed.

Signed-off-by: ruitianzhong <ruitian-zhong@outlook.com>

cfc4n · 2024-03-03T13:42:20Z

Please wait a moment, let me understand the issue.

cfc4n added good first issue Good for newcomers improve question Further information is requested labels Mar 3, 2024

ruitianzhong added a commit to ruitianzhong/ecapture that referenced this issue Mar 3, 2024

fix gojue#500

f606d8e

Signed-off-by: ruitianzhong <ruitian-zhong@outlook.com>

cfc4n closed this as completed in 144c89a Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential hang for eCapture under really rare circumstances #500

Potential hang for eCapture under really rare circumstances #500

Potential hang for eCapture under really rare circumstances #500

Potential hang for eCapture under really rare circumstances #500

Comments