|
robert Wanderer
Joined: 25 Feb 2004 Posts: 98
|
Posted: Wed Oct 31, 2007 3:40 am
[2.09] 0% CPU locks still |
First, I am using Vista 64-bit Home Premium, running CMUD Pro 2.09. I have MANY #GAG trigs, a few hundred, of which up to 40-50ish can be executed every 2 seconds. Over a few hours this will cause a 0% CPU lock which is solved by hitting escape when I notice the issue. I know 0% thread issues have been reported before, but I am just trying to make sure it is known that this is still an issue. FYI, having to hit escape isn't a good solution since my triggers all stop firing which can result in death for me fairly quickly.
Thanks for you hard work and great attitude Zugg, I can't wait to see the future improvements like the Mapper changes this year! |
|
|
|
Zugg MASTER
Joined: 25 Sep 2000 Posts: 23379 Location: Colorado, USA
|
Posted: Wed Oct 31, 2007 6:12 pm |
Yep, I actually finally got this to happen to me again last night. For me it usually happens on one of my many #NOMAP triggers, but last night it happened on one of my #capture/#gag triggers. So I think it actually happens on *any* trigger, but it's just a small random chance of happening. So, the more triggers you have, the higher the chance. It still looks like some sort of thread synchronization issue, although I'm still having a hard time determining exactly how it's happening. It's annoying that it takes so many hours to get it to happen. I've been leaving CMUD running all night every night to try and track it down.
|
|
|
|
Vijilante SubAdmin
Joined: 18 Nov 2001 Posts: 5182
|
Posted: Wed Oct 31, 2007 8:10 pm |
This did it for me quite rapidly. In a blank session.
Code: |
#LOOP 1000 {#EXEC {%concat("#TRIGGER ",%char(34),%i,%char(34)," {test} {#ADD a 1} {} {}")}} |
Then using #SHOW test. I ended up with it hanging and the result of a was 1999. |
|
_________________ The only good questions are the ones we have never answered before.
Search the Forums |
|
|
|
Vijilante SubAdmin
Joined: 18 Nov 2001 Posts: 5182
|
Posted: Wed Oct 31, 2007 8:23 pm |
That is wierd, the first time I did it I got a lock up quite rapidly, and now I can't get one at all but I did find 2 other bugs the complain about in the process. Maybe I will still think of something refine that with.
|
|
_________________ The only good questions are the ones we have never answered before.
Search the Forums |
|
|
|
Arde Enchanter
Joined: 09 Sep 2007 Posts: 605
|
Posted: Wed Oct 31, 2007 8:41 pm |
Using Vijilante' example and assuming you have "Autoclear cmd line" option on, send #SHOW test until lockup occurs. After this point send "#SHOW test" several times (for ex., 10 times) using UpArrow + Enter key sequence and, finally, release CMUD with Esc. There is high probability that you 'll receive several stopped threads (use #THREAD to see). I always have 2 stopped threads, no matter how many #SHOW test commands I send.
Update: well, you may try #LOOP 5000... instead of all that staff with Up+Enter almost with the same result, but having so many firing triggers will make CMUD far less responsible to user and you'll have tough time trying understand either it is CMUD hang or it just busy calculating a+1
Update 2: heh, I take my words back about only 2 stopped threads... |
|
|
|
Vijilante SubAdmin
Joined: 18 Nov 2001 Posts: 5182
|
Posted: Wed Oct 31, 2007 9:03 pm |
After making the triggers with the first one I used this for testing.
Code: |
#LOOP 1000 {#VAR $t1 {%eval(%time("n*60000+s*1000+z"))};#SHOW test;#VAR $t2 {%eval(%time("n*60000+s*1000+z"))};#SHOW %i time %eval($t2-$t1)} |
|
|
_________________ The only good questions are the ones we have never answered before.
Search the Forums |
|
|
|
Arde Enchanter
Joined: 09 Sep 2007 Posts: 605
|
Posted: Wed Oct 31, 2007 9:09 pm |
Hm, is it normal that interface thread change it's # from 1?
5147 [u] untitled running #THREAD |
|
_________________ My personal bug|wish list:
-Wrong Priority when copy-paste setting
-1 prompt trigger for Mapper, Session and General Options, not 3 different!
-#SECTION can terminate threads
-Buttons can't start threads |
|
|
|
Vijilante SubAdmin
Joined: 18 Nov 2001 Posts: 5182
|
Posted: Wed Oct 31, 2007 9:14 pm |
This ultimately ended very badly, after a few lockups CMud become a rampant memory eating machine. I also did some other tests during that particular launch, which is why the time indicators are on there. So I will have to try to seperate out what was going on, but so far it is about the best way I have found to replicate the lockup.
|
|
_________________ The only good questions are the ones we have never answered before.
Search the Forums |
|
|
|
Zugg MASTER
Joined: 25 Sep 2000 Posts: 23379 Location: Colorado, USA
|
Posted: Wed Oct 31, 2007 9:19 pm |
Excellent! Using both of Vijilante's scripts, I was also able to reproduce it. The first time it didn't happen until loop interation 83!
I was able to figure out the cause. I am using a variation of the Delphi TThread component for my threads from the Indy Internet components called IdThread. It adds a few features to the normal TThread. In any case, the Start event for the thread calls the Windows Resume event as expected. But Windows counts the Suspends and Resumes for a thread. For example, if you Suspend twice, and then call Resume once, the thread is still suspended.
There seems to be some cases when reusing a thread in my thread pool where Suspend has gotten called more than once. I really have no idea how this can happen. But I changed the Start procedure so that it continues to call Resume until the thread is actually started. So now Start will execute the thread no matter how many times it was suspended.
This seems to have fixed the problem. What was happening was that since the old Start didn't actually resume the thread, the event to signal the end of the thread never occurred, so the main thread was waiting on it forever.
Anyway, with this fix Vijilante's second script executed all 1000 iterations without any hang. So I think it's finally fixed now.
Thanks very much for posting this procedure. It's similar to something else I was trying, but I just wasn't giving it enough triggers and enough iterations. And it's definitely something that would depend upon your computer's speed. |
|
|
|
Zugg MASTER
Joined: 25 Sep 2000 Posts: 23379 Location: Colorado, USA
|
Posted: Wed Oct 31, 2007 9:22 pm |
Quote: |
Hm, is it normal that interface thread change it's # from 1? |
That means that it's not able to reuse an existing thread, so it creates another. When CMUD needs a thread, it checks it's pool to see if the last thread has completed. If it was completed, then it gets reused, otherwise it creates a new thread.
Quote: |
assuming you have "Autoclear cmd line" option on, send #SHOW test until lockup occurs. After this point send "#SHOW test" several times (for ex., 10 times) using UpArrow + Enter key sequence |
Of course, it's even easier if you turn the "Autoclean" option OFF and then type #SHOW test and then just keep pressing Enter to keep sending the same command over and over again. |
|
|
|
Arde Enchanter
Joined: 09 Sep 2007 Posts: 605
|
Posted: Wed Oct 31, 2007 9:29 pm |
Zugg wrote: |
Quote: |
assuming you have "Autoclear cmd line" option on, send #SHOW test until lockup occurs. After this point send "#SHOW test" several times (for ex., 10 times) using UpArrow + Enter key sequence |
Of course, it's even easier if you turn the "Autoclean" option OFF and then type #SHOW test and then just keep pressing Enter to keep sending the same command over and over again. |
Sure thing. Personally I don't like to change global settings for tests if there is another way. |
|
|
|
Vijilante SubAdmin
Joined: 18 Nov 2001 Posts: 5182
|
Posted: Wed Oct 31, 2007 9:40 pm |
When I was testing it with that I sometimes did manage to get through all 1000 on the second part of the test, although ussually it didn't make it.
One of the more interesting things that happened when I got on of the lock ups I hit ESC and after a while CMud returned to my control. I checked #THREAD and the counter was up somewhere around 10k with over a 2 screenfuls of stopped threads, but CMud only had 14 threads according to the task manager. I then entered just #SHOW test on the command line and the previous loop started again, but its counter went in revserse 1000 down to 1.
I am thinking there may have been some other bugs hiding in there, because that list of results is just too weird to be explained. |
|
_________________ The only good questions are the ones we have never answered before.
Search the Forums |
|
|
|
Zugg MASTER
Joined: 25 Sep 2000 Posts: 23379 Location: Colorado, USA
|
Posted: Wed Oct 31, 2007 10:32 pm |
#THREAD is showing the internal pool of thread objects...it doesn't correspond to the Task Manager. If #THREAD shows that a thread is stopped, then that means the object still exists in my pool, but isn't a real thread as far as windows is concerned.
In other words, my thread classes are "super classes" that contain the Windows Thread object as a property. If that Windows Thread object is destroyed, then you can still have one of my super classes in my thread pool list with just a nil value for the Windows thread.
Once you get one of the lockups, things in the thread pool are probably going to be messed up. Also, pressing ESC with the script that you are running above also causes the #LOOP to abort. There might be a separate bug regarding that.
I'll give it an even larger loop value and run it overnight just to be sure ;)
Quote: |
Personally I don't like to change global settings for tests |
The AutoClear option should be Disabled by default in the default settings. |
|
|
|
Zugg MASTER
Joined: 25 Sep 2000 Posts: 23379 Location: Colorado, USA
|
Posted: Wed Oct 31, 2007 10:43 pm |
Has anyone dared to run these two scripts in zMUD to see how much slower it is?
|
|
|
|
Zugg MASTER
Joined: 25 Sep 2000 Posts: 23379 Location: Colorado, USA
|
Posted: Wed Oct 31, 2007 10:50 pm |
Btw, the thread number that is shown in the first column of the #THREAD display is used to keep track of how many threads CMUD has created in the pool. It is normal for this value to be greater than 1.
There are many thread timing issues that can cause the previous thread to be unavailable for re-use. When CMUD tries to reuse the previous thread, it checks to make sure the previous thread is stopped/suspended. The Windows Event signal that is used to tell CMUD that a thread is finished is set before the thread itself is suspended. Thus, you can get a case where the previous thread "finished" (signaled it's Windows event), but hasn't actually been suspended by Windows yet. In this case, CMUD will not reuse the thread. CMUD just marks the previous thread to free itself when it is terminated, and then marks the thread as terminated. When Windows gets around to suspending the thread, it sees that it is marked as terminated and gets rid of it.
In the meantime, CMUD creates a new thread since it couldn't reuse the previous one. That causes the thread number counter to get incremented.
Probably more information that you wanted, but I just wanted to explain a bit more so you'd understand how this works and why it's normal to see a value > 1 in the #THREAD display.
A bit more information on #THREAD: Each window has two thread pools...one for the command line, and one for triggers (and other settings, such as macros). So, it can be normal to see these additional threads in the #THREAD display. I've modified this in v2.10 so that the Windows Task Manager will show a name for each thread based upon the window that created it and whether it's a ComLine thread or not. |
|
|
|
Iceclaw Apprentice
Joined: 11 Sep 2005 Posts: 124
|
Posted: Thu Nov 01, 2007 12:33 am |
While I'm still trying to wrap my brain arround the thread understandings, I think if you can fix this bug I'd be ready to call CMUD safe enough for my general mud scripting purposes. There's nothing like having it lock up in the middle of a fight with no sign it's locked up for quite a while :)
|
|
|
|
Zugg MASTER
Joined: 25 Sep 2000 Posts: 23379 Location: Colorado, USA
|
Posted: Thu Nov 01, 2007 3:04 am |
OK, this is kind of strange. I have run the second script with #LOOP 10000 instead of #LOOP 1000 twice now, and both times it stopped at about 3300 or so (Maybe 3323? it's already scrolled off the screen). It wasn't hung at that point, and #THREAD only showed a single thread. Then, I got the fun idea of just typing:
#SHOW test
to see what it would do. And then it started showing 10000 and is now counting down, with the "time" value at each step showing 0 or 16 (the minimum amount of time you can detect in a Windows tick).
So, somehow all of the messages were queue up in the internal line queue (the lines received from the MUD, or from the #SHOW command), and now all of those queued messages are spitting out.
Since the #SHOW command inserts it's text at the front of the queue, this would explain why it is counting backwards. The output for when the counter reached 10000 is at the head of the queue.
Not sure how the message queue got "stuck", but it looks like a complete different problem from the thread hang. Too bad it takes so long to get it to happen, but at least this procedure seems to cause it. Hopefully it's just a case of some access to the internal line buffer not being thread protected and it will be easy to find...crossing my fingers. |
|
|
|
Vijilante SubAdmin
Joined: 18 Nov 2001 Posts: 5182
|
Posted: Thu Nov 01, 2007 4:26 am |
I am glad you bumped into the bugs I thought were hiding. It is nice to see that the last of the threading issues is going to be squashed.
|
|
_________________ The only good questions are the ones we have never answered before.
Search the Forums |
|
|
|
Zugg MASTER
Joined: 25 Sep 2000 Posts: 23379 Location: Colorado, USA
|
Posted: Thu Nov 01, 2007 8:58 pm |
Code: |
%time("n*60000+s*1000+z") |
I didn't even know that you could do that kind of math within the %time function. Even I learn something new everyday ;)
For timing, I usually use %secs to get a millisecond count. In fact, in v2.10 the %secs function uses the high precision timer value from the Windows QueryPerformanceCount API, so it will actually be accurate to one millisecond.
Btw, you don't need the %eval in your final #SHOW statement. Here is my current test script:
Code: |
#LOOP 1000 {#EXEC {%concat("#TRIGGER ",%char(34),%i,%char(34)," {test} {#ADD a 1} {} {}")}}
#LOOP 10000 {#VAR $t1 %secs;#SHOW test;#VAR $t2 %secs;#SHOW %i time ($t2-$t1)} |
Simple and easy. I'm still waiting for it to fail. |
|
|
|
robert Wanderer
Joined: 25 Feb 2004 Posts: 98
|
Posted: Thu Nov 01, 2007 9:35 pm |
Wow, this thread got busy! Great work guys, can't wait to try out these fixes in 2.10. I will try to follow Vijilante's example in the future and try to come up with a good way of reproducing the issue which was quite impressive.
|
|
|
|
Zugg MASTER
Joined: 25 Sep 2000 Posts: 23379 Location: Colorado, USA
|
Posted: Thu Nov 01, 2007 9:44 pm |
OK, this bug is getting really weird. I got it to fail again. The script is still running...it's not a thread lock problem. But somehow the Windows message queue is getting messed up.
When #SHOW is called, it adds the line to the output queue, then posts a message to process this queue:
PostMessage( Handle, wm_NextLine, 0, 0)
and then #SHOW immediately calls my routine "ProcessNewLineNow" which does this:
if PeekMessage( Msg, Handle, wm_NextLine, wm_NextLine, PM_REMOVE) then
...send line in queue to screen
However, when I debug it after the message queue gets screwed up, I can see it calling PostMessage ok, but then when it gets to the PeekMessage, it returns FALSE, claiming that the message isn't in the queue! I'm not sure how this is possible. I guess I need to grab the return result from PostMessage to see if it is failing for some reason.
And yes, unlike previous problems with PeekMessage, it *is* calling Peek from the main thread. |
|
|
|
Zugg MASTER
Joined: 25 Sep 2000 Posts: 23379 Location: Colorado, USA
|
Posted: Thu Nov 01, 2007 10:32 pm |
OK, I got it to fail again and this time captured the error message. Calling PostMessage is returning the error:
"Not enough quota is available to process this command"
So, it looks like some sort of resource limit. CMUD is only taking about 140MB at this point, so there is plenty of memory on the system.
My best guess is that the message queue is full. Since we are in a long #LOOP command, CMUD isn't processing any Windows messages while the loop is executing (because it needs to prevent any text from the MUD from getting processed while a trigger is busy).
If that is the case, then this isn't something people are going to encounter during normal CMUD operation...it's an artifact of the huge processing loop that we are testing with here. I might be able to improve it by allowing the application to process *some* messages, such as WM_PAINT, and maybe even mouse-move events. I'll play with this a bit more and see what I can do. |
|
|
|
Zugg MASTER
Joined: 25 Sep 2000 Posts: 23379 Location: Colorado, USA
|
Posted: Thu Nov 01, 2007 11:44 pm |
Well, I guess I've reached the limits of this particular test.
I don't know what "quota" limit is being reached. According to my research, the default message queue size is "no limit". While a long loop is being executed like this, I can't allow most Windows events to be processed. Even allowing WM_PAINT and WM_MOUSEMOVE didn't fix the problem.
Part of the problem is that when a thread is completed, it gets marked as finished and a Windows message is posted to clean up the thread and remove it from the pool. Since messages are not being processed, this message doesn't get executed. So, during the loop, more and more internal thread objects (not actual Windows threads) are being created and not freed. So this leads to the memory usage of CMUD increasing during the loop. Once the loop is finished, then all of those messages get processed.
It's sort of like the #MESS loop posted in another thread. CMUD just can't handle really long loops like this which lock the user interface. In practice, calculations that require a long time to run should be placed into the background using a #WAIT command at the beginning of the script (or by pressing Shift-Esc on the command line)
I'm going to repeat this test tonight and press Shift-Esc to allow the loop to run in the background. This should be a better test to see if the original thread hang problem is really fixed, and to see if there are any other issues hiding.
But just keep in mind when you are creating test cases that running a really long loop on the command line and causing CMUD to stop processing Windows messages is a bad idea and isn't going to represent "normal" CMUD usage.
P.S. I can already tell with this loop running in the background that memory usage isn't going up like it was before, so maybe this will get past the 3400 mark. |
|
|
|
Zugg MASTER
Joined: 25 Sep 2000 Posts: 23379 Location: Colorado, USA
|
Posted: Fri Nov 02, 2007 1:12 am |
Woohoo! Got to 10,000 with no hang, and no memory leaks. Memory is still at the same 49Mb that it was when I started the test. So, I'm calling the original 0% CPU hang issue fixed in this case. While it's always possible there are some command or function specific threadsafe issues, at least the main hang that required the ESC is fixed. I'll let it run overnight just to be sure, but it's much better than it was when I started with 2.09.
|
|
|
|
Zugg MASTER
Joined: 25 Sep 2000 Posts: 23379 Location: Colorado, USA
|
Posted: Fri Nov 02, 2007 1:25 am |
Btw, I *did* get the courage to run this in zMUD:
1) The creation of the 1000 triggers was faster in zMUD, as expected (it doesn't need to create the database records, which is slow)
zMUD 7.21: 800 ms
CMUD 2.09: 2600 ms
CMUD 2.10: 2050 ms
2) The actual loop comparison, each step:
zMUD 7.21: 800-900ms
CMUD 2.09: 650-700ms
CMUD 2.10: 150-200ms
So you can also see that my fixes have also improved the performance even over 2.09. But zMUD actually did pretty well with this test. I was a bit surprised. Of course, zMUD scales much worse than CMUD, so if you make it 2000 triggers instead of 1000, you'll see less that just half the speed. |
|
|
|
|
|