[warped-users] Warped v1.02 simulation aborts

Dale E. Martin dmartin at cliftonlabs.com
Tue Nov 9 11:24:54 EST 2004


> I'm working in the development of a simulation tool built on top of Warped
> 1.02.

It's been a _long_ time since I've worked on parallel simulation on top of
warped 1.02...
 
> I'm trying to use the Time Warp kernel, but am running into some
> technical difficulties. (Please let me know if you need more technical
> information about this. Although I'd be glad to give you more
> information, I don't want to provide unnecessary details and make this
> message too long)
 
OK.

> In particular, the problem arises when a straggler message is received.

Has the application successfully processed non-stragglers prior to this
error?

> After detecting it, the program aborts (almost immediately, I believe).
> This situation happens regardless of the number of LPs and simulation
> objects used.
> 
> Before aborting, I get the following output:
> 
> (LTSFInputQueue print called)
> List[0] = 0x84399c0 sTime: 00:00:00:000 rTime: 00:00:00:000 sendID: 453
> dest: 236 Processed: 1 sign: + eventId: 0
> List[1] = 0x8457ce0 sTime: 00:00:00:000 rTime: 00:00:00:000 sendID: 453
> dest: 236 Processed: 1 sign: + eventId: 85
> List[2] = 0x846a5d0 sTime: 00:00:00:000 rTime: 00:00:00:000 sendID: 453
> dest: 236 Processed: 1 sign: + eventId: 172
> ...
> List[2396] = 0x85b4e28 sTime: 00:00:00:100 rTime: 00:00:00:100 sendID: 315
> dest: 453 Processed: 0 sign: + eventId: 6
> currentPos : 0x85c60b0 sTime: 00:00:00:100 rTime: 00:00:00:100 sendID: 453
> dest: 293 Processed: 0 sign: + eventId: 1928
> insertPos : 0x85c66b0 sTime: 00:00:00:000 rTime: 00:00:00:000 sendID: 448
> dest: 449 Processed: 0 sign: + eventId: 65
> (the list has more than 2000 elements)

(So I guess it would have processed some events OK to have all of these
elements in the list.)

>     p4_error: latest msg from perror: Bad file descriptor
> p3_10699: (26.439217) net_recv failed for fd = 140

I assume this is post-abort?  I think that's just MPI saying the other guy
aborted?

Have you tried loading your application into gdb on multiple nodes and
seeing what is causing the abort by walking back up the stack?  Or doing
"gdb <application-exe> core" and "where" to get a stack dump?

Thanks,
	Dale
-- 
Dale E. Martin, Clifton Labs, Inc.
Senior Computer Engineer
dmartin at cliftonlabs.com
http://www.cliftonlabs.com
pgp key available




More information about the warped-users mailing list