Skip to Main Content

DevOps, CI/CD and Automation

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Too much exclusive CPU time counted at swapcontext function

807578Jun 25 2009 — edited Jul 29 2009
Hi,

I'm using Sun Studio Express March 2009 Build, especially Performance Analyzer, and I have observed some hardly understandable CPU times measured at swapcontext function of libc library.

Here is my machine spec.
Two-way Intel E5320 processors with 16GB memory
SUSE Linux Enterprise Server 10 SP1 (x86_64)

My program consisted of 8 threads (pthread), and around a hundred of user contexts (coroutines or fibers) run on every single thread. I'm using makecontext/swapcontext for creating/scheduling user contexts on threads.
I'm using both Sun Studio Performance Analyzer and Intel VTune Performance Analyzer.
My problem is that performance analyzer reported about 20% of total CPU time as the exclusive CPU time of swapcontext function, while I couldn't find as many HW event samples related to swapcontext by using VTune.

To narrow down the problem scope, I made simple test program, and I reproduced the problem. I attached the test program generator written by bash script at the end of this message.
I generated the test program with following command.
% bash code_gen.sh 8 128 100000 1000
Then, you can get the test program, which consists of 8 threads with 128 user contexts at each threads, and 100000 times of context switch at each user context.

In my system, Sun Studio Performance Analyzer reported 328 seconds of exclusive CPU time at swapcontext out of 463 seconds total CPU time (CPU Time of <Total>). Briefly, swapcontext consumed about 70% of total CPU time.
However, according to VTune sampling, both libc-2.4.so and vmlinux-2.6.16.46-0.12-smp consume only 8% of total clockticks.
It's too large gap between Sun Studio and VTune.

Have you seen this kind of problem? Do you know why this mismatch happens?
Or how can I estimate actual swapcontext cost?

Thank you for reading my post, and I'm looking forward to some hints about my problem.
Colin

---- code_gen.sh ----
#!/bin/bash

usage()
{
echo "code_gen.sh <num_threads> <num_task> <num_loop> <func_body_size>"
}


test()
{
local num_threads=$1
local num_tasks=$2
local func_loop=$3
local func_body_size=$4

local file_name="mytest_${num_threads}_${num_tasks}_${func_loop}_${func_body_size}"

main_func_gen $num_tasks $func_loop $func_body_size > $file_name.c
gcc -O2 $file_name.c -o $file_name -lpthread

}

sub_func_gen()
{
local func_id=$1
local func_loop=$2
local func_body_size=$3
local num_tasks=$4

cat <<!
static void f$1(int threadId)
{
volatile int c = 0;
int i = 0;

for(i = 0; i < $func_loop; ++i)
{
!

for i in `seq 1 $func_body_size`; do
echo " c+=1;";
done

cat <<!

swapcontext(&ctx[threadId][$func_id], &ctx[threadId][($func_id+1)%$num_tasks]);
}
}
!

}

main_func_gen()
{
num_tasks=$1
func_loop=$2
func_body_size=$3

cat <<!
#include <stdio.h>
#include <ucontext.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
#include <pthread.h>

static ucontext_t **ctx;
static ucontext_t *mctx;
static pthread_t *pThreads;

!


for i in `seq 0 $((num_tasks-1))`; do
sub_func_gen $i $func_loop $func_body_size $num_tasks

done

cat <<!

int
threadMain (int* pThreadId)
{

char** st = NULL;
int i = 0;
int* ret = NULL;
int threadId=*pThreadId;

printf("$num_tasks tasks on %d thread\n", threadId);

st = (char**)malloc(sizeof(char*)*$num_tasks);
ctx[threadId] =(ucontext_t*)malloc(sizeof(ucontext_t)*$num_tasks);
ret = (int*)malloc(sizeof(int)*$num_tasks);

!

for i in `seq 0 $((num_tasks-1))`; do

cat <<!
st[$i] = (char*)malloc(sizeof(char)*8192);
getcontext(&ctx[threadId][$i]);
ctx[threadId][$i].uc_stack.ss_sp = st[$i];
ctx[threadId][$i].uc_stack.ss_size = 8192;
ctx[threadId][$i].uc_link = &mctx[threadId];
makecontext(&ctx[threadId][$i], f$i, 1, threadId);
!
done

cat <<!

//printf("start\n");

swapcontext(&mctx[threadId], &ctx[threadId][0]);

return 0;
}

int
main(int argc, char* argv[])
{
int num_threads = $num_threads;
int rc;
pthread_attr_t attr;
void *status;
struct timeval begin, end;
int *threadId;
int i;

printf("%d threads\n", num_threads);

pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

pThreads = (pthread_t*)malloc(sizeof(pthread_t)*num_threads);
mctx = (ucontext_t*)malloc(sizeof(ucontext_t)*num_threads);
ctx = (ucontext_t**)malloc(sizeof(ucontext_t*)*num_threads);
threadId = (int*)malloc(sizeof(int)*num_threads);

// begin time measurement
gettimeofday(&begin, NULL);

for(i=0; i < num_threads; ++i)
{
threadId[i] = i;
rc = pthread_create(&pThreads, &attr, threadMain, (void*)&threadId[i]);
if(rc)
{
printf("ERROR; return code from pthread_create is %d\n", rc);
exit(-1);
}
}

pthread_attr_destroy(&attr);
for(i = 0; i < num_threads; ++i)
{
rc = pthread_join(pThreads[i], &status);
if(rc)
{
printf("ERROR; return code from pthread_join is %d\n", rc);
exit(-1);
}
}

// end time measurement
gettimeofday(&end, NULL);
printf("finished. Elapsed time=%dms\n", ((end.tv_sec - begin.tv_sec)*1000000+(end.tv_usec - begin.tv_usec))/1000);

pthread_exit(NULL);
}


!

}

if [[ $# -ne 4 ]]; then
usage
exit 0
fi

test $1 $2 $3 $4
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Aug 26 2009
Added on Jun 25 2009
21 comments
1,317 views