Subscribe to:
Post Comments (Atom)
This is an accompanying blog for the www.1024cores.net site about lockfree/waitfree synchronization algorithms and data structures, scalability-oriented architecture, multicore design patterns, high-performance computing, threading technologies and libraries, message-passing systems and related topics.
FYI http://www.kernel.org/doc/man-pages/online/pages/man2/getcpu.2.html
ReplyDeleteVERSIONS
getcpu() was added in kernel 2.6.19 for x86_64 and i386.
And the code to use it since the glibc doesn't export it is:
ReplyDelete/* This code is in the public domain */
#define _GNU_SOURCE
#include
#include
#include
struct getcpu_cache {
unsigned long blob[128 / sizeof(long)];
};
static long (*vgetcpu)(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache);
static int init_vgetcpu(void)
{
void *vdso;
dlerror();
vdso = dlopen("linux-vdso.so.1", RTLD_LAZY);
if (vdso == NULL)
return -1;
vgetcpu = dlsym(vdso, "__vdso_getcpu");
dlclose(vdso);
return vgetcpu == NULL ? -1 : 0;
}
int main(void)
{
unsigned cpu, node;
struct getcpu_cache cache;
if (init_vgetcpu() < 0) {
fprintf(stderr, "Unable to locate vgetcpu: %s", dlerror());
return -1;
}
if (vgetcpu(&cpu, &node, &cache) < 0) {
perror("vgetcpu");
return -1;
}
printf("cpu:%d node:%d\n", cpu, node);
return 0;
}
and using it gives:
$ repeat 10 ./vgetcpu
cpu:0 node:0
cpu:1 node:0
cpu:0 node:0
cpu:1 node:0
cpu:0 node:0
cpu:1 node:0
cpu:1 node:0
cpu:0 node:0
cpu:1 node:0
cpu:1 node:0
(this is a core2 duo, I suppose that node is != 0 when this is a hardware thread but I don't have access to HT machines right now so somebody could probably confirm that)
Which seems to only work on X64, I must reckon I've no idea how vdso works on i386 (but who cares anyway ;p)
ReplyDeleteand I'm wrong, glibc returns it through sched_getcpu... "woopsie" ;)
ReplyDeletegetcpu is still rather heavy-weight compared to a direct call to rdtscp, vsyscall is definitely cheap in some cases but it isn't free. Per-processor distribution is nice, but sometimes you may simply want cache line distribution in which case you can use a simple subscribe/unsubscribe interface (which will also at least help alleviate some hotspot pains).
ReplyDeleteFor example,
pid = subscribe(data_structure);
operation(pid, data_structure);
...
...
unsubscribe(data_structure, pid);
where pid may be a map to a unique cache line. Additionally, if CPU migration is common and you're working with relatively small data sets then locality benefits may not exceed the cost of cache line invalidation.
@MadCoder Thanks for your suggestions. sched_getcpu() is not quite vgetcpu(). It seems that sched_getcpu() initially was a slow syscall, but I see that there is a constant progress going on:
ReplyDeletehttp://kerneltrap.org/mailarchive/linux-kernel/2008/9/23/3387414
I gave sched_getcpu() one more try, and it seems that now it's quite fast (potentially implemented with the same SIDT or LSL).
I will update the article with this and other aspects.
@sbahra
ReplyDeleteIt seems that sched_getcpu() is rather fast nowadays:
http://kerneltrap.org/mailarchive/linux-kernel/2008/9/23/3387414
Indeed, there is not reason why it can't be reimplemented with SIDT/LSL (which a cheaper that RDTSCP because there are not serializing instructions).
> but sometimes you may simply want cache line distribution
Well, actually I implied that it's a baseline, and what a lot of people are able to do. So it's vice versa :)
Per-thread subscription/unsubscription/handoff is usually leads to quite complex implementations.
> if CPU migration is common
I think then nothing will help you :)
I've updated the article with sched_getcpu(), LSL, and added clear indication that "remember in user-space it's no more than an optimization - a thread can be rescheduled to another CPU straight after it has obtained the number".
ReplyDeleteMy point was regarding overheads associated with VDSO (and DSO in general) on large multi-processor machines (where topology is not a simple 1-hop network) and not so much the specific implementation of getcpu. vgetcpu has been in Linux since 2.6.24 and sched_getcpu is implemented in terms of vgetcpu. However even on smaller multi-processor systems, getcpu is simply is not an option since sched_getcpu may still be implemented in terms of a full system call (RHEL5 is still a standard).
ReplyDeleteSubscribe/unsubscribe can provide a guarantee of cache line ownership everywhere while processor ID cannot do this in user-space if hard processor affinity is not used. You can collide very easily. In kernel-space, as you point out the ownership can be guaranteed. For memory-intensive user-space applications if the actual topology is an important factor then it is possible to set hard affinity (on Linux, see sched_setaffinity, simple interface example for RR distribution in http://codepad.org/IKLC2YkE) but this isn't suitable for applications that include a fair bit of thread deactivation and/or thread migration.
The additional complexity is totally negligible considering the attractive guarantees it can provide (in an operating system agnostic manner) if subscribe/unsubscribe frequency is low.
Note, also, you can also use TLS to implement a similar technique but abstract away the ID from the interface and all these may be coupled with the nice guarantee on some operating systems with first-touch allocation policies for NUMA (providing essentially the same benefits as using getcpu, but again, in a portable manner).
ReplyDelete@sbahra What you are saying makes perfect sense.
ReplyDeleteHowever, engineering is always about choice and trade-offs. So I just want to show alternatives and describe their trade-offs.
By the way, here is a draft of a new article "Distributed Reader-Writer Mutex" (it's not yet officially published, but almost done):
http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem/distributed-reader-writer-mutex
It shows how to build a very simple rw mutex, which still shows very good scalability.
I've done the same with per-thread approach, and I can say that the implementation is *significantly* more complicated. The problem is with synchronization between arriving/departing threads and writers, and between departing threads and mutex destruction. And by the way, it's not that easy to catch thread completion event under Windows.
vgetcpu is a vsyscall on recent linuces, and its implementation is fast and uses RDTSCP or a per-cpu variable:
ReplyDeletehttp://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/arch/x86/vdso/vgetcpu.c
Thanks for the info! Nice to hear!
ReplyDeleteNow I'm crossing my fingers for SYS_membarrier.
Hello Dmitry,
ReplyDeleteThank you for the ideas in the paper! I gave those three approaches (RDTSCP, CPUID, SIDT) a try in my microbenchmark (parallel_for over a team of 16 threads with 1 mill of tasks of (1) get_cpu_num(); (2) read a buffer of a predefined size from a memory pool and compute a check sum for the buffer (xor ints). Turns out reading from 0x7FFE0000 does not work on Windows 7. Using NtQuerySystemTime works just as good, but the return value is in 100 ns intervals, so simple != comparison in amortization function needs to be modified too.
Performance of the different approaches were as you'd expect (or as you predicted), which is SIDT is the fastest, then amortized RDTSCP, and then CPUID is the slowest one. Curious thing however is that caching latest proc_num (and apic or idt alongside of it) in TLS makes the code slower really. In my case search among 16 values (registered apics or idts for all the logical CPUs I got) worked 5% faster than the cached version of the same code.
Another finding - if you use GetSystemTimeAsFileTime instead of NtQuerySystemTime on Windows 7, it does the following under the covers:
ReplyDeletemov edx,dword ptr [7FFE0018h]
mov r8d,dword ptr [7FFE0014h]
mov eax,dword ptr [7FFE001Ch]
so, I guess "accessing the address" functionality is still there, only the address has changed a little bit.
Hi Anton,
ReplyDeleteThanks for the info!
>Turns out reading from 0x7FFE0000 does not work on...
Of course it is a dirty hack, and production code should include a dynamic check for known OS versions with a fall-back to fair GetTickCount() call.
I guess that you are working on a relatively modern version of Windows, so it would be interesting to see numbers for GetCurrentProcessorNumber().
>Curious thing however is that caching latest proc_num (and apic or idt alongside of it) in TLS makes the code slower...
Hummm... That is *not* as I would expect... Do you implement TLS with __declspec(thread)? Do you prevent several accesses to TLS?