Thursday, January 27, 2011

Per-processor Data

A new article Per-processor Data is available under Lockfree/Tips&Tricks section. It introduces an interesting technique of per-processor data, as well as some implementation aspects.

16 comments:

  1. FYI http://www.kernel.org/doc/man-pages/online/pages/man2/getcpu.2.html

    VERSIONS
    getcpu() was added in kernel 2.6.19 for x86_64 and i386.

    ReplyDelete
  2. And the code to use it since the glibc doesn't export it is:


    /* This code is in the public domain */
    #define _GNU_SOURCE
    #include
    #include
    #include

    struct getcpu_cache {
    unsigned long blob[128 / sizeof(long)];
    };

    static long (*vgetcpu)(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache);

    static int init_vgetcpu(void)
    {
    void *vdso;

    dlerror();
    vdso = dlopen("linux-vdso.so.1", RTLD_LAZY);
    if (vdso == NULL)
    return -1;
    vgetcpu = dlsym(vdso, "__vdso_getcpu");
    dlclose(vdso);
    return vgetcpu == NULL ? -1 : 0;
    }

    int main(void)
    {
    unsigned cpu, node;
    struct getcpu_cache cache;

    if (init_vgetcpu() < 0) {
    fprintf(stderr, "Unable to locate vgetcpu: %s", dlerror());
    return -1;
    }

    if (vgetcpu(&cpu, &node, &cache) < 0) {
    perror("vgetcpu");
    return -1;
    }
    printf("cpu:%d node:%d\n", cpu, node);
    return 0;
    }


    and using it gives:
    $ repeat 10 ./vgetcpu
    cpu:0 node:0
    cpu:1 node:0
    cpu:0 node:0
    cpu:1 node:0
    cpu:0 node:0
    cpu:1 node:0
    cpu:1 node:0
    cpu:0 node:0
    cpu:1 node:0
    cpu:1 node:0

    (this is a core2 duo, I suppose that node is != 0 when this is a hardware thread but I don't have access to HT machines right now so somebody could probably confirm that)

    ReplyDelete
  3. Which seems to only work on X64, I must reckon I've no idea how vdso works on i386 (but who cares anyway ;p)

    ReplyDelete
  4. and I'm wrong, glibc returns it through sched_getcpu... "woopsie" ;)

    ReplyDelete
  5. getcpu is still rather heavy-weight compared to a direct call to rdtscp, vsyscall is definitely cheap in some cases but it isn't free. Per-processor distribution is nice, but sometimes you may simply want cache line distribution in which case you can use a simple subscribe/unsubscribe interface (which will also at least help alleviate some hotspot pains).

    For example,
    pid = subscribe(data_structure);
    operation(pid, data_structure);
    ...
    ...
    unsubscribe(data_structure, pid);

    where pid may be a map to a unique cache line. Additionally, if CPU migration is common and you're working with relatively small data sets then locality benefits may not exceed the cost of cache line invalidation.

    ReplyDelete
  6. @MadCoder Thanks for your suggestions. sched_getcpu() is not quite vgetcpu(). It seems that sched_getcpu() initially was a slow syscall, but I see that there is a constant progress going on:
    http://kerneltrap.org/mailarchive/linux-kernel/2008/9/23/3387414
    I gave sched_getcpu() one more try, and it seems that now it's quite fast (potentially implemented with the same SIDT or LSL).
    I will update the article with this and other aspects.

    ReplyDelete
  7. @sbahra
    It seems that sched_getcpu() is rather fast nowadays:
    http://kerneltrap.org/mailarchive/linux-kernel/2008/9/23/3387414
    Indeed, there is not reason why it can't be reimplemented with SIDT/LSL (which a cheaper that RDTSCP because there are not serializing instructions).

    > but sometimes you may simply want cache line distribution

    Well, actually I implied that it's a baseline, and what a lot of people are able to do. So it's vice versa :)
    Per-thread subscription/unsubscription/handoff is usually leads to quite complex implementations.

    > if CPU migration is common

    I think then nothing will help you :)

    ReplyDelete
  8. I've updated the article with sched_getcpu(), LSL, and added clear indication that "remember in user-space it's no more than an optimization - a thread can be rescheduled to another CPU straight after it has obtained the number".

    ReplyDelete
  9. My point was regarding overheads associated with VDSO (and DSO in general) on large multi-processor machines (where topology is not a simple 1-hop network) and not so much the specific implementation of getcpu. vgetcpu has been in Linux since 2.6.24 and sched_getcpu is implemented in terms of vgetcpu. However even on smaller multi-processor systems, getcpu is simply is not an option since sched_getcpu may still be implemented in terms of a full system call (RHEL5 is still a standard).

    Subscribe/unsubscribe can provide a guarantee of cache line ownership everywhere while processor ID cannot do this in user-space if hard processor affinity is not used. You can collide very easily. In kernel-space, as you point out the ownership can be guaranteed. For memory-intensive user-space applications if the actual topology is an important factor then it is possible to set hard affinity (on Linux, see sched_setaffinity, simple interface example for RR distribution in http://codepad.org/IKLC2YkE) but this isn't suitable for applications that include a fair bit of thread deactivation and/or thread migration.

    The additional complexity is totally negligible considering the attractive guarantees it can provide (in an operating system agnostic manner) if subscribe/unsubscribe frequency is low.

    ReplyDelete
  10. Note, also, you can also use TLS to implement a similar technique but abstract away the ID from the interface and all these may be coupled with the nice guarantee on some operating systems with first-touch allocation policies for NUMA (providing essentially the same benefits as using getcpu, but again, in a portable manner).

    ReplyDelete
  11. @sbahra What you are saying makes perfect sense.
    However, engineering is always about choice and trade-offs. So I just want to show alternatives and describe their trade-offs.
    By the way, here is a draft of a new article "Distributed Reader-Writer Mutex" (it's not yet officially published, but almost done):
    http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem/distributed-reader-writer-mutex
    It shows how to build a very simple rw mutex, which still shows very good scalability.
    I've done the same with per-thread approach, and I can say that the implementation is *significantly* more complicated. The problem is with synchronization between arriving/departing threads and writers, and between departing threads and mutex destruction. And by the way, it's not that easy to catch thread completion event under Windows.

    ReplyDelete
  12. vgetcpu is a vsyscall on recent linuces, and its implementation is fast and uses RDTSCP or a per-cpu variable:

    http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/arch/x86/vdso/vgetcpu.c

    ReplyDelete
  13. Thanks for the info! Nice to hear!
    Now I'm crossing my fingers for SYS_membarrier.

    ReplyDelete
  14. Hello Dmitry,

    Thank you for the ideas in the paper! I gave those three approaches (RDTSCP, CPUID, SIDT) a try in my microbenchmark (parallel_for over a team of 16 threads with 1 mill of tasks of (1) get_cpu_num(); (2) read a buffer of a predefined size from a memory pool and compute a check sum for the buffer (xor ints). Turns out reading from 0x7FFE0000 does not work on Windows 7. Using NtQuerySystemTime works just as good, but the return value is in 100 ns intervals, so simple != comparison in amortization function needs to be modified too.
    Performance of the different approaches were as you'd expect (or as you predicted), which is SIDT is the fastest, then amortized RDTSCP, and then CPUID is the slowest one. Curious thing however is that caching latest proc_num (and apic or idt alongside of it) in TLS makes the code slower really. In my case search among 16 values (registered apics or idts for all the logical CPUs I got) worked 5% faster than the cached version of the same code.

    ReplyDelete
  15. Another finding - if you use GetSystemTimeAsFileTime instead of NtQuerySystemTime on Windows 7, it does the following under the covers:
    mov edx,dword ptr [7FFE0018h]
    mov r8d,dword ptr [7FFE0014h]
    mov eax,dword ptr [7FFE001Ch]
    so, I guess "accessing the address" functionality is still there, only the address has changed a little bit.

    ReplyDelete
  16. Hi Anton,

    Thanks for the info!

    >Turns out reading from 0x7FFE0000 does not work on...

    Of course it is a dirty hack, and production code should include a dynamic check for known OS versions with a fall-back to fair GetTickCount() call.

    I guess that you are working on a relatively modern version of Windows, so it would be interesting to see numbers for GetCurrentProcessorNumber().

    >Curious thing however is that caching latest proc_num (and apic or idt alongside of it) in TLS makes the code slower...

    Hummm... That is *not* as I would expect... Do you implement TLS with __declspec(thread)? Do you prevent several accesses to TLS?

    ReplyDelete