1024cores

Go, data races and combining

2013-02-22T07:23:00.001-08:00

Hi,

I continue my work on speeding up Go language. The next major release will feature faster parallel garbage collector, goroutine blocking profiler (enabled with -blockprofile flag) and a lot of other yummy features. Last but not least -- builtin data race detector. Yay! Just add -race flag.

The race detector is based on our ThreadSanitizer technology, which is initially developed for C/C++. Now that both C and C++ standards have multithreading support, data races officially banned as "undefined behavior". You can read my essay on data race here.

If you are subscribed to this blog, you are probably interested in synchronization and concurrency stuff. It's not that I am writing a lot lately, but here is at least something new -- Combiner/Aggregator Synchronization Primitive.

Best

AddressSanitizer, etc

2011-11-27T06:52:00.000-08:00

Hi,

While Go becomes faster and more scalable (if you don't yet tried it, don't miss An Online Tour of Go), our team was working on the AddressSanitizer (ASan) - a fast memory error detector for C/C++. The tool is really fast, a typical slowdown is only about 2x (while with, for example, Valgrind/Memcheck we observe slowdowns of 20-50x). And the speed is a real game changer here - we build Chromium with ASan and then do interactive browsing, or execute thousands of randomly generated tests and then do dichotomy reduction of crashing tests. It's interesting to observe how the tools spreads basically w/o our efforts, for example, here somebody writes how to build Firefox with ASan, and here is an interesting story of "ASanified" Perl.

Here are some concurrency-related links:
Concurrency Kit provides a plethora of concurrency primitives, safe memory reclamation mechanisms and lock-less and lock-free data structures designed to aid in the design and implementation of high performance concurrent systems.
Chasing state of the art: Java synchronization algorithms.

Intel has updated and extended (AVX, NUMA) its Guide for Developing Multithreaded Applications.

Relacy Race Detector 2.4 and more

2011-08-05T05:20:00.000-07:00

Hi,

It's been a long time since I blogged the last time. Since that I had joined Google, and was busy working on the ThreadSanitizer project, which is one the best race detectors out there (not saying that it's free and open source). Also I am actively contributing to the Go language concurrency/scalability related things, and it already starts paying off. The language has built-in support for concurrency and is very nice and simple to say the least of it. But any of that things requires a separate blog post.

OK, I've just uploaded Relacy Race Detector version 2.4, you may download it here.
Features:
+ Support for futex(FUTEX_WAIT/FUTEX_WAKE)
+ Linux/Darwin performance improved (2.5x for Linux, 7x for Darwin)
Fixes:
+ Fixed a bunch of issues with WaitForMultipleObjects()/SignalObjectAndWait()
+ Fixed rare spurious memory leak reports related to test progress reporting

The credit for the release goes to Charles Bloom - check out the blog, you may find a lot of interesting concurrency related stuff there.

The performance improvement is a result of an interesting optimization related to fiber/ucontext switching, and it deserves a separate post:
http://www.1024cores.net/home/lock-free-algorithms/tricks/fibers

Also check out SPAA11 "Location-Based Memory Fences" paper that I co-authored.

P.S. Google office in Moscow is awesome, people and possibilities are even better. Do not hesitate to submit your CV now... and don't forget to indicate me as a referee :)

Cache-oblivious Algorithms

2011-02-02T03:44:00.000-08:00

Hi!

This time it's about Parallel Computations, and the first real article in the section is titled Cache-oblivious Algorithms. It introduces the problem of memory bandwidth, the concept of cache hierarchy in modern computers; then describes the underlying idea of cache-oblivious algorithms, and proceeds with an example of how to design a cache-oblivious algorithm for a simple problem.Presented benchmark results show unconditional superiority of the approach.

Distributed Reader-Writer Mutex

2011-01-31T10:16:00.000-08:00

A new article Distributed Reader-Writer Mutex describes design and implementation of a simple yet scalable rw mutex. The mutex uses an interesting technique of per-processor data. Results of a benchmark against plain pthread_rwlock_t are presented.

By the way, I've changed code license on the site, now all code is covered by the Simplified BSD License.

Per-processor Data

2011-01-27T12:14:00.000-08:00

A new article Per-processor Data is available under Lockfree/Tips&Tricks section. It introduces an interesting technique of per-processor data, as well as some implementation aspects.

Case Study: MultiLane - a concurrent blocking multiset

2011-01-27T06:24:00.000-08:00

I've published a new article Case Study: MultiLane - a concurrent blocking multiset. It's about Dave Dice et al paper of the same name. The paper describes MultiLane technique which can be used to improve scalability of producer-consumer systems. I also describe some alternative architectural approaches to the problem.

Lockfree Tips&Tricks

2011-01-24T11:29:00.000-08:00

Hi!

I've kicked off a new subsection Lockfree/Tips&Tricks, in which I collect various separate topics related to design and implementation of synchronization algorithms. There are 2 articles for warm-up: Spinning (active/passive/hybrid, pros&cons, implementation) and Pointer Packing.

Lock-Free Link Pack

2011-01-22T11:21:00.000-08:00

Check out updated Lock-Free Links page. There you will find links to books, blogs, websites, research groups, separate articles and some sources to study. Perhaps, more than I ever needs to know about synchronization algorithms :)
Have I read all that? Well, mostly, but not all :)

Case Study: FastFlow Queue

2011-01-21T06:40:00.000-08:00

Hi!

I've posted a new article Case Study: FastFlow Queue, in which I examine design and implementation of a nonblocking single-producer/single-consumer queue, which used as a base building block in the FastFlow multicore programming library. In the second part of the article I design and implement "a better" queue. Benchmark results confirm better performance and scalability.

By the way, thanks to all who has participated in the poll on ~~left~~ right pane.

Alternative web addresses

2011-01-08T00:12:00.000-08:00

It seems that Google hosting periodically has outages. Just in case you have problems accessing the main URL, you can use the following one:
http://sites.google.com/site/1024cores

Differential Reference Counting

2011-01-07T08:56:00.000-08:00

Just published an article on Differential Reference Counting - reference counting algorithm which provides strong thread-safety and permits useful use cases that are not permitted by plain atomic reference counting (like boost::shared_ptr).

Lazy Concurrent Initialization

2011-01-05T12:46:00.000-08:00

I've added an article on Lazy Concurrent Initialization. It describes 2 types of initialization - blocking and non-blocking, describes available implementations in Windows API, POSIX threads and C1x/C++0x and provides guidelines for efficient implementation based on atomic operations and fine-grained memory fences.

What new materials would you like to appear on 1024cores?

2011-01-04T08:41:00.000-08:00

Hi!

I would like you to share you preferences regarding what kinds of information you would like to appear on 1024cores in the near future. You know, I need to kind of prioritize everything I may potentially share there, and your input is important to me.

I've started a poll on the right pane >>>

I have a lot of thoughts and ideas on lock-free algorithms: about queues and reader-writer problem, about object life-time management, lazy initialization, caching, etc

I may share more thought on scalable architecture: message-passing and pipelining, overloads, patterns, etc.

I may share thing related to parallel computations: some principles, caveats, problems and solutions, patterns and anti-patterns, etc.

Concurrency Myth Busters is going to be a section on common misconceptions around multicore, concurrency and parallelism.

I may highlight some hardware aspects like caches, NUMA and HyperThreading.

There are also Links, Libraries and Tools.

And if you check Other, please, drop a comment here regarding what exactly you mean.

Also don't hesitate to share here other thoughts, comments and suggestions on the site here.

Thank you.

Parallel Disk IO

2011-01-03T05:48:00.000-08:00

I've added an article on Parallel Disk IO Subsystem:
http://www.1024cores.net/home/scalable-architecture/parallel-disk-io
It's based on my Wide Finder 2 entry, but it's quite general and describes common aspects and patterns and anti-patterns. It shows how not to under-subscribe nor over-subscribe CPUs and disks, how to keep data hot in caches and memory, how to automatically adapt to various external conditions, etc.

Here is also a brief write-up on the Wide Finder 2 entry itself:
http://www.1024cores.net/home/scalable-architecture/wide-finder-2

Scalable Architecture

2010-12-30T08:43:00.000-08:00

Hi!

I've just kicked off the "Scalable Architecture" section on the site. For starters there is an Introduction which describes fundamental concepts of Distribution and Independence, as well as some ways as to how you may reason about architecture in the context of scalability.
There is also General "Bird's Eye View" Recipe for scalable architecture.
And free-standing introduction to task scheduling strategies.

About Me

2010-12-30T08:28:00.000-08:00

You know, I've got a rightful comments ~~Who the #$%^ are you?~~ that there are too many undergrads writing naive things about lock-free algorithms all over the Internet, so I kind of need to prove my "authority" on the matter :)
So I've published the About Me page:
http://www.1024cores.net/home/about-me

So what is a memory model? And how to cook it?

2010-12-26T09:32:00.000-08:00

Just posted a new article on fundamentals of memory models in the context of multi-threading. It covers 3 basic properties: Atomicity, Visibility and Ordering, along with some compiler-related and high-level languages aspects:
http://www.1024cores.net/home/lock-free-algorithms/so-what-is-a-memory-model-and-how-to-cook-it

1024cores officially goes into public beta

2010-12-25T03:20:00.000-08:00

http://www.1024cores.net - a site devoted to concurrency, synchronization algorithms, parallel computations (HPC), multi-threading, multi-core, scalability - officially goes into public beta.

So what is already there - some materials on synchronization (lock-free) algorithms, some articles on parallel computations (my write-ups for Intel Threading Challenge), and some initial sections on libraries, tools and external references (books, blogs, articles).
Much more is coming (including sections of scalable architecture and hardware aspects), so subscribe to the RSS feed or follow http://blog.1024cores.net.

Also don't hesitate to tweet/buzz/share/blog about the site. I appreciate it. Thank you in advance.
Stay tuned!

Welcome!

2010-12-20T04:41:00.000-08:00

Hi!
My name is Dmitry Vyukov (aka Dmitiy V'jukov, dvyukov and remark), and this an accompanying blog for the 1024cores site. The site is dedicated to lock-free, wait-free, obstruction-free, atomic-free synchronization algorithms and data structures, scalability-oriented architecture, multi-core/multi-processor design patterns, high-performance computing, threading technologies and libraries, message-passing systems and related topics.
In this blog I am going to post announcement of a new content available on the site, so subscribe to it right now! ;)