Synchronization doesn't necessarily mean locking
Tuesday, 15. May 2007, 21:46:54
One of the biggest problems we deal today is correct thread synchronization. Thread synchronization itself is a maze. I would like to discuss my new proposed solution to a problem; reference counting smart pointer based lock-free implementation.
We all know "how things are supposed to happen", right? Imagine, somewhere in Matrix, there is a shared resource. If it's not mutable, then no synchronization is required, anybody can read it at any given point in time, through any CPU. If variable is mutable but not shared, it will have one copy for each thread and thread can read or write it without any concern. But if a variable is both "shared" and "mutable", that is, multiple threads can *not just read*, but also alter the variable, problems start arising.
Reminder #0 The order of execution of multiple threads is not defined (in most cases). You may not know which piece of code is executed at any given point in time, though you may "guess" in most cases. So you must write the code in such a way that; a) order of execution should not matter, or, b) you should serialize tasks, in a non-preemtive manner, so that you somewhat order the flow, which IMHO implies that you must enforce blocking on some tasks.
Locking
Everybody knows locks, right? They may vary from one operating system to another, some operating systems may provide some (partially) user mode locking mechanisms (on contrary, though, because in contention case, it makes system calls; I mean Windows critical sections). But we all know what is a lock.
// Shared mutable variable
extern int g_nSomeCounter;
// shared non-mutable variable
const int g_cnTooManyItems = 1000;
// Thread A
...
AcquireLock();
g_nSomeCounter += GetMyCounter();
ReleaseLock();
...
// Thread B
...
AcquireLock();
if (g_nSomeCounter == g_cnTooManyItems)
PopupMessage("Too many items.");
ReleaseLock();
...
Looks simple? Not at all! For example, what happens, if something goes wrong when you call PopupMessage? Would you actually need something like a "smart locks", apply RAII and go along with this solution? What if RAII is not applicable in your case (imagine, PopupMessage is a C function, in a third party library and doesn't understand C++ exceptions but does understand SIGBUS)?
- Locks are not composable. I didn't know this until last year. For instance; imagine, if you will, we can prove that a piece of code, say A, implemented locking perfectly correct. Similarly, another piece of code, say B, implemented locking perfectly correct. We cannot guarantee the result will be a perfect locking implementation, if we use these two pieces of code together. Thread A takes a lock, calls thread B. It takes another lock and calls something that calls a code where it is guarded by a lock that thread A has acquired and there we have a deadlock. So, never call unknown code while holding a lock. (See, or listen, Herb Sutter's talks and papers).
- It's usually expensive to acquire and release locks.
- Threads just sit and wait until owner of the lock releases it. This could also cause priority inversion.
- If all threads block at during acquisition of a lock, usually no progress could be made.
- One must find the best locations in the code where a lock must be acquired and released so that module can achieve highest performance and safety possible.
- It's easy to result a dead lock and they sometimes are difficult to detect until product is shipped.
Lock-free techniques
This is a very complicated synchronization technique, where we rely on atomic integer operations, and no locking is involved. It's also the basis of our discussion and lock based approach itself (locks use these operations internally). Most modern CPUs, even those embedded ones, support some set of atomic integer operations. One of the most famous one is CAS - Compare And Swap or Compare And Exchange, Test and Set (anything and everything resembles these). It's implemented on Windows as InterlockedCompareExchange. Basically (pseudocode):
long InterlockedCompareExchange(volatile long* pdest,
unsigned long exchg,
unsigned long comperand)
{
long lret;
lret = *pdest;
if (*pdest == comperand)
*pdest = exchg;
return lret;
}
In English: if value of variable (Note: machine word aligned and sized) pointed by pdest equals to comperand, then set new value of variable pointed by pdest to exchg. Return initial value of variable pointed by pdest. This entire operation is atomic.
In the quiet words of the Virgin Mary... come again.
quote from Alan Ford, Snatch
When it's translated to (32-bit, where we assume sizeof(machine word) is 4 bytes) assembly:
mov ecx, DWORD PTR [esp + 0x4] ; pdest mov edx, DWORD PTR [esp + 0x8] ; exchg mov eax, DWORD PTR [esp + 0xC] ; comperand lock cmpxchg DWORD PTR[ecx], edx ret 0xC
It doesn't necessarily cause CPU to assert LOCK' on bus, but it's mostly a cache lock on modern CPUs (according to Intel documentation). AFAIK, on uniprocessor systems, LOCK is ommited. A lock-free implementation doesn't necessarily have to be wait-free, but a wait-free implementation has to be lock-free. Discussion was about a wait-alike case where main thread wants to cancel some asynchronous operation, it sets some flags and worker thread realizes that flags have changed and aborts accessing memory location. The topic of discussion was; "there is no synchronization here".
Aside: As far as I know, on uniprocessor systems; assignment of variables at machine-word aligned addresses are guaranteed to be atomic. Likewise, reading variables at machine-word aligned addresses is atomic. Please correct me, if this is wrong. Could this also imply that one can mutate a machine word aligned and sized integral type and read it safely, without any synchronization, on a uniprocessor (ARM based) system, through multiple threads?
// main thread invokes a worker thread.
// caller stores the returned handle
SomeInternalHandle InvokeWorkerThread()
{
SomeObject *pobj = new SomeObject;
PrepareObj(pobj);
return CreateWorkerThread(pobj);
}
// this handle contains:
// * a state variable that threads set to
// different values indicating different
// phases of execution or requests termination, etc.
// * destination buffer
// * Some other values required to complete the task
// States:
// CanCopy: Worker thread is/can run normally
// DontStopNow: Worker thread is writing to shared buffer,
// so main thread should wait to free it.
// TerminateWorker: Main thread wants worker to abort.
// Worker thread is ... working and it needs to make a copy
// operation to a shared buffer which could be freed
// by main thread at any time.
DWORD CALLBACK WorkerThread(LPVOID lpv)
{
SomeInternalHandle h = static_cast<SomeInternalHandle>(lpv);
// do the job, means prepare the "source"
void* pdest = ExtractDestinationFromHandle(h);
// try to set state to DontStopNow, so that main thread doesn't
// try to free the memory while we copy something.
if (InterlockedCompareExchange(pflag, DontStopNow, CanCopy) == CanCopy)
CopySomething(pdest, SomeSource, SomeSourceSize);
// reset state back to normal
InterlockedExchange(pflag, CanCopy);
long* pflag = ExtractFlagPtrFromHandle(h);
// again, set state to DontStopNow to prevent main thread
// to free the buffer.
if (InterlockedCompareExchange(pflag, DontStopNow, CanCopy) == CanCopy)
memcpy(pdest, Source, SourceSize);
// reset it back to normal state
InterlockedExchage(pflag, CanCopy);
return 0;
}
// Main thread cancellation function
void TerminateWorkerThread(SomeInternalHandle h)
{
long* pflag = ExtractFlagPtrFromHandle(h);
while (1)
{
if (InterlockedCompareExchange(pflag,
TerminateWorker,
CanCopy) == CanCopy)
{
CloseInternalHandleAndFreeBuffer(h);
break;
}
Sleep(0); // or do something else
}
}
I have written above code, and today I find it weak and insufficient, because it's over-complex and it's utterly unnecessary. IMHO, altering state twice in worker thread is not a good practice. I think SomeInternalHandle can contain a pointer-to-pointer-to-actual-variable instead of just a pointer to it, and therefore with no worries about undesired freeing memory. Not to mention using a reference counter would be fantastic! So, imagine, if you will:
// new! use a typedef'd type, forget about the rest
typedef char* MyBaseBufferType;
typedef MyBaseBufferType* MyPtrType;
// accessed somehow with SomeInternalHandle type
struct InternalDataStructure
{
MyPtrType ppdest; // was: void* pdest
... // other stuff
~InternalDataStructure()
{
if (NULL != ppdest)
FreeBuffer_ThreadSafe(*ppdest);
FreeBuffer_ThreadSafe(ppdest);
}
};
struct RefCounterTraits
{
typedef long CounterType;
};
template<typename T, typename TTraits = RefCounterTraits>
class MyRefCounter
{
T* m_pt;
typename RefCounterTraits::CounterType m_cref;
public:
typedef T MyType;
typedef T* PtrType;
typedef const PtrType ConstPtrType;
typedef T& RefType;
typedef const T& ConstRefType;
typedef typename RefCounterTraits::CounterType CounterType;
// implement a basic strict ownership based or
// a non-rebindable ref counter.
// this will be shared across threads and
// will act like T*, so override operators.
CounterType AddRef() { return InterlockedIncrement(&m_cref); }
CounterType DecRef()
{
CounterType cref = InterlockedDecrement(&m_cref);
if (0 == cref)
delete this;
return cref;
}
explicit MyRefCounter(T* pt)
: m_pt(pt)
, m_cref(1)
{ }
~MyRefCounter()
{
delete m_pt;
}
// override operators
PtrType operator->() const throw()
{
return m_pt;
}
// other operators...
};
// T is a MyRefCounter (or another type
// interface compatible with MyRefCounter).
template<typename T>
class AutoRefCounter
{
T* m_pt;
public:
typedef typename T::PtrType PtrType;
// rest of types
explicit AutoRefCounter(T* pt = NULL, bool fAddRef = false)
: m_pt(pt)
{
if (fAddRef && m_pt)
m_pt->AddRef();
}
~AutoRefCounter()
{
assert(m_pt);
if (!m_pt)
return;
m_pt->DecRef();
}
// override operators, so that accessing
// this->m_pt->m_pt becomes simpler.
PtrType operator->() const throw()
{
return m_pt->operator->();
}
// other operators...
};
typedef MyRefCounter<InternalDataStructure> ThreadSharedData;
typedef AutoRefCounter<ThreadSharedData> AutoRef;
// assumption: mainthread sets *ppdest to correct buffer.
// assumption: RefCounter passed to worker thread will not
// be destroyed before thread starts. This implies that
// unlike previous code, main thread should pass a
// ThreadSharedData* to worker thread.
DWORD CALLBACK WorkerThread(LPVOID lpv)
{
// bind smart pointer to object
AutoRef ptsd((ThreadSharedData*)lpv);
MyPtrType ppdest = ptsd->ppdest;
void* pdest;
// do the job, means prepare the "source"
DoTheJob();
// allocate a new buffer instead of directly accessing
// destination.
pdest = AllocateCompletelyNewBuffer_ThreadSafe();
// perform copying
memcpy(pdest, Source, SourceSize);
// just copy it, we don't really mind
InterlockedExchange(ppdest, pdest);
if (!ptsd->TerminateWorker)
// send notification that we have completed.
return 0;
} // smart pointer should decrease reference count by 1
// Main thread cancellation function becomes
void TerminateWorkerThread(SomeInternalHandle h)
{
ThreadSharedData* pdata = DataFromHandle(h);
AutoRef ptsd(pdata);
InterlockedExchange(&pdata->TaskState, TerminateWorker);
} // smart pointer should decrease reference count by 1
Note: Above code is completely imaginary and not even compiled. It just looked sensible to me. Also, as it could be seen, there is almost no error or invariant check.
If you don't know how InterlockedExchange works; it is doing InterlockedCompareExchange(pdest, exchg, *pdest) in a loop until destination value is set to exchange, according to Jeffrey Richter's article on MSDN.
I usually believe in this rule; "each thread should allocate its own buffer, but main thread is main supplier of shared buffers that can be shared across threads". If worker thread needs some memory, it can allocate, use and free it in its lifetime. However, worker thread(s) better not expose their buffers to other threads. Some reasons I can think of right now:
- Each thread may use a different allocator
- Default allocator may not be thread-safe
- Bookkeeping of allocations and deallocations could be way more expensive than performance gain we aim
- Various other reasons
By discipline, I prefer not to mix allocations and deallocations made from within different threads with each other. Above code, particularly memory management, could be enhanced further and would better take a simpler form, but it gives about what I think.
Imaginary, again, operator new and delete (and their overloads) could be overridden for these classes which forces a thread-safe allocation and deallocation.
Above code relies on RAII pattern, presence of thread safe allocator and proper memory management at main thread (i.e. calling a cleanup function to free allocated buffers) as well.
Basically, what I mean is:
- Critical sections are fast, if there is no contention. It makes a system call, if there is contention. Some critical section functions may throw exceptions, which could be caught with a "catch-all" (e.g. "try { foo(); } catch(...) {}") or SEH (Structured Exception Handling). But catch-all is more C++'ish way and is probably follows C++ object semantics, which means any automatic variable constructed inside try block will be destroyed before catch is executed.
- Lock based programming is not bad, but is not that simple either. Major problem is dead-locks and it's sometimes difficult to find them until they happen during debugging session.
- Lock based synchronization is based on lock-free operations.
- Lock-free techniques are bit more complicated, primarily because there is less room for complex operations (e.g. insertion or deletion operation on doubly linked list, with single CAS, is an academic research subject).
- Lock-free programming doesn't necessarily mean wait-free programming, but wait free programming means code is lock-free. There might be cases where caller must wait, but can also perform an idle task while waiting result.
- Sometimes writing more code to automate some operations is better than performing those steps manually (e.g. refcounter and auto-refcounter are good programming practices).
- Always consider threads, they are real and we will use them more frequently day by day.
- If you believe you can do an arbitrary operation in multi-threaded environment with lock-free synchronization, go for it.








