Implementing a fast multi-threaded counter.

Today I'll write a bit about implementing a simple thread safe counter and improving its speed.

Implementing a basic mutli-threaded counter is a fairly easy task. Using pthreads, you just need to wrap the counter increment in a lock.

The code (github link) looks like this:

#include "mythreads.h"

struct counter_t {
  int value;
  pthread_mutex_t lock;
};

static struct counter_t counter;

void init(struct counter_t *c) {
  c->value = 0;
  pthread_mutex_init(&c->lock, NULL);
}

void increment_by(struct counter_t *c, int by) {
  Pthread_mutex_lock(&c->lock);
  c->value += by;
  Pthread_mutex_unlock(&c->lock);
}

void increment(struct counter_t *c) {
  increment_by(c, 1);
}

int get(struct counter_t *c) {
  Pthread_mutex_lock(&c->lock);
  int rc = c->value;
  Pthread_mutex_unlock(&c->lock);
  return rc;
}

Nothing complicated. Let's say that we want to increment this counter 1,000,000 times. And let's do this with an increasing amount of threads, each thread incrementing the counter 1,000,000 times. I get the following timings for this exercise.

1 thread:  0.064s real time.
2 threads: 9.930s
4 threads: 23.971s

This counter is really slow. Also, it doesn't scale well, since at the moment more than one thread starts to use it, it becomes so much slower. Why is this? Well, it's a bit difficult to tell without knowing how mutexes are implemented, but since we're using a single mutex that has to switch between two threads, it looks like there's a lot of overhead in this. The core operation - the increment, is also not parallel, since it can be done by only one thread at a time. But judging from the single threaded timing, this operation by itself is not the bottleneck here. So the synchronization must be.

How can we improve this? We can use what's called a sloppy counter. The sloppy counter is also fairly easy to understand. Each thread has its own counter and when that counter becomes bigger than a certain value, its current value is transferred into a global counter. Here's how the code for that looks like:

#include <string.h>
#include <stdlib.h>

#define min(a, b) ((a) < (b) ? (a) : (b))

const uint64_t SLOTS_COUNT = 101;

struct sloppy_counter_t {
  int value;

  struct counter_t gcounter;

  // A hash table for per-thread counters. Since we're unlikely to run too many threads at the same time,
  // chances for collision are low. If that's not the case, we can always use a per-counter mutex.
  int lcounters[SLOTS_COUNT];
};

static struct sloppy_counter_t sloppy_counter;

void sloppy_init(struct sloppy_counter_t *c) {
  for (int i = 0; i < SLOTS_COUNT; i++) {
    c->lcounters[i] = 0;
  }

  init(&c->gcounter);
}

int slot_id(pthread_t thread_id) {
  uint64_t ptid = 0;
  memcpy(&ptid, &thread_id, min(sizeof(thread_id), sizeof(ptid)));

  int sid = ptid % SLOTS_COUNT;
  return sid;
}

void sloppy_increment(struct sloppy_counter_t *c, pthread_t thread_id) {
  int sid = slot_id(thread_id);

  c->lcounters[sid]++;
  if (c->lcounters[sid] > 128) {
    increment_by(&c->gcounter, c->lcounters[sid]);
    c->lcounters[sid] = 0;
  }
}

void sloppy_flush(struct sloppy_counter_t *c, pthread_t thread_id) {
  int sid = slot_id(thread_id);
  if (c->lcounters[sid] > 0) {
    increment_by(&c->gcounter, c->lcounters[sid]);
  }
}


int sloppy_get(struct sloppy_counter_t *counter) {
  return get(&counter->gcounter);
}

Now let's time this counter.

1 thread:  0.026s
2 threads: 0.050s
4 threads: 0.164s

Much better! This counter, just like the first one, is thread safe. It's not as accurate, but the inaccuracy is small (at most 128 * number of threads) and we can use the flush function if we want accurate counts.

social