Scalable Statistics Counters

Dipankar Sarma <dipankar@in.ibm.com>
Ravikiran G. Thirumalai <kiran@in.ibm.com>
Paul E. McKenney <paul.mckenney@us.ibm.com>

Overview:

In-kernel counters are used to collect statistics of various events and render the same on requests (Eg., total packets transmitted from a network interface, context swithces, disk usage statistics, etc). These counters are usually not serialised by any mutex, since such counters need to be reasonably accurate only. Depending on the purpose and load conditons, these counters may be be updated very frequently. In this document, a new set of interfaces are proposed, which should improve performance of such counters in multiprocessor environments.

The basic idea behind the approach is to have per cpu versions of these counters and update per-cpu versions. The per cpu versions can be consolidated on a read request to the counter. This improves performance significantly in multiprocessor environments by reducing the bouncing of the counter cache lines among different CPUs. At the same time, this implementation switches to code that is completely devoid of overheads for SMP when compiled without CONFIG_SMP option set.

Interfaces:

Counters where exactness is not important (like statistics counters) can be declared as of type statctr_t. This is an opaque type and users of the counters do not need to bother about its fields. For an SMP kernel (CONFIG_SMP=y), this is a structure that has information which can be used to quickly compute the appropriate per-cpu counter. For uniprocessor kernel, this is just a unsigned long and thus there is no overhead.

	#ifdef CONFIG_SMP
	typedef struct statctr {
		pcpu_ctr_t *ctr;
	} statctr_t;
	#else
	typedef unsigned long statctr_t;
	#endif

  1. int statctr_init(statctr_t *stctr, unsinged long value)
  2. int statctr_ninit(statctr_t *stctr, int count, unsinged long value)

  3. All statctrs must be initialized using one of these interfaces. These counters cannot be just used by statically allocating statctr_t types at compile time, simply because the counter is going to refer to per cpu versions in different cachelines (which is going to be allocated during statctr_init()). Counters allocated using these functions must be freed using statctr_cleanup() or statctr_ncleanup(). For, uniprocessor kernels, statctr_init()/statctr_ninit() reduce to simply initializing the unsigned long counter or an array of them, there is no separate allocation involved here. These functions return 0 on success and a non-zero value on failure.

  4. void statctr_cleanup(statctr_t *stctr)
  5. void statctr_ncleanup(statctr_t *stctr, int count)

  6. These interfaces must be used to free statistics counters initialized using statctr_init() or statctr_ninit() respectively. These function frees up the per-cpu counter(s)/internal data structure(s) allocated during statctr_init() or statctr_ninit(). They are NOP for uniprocessor kernels.

  7. void statctr_inc(statctr_t *stctr)

  8. This interface increments the counter by one. Internally, only the per-cpu counter is incremented. For uniprocessor kernels, it simply non-atomically increments the unsigned long statistics counter.

  9. void statctr_dec(statctr_t *stctr)

  10. Decrements the counter by one. Internally, only the per-cpu counter is decremented. For uniprocessor kernels, it non-atomically decrements the counter.

  11. void statctr_add(statctr_t *stctr, unsigned long addend)

  12. Adds the addend value to counter. Internally, only the per-cpu counter is added for SMP kernels.

  13. void statctr_sub(statctr_t *stctr, unsigned long subtrahend)

  14. Subtracts the subrtrahend value from the counter. Internally, only the per-cpu counter is updated for SMP kernels.

  15. unsigned long statctr_read(statctr_t *stctr)

  16. Returns the counter value. This function reads all of the per-cpu versions of this counters, adds them and returns to the caller. For uniprocessor kernels, it just returns the value of the counter.

  17. void statctr_set(statctr_t *stctr, unsigned long i)

  18. Sets the statctr to the value i. If statctr_read() is invoked on a counter after statctr_set(), the return value should reflect the value set.

Statistics Counter Maintenance:

In order to avoid sharing of cache lines between CPUs, the per-CPU versions of a statctr_t counter are located in different cache lines. However, since the counter size is a lot smaller than typical cache line sizes, different statistic counters interlaced their per-CPU versions and thus save on space.
             CPU #0                  CPU #1
	-----------------	-----------------       ^
	|		|	|		|	|
	|   counter #0  |       |  counter #0   |	|
	|		|	|		|	|
	-----------------	-----------------	|
	|		|	|		|	|
	|   counter #1  |       |  counter #1   |	|
	|		|	|		|	|
	-----------------	-----------------	|
	|		|	|		|	|
	|   counter #3  |       |  counter #2   |	|
	|		|	|		|
	-----------------	----------------- SMP_CACHE_BYTES
	|		|	|		|	
	|   counter #4  |       |  counter #4   |	|
	|		|	|		|	|
	-----------------	-----------------	|
	|		|	|		|	|
	|		|	|		|	|
	|		|	|		|	|
	-----------------	-----------------	|
               .                        .		|
               .                        .		|
               .                        .		|
               .                        .		|
               .                        .		|
               .                        .		v

We use a separate allocator to allocate and maintain interlaced per-cpu counters. Currently there are two implementations of this allocator - Fixed and Flexible. They differ in flexibility provided, specially with respect to handling addition or removal of CPUs in the system. The statctr implementation uses a common set of APIs to access the per-cpu counters irrespective of the underlying allocator. A per-cpu counter is represented by an opaque type pcpu_ctr_t. The APIs are -

  1. pcpu_ctr_t *pcpu_ctr_alloc(int flag)

  2. This allocates a per-CPU counter including all the data structures necessary in the underlying implementation. Returns NULL on failure.

  3. pcpu_ctr_t *pcpu_ctr_free(pcpu_ctr_t *ctr)

  4. Frees the counter.

  5. unsigned long *PCPU_CTR(pcpu_ctr_t *ctr, int cpu)

  6. Given a cpu, returns a pointer to its own version of the counter.

Fixed counters:


  my_struct
  ------------
  |          |
  | statctr0 |------|
  |          |      v
  -----------        -----------------    ^
  |          |       |	             |    |
  | statctr1 |----   |   counter #0  |    |
                 |   |	             |    |
                 --> -----------------    |
		     |		     | SMP_CACHE_BYTES
		     |   counter #1  | (for CPU #0)
		     |		     |    |
		     -----------------    |
		             .            |
		             .            |
		     |		     |    v
		     -----------------    ^
		     |		     |    |
		     |   counter #0  |    |
		     |		     |    |
		     -----------------    |
		     |		     | SMP_CACHE_BYTES
		     |   counter #1  | (for CPU #1)
		     |		     |    |
		     -----------------    |
		            .        
		            .       
		            .      
		            .     
		            .    


Here a statistics counter declared and initialized using statctr_init() will be allocated from a smp_num_cpus * SMP_CACHE_BYTES sized block with each per-CPU version residing in a different cache line aligned chunk of SMP_CACHE_BYTES size. The next counter initialized will share the same block by placing the per-CPU versions interlaced. When one set of blocks get exhausted, another set is allocated.

The statctr_t opaque type contains a pointer to the CPU #0 version of the statistics counter. This pointer resides on line 0 of the statctr_t block (of size smp_num_cpus * SMP_CACHE_BYTES). The address of the per-cpu version of this counter is computed using pointer arithmatic.

typedef void *pcpu_ctr_t;

#define PCPU_CTR(ctr,cpu) \
       ((unsigned long *)(ctr + SMP_CACHE_BYTES * cpu))

static __inline__ void statctr_inc(statctr_t *stctr)
{
	(*PCPU_CTR(statctr->ctr, smp_processor_id()))++;
}

Because SMP_CACHE_BYTES will be a power of 2, reasonable compliers would generate code without using costlier instructions.

Flexible counters:

In order to make statctr work with CPU-hotplug features, we need to use a different approach to maintain per-cpu counters. For CPU online/offline to work with the Fixed counters, we have to allocate cacheline sized counter chunks (of size SMP_CACHE_BYTES) for NR_CPUS cpus which is likely to be a waste on many SMP machines. Instead, we use a flexible array based scheme that allows dynamic expansion of per-cpu chunks allocated for counters.
  1. Each statctr is represented by a an array of size NR_CPUS that contains the pointers to the per-cpu versions of the counters. Initially we fill out the array with the pointers for the currently online CPUs. Onlining/Offlining would only requiring updates to the affected CPUs entry in this array. The data structures that contain the statctr will not need any modification and will continue to available to other CPUs for increment/decrement/read operations.
  2. 
                      ---------------          ---------------
                      |             |          |             |
                      |    next     |--------->|     next    |--------->
                      |             |          |             |
    		  ---------------          ---------------
                      |             |          |             |
                      |             |          |             |
                      |             |          |             |
                      ---------------          --------------
                             |                        |
       my_struct             |                        |
      -----------            |
      |         |            v
      | statctr |----->---------------
      |         |  ^  |              |
      ----------   |  |   CPU #0     |--------->-----------------  ^
      |         |  |  |              |          |                |  |
      |         |  |  ----------------          |                |  |
                   |  |              |          |                |  |
                   |  |   CPU #1     |----      |                |  |
              NR_CPUS |              |   |      |                | SMP_CACHE_BYTES
                   |  ----------------   |      |                |  |
                   |  |              |   |      |                |  |
                   |  |   CPU #2     |   |      |                |  |
                   |  |              |   |      |                |  v
                   |  ----------------   ------>------------------  
                   |  |              |          |                |
                   |          .                 |                |
                   |          .                 |                |
                   |          .                 |                |
                   |          .                 |                |
                   v
    
  3. The allocator maintains the counters in counter blocks. Each block consists of an array of size NR_CPUS with each element corresponding to an online CPU pointing to a cacheline sized, cacheline aligned counter chunk (of SMP_CACHE_BYTES). These chunks are allocated using the existing kmem_cache_alloc() slab allocator. The counter block also has a free list to allocate a single counter within itself.
  4. The fast path code (increment/decrement) involves an extra load to get the given CPUs version of the counter, but it isn't likely to be as significant compared to the savings since the extra load will not result in cache line bouncing.
    
    typedef struct pcpu_ctr_s {
           void *arr[NR_CPUS];     /* Pcpu counter array */
           void *blkp;             /* Pointer to block from which ctr was
                                      allocated from (for use with free code) */
    } pcpu_ctr_t;
    
    #define PCPU_CTR(ctr, cpuid) ((unsigned long *)ctr->arr[cpuid]) 
    
    static __inline__ void statctr_inc(statctr_t *statctr)
    {
    	(*PCPU_CTR(statctr->ctr, smp_processor_id()))++;
    }
    
  5. We may maintain a list of all such counters allocated (linked list of the counter pointer-arrays). During onlining or offlining CPUs, this list is to be traversed and the corresponding pointer in the array updated. There is no locking here necessary since different CPUs will change their own pointers. Of course locking is needed for creating new counters.
  6. If the number of CPUs in the system decreases, then we can live with letting the previously allocated cache lines be around unused. If the number of CPUs increase, then we need to allocate an additional cache line for each additional CPU.
  7. This scheme will also handle sparse CPU numbers.

Results:

All measurements were done in a 8-way PIII 700MHz Xeon (1MB L2 cache) machine with 4GB RAM.

Microbenchmarking:

Microbenchmarking, while not indicative of any real workload, is still beneficial to verify if a certain implementation has inherent bottlenecks. So, here are the results of cost measurement for global counters vs. statctr averaged over 200 million operations.

Obvservations:

Tbench over loopback:

The loopback driver uses 4 counters to maintain Tx/Rx statistics. This measurement used tbench over lo to compare relative advantage and disadvantage of using statctrs for these statistics. Amount of cycles needed to increment 4 counters as well as profile count for loopback_xmit() where measured.

Obvservations:

Download:

The statctr patches are available from Scalable Counters Package in LSE.

SourceForge Logo