Blame SOURCES/glibc-rh1284959-1.patch

147e83
Description: Makes trimming work consistently across arenas.
147e83
Author: Mel Gorman <mgorman@suse.de>
147e83
Origin: git://sourceware.org/git/glibc.git
147e83
Bug-RHEL: N/A
147e83
Bug-Fedora: N/A
147e83
Bug-Upstream: #17195
147e83
Upstream status: committed
147e83
147e83
Part of commit 8a35c3fe122d49ba76dff815b3537affb5a50b45 is also included
147e83
to allow the use of ALIGN_UP within malloc/arena.c.
147e83
147e83
commit c26efef9798914e208329c0e8c3c73bb1135d9e3
147e83
Author: Mel Gorman <mgorman@suse.de>
147e83
Date:   Thu Apr 2 12:14:14 2015 +0530
147e83
147e83
    malloc: Consistently apply trim_threshold to all heaps [BZ #17195]
147e83
    
147e83
    Trimming heaps is a balance between saving memory and the system overhead
147e83
    required to update page tables and discard allocated pages. The malloc
147e83
    option M_TRIM_THRESHOLD is a tunable that users are meant to use to decide
147e83
    where this balance point is but it is only applied to the main arena.
147e83
    
147e83
    For scalability reasons, glibc malloc has per-thread heaps but these are
147e83
    shrunk with madvise() if there is one page free at the top of the heap.
147e83
    In some circumstances this can lead to high system overhead if a thread
147e83
    has a control flow like
147e83
    
147e83
        while (data_to_process) {
147e83
            buf = malloc(large_size);
147e83
            do_stuff();
147e83
            free(buf);
147e83
        }
147e83
    
147e83
    For a large size, the free() will call madvise (pagetable teardown, page
147e83
    free and TLB flush) every time followed immediately by a malloc (fault,
147e83
    kernel page alloc, zeroing and charge accounting). The kernel overhead
147e83
    can dominate such a workload.
147e83
    
147e83
    This patch allows the user to tune when madvise gets called by applying
147e83
    the trim threshold to the per-thread heaps and using similar logic to the
147e83
    main arena when deciding whether to shrink. Alternatively if the dynamic
147e83
    brk/mmap threshold gets adjusted then the new values will be obeyed by
147e83
    the per-thread heaps.
147e83
    
147e83
    Bug 17195 was a test case motivated by a problem encountered in scientific
147e83
    applications written in python that performance badly due to high page fault
147e83
    overhead. The basic operation of such a program was posted by Julian Taylor
147e83
    https://sourceware.org/ml/libc-alpha/2015-02/msg00373.html
147e83
    
147e83
    With this patch applied, the overhead is eliminated. All numbers in this
147e83
    report are in seconds and were recorded by running Julian's program 30
147e83
    times.
147e83
    
147e83
    pyarray
147e83
                                     glibc               madvise
147e83
                                      2.21                    v2
147e83
    System  min             1.81 (  0.00%)        0.00 (100.00%)
147e83
    System  mean            1.93 (  0.00%)        0.02 ( 99.20%)
147e83
    System  stddev          0.06 (  0.00%)        0.01 ( 88.99%)
147e83
    System  max             2.06 (  0.00%)        0.03 ( 98.54%)
147e83
    Elapsed min             3.26 (  0.00%)        2.37 ( 27.30%)
147e83
    Elapsed mean            3.39 (  0.00%)        2.41 ( 28.84%)
147e83
    Elapsed stddev          0.14 (  0.00%)        0.02 ( 82.73%)
147e83
    Elapsed max             4.05 (  0.00%)        2.47 ( 39.01%)
147e83
    
147e83
                   glibc     madvise
147e83
                    2.21          v2
147e83
    User          141.86      142.28
147e83
    System         57.94        0.60
147e83
    Elapsed       102.02       72.66
147e83
    
147e83
    Note that almost a minutes worth of system time is eliminted and the
147e83
    program completes 28% faster on average.
147e83
    
147e83
    To illustrate the problem without python this is a basic test-case for
147e83
    the worst case scenario where every free is a madvise followed by a an alloc
147e83
    
147e83
    /* gcc bench-free.c -lpthread -o bench-free */
147e83
    static int num = 1024;
147e83
    
147e83
    void __attribute__((noinline,noclone)) dostuff (void *p)
147e83
    {
147e83
    }
147e83
    
147e83
    void *worker (void *data)
147e83
    {
147e83
      int i;
147e83
    
147e83
      for (i = num; i--;)
147e83
        {
147e83
          void *m = malloc (48*4096);
147e83
          dostuff (m);
147e83
          free (m);
147e83
        }
147e83
    
147e83
      return NULL;
147e83
    }
147e83
    
147e83
    int main()
147e83
    {
147e83
      int i;
147e83
      pthread_t t;
147e83
      void *ret;
147e83
      if (pthread_create (&t, NULL, worker, NULL))
147e83
        exit (2);
147e83
      if (pthread_join (t, &ret))
147e83
        exit (3);
147e83
      return 0;
147e83
    }
147e83
    
147e83
    Before the patch, this resulted in 1024 calls to madvise. With the patch applied,
147e83
    madvise is called twice because the default trim threshold is high enough to avoid
147e83
    this.
147e83
    
147e83
    This a more complex case where there is a mix of frees. It's simply a different worker
147e83
    function for the test case above
147e83
    
147e83
    void *worker (void *data)
147e83
    {
147e83
      int i;
147e83
      int j = 0;
147e83
      void *free_index[num];
147e83
    
147e83
      for (i = num; i--;)
147e83
        {
147e83
          void *m = malloc ((i % 58) *4096);
147e83
          dostuff (m);
147e83
          if (i % 2 == 0) {
147e83
            free (m);
147e83
          } else {
147e83
            free_index[j++] = m;
147e83
          }
147e83
        }
147e83
      for (; j >= 0; j--)
147e83
        {
147e83
          free(free_index[j]);
147e83
        }
147e83
    
147e83
      return NULL;
147e83
    }
147e83
    
147e83
    glibc 2.21 calls malloc 90305 times but with the patch applied, it's
147e83
    called 13438. Increasing the trim threshold will decrease the number of
147e83
    times it's called with the option of eliminating the overhead.
147e83
    
147e83
    ebizzy is meant to generate a workload resembling common web application
147e83
    server workloads. It is threaded with a large working set that at its core
147e83
    has an allocation, do_stuff, free loop that also hits this case. The primary
147e83
    metric of the benchmark is records processed per second. This is running on
147e83
    my desktop which is a single socket machine with an I7-4770 and 8 cores.
147e83
    Each thread count was run for 30 seconds. It was only run once as the
147e83
    performance difference is so high that the variation is insignificant.
147e83
    
147e83
                    glibc 2.21              patch
147e83
    threads 1            10230              44114
147e83
    threads 2            19153              84925
147e83
    threads 4            34295             134569
147e83
    threads 8            51007             183387
147e83
    
147e83
    Note that the saving happens to be a concidence as the size allocated
147e83
    by ebizzy was less than the default threshold. If a different number of
147e83
    chunks were specified then it may also be necessary to tune the threshold
147e83
    to compensate
147e83
    
147e83
    This is roughly quadrupling the performance of this benchmark. The difference in
147e83
    system CPU usage illustrates why.
147e83
    
147e83
    ebizzy running 1 thread with glibc 2.21
147e83
    10230 records/s 306904
147e83
    real 30.00 s
147e83
    user  7.47 s
147e83
    sys  22.49 s
147e83
    
147e83
    22.49 seconds was spent in the kernel for a workload runinng 30 seconds. With the
147e83
    patch applied
147e83
    
147e83
    ebizzy running 1 thread with patch applied
147e83
    44126 records/s 1323792
147e83
    real 30.00 s
147e83
    user 29.97 s
147e83
    sys   0.00 s
147e83
    
147e83
    system CPU usage was zero with the patch applied. strace shows that glibc
147e83
    running this workload calls madvise approximately 9000 times a second. With
147e83
    the patch applied madvise was called twice during the workload (or 0.06
147e83
    times per second).
147e83
    
147e83
    2015-02-10  Mel Gorman  <mgorman@suse.de>
147e83
    
147e83
      [BZ #17195]
147e83
      * malloc/arena.c (free): Apply trim threshold to per-thread heaps
147e83
        as well as the main arena.
147e83
147e83
Index: glibc-2.17-c758a686/malloc/arena.c
147e83
===================================================================
147e83
--- glibc-2.17-c758a686.orig/malloc/arena.c
147e83
+++ glibc-2.17-c758a686/malloc/arena.c
147e83
@@ -661,7 +661,7 @@ heap_trim(heap_info *heap, size_t pad)
147e83
   unsigned long pagesz = GLRO(dl_pagesize);
147e83
   mchunkptr top_chunk = top(ar_ptr), p, bck, fwd;
147e83
   heap_info *prev_heap;
147e83
-  long new_size, top_size, extra, prev_size, misalign;
147e83
+  long new_size, top_size, top_area, extra, prev_size, misalign;
147e83
 
147e83
   /* Can this heap go away completely? */
147e83
   while(top_chunk == chunk_at_offset(heap, sizeof(*heap))) {
147e83
@@ -695,9 +695,16 @@ heap_trim(heap_info *heap, size_t pad)
147e83
     set_head(top_chunk, new_size | PREV_INUSE);
147e83
     /*check_chunk(ar_ptr, top_chunk);*/
147e83
   }
147e83
+
147e83
+  /* Uses similar logic for per-thread arenas as the main arena with systrim
147e83
+     by preserving the top pad and at least a page.  */
147e83
   top_size = chunksize(top_chunk);
147e83
-  extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1);
147e83
-  if(extra < (long)pagesz)
147e83
+  top_area = top_size - MINSIZE - 1;
147e83
+  if (top_area <= pad)
147e83
+    return 0;
147e83
+
147e83
+  extra = ALIGN_DOWN(top_area - pad, pagesz);
147e83
+  if ((unsigned long) extra < mp_.trim_threshold)
147e83
     return 0;
147e83
   /* Try to shrink. */
147e83
   if(shrink_heap(heap, extra) != 0)
147e83
Index: glibc-2.17-c758a686/malloc/malloc.c
147e83
===================================================================
147e83
--- glibc-2.17-c758a686.orig/malloc/malloc.c
147e83
+++ glibc-2.17-c758a686/malloc/malloc.c
147e83
@@ -236,6 +236,8 @@
147e83
 /* For va_arg, va_start, va_end.  */
147e83
 #include <stdarg.h>
147e83
 
147e83
+/* For ALIGN_UP.  */
147e83
+#include <libc-internal.h>
147e83
 
147e83
 /*
147e83
   Debugging: