What happens when you run out of memory in Linux?

I've always been curious to figure out what happens when you run out of memory in Linux and recently I was experimenting with something that helped me figure it out.

I was trying out dhbox deployment on an EC2 machine. dhbox allows you to start a virtual environment in which you can try out various data science tools. These virtual environments are run on Docker containers. These Docker containers take a lot of memory, so you can't run too many on a single machine. So far so good. But let's see what actually happens.

Before we start, here is a great document from the Netflix Eng blog that describes the tools that can be used to debug a slow Linux box.

I deployed DHBox on an Ubuntu image in EC2 with 1GB of memory and no swap file. Yes, no swap file. Why? It seems like that's done, because having a swap file might incur a lot of EBS IO, which leads to high pricing. But still, this makes for an interesting debugging scenario, so let's continue.

Here's what our free memory situation is in the beginnging:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:            990          78         518           4         392         867
Swap:             0           0           0

Notice how we have 392MB in "buff/cache". What is this? Here is a good explanation of it. The buffer cache is caching (in RAM) data that's on disk. For example, the "ls" command, or the glibc library are things that are often used and are good candidates for caching.

The first thing I do is to start the Python web app for DHBox. It's a simple Flask app running on gunicorn. After I start it, this is what free is showing:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:            990         127         453           4         410         818
Swap:             0           0           0

Reasonable. We are using a bit more memory, we have cached a bit more data from disk.

Now, I'll start running vmstat and run four virtual labs. This seems to be enough to take 1GB of memory and pretty much bring down the machine. One other tool we're going to use is vmstat. vmstat is great, it's outputting a lot of useful information. Let's see how it looks like before we start running the Docker containers:

$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 463880  51328 369012    0    0   635    32   94  231  2  1 95  1  1
 0  0      0 463356  51336 369044    0    0     0     2   36   68  0  0 100  0  0
 0  0      0 462884  51372 369116    0    0    14     2   46   69  1  0 99  0  0
 0  0      0 462380  51372 369116    0    0     0     0   43  101  0  0 100  0  0

The telling part here is the id column. That's the percentage which the CPU spends being idle. As you can see, the CPU is pretty idle. Not much to do.

We're then starting the first Docker container. Things spike up for a while. Let's observe vmstat:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 462380  51372 369116    0    0     0     0   43  101  0  0 100  0  0
# Here's when we start the Docker container.
 2  0      0 185252  56072 497736    0    0  9159  8016 2126 7829 22 14 43 21  0
 1  0      0  95416  56576 525740    0    0  1924  1036  396 1467 51 13 30  7  0
 1  0      0 147764  56616 527408    0    0   336   630  117  307  1  1 96  1  0
 0  0      0  65604  56664 537452    0    0     0    82   81  494  3  3 94  0  0
 0  0      0  65556  56676 537500    0    0     0    59   59  181  0  0 99  1  0
 0  0      0  65524  56684 537504    0    0     0    58   56  179  0  0 99  1  0
 0  0      0  65400  56692 537552    0    0     0  4928  138  231  0  0 98  2  0
 0  0      0  65400  56700 537552    0    0     0  3187  107  201  0  0 98  2  0

Interesting. As you can see the id column value decreases. Less idleness, we're starting things. But after a while, the Docker container has started and we're back to things being quiet.

Let's see what free is saying:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:            990         350          59          10         580         569
Swap:             0           0           0

So, used went up from 127MB to 350MB. The buffer cache also went up. Less available memory.

Let's start the second Docker container. We're looking at vmstat again.

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0  61304  56740 537872    0    0     0    56   83  227  0  0 99  1  0
# Starting second container here.
 1  1      0 126516  59292 341252    0    0  7109  8100  823 3255 22 15 56  6  0
 1  0      0  47572  59904 337152    0    0  9186  7515  882 2316 51 13 17 19  0
 0  1      0 153212  59940 307720    0    0  1094  2122  214  557  2  3 92  3  0
 0  0      0  92848  59996 303368    0    0   556   149  147  690  2  3 93  1  0
 0  0      0  92848  60004 303368    0    0     0   108   91  311  0  0 99  0  0
 0  0      0  92724  60012 303376    0    0     0   102   94  326  0  0 98  1  0
 0  0      0  88296  60028 305048    0    0   320   102  147  424  2  1 96  2  0
 0  0      0  77228  60364 314520    0    0  1876   144  306  868 28  4 64  4  0
 0  0      0  86800  60372 304968    0    0    26   387  111  372  1  1 97  1  0
 0  0      0  86032  60780 305120    0    0   110   119  150  399  0  0 98  2  0

Similar situation. A spike in io, a spike in non-idle CPU percentage, followed by quiet. Let's look at free:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:            990         549          84          17         356         371
Swap:             0           0 

More used memory. But notice how this time the buff/cache section went down. Why? Well, because our running processes are using more memory and this memory has to come from somewhere. The kernel is freeing memory from the buffer cache and giving it to processes. Again, reasonable behavior.

Let's start the third and fourth Docker containers and see how vmstat looks after that.

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0  46052  11188 160564    0    0     0   151  153  532  0  1 97  2  0
 0  0      0  39828  11336 166664    0    0  1246   141  235  634  1  1 95  4  0
 0  0      0  39704  11344 166664    0    0     0   149  158  542  0  1 97  2  0
# After we start the fourth container.
 6  3      0  18840  11668 179484    0    0 18678  7726 2711 7721  5  8 46 41  0
 1  4      0   8964   2796 103108    0    0 38137  2508 4198 4872 37 21  0 41  1
 0 11      0   9576    588 101548    0    0 68988   485 4097 6006  1 10  0 88  0
 0  8      0   9352    152 100752    0    0 67797   250 2463 4497  0  8  0 90  1

....


 1 18      0   9508    248 100560    0    0 63568   138 1700 3935  0  8  0 91  1
 0 14      0   8656    400 100436    0    0 63154    68 1580 3766  0  8  0 91  1
 0 19      0   8852    288 100552    0    0 63711   128 1606 3705  0  8  0 92  1
 1 47      0  10232    372  98616    0    0 63410    94 1720 4316  0  7  0 91  1
 2 50      0   9648    420  99604    0    0 63126    77 1696 4425  0  8  0 91  1
 0 40      0  10028    272  99432    0    0 63692    74 1681 4506  0  7  0 91  1
 0 75      0   9944    160  98220    0    0 63342    59 1724 5150  0 10  0 89  1
 3 34      0   8952    296 100400    0    0 63575   110 1958 4663  0  8  0 91  1
 1  5      0   9004   1716 117640    0    0 62704   123 2674 5329  3  9  0 88  1

This is where things start to get bad! The box becomes very unresponsive. Typing a simple command like ls takes seconds to execute. Let's take a peek at free and we'll continue to our analysis after that:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:            990         851          20          24         119          54
Swap:             0           0           0

As you can see, we barely have any free memory and the buffer cache has shrunk even more. But notice what's going on in vmstat. The bi (blocks received from a block device) is stuck at a constant high value. This means we're doing a lot of disk reads. Why is that, are our processes doing a lot of disk operations? No. Our many processes are executables that are located on disk. When the kernel executes them, it has to read instructions. If we have enough buffer cache, we can store these instruction in memory and not have to read them again. However, our buffer cache is small now. So when processes run, the kernel needs to pull their instructions from disk. It probably stores them in the buffer cache, but when the next process is running, it tries to store in the cache again and evicts what's stored from the old process.

Are we done? Not yet. When this whole sad mess happens, the kernel will run something called OOM killer. How do we know this is what's going on? We can use dmesg and view the system messages:

$ dmesg | grep "Out of memory"
[ 3635.537538] Out of memory: Kill process 21580 (jupyter-noteboo) score 37 or sacrifice child
[ 3636.822607] Out of memory: Kill process 22714 (jupyter-noteboo) score 37 or sacrifice child
[ 3643.006328] Out of memory: Kill process 24976 (jupyter-noteboo) score 37 or sacrifice child
[ 3654.916468] Out of memory: Kill process 26118 (jupyter-noteboo) score 37 or sacrifice child
[ 3658.712286] Out of memory: Kill process 28364 (jupyter-noteboo) score 35 or sacrifice child
[ 3666.654763] Out of memory: Kill process 30603 (jupyter-noteboo) score 36 or sacrifice child
[ 3685.390829] Out of memory: Kill process 30620 (jupyter-noteboo) score 36 or sacrifice child

The strange thing here is that it keeps killing these Jupyter processes for a long time. My guess here is that something restarts them after they're killed.

The best thing to do here is to simply kill all the running Docker containers so that the box is usable again:

docker kill $(docker ps -q)

Things would look different if there was a swap partition on the disk. I might try such a setup as well. If you think about it, the swap would allow some of the memory used by processes to be transferred to disk. That's a great win, because some parts of memory might be very rarely if at all accessed. However, without a swap, there's no such option.

social