What happens when you run out of memory in Linux?
I've always been curious to figure out what happens when you run out of memory in Linux and recently I was experimenting with something that helped me figure it out.
I was trying out dhbox deployment on an EC2 machine. dhbox allows you to start a virtual environment in which you can try out various data science tools. These virtual environments are run on Docker containers. These Docker containers take a lot of memory, so you can't run too many on a single machine. So far so good. But let's see what actually happens.
Before we start, here is a great document from the Netflix Eng blog that describes the tools that can be used to debug a slow Linux box.
I deployed DHBox on an Ubuntu image in EC2 with 1GB of memory and no swap file. Yes, no swap file. Why? It seems like that's done, because having a swap file might incur a lot of EBS IO, which leads to high pricing. But still, this makes for an interesting debugging scenario, so let's continue.
Here's what our free memory situation is in the beginnging:
$ free -m
total used free shared buff/cache available
Mem: 990 78 518 4 392 867
Swap: 0 0 0
Notice how we have 392MB in "buff/cache". What is this? Here is a good explanation of it. The buffer cache is caching (in RAM) data that's on disk. For example, the "ls" command, or the glibc library are things that are often used and are good candidates for caching.
The first thing I do is to start the Python web app for DHBox. It's a simple Flask app running on gunicorn. After I start it, this is what free
is showing:
$ free -m
total used free shared buff/cache available
Mem: 990 127 453 4 410 818
Swap: 0 0 0
Reasonable. We are using a bit more memory, we have cached a bit more data from disk.
Now, I'll start running vmstat and run four virtual labs. This seems to be enough to take 1GB of memory and pretty much bring down the machine. One other tool we're going to use is vmstat
. vmstat
is great, it's outputting a lot of useful information. Let's see how it looks like before we start running the Docker containers:
$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 463880 51328 369012 0 0 635 32 94 231 2 1 95 1 1
0 0 0 463356 51336 369044 0 0 0 2 36 68 0 0 100 0 0
0 0 0 462884 51372 369116 0 0 14 2 46 69 1 0 99 0 0
0 0 0 462380 51372 369116 0 0 0 0 43 101 0 0 100 0 0
The telling part here is the id
column. That's the percentage which the CPU spends being idle. As you can see, the CPU is pretty idle. Not much to do.
We're then starting the first Docker container. Things spike up for a while. Let's observe vmstat:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 462380 51372 369116 0 0 0 0 43 101 0 0 100 0 0
# Here's when we start the Docker container.
2 0 0 185252 56072 497736 0 0 9159 8016 2126 7829 22 14 43 21 0
1 0 0 95416 56576 525740 0 0 1924 1036 396 1467 51 13 30 7 0
1 0 0 147764 56616 527408 0 0 336 630 117 307 1 1 96 1 0
0 0 0 65604 56664 537452 0 0 0 82 81 494 3 3 94 0 0
0 0 0 65556 56676 537500 0 0 0 59 59 181 0 0 99 1 0
0 0 0 65524 56684 537504 0 0 0 58 56 179 0 0 99 1 0
0 0 0 65400 56692 537552 0 0 0 4928 138 231 0 0 98 2 0
0 0 0 65400 56700 537552 0 0 0 3187 107 201 0 0 98 2 0
Interesting. As you can see the id
column value decreases. Less idleness, we're starting things. But after a while, the Docker container has started and we're back to things being quiet.
Let's see what free
is saying:
$ free -m
total used free shared buff/cache available
Mem: 990 350 59 10 580 569
Swap: 0 0 0
So, used
went up from 127MB to 350MB. The buffer cache also went up. Less available memory.
Let's start the second Docker container. We're looking at vmstat again.
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 61304 56740 537872 0 0 0 56 83 227 0 0 99 1 0
# Starting second container here.
1 1 0 126516 59292 341252 0 0 7109 8100 823 3255 22 15 56 6 0
1 0 0 47572 59904 337152 0 0 9186 7515 882 2316 51 13 17 19 0
0 1 0 153212 59940 307720 0 0 1094 2122 214 557 2 3 92 3 0
0 0 0 92848 59996 303368 0 0 556 149 147 690 2 3 93 1 0
0 0 0 92848 60004 303368 0 0 0 108 91 311 0 0 99 0 0
0 0 0 92724 60012 303376 0 0 0 102 94 326 0 0 98 1 0
0 0 0 88296 60028 305048 0 0 320 102 147 424 2 1 96 2 0
0 0 0 77228 60364 314520 0 0 1876 144 306 868 28 4 64 4 0
0 0 0 86800 60372 304968 0 0 26 387 111 372 1 1 97 1 0
0 0 0 86032 60780 305120 0 0 110 119 150 399 0 0 98 2 0
Similar situation. A spike in io, a spike in non-idle CPU percentage, followed by quiet. Let's look at free
:
$ free -m
total used free shared buff/cache available
Mem: 990 549 84 17 356 371
Swap: 0 0
More used memory. But notice how this time the buff/cache
section went down. Why? Well, because our running processes are using more memory and this memory has to come from somewhere. The kernel is freeing memory from the buffer cache and giving it to processes. Again, reasonable behavior.
Let's start the third and fourth Docker containers and see how vmstat
looks after that.
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 46052 11188 160564 0 0 0 151 153 532 0 1 97 2 0
0 0 0 39828 11336 166664 0 0 1246 141 235 634 1 1 95 4 0
0 0 0 39704 11344 166664 0 0 0 149 158 542 0 1 97 2 0
# After we start the fourth container.
6 3 0 18840 11668 179484 0 0 18678 7726 2711 7721 5 8 46 41 0
1 4 0 8964 2796 103108 0 0 38137 2508 4198 4872 37 21 0 41 1
0 11 0 9576 588 101548 0 0 68988 485 4097 6006 1 10 0 88 0
0 8 0 9352 152 100752 0 0 67797 250 2463 4497 0 8 0 90 1
....
1 18 0 9508 248 100560 0 0 63568 138 1700 3935 0 8 0 91 1
0 14 0 8656 400 100436 0 0 63154 68 1580 3766 0 8 0 91 1
0 19 0 8852 288 100552 0 0 63711 128 1606 3705 0 8 0 92 1
1 47 0 10232 372 98616 0 0 63410 94 1720 4316 0 7 0 91 1
2 50 0 9648 420 99604 0 0 63126 77 1696 4425 0 8 0 91 1
0 40 0 10028 272 99432 0 0 63692 74 1681 4506 0 7 0 91 1
0 75 0 9944 160 98220 0 0 63342 59 1724 5150 0 10 0 89 1
3 34 0 8952 296 100400 0 0 63575 110 1958 4663 0 8 0 91 1
1 5 0 9004 1716 117640 0 0 62704 123 2674 5329 3 9 0 88 1
This is where things start to get bad! The box becomes very unresponsive. Typing a simple command like ls
takes seconds to execute. Let's take a peek at free
and we'll continue to our analysis after that:
$ free -m
total used free shared buff/cache available
Mem: 990 851 20 24 119 54
Swap: 0 0 0
As you can see, we barely have any free memory and the buffer cache has shrunk even more. But notice what's going on in vmstat
. The bi
(blocks received from a block device) is stuck at a constant high value. This means we're doing a lot of disk reads. Why is that, are our processes doing a lot of disk operations? No. Our many processes are executables that are located on disk. When the kernel executes them, it has to read instructions. If we have enough buffer cache, we can store these instruction in memory and not have to read them again. However, our buffer cache is small now. So when processes run, the kernel needs to pull their instructions from disk. It probably stores them in the buffer cache, but when the next process is running, it tries to store in the cache again and evicts what's stored from the old process.
Are we done? Not yet. When this whole sad mess happens, the kernel will run something called OOM killer. How do we know this is what's going on? We can use dmesg
and view the system messages:
$ dmesg | grep "Out of memory"
[ 3635.537538] Out of memory: Kill process 21580 (jupyter-noteboo) score 37 or sacrifice child
[ 3636.822607] Out of memory: Kill process 22714 (jupyter-noteboo) score 37 or sacrifice child
[ 3643.006328] Out of memory: Kill process 24976 (jupyter-noteboo) score 37 or sacrifice child
[ 3654.916468] Out of memory: Kill process 26118 (jupyter-noteboo) score 37 or sacrifice child
[ 3658.712286] Out of memory: Kill process 28364 (jupyter-noteboo) score 35 or sacrifice child
[ 3666.654763] Out of memory: Kill process 30603 (jupyter-noteboo) score 36 or sacrifice child
[ 3685.390829] Out of memory: Kill process 30620 (jupyter-noteboo) score 36 or sacrifice child
The strange thing here is that it keeps killing these Jupyter processes for a long time. My guess here is that something restarts them after they're killed.
The best thing to do here is to simply kill all the running Docker containers so that the box is usable again:
docker kill $(docker ps -q)
Things would look different if there was a swap partition on the disk. I might try such a setup as well. If you think about it, the swap would allow some of the memory used by processes to be transferred to disk. That's a great win, because some parts of memory might be very rarely if at all accessed. However, without a swap, there's no such option.