14 June 2016

Sysadmin tales: Analyzing slow database server

Update:
  • June 14th, 2016 6:01 PM UTC+7: using direct I/O or raw access also bypass filesystem caching. This also has effect to avoid double caching. I dare to guess Oracle does caching. So if everything cache, we would have: device caching, page cache, Oracle's own caching. Direct I/O eliminates page caching.


Dear readers

It's been awhile since my last post, so I guess I'll try to share something that might be useful for perfomance monitoring. This is something that I focus at now, and I found tightly related to OS knowledge. So please read on.

One day, I was urgently assigned by upper management to help identify certain (unknown) problem in Linux server. I quickly visited the site and did various checking.

The reported issue was that this server (a RHEL/CentOS one) was terribly slow at certain time. This server is running Oracle DB and running certain CRM software, as VM on top of VMWare. So slow that you barely could not do anything in ssh shell or interactive shell. And admins reported load average shoot to whooping 300-400 ("What the f**k?" Yeah I know, I was shocked too. 300 loadvg for few seconds was fine, but for minutes?). The machine itself has 30 core, by the way, so you can imagine this load is way too much

My quick guess was it was something related to I/O, to be specific disk read/write. Why? Because I/O code path mostly work in uninterruptible way. So once the calling process does read, for example, it will jump into kernel space and keep working until either it is finished or its scheduling interval is running out so it gets scheduled out.

Note: If the I/O is done in asynchronous style, the calling process could quickly return and continue executing next instruction. The work will be taken over by aio kernel thread and will trigger the callback mechanism back to the reading process once the read (or write) is finished. However, it is the aio thread that does the work now so system wise, the load would be the fairly the same. Only who does the work is different.

I check various logs, /proc entries and not to forgot, dmesg. Fortunately, I saw very enlightening hint there (not everyday I got this kind of clue at first sight, to be honest):

INFO: task oracle:22763 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
oracle D ffff81323c100600 0 22763 1 22779 22728 (NOTLB)
ffff810f0a3c3b88 0000000000000086 ffff812c3108e7b0 ffff81153c3ce100
ffff812c3108e7b0 0000000000000009 ffff812c3108e7b0 ffff8131ff133040
0000975e2fd08e7f 0000000000003e2c ffff812c3108e998 0000000c80063002
Call Trace:
[] do_gettimeofday+0x40/0x90
[] :jbd:start_this_handle+0x2e9/0x370
[] autoremove_wake_function+0x0/0x2e
[] :jbd:journal_start+0xcb/0x102
[] :ext3:ext3_dirty_inode+0x28/0x7b
[] __mark_inode_dirty+0x29/0x16e
[] do_generic_mapping_read+0x347/0x359
[] file_read_actor+0x0/0x159
[] __generic_file_aio_read+0x14c/0x198
[] generic_file_aio_read+0x36/0x3b
[] do_sync_read+0xc7/0x104
[] __dentry_open+0x101/0x1dc
[] autoremove_wake_function+0x0/0x2e
[] do_filp_open+0x2a/0x38
[] vfs_read+0xcb/0x171
[] sys_pread64+0x50/0x70
[] tracesys+0x71/0xdf
[] tracesys+0xd5/0xdf


Note: the above stack tracing happen automatically when kernel config CONFIG_DETECT_HUNG_TASK is enabled (=y). And this is enabled by default in RHEL/CentOS. So yes, this is an awesome feature and life saver.

What we can see is that it is starting as read operation (sys_pread). Then it is invoking read operation of virtual filesystem (vfs_read). Directory entry is opened then (dentry_open). It happens because before you open and access the file, you need to locate the directory that contains the file first.

Then the file is read in synchronous style. At this point, you might already suspect that this is the root cause. I partially agree. However, I am more concerned with the amount of data it is writing and how the Oracle database engine managed it.

Ok, let's move on. After usual marking inode as dirty (i am not sure about this, I guess it is dirty because access time metadata is updated, hence dirtying inode), and then journalling kicks in (jdb:journal start). Followed then by fetching current time (gettimeofday). And this is where things slow down.

First thing first. Getting local time shouldn't be that slow, even if you still count on PIT. HPET or TSC should be very fast. So getting time is out of equation.

Journalling is the next attention. Indeed that the filesystem Oracle puts data on is ext3. So for every write/read, journalling is updated so in the case of incidents, read/write could be replayed thus avoiding data corruption as much as possible.

Alright, so here is the early hypotheses. Oracle write huge amout of data, filesystem getting stressed. Journalling tries to keep up at the same speed. So who's to blame?

In my opinion, for database as sophisticated as Oracle, what they need is that bypassing filesystem layer. That could be implement by using O_DIRECT flag in read/write, or simply accessing raw device (e.g sda1 instead of /data for example). I am not sure which one is doable, but from operating system's point of view, those are the options.

Also, it is very worth to try to use noop I/O scheduler, instead of CFQ or deadline. The premise is same. Instead of letting filesystem layer manage the way Oracle works toward disk, why not just let it decide what is best for itself? Noop precisely do this. It is still does some I/O merging (front, back merge), but it doesn't do I/O prioritization, time out etc. So it feels almost like plain "I want to do it now!" and Linux kernel simple says "be my guest".

Other fact that I found, which is also shocking, is this (taken from /proc/meminfo):
PageTables: 116032004 kB

That's around 110 GiB, ladies and gentlemen, just to hold page tables! And don't forget that those page tables need to stay in RAM and can't be swapped out. So, simply speaking, 110 GiB of your RAM can't be used by your application and it might be your application that will be kicked out to swap in the event of shortage of free memory (before Out of Memory killer kicks in).

Thus, although not primary root cause, this must be solved too. The solution is to use huge page. This is the term to refer of using 2 MiB or 4 MiB (depending on page setting by kernel) page size instead of standar 4 KiB kernel uses. Let's assume later we will use 2 MiB page size and all allocated pages could fit into 2 MiB page size without fragmentation, then we could squeeze the PageTables usage by 512:1 ratio, or down to 0.2 GiB. Sounds good, right?

The full story might be boring, so hopefully I summarize it well enough for light study case.

Cross check result (by other team)
  1. related team confirms that using asynchronous I/O together with direct I/O within Oracle relieve the load stress
  2. Huge page is going to be tested. Not sure how it really goes since I am not updated, but I am sure, it will be helpful too

Until next time, cheerio! :)

regards,

Mulyadi Santosa

How to execute multiple commands directly as ssh argument?

 Perhaps sometimes you need to do this: ssh user@10.1.2.3 ls It is easy understand the above: run ls after getting into 10.1.2.3 via ssh. Pi...