16 April 2018

Fixing failed FreeIPA web UI login

Dear readers

This time I met interesting case, thus I think it is worth sharing it with you all. But first, big disclaimer. I AM NOT FREE IPA EXPERT AT ALL. The investigation done here simply use common trouble shooting sense in UNIX/Linux world, no more, no less.

A calm weekend suddenly turned into detective work. A team mate contacted that he was unable to login to Free IPA web UI. The OS is CentOS 7 64 bit. For those who don't know what Free IPA is, think it's like LDAP-turned-into-ActiveDirectory-ish. What? You don't know what Active Directory is? You're not alone, me too :)

The fact I got was this. Two Free IPA instances were setup on two different virtual machines, say A and B. On A, my mate could log in. Whereas on B, he couldn't. He insist that he did the same exact steps to configure IPA. So what's wrong?

Since we're dealing with computer, we're dealing with basic logic concept: if you setup same software (and same version) using same exact steps on same exact OS version, library and hardware, then the result MUST be the same. There is no exception here. Hardware flaws? Ok, that might screw things, but I am talking about the general logic. We are on the same page up to this point, right?

There were also clue. On httpd error_log, I saw:
[:error] [pid 13588] [remote xx.xx.xx.xx:yy] mod_wsgi (pid=13588): Exception occurred processing WSGI script '/usr/share/ipa/wsgi.py'.
[:error] [pid 13588] [remote xx.xx.xx.xx:yy] Traceback (most recent call last):
[:error] [pid 13588] [remote xx.xx.xx.xx:yy] File "/usr/share/ipa/wsgi.py", line 51, in application [:error] [pid 13588] [remote xx.xx.xx.xx:yy] return api.Backend.wsgi_dispatch(environ, start_response)
 [:error] [pid 13588] [remote xx.xx.xx.xx:yy] File "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", line 262, in __call__
 [:error] [pid 13588] [remote xx.xx.xx.xx:yy] return self.route(environ, start_response)
 [:error] [pid 13588] [remote xx.xx.xx.xx:yy] File "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", line 274, in route
 [:error] [pid 13588] [remote xx.xx.xx.xx:yy] return app(environ, start_response) [:error] [pid 13588] [remote xx.xx.xx.xx:yy] File "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", line 929, in __call__
[:error] [pid 13588] [remote xx.xx.xx.xx:yy] self.kinit(user_principal, password, ipa_ccache_name) [:error] [pid 13588] [remote xx.xx.xx.xx:yy] File "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", line 965, in kinit
 [:error] [pid 13588] [remote xx.xx.xx.xx:yy] pkinit_anchors=[paths.KDC_CERT, paths.KDC_CA_BUNDLE_PEM],
[:error] [pid 13588] [remote xx.xx.xx.xx:yy] File "/usr/lib/python2.7/site-packages/ipalib/install/kinit.py", line 125, in kinit_armor
 [:error] [pid 13588] [remote xx.xx.xx.xx:yy] run(args, env=env, raiseonerr=True, capture_error=True)
 [:error] [pid 13588] [remote xx.xx.xx.xx:yy] File "/usr/lib/python2.7/site-packages/ipapython/ipautil.py", line 512, in run
[:error] [pid 13588] [remote xx.xx.xx.xx:yy] raise CalledProcessError(p.returncode, arg_string, str(output))
 [:error] [pid 15892] [remote xx.xx.xx.xx:yy] CalledProcessError: Command '/usr/bin/kinit -n -c /var/run/ipa/ccaches/armor_15892 -X X509_anchors=FILE:/var/kerberos/krb5kdc/kdc.crt -X X509_anchors=FILE:/var/lib/ipa-client/pki/kdc-ca-bundle.pem' returned non-zero exit status 1

Later, I found out that I just have to focus on that last line that said the kinit thingy. But before that, I do the usual mantra. I compare several latest modified file on A and use diff to see how it is compared to B. The find command is more or less:

sudo find / -path /proc -prune -o -path /dev -prune -o -path /sys -prune -o -path /run -prune -o type f -printf "%T@ %T+ %p\n" | sort -nr | less

 That way, I got sorted list of files, starting from recently modified ones. This is really manual work. I looked over files named .conf or similar. I ended up suspect two files:

I edited those files on B so the contents mimic the ones on A. The diffs are:
$ diff /etc/sysconfig/authconfig{.orig,}

$ diff /etc/nsswitch.conf{.orig,}
< hosts:      files dns myhostname
> hosts:      files dns

Finger crossed. Login again. F**k! Still not working.

Then I googled and found I need to install mod_auth_kerb rpm. I did that, but somehow after yum install, my hunch told me it was not the root cause. Okay. Comparing the enabled services on A and B. I found A have realmd, B not. Installing realmd, enabling and starting it. Also restarting ipa, certmonger and httpd. S**t, nothing changes. Comparing the installed rpms on A and B really bored me. So I tossed them out of the game.

I stepped back and checked the above call trace. Googling drove me to this site. It mentions that you need to make sure your CentOS is up-to-date, since there is bug in the IPA components. Great. But yum upgrade said it had all latest update. lsb_release also confirmed that A and B had same release number. Almost dead end here.....

Back to kinit error message above, I confirmed that the file exist. Ok, did nasty experiment. What if I removed it? back it up first somewhere else, trigger the login. Same call trace. man kinit clearly mentions that -c means it will create the file in the said directory. But none there. Something is really wrong here, but not exposed by error log.

How do you deal with error that is not seen? I chose to use strace. The pid in the error log was actually pid of wsgi process, so logically it was indeed the responsible process that invoked kinit.

ran strace and hook it to process id of wsgi. Something like:
strace -p 13588 -ff -o trace.txt

Refresh the web UI, wait a bit, login, boom , failed. Then stop strace.

Strace wrote several file named trace.txt suffixed with numbers. What we're going to see here? Wild guess, let's see if it fails to open or read some files (likely libraries):
grep open trace.txt*

Kazaaammmm. Lucky hit. I saw several EPERM (unable to read due to permission). The files are:

ls confirms that "other" has no permissions at all upon them. And on A, they have at least "r" on "other". Thus, simply run chmod o+r on them.

restart httpd, certmonger, ipa again. Login. F***k**f success dude!!!

Moral of the story: Complicated problem sometimes demands really simple solution. You just need to open your mind for many possibilities and know how to trim them down to limited suspect. If not, you will tangle yourself with endless headache.

Until next time, ciao....



How to execute multiple commands directly as ssh argument?

 Perhaps sometimes you need to do this: ssh user@ ls It is easy understand the above: run ls after getting into via ssh. Pi...