VINTELA AUTHENTICATION SERVICES Troubleshooting Training, Level I Last printed 10/26/2006 3:07:00 PM
VAS Troubleshooting Training, Level I VAS Troubleshooting Training, Level I... 2 1: Outline and Purpose... 3 2: Overview...3 LDAP... 3 Kerberos:... 3 NSS:...3 PAM:... 4 LAM:... 4 NIS:... 4 3: Using the vastool Application and Common Commands... 4 4: The Steps of a vastool join... 7 5: Integration Points... 9 6: Files Installed, Location, and Purpose... 10 7: Unix Personality Mode... 10 8: The vasd Daemon Process... 10 9: Troubleshooting: Kerberos... 11 10: Troubleshooting: Loading Information into the Cache from AD... 12 11: Troubleshooting: Providing Information to the System from the Cache... 14 12: Troubleshooting: Authentication Issues... 15 13: Troubleshooting: NIS... 17 14: Finding and reporting bugs... 17 File: VAS Troubleshooting Training Page: 2
1: Outline and Purpose This document is meant for front line support: People who support multiple systems/programs, and need a primer on some basic issues and resolutions when dealing with the VAS product. For issues that are beyond this document, it will provide what information to gather for those that will do advanced troubleshooting (outlined in the second document in this series) It also includes an overview of the major components VAS uses to provide the AD (Active Directory) information and authentication to the *nix system. The focus will be on maintaining an existing installation of VAS that previously worked, though an overview of the join process will be given. It does not involve a lot of understanding debug logs, only a few certain things from debug will be utilized. 2: Overview LDAP: Lightweight Directory Access Protocol. This is how external machines access information stored in Active Directory. Search options include a filter, base, URI, and depth. For VAS, the main use will be searching AD using LDAP, as defined in RFC 2254 ( http://www.faqs.org/rfcs/rfc2254.html ). Kerberos: The mythical three-headed dog guarding the gates of hell. Or in this case, a set of protocols and specifications for securely authenticating using a trusted third party. This means an AD KDC (Kerberos Domain Controller, each DC in AD is one) which is trusted by both the user and the service the user is accessing (like logging into a computer). For troubleshooting, commands will be given that help verify this trusted relationship is set up properly. NSS: Name Service switch. Used on a *nix system to obtain user/group information. getpwnam, getgrgid are examples of the relevant system calls. This provides Identity: who a user is, information about the account, group memberships. NSS can also provide other information, like services, hosts, netgroups, etc. The default location for the information is the files backend (backend is the name for a database source for NSS). The files backend includes File: VAS Troubleshooting Training Page: 3
files such as /etc/passwd for users, /etc/group for groups, /etc/services for services information, and so on. Other possible repositories are NIS, LDAP (using an openldap implementation like PADL), vas3 (VAS s nss module) or any other custom NSS backend. PAM: Pluggable Authentication Module. This provides Authentication. Is a user is who they claim to be and should they have access to this system? Controlled by /etc/pamd.conf in the Unix systems, and files in the /etc/pam.d/ directory on Linux. When an authentication happens it has a service name, and that name is matched to an entry in /etc/pam.conf, or a file with the same name in /etc/pam.d/. If there is no match it uses the OTHER or other entry for processing. LAM: Loadable Authentication Module. Also known as I&A, Identification and Authorization. AIX s nss/pam implementation. Controlled mainly by the file /etc/security/user by adding VAS to the default: SYSTEM line. Then /usr/lib/security/methods.cfg is where the library location for VAS is specified. NIS: Network Information Services. A networked based back-end for NSS. Normally a system used the files backend, which is comprised of local files on the machine to access information. Files like /etc/passwd, /etc/group, /etc/services. NIS allows that information to come from a central network location, removing the need to sync multiple local files when information changes. 3: Using the vastool Application and Common Commands /opt/quest/bin/vastool is a general tool for setting up, maintaining, test, and use a VAS installation. General format: /opt/quest/bin/vastool [authentication] <command> [options] For example, here is using vastool to do an AD query of LDAP information: /opt/quest/bin/vastool u host/ search (cn=test_user) File: VAS Troubleshooting Training Page: 4
This is using the host/ object, also known as the computer object. It consists of a Computer object in AD, with proper SPN (serviceprincipalname) that corresponds to the *nix machines hostname/fqdn, and a local /etc/opt/quest/vas/host.keytab that is a Kerberos keytab. The keytab is a key to unlocking tickets meant for the computer object s service. Each individual machine can use the word host/. VAS takes that, and makes a proper SPN from that, by filling in host/<fqdn>@<domain>. The host.keytab file is readable only by root, so when using u host/ for authentication on a vastool command, root access is needed. Next is the command, search. That is used to query ldap information from AD. It requires a valid search filter, the (cn=test_user). That is saying to return any object in AD that has a cn (standing for canonical name, the actual name is cn ) that is test_user exactly. Some example of other valid searches: ( (cn=test_user)(uid=test_user)) Searches for anything with either cn or uid of test_user (&(objectcategory=person)(userprincipalname=test_user@testdomain.com)) Searches for and user with the specified User Principal Name. The first option of a search command (aside from flags) is the filter. Anything after that are interpreted as attribute names to search. If none are given, all attributes are returned. If only specific ones are required, they can be listed as so: vastool u host/ search (cn=test_user) uidnumber gidnumber gecos unixhomedirectory logonshell That will return the unix attributes if they are enabled for the test_user object. The search command is useful for verifying information in AD when dealing with information not being cached. Another useful command is the list command. vastool list c user <username> vastool list f user <username> vastool list users username> vastool list c group <groupname> vastool list f group <groupname> vastool list groups File: VAS Troubleshooting Training Page: 5
These all pull from the VAS s local cache of information. With c, it is only form the cache. With f, it is forced to update with the latest AD information. With either, then internal algorithms determine if it is cache only, or updated from AD, depending on age of the entry, and who is calling. Only root can use the f option. This is useful for issues where the information isn t provided to the system, this command helps determine if VAS even known the information in question. The last command (it has many more, vastool h, or man vastool for more) for troubleshooting at this time is the nss command. vastool nss getpwnam <username> This command gives an interface for asking the system through NSS about uses. These are the same types of commands programs use, so if a program doesn t know about a user, the nss command is used to see if the system knows about the user. The possibilities are: getpwnam <user name>, getgrnam <group name>, getpwuid <users uid>, getgrgid <group s gid>, getgrent, getpwent, and on AIX only, getgrset <username> for a list of GID s of groups they are a member of. These commands should return /etc/passwd and /etc/group entry style responses: <name>:<password hash>:<uid>:<gid>:<gecos>:<homedir>:<shell> If the password hash is not :VAS:, then something else is likely providing the information, as in an /etc/passwd entry, or NIS. If it is *LK*, or a hash, then VAS could be providing it and the account is locked, or configured to provide password hashes to NSS. File: VAS Troubleshooting Training Page: 6
4: The Steps of a vastool join This section can be considered optional. It helps show how VAS pulls together the various parts to provide the AD -> *nix functionality. Here is a basic join: sethe:/home/sethe # /opt/quest/bin/vastool -u administrator -w Test1234 join -f baka.dev Checking whether computer is already joined to a domain... no Configuring forest root... baka.dev... OK Configuring site... Default-First-Site-Name... OK Selecting server to use for join... baka-dc.baka.dev... OK Stopping VAS client daemon: vasd... OK Joining computer to the domain as host/sethe.vintela.com... OK Joined using computer object "CN=sethe,CN=Computers,DC=baka,DC=dev"... OK Writing vas.conf... OK Populating misc cache... OK Detecting Schema Configuration... OK Preparing to apply Group Policy... OK Applying VAS Related Group Policy Settings... OK Loading users cache:... OK Loading groups cache:... OK Loading Domain Info cache:... OK Configuring Name Service Switch... OK Configuring PAM Authentication... OK Starting VAS client daemon: vasd... OK sethe:/home/sethe # Step by step: Checking whether computer is already joined to a domain... no Checking current status. If the machine isn t joined, then a f flag ( force ) isn t needed. If it is joined, and f is not specified, it will fail. Configuring forest root... baka.dev... OK Vas needs to know the forest root, as certain information can only be found there, like a complete list of domains, and some server specific information. Configuring site... Default-First-Site-Name... OK Vas will follow sites. Sites are a programmatic way of determining which server to communicate with. They are set up in AD, and link subnets to specific servers. Selecting server to use for join... baka-dc.baka.dev... OK Vastool tries to pick a global catalog, as some information is needed from there, and if it isn t on the server we generate the host object at, it might not have the object yet. File: VAS Troubleshooting Training Page: 7
Stopping VAS client daemon: vasd... OK As the information is changing, stop the vasd process. Joining computer to the domain as host/sethe.vintela.com... OK Information on the SPN ( service principal name ) being used. Joined using computer object "CN=sethe,CN=Computers,DC=baka,DC=dev"... OK Location of the created computer object in AD. Written /etc/opt/quest/vas/host.keytab. Writing vas.conf... OK This takes information from the join, and puts it in vas.conf. Any command line settings ( workstation mode, UPM, search bases, etc. ) Populating misc cache... OK A cache of information, like domain joined, forest root, site, and other information in vas.conf, stored in a format with faster access then vas.conf. Detecting Schema Configuration... OK Where vas decided to use RFC 2307 or SFU for the unix attributes for users. Preparing to apply Group Policy... OK Testing for VGP. Applying VAS Related Group Policy Settings... OK Running vgptool apply before loading users/groups, so any AD GPO configured settings are applied. Loading users cache:... OK This is running vasd, something like /opt/quest/sbin/vasd xugs ( u for users, g for groups, and x form domain info ) Loading groups cache:... OK Continuation of above. Loading Domain Info cache:... OK Continuation of above. Configuring Name Service Switch... OK Equivalent to running vastool configure nss, this adds vas3 to /etc/nsswitch.conf on the passwd and group lines. Configuring PAM Authentication... OK File: VAS Troubleshooting Training Page: 8
Equivalent to vastool configure pam, this adds a pam_vas.so entry for auth session password and account. Starting VAS client daemon: vasd... OK Starts up the vasd process in daemon mode, just like running /etc/init.d/vasd start 5: Integration Points All the local files, and a bit of explanation on what files on a machine VAS will modify to integrate. For most systems, it is the files /etc/nsswitch.conf, and /etc/pam.conf. In /etc/nsswitch.conf, vastool join ( or vastool configure nss ) adds the word vas3 after the files entry, before any other entries. Location is important, it determines which modules are asked first. If an entry is not found, then the next entry is tested. It is not recommend that VAS is first. VAS will try very hard to find users it doesn t have cached. If the module with information is beyond VAS, then VAS will be doing unnecessary work attempting to locate those users/groups. For example, if there are still a number of NIS users on the system, then a setting of passwd: files nis vas3 should give the best performance. In /etc/pam.conf, there are sections for the individual services. Or on Linux, in the /etc/pam.d/ directory, files are made, each one named for a service. On Linux as well, most services drop through to a central point, either /etc/pam.d/system- * or /etc/pam.d/common-* depending on RedHat or Suse. Except other, which is denied by default. If a new application is not working, it might have to have its own unique file made in /etc/pam.d/. At this level of troubleshooting, all that needs to be checked is if there is a pam_vas entry for the service being used. On AIX, /etc/security/user is modified, adding VAS to the SYSTEM = line, and 3 lines for the VAS library into /usr/lib/security/methods.cfg. File: VAS Troubleshooting Training Page: 9
6: Files Installed, Location, and Purpose. For the first document in this series, only basic locations will be listed: In /etc/init.d/ are placed the init scripts for controlling the daemons. /sbin/init.d on HP, and /etc/rc.d/init.d/ on AIX. The /etc/opt/quest/vas directory holds configuration information, and the keytab files for a vas installation. All executables, daemons, libraries, and helper programs are placed in /opt/quest/bin, /opt/quest/sbin, /opt/quest/lib, and /opt/quest/libexec respectively. 7: Unix Personality Mode With the R2 version of Windows 2003, and the inclusion of the RFC_2307 schema extensions, it is possible to have posix accounts ( Personalities ) for users and groups. These allow a separation of the actual AD User account from the user information. With this, a single AD user can have multiple sets of unix information. These accounts are separated into OUs (organization units) and a *nix server is joined specifying a specific OU. So on that machine, a user has the identity available from that OU. Main purpose is for migrating information form multiple NIS domains before all the information is rationalized across the different domains. (i.e. bob on the DB machine had a uid of 612, on the backup server was 532, needs to keep that, while using the same password from a single AD account. ) Any given *nix machine can only see the information from the joined OU ( Primary Container ) and any Secondary containers. 8: The vasd Daemon Process All updates to the cache for VAS happen through the vasd process. It needs to be running to respond to requests for new/updated information, and for authentications to go to AD instead of be disconnected. File: VAS Troubleshooting Training Page: 10
It runs in a parent/child model. vasd, once started, forks. The child process reads host.keytab, and then drops permissions to daemon. The parent monitors the child, and handles maintenance tasks. The child works on handling requests form nss_vas and pam_vas to update information. To make sure it is running properly, run ps ef grep vas This should show all the vas-related process, including vasd. There should only be two processes, one a child of the other. The parent running as root, and the child running as daemon. There should be no defunct processes. If there are any issues, it can be restarted with the command: /etc/init.d/vasd restart (Linux, Solaris ) /etc/rc.d/init.d/vasd restart ( AIX ) /sbin/inid.t/vasd restart ( HPUX ) If that doesn t resolve the issue, escalate. 9: Troubleshooting: Kerberos The first thing to check on any machine with issues (beyond checking vasd ) is the following command (run as root): /opt/quest/bin/vastool u host/ auth S host/ This command verifies many things: Time is properly synced with AD. Kerberos is time-sensitive, and cannot be off by more than 5 minutes. The computer object in AD exists. The host.keytab authenticates the host/ service to AD. The computer object is not disabled, and can authenticate. The computer object has a valid SPN entry. A ticket for host/ as a service can be obtained. The ticket can be unlocked by the host.keytab entry for the service. The computer object service can handle an authentication. If the command fails for any of those reasons, it can stop the local *nix machine form working properly, dropping it into disconnected mode. File: VAS Troubleshooting Training Page: 11
Disconnected mode means the machine can only work off of what is currently cached. Any changes to AD will not be reflected. Authentication will only work for users who either logged in before (a SHA hash of the password is stored for this purpose), or those set up in vas.conf under the perm-disconnected-users entry. Except for syncing the time, all of these issues can be handled by re-joining the server to AD. If that is not possible, next best is having someone with AD access reset the computer object, by right-clicking on it in AD, and selecting Reset Account. Then run the following command: /opt/quest/bin/vastool u host/ -w `hostname cut d. f1` passwd rk /etc/opt/quest/vas/host.keytab This won t fix issues with SPN, as without permissions those can t be change. It will fix issues where the keytab is out of sync with AD (preauth failed messages ). 10: Troubleshooting: Loading Information into the Cache from AD All of this troubleshooting centers around information and authentication. There are 5 major parts: 1) Information is in AD. 2) Information is queried from AD, and put into the local cache. 3) The information in the local cache is made available to the system. 4) Authentication. 5) System allowing the user in. This section covers parts 1 and 2, and how to determine if that is where the issue is. First step is to determine if the information is available to the system. vastool nss <command relevant to the information> Since we are in this section, it doesn t show up. File: VAS Troubleshooting Training Page: 12
So the next step is to see if the information is in the cache. vastool list c user group <name> Again, since we are in this section, it doesn t show up. Next step is to try and force it. Maybe something just changed in AD that makes the information available ( the user was unix enabled, and it had just now replicated to the DC VAS is talking with. ). vastool list f user group <name> If it was just AD delay, it would show up now. But if it isn t, a few thing to check first: vastool u host/ search (userprincipalname=<name>@*) Need to be root. Also, if another attribute for naming is being used, that should be used instead of cn=. For example, if vas.conf had a [vasd] entry of username-attr-name = samaccountname, then search by samaccountname ( AD is case insensitive, name Name NAME all mean the same to AD ). By default, VAS uses userprincipalname for a users logon name ( trimmed at the @ ). If it is a Personality, then the UID attributes. Look for the unix attributes all set on the user/group (groups just have gidnumber ). Another thing to check is location. If a user-search-path/group-search-path is set in vas.conf, then the user/group needs to be in that path to be loaded. (Users who log in should be found, groups might not be, depends on VAS version). By this point it should be narrowed to three possibilities: 1) Information doesn t exist in AD. Get the information populated in AD 2) Information exists in AD, but isn t getting into the cache. Time to escalate this to the team that handles in-depth issues. 3) The information was in AD, and is now in the cache. nothing more to do, just needed to wait. File: VAS Troubleshooting Training Page: 13
11: Troubleshooting: Providing Information to the System from the Cache This section covers if the information is in the local cache, but the vastool nss commands are not returning it. This involves using NSS debug. Run the nss command, like this: NSS_VAS_STDERR_DEBUG=1 /opt/quest/bin/vastool nss getxxyyy <value> This will print information of the nss_vas layer as it processes the request. The majority of the middle is nss_vas talking to vasd through IPC, telling it to do an update. Look near the end for a message about why it isn t returned. An error 2 is ENOENT, not found, and shouldn t happen if it is in the cache. If this is the case, make sure the names being used match what VAS is using for the name attribute. Error 16 is EBUSY, the database is busy. Run fusr /var/opt/quest/vas/vasd/* Get the name of each of the processes listed, and escalate that information. Error 13 is EPERM, permissions denied. Check the permissions on the databases and directories leading up to them: ls la /var/opt/quest /var/opt/quest/vas /var/opt/quest/vas/vasd The /var/opt/quest/vas/vasd and files vas_ident.vdb and vas_misc.vdb should be readable by world. This issue is seen when users can t get nss information, but as root, the information is seen. Another possibility is nss_vas is deciding to not return the information due to OS limitations. Users/groups with UID/GIDs larger then the system can handle. A group with a membership list larger then the OS supplied buffer can handle (with 8 character length names, about 531 on Solaris. HP is also affected, AIX and Linux are not.). File: VAS Troubleshooting Training Page: 14
12: Troubleshooting: Authentication Issues Authentication is a complex process, the common and easily seeable issues will be discussed. The first step of authentication is making sure the user exists, so verifying the previous two sections should be done first. The one exception is when a user can log into a system, and use some, but not all tools. For example, they can telnet onto a system, but can t ssh. The method that doesn t work should be verified to be pam enabled, This can be done by running ldd against the binary. For example, with sshd, run: ldd /usr/local/bin/sshd ( if sshd was located in /usr/local/bin ) On HPUX, use the dump command: dump H /usr/local/bin/sshd Look for the pam library. If it isn t there, it is highly likely the program isn t pam enabled, so can t interact with VAS for authentication. In the case of SSHD, it is possible for it to be configured to not use PAM. The easiest method to check an authentication issue is to as root, su to the user, no -. If you can get a shell, you have eliminated the information not being present to the system, and that the information is accepted by the system. Then exit, and su - <username>. This runs through the users profiles, and also hits the home directory for the operating systems that care. If that worked, again, su - <username> and have the user enter their password. If that works, investigate how the user is logging in that is denied, as apposed to the system in general not letting the user in. su tends to be the simplest authentication method, and is great for investigating the issue if it can show it. Once it has been determined the application should be using PAM, and that the user s information is available to the system, the next step is to see what pam_vas is reporting about the user. Examine the /etc/syslog.conf file for an auth (authpriv on Linux ) entry. A common one is /var/log/auth. If there isn t one, have the sys admin set up a syslog capture for that. File: VAS Troubleshooting Training Page: 15
Examine the file pointed to and look for an entry like this: Sep 28 20:23:30 sethe su: pam_vas: Authentication <failed> for <Active Directory> user: <tu-1-b> account: <tu-1-b@baka.dev> service: <su> reason: <invalid password> Sep 27 11:40:28 sethe su: pam_vas: Authentication <failed disconnected> for <Active Directory> user: <tu-2-b> account: <tu-2-b@baka.dev> service: <su> reason: <invalid password> Sep 27 21:01:18 sethe su: pam_vas: Authentication <succeeded> for <Active Directory> user: <tu-1-b> account: <tu-1-b@baka.dev> service: <su> reason: <N/A> There are four possibilities: 1) The entry doesn t exist. Or doesn t exist for the time being investigated. Make sure the application is PAM enabled. SSHD can decide a given user is invalid, and lets them try to log in, saying it failed, when it didn t do any authentication. More advanced then this document, run sshd in debug, or enable it, and look for when it decided the user was illegal (the term it uses for a user that will never successfully authenticate). 2) An entry and the cause is apparent. Like the account is disabled, or outside logon hours, or local policy (users.allow) denies the user. a. If the reason is invalid password, have the user validate their password by running: /opt/quest/bin/vastool u <username> kinit. That does just the AD authentication using Kerberos. b. If local policy, run vastool user checkaccess <username> /. 3) The reason given in the log says internal error. In this case, the issue will need to be escalated, as further debugging will be needed to find the issue. 4) The entry says it succeeded. As for the last case, just because the pam_vas layer said to let the user in doesn t mean they got in. At this point, the OS still needs to allow the user. For all operating systems this means the users shell must exist, work, and for some applications like ftp, be listed in /etc/shells. For some operating systems the users home directory must exist (pam_vas tries to make it by default), and be owned by the user. For AIX, the users GID MUST resolve to a group. An su would with the message Unable to set terminal ownership for this issue. File: VAS Troubleshooting Training Page: 16
13: Troubleshooting: NIS VAS can provide NIS information through vasypd, a NIS server application that gets it information from AD. The first step is to determine if a nis client is properly running. ypwhich ypwhich m The first will show the yp server being talked to and the second the available maps. If the first fails, check for the vasypd process, same as you would for vasd. It runs the same parent/child model. If it isn t running, or doesn t have two processes, restart it. /etc/init.d/vasypd restart (Linux, Solaris ) /etc/rc.d/init.d/vasypd restart ( AIX ) /sbin/inid.t/vasypd restart ( HPUX ) If that doesn t fix the issue, the sys admin for that machine should be brought in to get it configured properly. If the ypwhich m doesn t list all the expected maps, the vasypd cache can be flushed by running /opt/quest/sbin/vasypd x If that doesn t fix the issue, escalate. If the issue is with an individual issue, treat as the same as a missing map. Flush vasypd, and if that doesn t fix the issue, escalate. 14: Finding and reporting bugs So, after doing the troubleshooting, and eliminating configuration or information issues, there is still a problem. Likely found a bug, or a configuration issue that needs more investigation. Now need to distill it to its essential components so when it is reported to higher support internally, or Quest support, only the essentials are sent. First off, is it reproducible? Can there be established a series of steps that will reliably produce the issue? This doesn t mean if it doesn t always happen, it isn t a bug, but it is hard to do any work on a bug that can t be reproduced when needed. File: VAS Troubleshooting Training Page: 17
That usually means not all aspects of the bug are known, and some other influencing factor has yet to be identified. Once it is reproducible, if the machine is cleaned and re-installed/joined with VAS, can it be reproduced? If not, there is probably a series of steps over time that contributes to the issue. Next step is to eliminate the extraneous. If there are many vas.conf settings, which ones can be turned off (or on) and still have the issue appear? Now, if it is an issue with a Personality, turning off UPM mode will make it go away, but isn t useful information for debugging. But if it is a user not able to log in, and it is found that turning off lowercase-name fixes the issue, that fact is very relevant. I would highly suggest obtaining the snapshot script from Quest Support, and include the output in each bug report. It gathers various information about a system and configuration files, to aid someone in investigating an issue. Always send steps to reproduce, as accurate as possible. Sometimes, the issue can look like VAS, but ends up being caused outside VAS. For example: On Solaris, VAS users with the /bin/csh shell were getting a crash whenever they tried to expand a tilde ~ to a username, like echo ~tu-1-b. The core trace showed deep into the vas cache. The final fix ended up being patching tcsh (what provides csh, patch 110898 level 13+ if you are interested), as csh had an issue where it was trashing its own internal stack, messing up some VAS items attached. Finally, check the KB Articles at http://www.quest.com/support, Search Knowledge Base, and select the Product: Vintela Authentication Service. Put in a few key words related to the issue, and it is possible the issue is known and documented. File: VAS Troubleshooting Training Page: 18