Integrating the Google Search Appliance with WebSphere Portal and Lotus Web Content Management Dave Hay Portal and Collaboration Architect IBM Software Services for Lotus (ISSL) david_hay@uk.ibm.com +44 7802 918423
About Me With IBM since 1992 Experienced with hardware, software and now services AS/400 and iseries Network Station WebSphere and Lotus software Linux advocate Collaboration evangelist Infrastructure Architect With ISSL since 2009
Introduction The Project Major UK financial institution Internal and external websites Content held in Lotus Web Content Management Existing intranet and internet sites using Google Search Appliance IBM team and Google partner engaged Solution adoption programme
Requirements To To To To To To To deliver access to unsecured AND secured content maintain security of content within search results present content in context via search results deliver personalized results with variance and relevance integrate with WebSphere Portal maintain access to existing search facilities perform in line with non-functional requirements
Lotus Web Content Management Role-based content management system Built upon WebSphere Portal Workflow-driven authoring, approval and publishing process Content accessible via portlets, standalone websites, API, feeds etc. etc. etc. Content stored in standards-based Java Content Repository (JCR) database
Google Search Appliance Search in a box Self-contained appliance Only requires power and data Different models for different requirements Client uses GB-7007 Can index 10,000,000 content items / documents GSAs can be scaled to meet non-functional requirements
Challenges Preserve existing search functionality Integrate with client's custom security solution Need to maintain segregation - GSA should never interact with WCM directly WCM supports standard Seedlist format GSA supports Google Feeds format User experience what and where
Terminology Crawling the process that the GSA goes through to build its onbox search index ( known as the default collection ) Serving the GSA provides search request form and search results to users Searching the process that the users go through Collections provide views into the default collection based upon URL patterns Front-Ends defines the user experience IN and OUT of GSA XSLT Extensible Stylesheet Language Transformations, used to drive the user experience
Seedlists and Feeds Google Feeds is the format that the GSA uses when crawling, and what our solution needed to produce WCM automatically produces a Seedlist, albeit on-demand Seedlist can also be scheduled and, perhaps, persisted Question about where seedlist would be persisted e.g. file sysem, database Both are XML structures What are the differences? IBM Seedlist format has features that Google Feeds doesn't offer: Pre-filtering by user groups stored in meta-data in the index Post-filtering at run-time Pagination useful for large content stores Embedded seedlists ( seedlists within seedlists ) Incremental indexing ( what has changed since the last crawl ) Long-term objective is for standardization around the Seedlist format
The Solution IBM team developed Crawling Proxy (CP) solution CP is based upon an established Google pattern, so not First Of A Kind (FOAK) CP is a standard JEE application deployed onto WebSphere Application Server 6.1 CP acts as broker between GSA and WCM GSA never connects to WCM direct CP can be scaled across clustered WebSphere environment to meet non-functional requirements
System Context Diagram CWS Administrator Core Web Security Content Authoring Server Content Author WCM Database Portal Administrator Database Administrator Web Server Portal/Content Delivery Cluster End User End User Portal Databases Google Search Appliance Existing Content (Insite) GSA Administrator Insite Administrator Admin Flow User data Flow Security flow
Crawling Process GSA makes a crawl request to Crawling Proxy via a specific URL CP requests Seedlist from WCM CP generates Jump Page HTML page of links, paged as needed GSA crawls Jump Page requesting each URL from CP CP returns content and meta-data to GSA Injected into GSA using Google Feeds format
Crawling Process Jump Page
Crawling Process - Feeds
Delivering Secured Search Content is secured in WCM using user groups Crawling proxy injects groups into GSA as meta-data via Feed process GSA needs to get the user groups to perform search across ACLsecured content in index How does the GSA know the identity and groups of the user? GSA can use LDAP, but client doesn't use it with a custom authentication mechanism used instead
The Cookie Cracker Like the Crawling Proxy, this is another pattern that GSA supports Cookie Cracker is used to decrypt and validate user's security token Then returns user ID and groups to GSA GSA can then perform search across ACL-secured content in index Also need a Redirect URL to force user to authenticate if anonymous or expired session
Serving Process User initiates search request Either by accessing GSA directly or via portal User indicates whether secured or unsecured search is required If unsecured, then GSA searches as usual If secured, GSA redirects user request to Cookie Cracker If no valid token, GSA redirects user request to Redirect URL to force logon Once valid token, Cookie Cracker returns user ID and groups to GSA GSA performs search across ACL-secured content
The Multiple GSA Scenario May be needed for performance and/or resilience Multiple patterns including Active/Active Crawl, Active/Active Search, Active/Passive Crawl etc. Option to use mirroring to keep passive GSA in sync with active GSA Crawling Proxy needs to be designed to know which GSA is making a request Crawling Proxy also needs to persist timestamp of last Seedlist request 2222 IBM Corporation
GSA Security GSA can use security mechanisms such as NTLM and FormBased Authentication to control crawler access We chose to use NTLM GSA also supports solutions such as Kerberos and SAML for client authentication essential for secure serving We chose to use Cookie Cracking We also needed to consider other aspects: Using HTTPS to encrypt access from GSA to Crawling Proxy Using IP whitelist and network ACLs to control access to GSA ports such as Feeds and Admin Using HTTPS to encrypt data being fed into the Feed port Using on-box user accounts ( administrator, manager ) rather than LDAP
End-user Experience Options to deliver UX from portal -or- from GSA GSA experience driven by front-end Front-end provides search request and search results Option to have multiple front-ends; each with different theme/style Front-ends delivered using Extensible Stylesheet Language Transformations (XSLT) Re-use existing styles e.g. CSS files, icons, logos etc.
Examples of UX
Component Design
Skills Client had previous experience with GSA Needed to acquire additional GSA administration experience Crawling Proxy, Cookie Cracker and Redirect URL applications realized in JEE XSLT skills needed to customize front-ends GSA has on-box front-end tooling XSLT expertise needed to modify over and above
Project Lifecycle Conduct requirements gathering exercise We started with a baseline requirement for secured search in Portal Equates to an agile project; we knew where we wanted to get to, but the way-points on the journey changed along the way Work with Google partner to understand art of the possible Patterns such as Crawling Proxy and Cookie Cracking came this way Identify dependancies Need GSA 6.8 software level to support content-level ACLs Needed additional fix for SSL support Develop and functionally test, iteratively Plan for non-functional testing, to build capacity model Using Crawling Proxy against WCM was a known unknown Plan to upgrade production GSAs to 6.8 Plan for administrator and developer training
The Future Client plans to make this Search Solution a standard part of all future Portal/WCM deployments This includes internal AND external web sites Option to re-use all/part of solution ( esp. Crawling Proxy ) for Collaboration project with Lotus Connections Extend solution to offer Personalization ( variance and relevance ) using meta-data Consider scheduling Seedlist generation, and caching across clusters Look at options to standardize XSLT across organization Consider search on mobile devices e.g. ipad, Android
Lessons Learned Need complete set of skills Portal/WCM GSA XSLT Security infrastructure Networking Project spans infrastructure, application and security disciplines Decide on UX as soon as possible Focus on requirements, requirements, requirements
Any questions?
How to contact me Lotus Sametime 07802 918423 Lotus Notes