Controls System UNIX Computing Environment And Data Management Facilities Review Issues data storage backup security scalability reliability system administration Solutions Production data staging and distribution Local NFS system
Existing Production Unix System Web server NIS server AFS NFS SUN staging system (backup) ORACLE server SCS pubs nlcdev-opi01 nlcdev-opi02 nlcdev-opi03 nlcdev nlcdev-opi04 nlcdev-opi05 slcsun1 (epics dev) slcs1 (backup for slcs2) slcs2 (babar gateway) lan babar slclavc opi00gtw04 (nlcta server) lan opi00gtw00 (gateway server) opi00gtw01 (gateway server) opi00gtw02 (gateway server) MCC VMS (backup & dns) leb opi08rfs00 opi12rfs00 netmon-pep (network monitor) mccux02 (gpib) pepii opi04rfs00 opi04lfb00 opi04lfb01 opi04lfb02 opi99rfs02 (system monitor) NAS 0 NAS 1 lan
Issues Standalone computing environment Controls functional if disconnected Balance of high security and availability? Systems in production Security enforcement behind Reboot often needed, system upgrade needed Latest incident: GPIB server (kernel corrupted) Balance of reliability and scalability? Standalone: redundant, robust, very reliable problem: scalability, system admin (local services) A new model: Localized yet Transparent Data storage and backup Limited capacity Constraints
A Transparent Model Web server NIS server AFS NFS SUN staging system (backup) ORACLE server SCS pubs slclavc nlcdev leb babar pepii unix client unix client Router Production Server 0 Production Server 1 unix client unix client Model Router added Gateways -> Production servers SCS: OS, security, customized tailor, sumptuous services We: Local NIS and DNS Automout : NFS locally and AFS from SCS Data storage buffer Prod (and some dev) software Procedures (backup, data dist. monitoring and tools) Localized yet Transparent and Scalable Served by local NFS, NIS and DNS, local prod software Control functional if disconnected from SCS Resources and data transparent Dev and prod platform symmetric (OS) Easy in dev, deployment, maintenance Monitoring in a timely fashion Upgrade simple, easy expansion Concerns We: reliability, availability, controls network security SCS: 24 hrs on call, more workloads
Storage Facilities Scattered PEPII -> /u1 CMLOG -> NAS NLCTA -> /u1 and SCS NFS Channel Archiver -> NFS & AFS Luminosity -> AFS etc.. Processes needed, CPU drain Increased network traffic SCS NFS NLCTA gateway server (NLCTA Channel Archiver data) backup BABAR gateway server Oracle server SCS pubs LEB BABAR LAVC PEPII gateway servers MCC VMS backup /u1 (PEPII data) PEPII NAS 0 NAS 1 (cmlog data)
NFS File System Current Archiver Architecture Data Backup SLAC Public Network RTR-MCC (disconnect point) AFS File System SLCLAVC OPI00GTW04 TARF Archive Engine And Archive Browser Data Flow Networks SLCS2 PEPII Archive Engine And Archive Browser LEB BABAR
NLCTA Production computing resources shared with PEPII NLCTA server (opi00gtw04) s booted from PEPII gateway servers logged into PEPII gateway storage area (/u1) accumulating at a high rate of > 100MB/day (approx 1600 binary files, 48K/file) constraints: limited storage capacity constraints to add additional disks isolated production network a solution: data staging distribution SCS NFS PEPII gateway servers backup NLCTA server SCS pubs LEB LAVC /u1 (PEPII data) PEPII
Data Distribution NLCTA accumulating data logged into PEPII staging area ($STAG/fault) Process: distnlctadata on NLCTA server archiving the data files (if >1 day) creating a tar file: fault_mm-dd-yy.tar distributed to SCS NFS tar -> $NLCTA/fault/archive for save keeping data restored -> $NLCTA/fault/data (mirrored) for analysis compressed, backup via SCS facility time distnlctadata Process: cleanup on PEPII gateway server trigged when archiving completed offset by the time distnlctadata takes removed from staging Automatic: trscrontab.nlcta daily, extended trscrontab token Reliability: all involved systems validated any failure tracked and informed via e-mail cleanup disabled, local backup, data protected (Thanks Krisiti.) SCS NFS PEPII gateway servers backup NLCTA server SCS pubs LEB distnlctadata trscrontab LAVC staging cleanup /u1 (PEPII data) PEPII
Protection of data loss Data loss (incident I) Incremental Fault trips occurred after archived done and cleanup triggered Solution Archiving the data files (>1day) On a daily basis Name: fault_mm-dd-yy.tar Archived Cleanup triggered and fault trips occurred Data loss (incident II) The elapsed time of distnlctadata Solution Time distnlctadata Offset cleanup by the time Dara > 1 day archived Data > 1 day removed (cleanup) offset
Backup Facilities In different forms: PEPII systems, applications and data -> VMS buffer -> tape CMLOG data -> NAS disk (mirrored) NLCTA data -> SCS tape staging facility Channel Archiver data -> SCS AFS & CD GPIB server system and application -> local tape Network monitor utilities -> none Restoring in different methods A dilemma VMS buffer limited (image and incremental) problem of writing (> 2GB) file to VMS buffer from Solaris Multinet NFS protocol file size limitation ~2GB Magneto optical media option (KenB)
A Long Term Solution: Local NFS Local NFS file systems: Two SUN Netra 20 Redundant (master & failover) Expandable Optimized for serving files on network Sun StorEdge Disk Array RAID controller: disk mirrored Dynamic reconfig. of disk storage Hot-swap redundant components Optical fiber cable Backup facility: Magneto optical storage media tech HP Optical Jukeboxes High-volume reference data Write once, read for ever Scalable (up to 2TB) Transfer rate up to 10MB/s More in research (Thanks KenB) SUN Netra 20 Master Server optical Sun StorEdge Disk Array SUN Netra 20 Failover Server HP Optical Jukeboxes
New Storage System SCS NFS backup Oracle server SCS pubs (NLCTA Channel Archiver data) LEB All production data NLCTA gateway server BABAR gateway server LAVC PEPII gateway servers MCC VMS backup /u1 (PEPII & NLCTA data) PEPII NAS 0 NAS 1 (Cmlog data)
Conclusions The standalone controls computing environment Robust (e.g, no failure for two years) Dilemmas & solutions An enlightenment: local NFS solution Centralized All production data in one place, backup in one form and in one place Higher volume, higher data rate, yet less network traffic Fast data access, and simple data restore Scalable and reliable Systems expandable Redundant, robust systems ; less network route, less platform, less procedures. No SCS dependent at all All services critical to production provided locally Except for development and e-mail stuffs Trade off