The Paraná Digital Project

Similar documents

Proposal for Virtual Private Server Provisioning

A Better Approach to Backup and Bare-Metal Restore: Disk Imaging Technology

Blackboard Managed Hosting SM Disaster Recovery Planning Document

Blackboard Collaborate Web Conferencing Hosted Environment Technical Infrastructure and Security

Module 7: System Component Failure Contingencies

Business Continuity: Choosing the Right Technology Solution

AKIPS Network Monitor User Manual (DRAFT) Version 15.x. AKIPS Pty Ltd

Backup Retention and Data Availability

Service Name: Software Support Service

DATASHEET. Proactive Management and Quick Recovery for Exchange Storage

VDI can reduce costs, simplify systems and provide a less frustrating experience for users.

Designing, Optimizing and Maintaining a Database Administrative Solution for Microsoft SQL Server 2008

an introduction to networked storage

Distributed File Systems

Availability and Disaster Recovery: Basic Principles

MS Design, Optimize and Maintain Database for Microsoft SQL Server 2008

Selling Virtual Private Servers. A guide to positioning and selling VPS to your customers with Heart Internet

Creating A Highly Available Database Solution

High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach

How Routine Data Center Operations Put Your HA/DR Plans at Risk

AVLOR SERVER CLOUD RECOVERY

The Microsoft Large Mailbox Vision

HOSTING SERVICES AGREEMENT

The Importance of Software License Server Monitoring White Paper

Be ready for emergencies

Server Operations Managed Servers Service Level Agreement (SLA)

Explain how to prepare the hardware and other resources necessary to install SQL Server. Install SQL Server. Manage and configure SQL Server.

Backup and Recovery by using SANWatch - Snapshot

PLUMgrid Toolbox: Tools to Install, Operate and Monitor Your Virtual Network Infrastructure

SERVICE SCHEDULE DEDICATED SERVER SERVICES

Perforce Backup Strategy & Disaster Recovery at National Instruments

How To Write A Server On A Flash Memory On A Perforce Server

Oracle VM Server Recovery Guide. Version 8.2

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

Enhancements of ETERNUS DX / SF

What is RAID and how does it work?

Fault Tolerant Servers: The Choice for Continuous Availability on Microsoft Windows Server Platform

Meeting Management Solution. Technology and Security Overview N. Dale Mabry Hwy Suite 115 Tampa, FL Ext 702

CS615 - Aspects of System Administration

Server Operations Managed Servers Service Level Agreement (SLA)

A Project Summary: VMware ESX Server to Facilitate: Infrastructure Management Services Server Consolidation Storage & Testing with Production Servers

FileCruiser Backup & Restoring Guide

Managing and Maintaining a Windows Server 2003 Network Environment

Tk20 Backup Procedure

Created By: 2009 Windows Server Security Best Practices Committee. Revised By: 2014 Windows Server Security Best Practices Committee

Tiburon Master Support Agreement Exhibit 6 Back Up Schedule & Procedures. General Notes on Backups

Aljex Software, Inc. Business Continuity & Disaster Recovery Plan. Last Updated: June 16, 2009

Active Directory - User, group, and computer account management in active directory on a domain controller. - User and group access and permissions.

Maintaining a Microsoft SQL Server 2008 Database

MCSA Security + Certification Program

ORACLE DATABASE 10G ENTERPRISE EDITION

HAVmS: Highly Available Virtual machine Computer System Fault Tolerant with Automatic Failback and close to zero downtime

Sage CRM Technical Specification

Network Virtualization Platform (NVP) Incident Reports

Data Protection in a Virtualized Environment

Administering a Microsoft SQL Server 2000 Database

STORAGE CENTER. The Industry s Only SAN with Automated Tiered Storage STORAGE CENTER

How To Migrate To Redhat Enterprise Linux 4

Tk20 Network Infrastructure

Our Server Support. Looking after your servers giving you peace of mind. Document Version Revision Date Feb 2015

BME CLEARING s Business Continuity Policy

Cisco Active Network Abstraction Gateway High Availability Solution

system monitor Uncompromised support for your entire network.

ANNE ARUNDEL COMMUNITY COLLEGE ARNOLD, MARYLAND COURSE OUTLINE CATALOG DESCRIPTION

Backup and Redundancy

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

Operations Manager Comprehensive, secure remote monitoring and management of your entire digital signage network infrastructure

Administrator Manual

WHITE PAPER. VERITAS Bare Metal Restore AUTOMATED SYSTEM RECOVERY WITH VERITAS NETBACKUP

WHITE PAPER PPAPER. Symantec Backup Exec Quick Recovery & Off-Host Backup Solutions. for Microsoft Exchange Server 2003 & Microsoft SQL Server

Introduction to Microsoft Small Business Server

RAID systems within Industry

Router Recovery with ROM Monitor

ENTERPRISE STORAGE WITH THE FUTURE BUILT IN

Course 2788A: Designing High Availability Database Solutions Using Microsoft SQL Server 2005

System Availability and Data Protection of Infortrend s ESVA Storage Solution

LOCKSS on LINUX. CentOS6 Installation Manual 08/22/2013

The Aspect Unified IP Five 9s Environment

Snapshot Technology: Improving Data Availability and Redundancy

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

PC Proactive Solutions Technical View

This is when a server versus a workstation is desirable because it has the capability to have:

Disaster Recovery Strategies: Business Continuity through Remote Backup Replication

Business Process Desktop: Acronis backup & Recovery 11.5 Deployment Guide

automates system administration for homogeneous and heterogeneous networks

Cisco Unified Computing Remote Management Services

NETWORK SERVICES WITH SOME CREDIT UNIONS PROCESSING 800,000 TRANSACTIONS ANNUALLY AND MOVING OVER 500 MILLION, SYSTEM UPTIME IS CRITICAL.

Transcription:

Marcos Castilho Center for Scientific Computing and Free Software C3SL/UFPR September, 25th 2008 Managing a Grid of Computer Laboratories for Educational Purposes

Abstract We present a grid-based model for managing hundreds of distributed educational computing laboratories. The aim is to dispense the need of specialized staff in each place whilst maximizing the laboratories uptime The model was implemented in order to manage more than two thousand laboratories spread across the entire public school network of Paraná State (Brazil)

Motivation Computers and the Internet are important for educational purposes Cost reduction of computer hardware led to expansion of digital inclusion policies The aim is to provide laboratories for every public school

First problems The lack of specialized workforce to manage and maintain these laboratories frequently turns them useless Computers get obsolete, the software doesn t work properly, computers get infected, among other reasons In public schools this situation is worsened by the lack of funds for frequent updates

Problems that seem to have no solution It s impossible to have an expert to manage every school Educators and students should only worry about the educational use of the system, and not about its technical details This demands an effort in order to define a new administration model

Problems do have a solution! The Grid-based model Management tasks are automated Self-monitoring systems Maintenance costs are minimized Dispenses specialized staff in evety laboratory Maximises system uptime

The model was implemented in Brazil 2.100 schools 1.500.000 students 57.000 teachers 42.000 working points 399 cities 199,314km 2 GNU/Linux software, under Free Software licence (GPL) Up and running since June, 2006

It is a partnership between State Secretary of Education State Computer Company of Paraná State State Secretary of Science and Technology Inter-American Development Bank (IDB) United Nations Development Programme (UNDP) Electric Company of Electricity (COPEL) Universidade Federal do Paraná (C3SL)

View of Paraná State in Brazil

View of Paraná State

PRD s architecture

PRD s architecture In each school: a laboratory composed of 20 X-terminals one processing server called the school server The school server acts simultaneously as processing and storage unit, gateway to the network, firewall and access point to the Core it runs a Debian-based GNU/Linux distribution, all servers have the same software packages installed. At the Core, a proxy-controlled connection to the Internet is provided

The grid-based model The computer laboratories are interconnected forming a grid The hardware, operating system and main applications are fairly homogeneous Global tasks: are executed to guarantee the expected services, with maximum performance Local tasks: are related to specific aspects of each laboratory and those demanding physical access to the hardware Main idea: To implement self-management concepts

The local managers They are not computer experts They take high level and simple decisions They use a user-friendly interface to perform local tasks: User accounts, disk quotas, etc. Initial installation or re-installation procedures Contact the call center in case of problems

Global managers: the Core A small group of specialized system managers They are able to manage hundreds of laboratories They specify what are the system s features, software to be installed, what should be restricted or allowed Their decisions are propagated through self-management Self-configuration Self-optimization Self-recuperation Self-protection

Self-configuration Concerns the adaptation of new components or new execution environments without a significant human intervention Is a continuous process aiming to keep the system configured The configuration policies are defined for the entire grid by the Core

Implementing self-configuration Users have access to the server through the X-terminals The server is the only machine to be configured and maintained The software and configuration of the X-terminals are determined at the server The servers self-configuration is based on software package management and unified mirrors

Initial installation A CD-ROM containing a standard system image The server checks periodically for updates at the central mirror X-terminals must automatically recognize and configure hardware It allows automated installation and configuration by the local manager This automated procedure minimizes service disruption and data loss A regular user must have an account created by the local manager. The local manager does not have root powers This ensures that critical tasks are globally defined on the grid.

Self-optimization Allows the configuration of parameters that impact on the overall performance Monitoring systems are fundamental in a self-optimized system Collected data can be used to optimize system parameters New configurations can be automatically applied globally in the grid The identification of possible failure points can be used to avoid or to fix bugs

Self-optimization The analysis of historical data can also be used to optimize the system This helps to scale the hardware in each laboratory or to plan a hardware upgrade There is a web page with strategic information from the whole grid

Self-recuperation and Self-protection Failures in a system can cost many weeks of work It s possible to automatically detect, diagnose, treat and prevent many software and hardware problems The downtime can be reduced if some aspects of the hardware and the operating system are tracked Examples: Contact the Core depending on memory, hard disk and temperature sensors

Self-recuperation and Self-protection Redundant Array of Inexpensive Disks (RAID) prevent loss of data Downtime can be minimized by verifying the filesystem s integrity The system must automatically recover damaged files There must be a recuperation process to reinstall the system without user data loss If everything fails, enter in remote assistance mode (ssh from the Core)

System Recovery Automatic recovery: it formats only the root filesystem, preserving user data and local information. A collection of scripts saves important files, restoring them after system re-installation. The remote help: Live-CD allows the Core to connect to the school server and try to find the problem or to do a backup.

System Upgrade Frequent system upgrades are necessary to provide new functionalities, address security problems and propagate new software, tools, or policies from the Core automatic daily upgrades and triggered upgrades

The daily automatic upgrade Based on Debian s apt-get tools Every night, each school server looks for new software packages The single mirror ensures that all servers will install exactly the same software It is quite simple to propagate a new tool or configuration over the entire network

The triggered upgrade The Core can force all school servers to upgrade This happens as soon as the network link turns on Use cases: the Kernel exploit that allowed an ordinary user to become root the upgrade from sarge to etch (in February 2008)

System Monitoring Is an essential feature of autonomic systems, and provides information to allow the system s self-optimization and self-recuperation It is at the heart of the PRD network, revealing the real state of the whole grid In the PRD model, there are two different monitoring systems: the statistics center diagnosis system

The Statistics Center A web site with strategic information It allows an overview of the network s growth and provide data concerning the laboratories usage This is automatically stored in the central database This provides strategic information for decision support

Snapshot of the statistics center web page

Snapshot of the statistics center web page

Examples of collected data giving decision support 2500 Schools User/1000 Login time/1000 2000 1500 1000 500 0 Sep/07 Nov/07 Jan/08 Mar/08 May/08 Jul/08

The Instant Diagnosis System Conceived to provide real time diagnosis of the network s status Detects possible faults in the laboratories and triggers alarms at the Core Capable to detect whether the schools are online, identify old software, inconsistencies in a filesystem, checksum errors, hard disk problems and so on Allows prevention of critical failures, either by acting remotely or by timely providing hardware replacement

Manual System Inspection Manual interventions are undesired The core can remotely log in a server via SSH But just to inspect the server s behavior All changes should happen on a global manner in the grid This means: global management!

Main contributions Software 40 million dollars saved due to free software Software is free and has high quality More than 100 thousand packages are made available Hardware Carefully dimensioned hardware configuration Obsolete hardware can be reused 15 million dollars saved due to the multiseat Maximization of hardware lifetime

Conclusion This model allows the administration of thousands of computing laboratories with minimum human intervention A large scale implementation of this model is provided, in which a management team composed of 12 highly trained Unix managers is able to control all software-related issues from the entire network

Final considerations The system works pretty well...... but a working system does not mean it will be used!

Contact and links castilho@ufpr.br jefferson.s@lactec.org.br http://www.c3sl.ufpr.br/prd http://yoda.c3sl.ufpr.br/sdi http://www.prdestatistica.seed.pr.gov.br