Managing data. Introduction to Linux for Bioinformatics. Joachim Jacob 25 and 29 april 2016

Similar documents
Linux+ Guide to Linux Certification, Third Edition. Chapter 11 Compression, System Backup, and Software Installation

USTM16 Linux System Administration

USB 2.0 Flash Drive User Manual

How to Restore a Linux Server Using Bare Metal Restore

Unix Sampler. PEOPLE whoami id who

Training Day : Linux

USEFUL UNIX COMMANDS

Introduction to Mac OS X

Using TrueCrypt to protect data

Technical Note TN_146. Creating Android Images for Application Development

Linux Backups. Russell Adams Linux Backups p. 1

WES 9.2 DRIVE CONFIGURATION WORKSHEET

System administration basics

Two Parts. Filesystem Interface. Filesystem design. Interface the user sees. Implementing the interface

BackTrack Hard Drive Installation

Linux System Administration. System Administration Tasks

CLOUD INFRASTRUCTURE VIRTUAL SERVER (SHARED) DATA IMPORT GUIDE

VERSION 9.02 INSTALLATION GUIDE.

HW 07: Ch 12 Investigating Windows

How you configure Iscsi target using starwind free Nas software & configure Iscsi initiator on Oracle Linux 6.4

Installing MooseFS Step by Step Tutorial

Linux System Administration

Monitoring disk stats with Cacti

Acronis Backup & Recovery 10 Server for Linux. Command Line Reference

Linux Overview. The Senator Patrick Leahy Center for Digital Investigation. Champlain College. Written by: Josh Lowery

Cisco Networking Academy Program Curriculum Scope & Sequence. Fundamentals of UNIX version 2.0 (July, 2002)

Recover Data Like a Forensics Expert Using an Ubuntu Live CD

How to upload - copy PowerChute Network Shutdown installation files to VMware VMA from a PC

Buildroot for Vortex86EX (2016/04/20)

LSN 10 Linux Overview

Installing IBM Websphere Application Server 7 and 8 on OS4 Enterprise Linux

On Disk Encryption with Red Hat Enterprise Linux

WebPublish User s Manual

Installing Virtual Coordinator (VC) in Linux Systems that use RPM (Red Hat, Fedora, CentOS) Document # 15807A1-103 Date: Aug 06, 2012

User Manual for Data Backups

Capturing a Forensic Image. By Justin C. Klein Keane <jukeane@sas.upenn.edu> 12 February, 2013

LAE Enterprise Server Installation Guide

AlienVault Offline Key Activation

Paragon ExtFS for Mac OS X

Linux System Administration on Red Hat

4PSA Total Backup User's Guide. for Plesk and newer versions

RocketRAID 2640/2642 SAS Controller Ubuntu Linux Installation Guide

INSTALLING NOVA DATA ENCRYPTION SOFTWARE

Computer forensic science

StoreBackup April 20, Download the source from

Usage of the mass storage system. K. Rosbach PPS 19-Feb-2008

Instructions for downloading and installing the GPS Map update

UNIX (LINUX) PRACTICAL 1 INTRODUCTION

The 2013 Experimental Warning Program (EWP) Virtual Weather Event Simulator (WES) Windows & Linux Installation Documentation

ATT8231: Creating a Customized USB Thumb Drive for ZCM Imaging Methods for creating a customized bootable USB Thumb Drive

Copyright

UNIX / Linux commands Basic level. Magali COTTEVIEILLE - September 2009

Installation Guidelines for NS2 on Windows

USB Bare Metal Restore: Getting Started

Cloud Computing. Command Line Tools

Searching your Archive in Outlook (Normal)

Red Hat Certifications: Red Hat Certified System Administrator (RHCSA)

Online Backup Client User Manual

USB bootable Ubuntu Kickstart Howto

1. Product Information

Online Backup Client User Manual Linux

Local Caching Servers (LCS): User Manual

Computing Service G72. File Transfer Using SCP, SFTP or FTP. many leaflets can be found at:

Lab V: File Recovery: Data Layer Revisited

Digital Forensics Tutorials Acquiring an Image with Kali dcfldd

Module 16: Some Other Tools in UNIX

Linear Tape File System (LTFS) Linux and Mac User Guide

AlienVault. Unified Security Management x Offline Update and Software Restoration Procedures

Call Recorder Quick CD Access System

BACKUP YOUR SENSITIVE DATA WITH BACKUP- MANAGER

System Administration. Backups

RecoveryVault Express Client User Manual

In order to upload a VM you need to have a VM image in one of the following formats:

Replacing a Laptop Hard Disk On Linux. Khalid Baheyeldin KWLUG, September 2015

INSTALL ZENTYAL SERVER

File Transfer Protocol. What is Anonymous FTP? What is FTP?

Managed Backup Service - Agent for Linux Release Notes

Online Backup Linux Client User Manual

Online Backup Client User Manual

GNU/LINUX Forensic Case Study (ubuntu 10.04)

TCB No September Technical Bulletin. GS FLX+ System & GS FLX System. Installation of 454 Sequencing System Software v2.

How to partition your disk with the parted magic linux livecd

HP-UX Essentials and Shell Programming Course Summary

Fuse ESB Enterprise Installation Guide

HARFORD COMMUNITY COLLEGE 401 Thomas Run Road Bel Air, MD Course Outline CIS INTRODUCTION TO UNIX

Drobo How-To Guide. What You Will Need. Use a Drobo iscsi Array with a Linux Server

Installing Proview on an Windows XP machine

Image for Linux User Manual

9/3/2013. Storage Media. Storage Media

Adafruit's Raspberry Pi Lesson 1. Preparing an SD Card for your Raspberry Pi

Storage Management. in a Hybrid SSD/HDD File system

Recovering Data from Windows Systems by Using Linux

Tutorial 0A Programming on the command line

How To Store Data On A Computer (For A Computer)

SWsoft Plesk 8.2 for Linux/Unix Backup and Restore Utilities. Administrator's Guide

Recovering Data from Windows Systems by Using Linux

Epics_On_RPi Documentation

User s Manual Portable Hard Drives StoreJet 25

Understanding the Boot Process and Command Line Chapter #3

Digital Photo Bank / Portable HDD Pan Ocean E350 User Manual

Transcription:

Introduction to Linux for Bioinformatics Managing data Joachim Jacob 25 and 29 april 2016 This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.

Organise your system A Linux system is organised into drives into partitions into folders into files One of the most basic tasks in bioinformatics is handling and storing of big data. 2 of 47

Bioinformatics data Historically, bioinformatics has always used text files to store data. PDB file excerpt Genbank record 3 of 47 HMM profile

NGS data The NGS machines produce a lot of data, stored in plain text files. These files are multiple gigabytes in size. 4 of 47

Tips for managing (NGS) data 1. When you move the data, do it in its smallest form. Compress the data. 2. When you decompress the data, leave it where it is. Symbolic links point to the data in different folders. 3. Provide enough storage for your data. choose your file system type wisely 5 of 47

Compression: tools in Linux But some more exist (e.g. freearc, dtrx). gzip and bzip2 are the most used and fairly performant. 6 of 47 http://www.linuxlinks.com/article/20110220091109939/compressiontools.html

Tips Widely used compression tools: GNU zip (gzip) Block Sorting compression (bzip2) Typically, compression tools work on one file. How to compress complete directories and their contents? 7 of 47

Tar without compression Tar (Tape Archive) is a tool for bundling a set of files or directories into a single archive. The resulting file is called a tar ball. Syntax to create a tarball: $ tar -cf archive.tar file1 file2 Syntax to extract: $ tar -xvf /path/to/archive.tar 8 of 47

Compression: a typical case Archiving and compression mostly occur together. The most used formats are tar.gz or tar.bz. These files are the result of two processes. 1. Archiving (tar) 2. Compressing (gzip or bzip2) 9 of 47

Compression: on your desktop 10 of 47

Compression: on your desktop 11 of 47

Compression: on the command line Tar is the tool for creating.tar archives, but it can compress in one go, with the z or j option. Creating a compressed tar archive: $ tar cvfz mytararchive.tar.gz $ tar cvfj mytararchive.tar.bz create docs/ docs/ Compression technique Decompressing a compressed tar archive $ tar xvfz mytararchive.tar.gz $ tar xvfj mytararchive.tar.bz extract files verbose 12 of 47

De-/compression To compress one or more files: $ gzip [options] file $ bzip2 [options] file To decompress one or more files: $ gunzip [options] file(s) $ bunzip2 [options] file(s) Every file will be compressed, and tar.gz or tar.bz appended to it. 13 of 47

Tips 1. Do you have to uncompress a big text file to read it? No! Some tools allow to read compressed files (instead of first unpacking then reading). Time saver! $ zcat file(s) $ bzcat file(s) 2. Compression is always a balance between time and compression ratio. Gzip is faster, bzip2 compresses harder. If compression is important to you: benchmark! 14 of 47

Exercise a little compression exercise. 15 of 47

Symlinks Something very convenient! A symbolic link (or symlink) is a file which points to the location of another file. You can do anything with the symlink that you can do on the original file. But when you move the original file from its location, the symlink is 'dead'. Downloads/ ~ Annotation/ Rice/ Projects/ Sequences/ alignment.sam Butterfly/ 16 of 47

Symlinks To create a symlink, move to the folder in where the symlink must be created, and execute ln. ~/Projects $ cd Butterfly ~/Butterfly $ ln -s../rice/sequences/alignment.sam Link_to_alignment.sam Downloads/ ~ Annotation/ Rice/ Projects/ Sequences/ alignment.sam Butterfly/ 17 of 47

Symlinks The symlink is created. You can check with ls. To delete a symlink, use unlink. ~/Projects $ cd Butterfly ~/Butterfly $ ln -s../rice/sequences/alignment.sam Link_to_alignment.sam ~/Butterfly $ ls -lh Link_to_alignment.sam lrwxrwxrwx 1 joachim joachim 44 Oct 22 14:47 Link_to_alignment.sam ->../Sequences/alignment.sam Downloads/ ~ Annotation/ Rice/ Projects/ Sequences/ alignment.sam Butterfly/Link_to_alignment.sam 18 of 47

Exercise a little symlink exercise 19 of 47

Disks and storage If you dive into bioinformatics, you will have to manage disks and storage. Two types of disks - solid state disks Low capacity, high speed, random writes - spinning hard disks High capacity, 'normal' speed, sequential writes. 20 of 47 http://en.wikipedia.org/wiki/solid-state_drive http://en.wikipedia.org/wiki/hard_disk

A disk is a device Via the terminal, show the disks using $ sudo fdisk -l [sudo] password for joachim: Disk /dev/sda: 13.4 GB, 13408141312 bytes... Disk /dev/sdb: 3997 MB, 3997171712 bytes... 21 of 47

A disk is divided into partitions A disk can be divided in parts, called partitions. An internal disk which runs an operating system is usually divided in partitions, one for each functions. An external disk is usually not divided in partitions. (but it can be partioned). 22 of 47

Check out the disk utility tool 23 of 47

The system disk Name of the disk 24 of 47

The system disk Name currently highlighted partition 25 of 47

The system disk Place in the directory structure where the partition can be accessed 26 of 47

An example of an USB disk - Place in the directory structure where the partition can be accessed 27 of 47

An example of an USB disk The USB disk is 'mounted' automatically on the directory tree under /media. 28 of 47

An example of an USB disk - This is the type of file system on the partition (i.e. the way data is stored on this partition) The partition is said to be formatted in FAT32 (in this case). 29 of 47

File system formats By default, many USB flash disks are formatted in FAT32. Other types are NTFS, ext4, ZFS. FAT32 max 4GB files NTFS maximum portability (also for use under windows) Ext4 default file system in Linux, Btrfs the next default file system in Linux in the near future. 30 of 47 http://en.wikipedia.org/wiki/file_system#file_systems_and_operating_systems

Example: formatting a USB disk First unmount the device (red box 1 below). Now the USB disk can not be accessed anymore. Next, choose format the device (red box 2 below). 1 2 31 of 47

Format disks with disk utility Choose the type of file system you want to be on that device. 32 of 47

Format disks with disk utility 33 of 47

Format disks with disk utility The program disk-utility put a lot of commands at work behind the scenes. Some of the command line tools used: - mount - umount - fdisk - mkfs You can read the man pages and search for guides on the internet if you want to get to know these (out of scope for this course). 34 of 47

Checking storage space By default 'disk usage analyzer'. 35 of 47

Checking storage space Bonus: K4DirStat. Not installed by default. 36 of 47

Checking storage space Bonus: K4DirStat. Not installed by default. 37 of 47

K4Dirstat is a KDE package Rehearsal: what is KDE? Bonus: what happens when you install this package on our system? 38 of 47

Space left on disks with df To check the storage that is used on the different disks: df -h ~/ $ df -h Filesystem /dev/sda1 udev tmpfs none none /dev/sdb1 Size 12G 490M 200M 5.0M 498M 3.8G Used Avail Use% Mounted on 5.3G 5.7G 49% / 4.0K 490M 1% /dev 920K 199M 1% /run 0 5.0M 0% /run/lock 76K 498M 1% /run/shm 20M 3.7G 1% /media/test ~/ $ df -h. 39 of 47

The size of directories To check the size of files or directories: du ~/ $ du -sh * 520K bin 281M Compression_exercise 4.0K Desktop 4.0K Documents 5.0M Downloads 4.0K Music 4.0K Pictures 4.0K Public 373M Rice Example 4.0K Templates 4.0K test 17Mtest.img 114M ugene-1.12.2 4.0K Videos * means 'anything'. This is called 'globbing': * is a wild card symbol. 40 of 47

Wildcards on the command line Wildcards are used to describe the names of files/dirs. * On that position, any length of string is allowed e.g. s* matches: san, sdd, sanitisation, sam.alignment,...? On that position, any character is allowed. e.g. saniti?ation matches: sanitisation, sanitiration,... [] On that position, the character may be one of the characters between [ ], 41 of 47 e.g. saniti[sz]ation matches: sanitisation and sanitization

Wildcards on the command line Many tools that require an argument to point to files or directories accept these wildcards. ~/ $ du -sh Do* 42 of 47

Wildcards on the command line Many tools that require an argument to point to files or directories accept these wildcards. ~/ $ du -sh Do* 4.0K Documents 20GDownloads 43 of 47

Wildcards on the command line Many tools that require an argument to point to files or directories accept these wildcards. ~/ $ ls *.fastq 44 of 47

Wildcards on the command line Many tools that require an argument to point to files or directories accept these wildcards. ~/ $ ls *.fastq ERR148552_1.fastq testout.fastq ERR148552_1_prinseq_good_zzwI.fastq ERR148552_2.fastq test.fastq 45 of 47

Keywords Compression Archive Symbolic link mounting File system format partition Recursively df du unlink Write in your own words what the terms mean 46 of 47

Break 47 of 47