How to represent characters?



Similar documents
The use of binary codes to represent characters

CAPE-OPEN Expanding Process Modelling Capability through Software Interoperability Standards

How To Write Portable Programs In C

File Transfer Protocols In Anzio

Perl version documentation - Digest::SHA NAME SYNOPSIS SYNOPSIS (HMAC-SHA) Page 1

Chapter 4: Computer Codes

NVT (Network Virtual Terminal) description

ASCII Code. Numerous codes were invented, including Émile Baudot's code (known as Baudot

TFTP Trivial File Transfer Protocol. TFTP Usage and Design. TFTP Usage and Design (cont.) References: RFC 783, Transfer files between processes.

Number Representation

Solution for Homework 2

Encoding Text with a Small Alphabet

Pemrograman Dasar. Basic Elements Of Java

Information, Entropy, and Coding

How To Read Data Files With Spss For Free On Windows (Spss)

Hexadecimal Object File Format Specification

This is great when speed is important and relatively few words are necessary, but Max would be a terrible language for writing a text editor.

Today s topics. Digital Computers. More on binary. Binary Digits (Bits)

filename := Filename named: 'c:\basedir\file.ext'. filename := 'c:\basedir\file.ext' asfilename.

2- Electronic Mail (SMTP), File Transfer (FTP), & Remote Logging (TELNET)

Simple Image File Formats

Counting in base 10, 2 and 16

Digital codes. Resources and methods for learning about these subjects (list a few here, in preparation for your research):

Section 1.4 Place Value Systems of Numeration in Other Bases

SYMETRIX SOLUTIONS: TECH TIP August 2015

Automating SQL Injection Exploits

Variables, Constants, and Data Types

WinAgentLog Reference Manual

HOMEWORK # 2 SOLUTIO

SMPP protocol analysis using Wireshark (SMS)

Java Interview Questions and Answers

The programming language C. sws1 1

The following functions are provided by the Digest::MD5 module. None of these functions are exported by default.

ASCII Characters. 146 CHAPTER 3 Information Representation. The sign bit is 1, so the number is negative. Converting to decimal gives

TRANSLATIONS FOR A WORKING WORLD. 2. Translate files in their source format. 1. Localize thoroughly

Obj: Sec 1.0, to describe the relationship between hardware and software HW: Read p.2 9. Do Now: Name 3 parts of the computer.

Japanese Character Printers EPL2 Programming Manual Addendum

Data Storage. Chapter 3. Objectives. 3-1 Data Types. Data Inside the Computer. After studying this chapter, students should be able to:

Cyber Security Workshop Encryption Reference Manual

Digital Versus Analog Lesson 2 of 2

So far we have considered only numeric processing, i.e. processing of numeric data represented

Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM

Active Directory Integration with Blue Coat

This section describes how LabVIEW stores data in memory for controls, indicators, wires, and other objects.

Handout 1. Introduction to Java programming language. Java primitive types and operations. Reading keyboard Input using class Scanner.

Storing Measurement Data

Systems I: Computer Organization and Architecture

Designing for Magento

Computer Logic (2.2.3)

Intel Hexadecimal Object File Format Specification Revision A, 1/6/88

1.2 Using the GPG Gen key Command

4D Plugin SDK v11. Another minor change, real values on 10 bytes is no longer supported.

Chapter 14 Analyzing Network Traffic. Ed Crowley

Hushmail Express Password Encryption in Hushmail. Brian Smith Hush Communications

ASCII Encoding. The char Type. Manipulating Characters. Manipulating Characters

Raima Database Manager Version 14.0 In-memory Database Engine

ELFRING FONTS UPC BAR CODES

The Unicode Standard Version 8.0 Core Specification

First Bytes Programming Lab 2

As previously noted, a byte can contain a numeric value in the range Computers don't understand Latin, Cyrillic, Hindi, Arabic character sets!

This is a preview - click here to buy the full publication INTERNATIONAL STANDARD

Internationalizing the Domain Name System. Šimon Hochla, Anisa Azis, Fara Nabilla

Building a Multi-Threaded Web Server

EE 261 Introduction to Logic Circuits. Module #2 Number Systems

Import and Export User Guide PowerSchool Student Information System

2011, The McGraw-Hill Companies, Inc. Chapter 3

Guide to Trading GUIDE TO TRADING

CS 106 Introduction to Computer Science I

CONCEPT1 RS232 COMMUNICATION

Chapter 12 File Management. Roadmap

Introduction to UNIX and SFTP

Chapter 12 File Management

java.util.scanner Here are some of the many features of Scanner objects. Some Features of java.util.scanner

The New IoT Standard: Any App for Any Device Using Any Data Format. Mike Weiner Product Manager, Omega DevCloud KORE Telematics

Memory is implemented as an array of electronic switches

Remote login (Telnet):

Secure Shell SSH provides support for secure remote login, secure file transfer, and secure TCP/IP and X11 forwarding. It can automatically encrypt,

About this release. McAfee Application Control and Change Control Addendum. Content change tracking. Configure content change tracking rule

Avid Technology, Inc. inews NRCS. inews FTP Server Protocol Specification. Version January 2006

Regular Expression Syntax

PROBLEMS (Cap. 4 - Istruzioni macchina)

Project #2: Secure System Due: Tues, November 29 th in class

KLV encoder / decoder library for STANAG 4609

Sybase Adaptive Server Enterprise

Linux Driver Devices. Why, When, Which, How?

Virtual Integrated Design Getting started with RS232 Hex Com Tool v6.0

HOMEWORKS. RS-232 Protocol. Data Protocol for Communicating with Lutron's HOMEWORKS System

SQL Injection Optimization and Obfuscation Techniques

HP Service Virtualization

Elfring Fonts LaserJet Bar Codes & More

Third Southern African Regional ACM Collegiate Programming Competition. Sponsored by IBM. Problem Set

Lecture 22: C Programming 4 Embedded Systems

AQA GCSE in Computer Science Computer Science Microsoft IT Academy Mapping

Symbols in subject lines. An in-depth look at symbols

Internationalization of Domain Names

Introduction to: Computers & Programming: Input and Output (IO)

RDF1. RF Receiver Decoder. Features. Applications. Description. Ordering Information. Part Number Description Packages available

Transcription:

Copyright Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See http://software-carpentry.org/license.html for more information.

How to represent characters?

How to represent characters? American English in the 1960s:

How to represent characters? American English in the 1960s: 26 characters {upper, lower}

How to represent characters? American English in the 1960s: 26 characters {upper, lower} + 10 digits

How to represent characters? American English in the 1960s: 26 characters {upper, lower} + 10 digits + punctuation

How to represent characters? American English in the 1960s: 26 characters {upper, lower} + 10 digits + punctuation + special characters for controlling teletypes (new line, carriage return, form feed, bell, )

How to represent characters? American English in the 1960s: 26 characters {upper, lower} + 10 digits + punctuation + special characters for controlling teletypes (new line, carriage return, form feed, bell, ) = 7 bits per character (ASCII standard)

How to represent text?

How to represent text? 1. Fixed-width records

How to represent text? 1. Fixed-width records A crash reduces your expensive computer to a simple stone.

How to represent text? 1. Fixed-width records A crash reduces your expensive computer to a simple stone. A c r a s h r e d u c e s y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e.

How to represent text? 1. Fixed-width records A crash reduces your expensive computer to a simple stone. A c r a s h r e d u c e s y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e. Easy to get to line N

How to represent text? 1. Fixed-width records A crash reduces your expensive computer to a simple stone. A c r a s h r e d u c e s y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e. Easy to get to line N But may waste space

How to represent text? 1. Fixed-width records A crash reduces your expensive computer to a simple stone. A c r a s h r e d u c e s y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e. Easy to get to line N But may waste space What if lines are longer than the record length?

How to represent text? 1. Fixed-width records 2. Stream with embedded end-of-line markers

How to represent text? 1. Fixed-width records 2. Stream with embedded end-of-line markers A crash reduces your expensive computer to a simple stone. A c r a s h r e d u c e s y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e.

How to represent text? 1. Fixed-width records 2. Stream with embedded end-of-line markers A crash reduces your expensive computer to a simple stone. A c r a s h r e d u c e s y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e. More flexible

How to represent text? 1. Fixed-width records 2. Stream with embedded end-of-line markers A crash reduces your expensive computer to a simple stone. A c r a s h r e d u c e s y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e. More flexible Wastes less space

How to represent text? 1. Fixed-width records 2. Stream with embedded end-of-line markers A crash reduces your expensive computer to a simple stone. A c r a s h r e d u c e s y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e. More flexible Skipping ahead is harder Wastes less space

How to represent text? 1. Fixed-width records 2. Stream with embedded end-of-line markers A crash reduces your expensive computer to a simple stone. A c r a s h r e d u c e s y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e. More flexible Wastes less space Skipping ahead is harder What to use for end of line?

Unix: newline ('\n')

Unix: newline ('\n') Windows: carriage return + newline ('\r\n')

Unix: newline ('\n') Windows: carriage return + newline ('\r\n') Oh dear

Unix: newline ('\n') Windows: carriage return + newline ('\r\n') Oh dear converts '\r\n' to '\n' and back on Windows

Unix: newline ('\n') Windows: carriage return + newline ('\r\n') Oh dear converts '\r\n' to '\n' and back on Windows To prevent this (e.g., when reading image files) open the file in binary mode

Unix: newline ('\n') Windows: carriage return + newline ('\r\n') Oh dear converts '\r\n' to '\n' and back on Windows To prevent this (e.g., when reading image files) open the file in binary mode reader = open('mydata.dat', 'rb')

Back to characters

Back to characters How to represent ĕ, β, Я,?

Back to characters How to represent ĕ, β, Я,? 7 bits = 0 127

Back to characters How to represent ĕ, β, Я,? 7 bits = 0 127 8 bits (a byte) = 0 255

Back to characters How to represent ĕ, β, Я,? 7 bits = 0 127 8 bits (a byte) = 0 255 Different companies/countries defined different meanings for 128...255

Back to characters How to represent ĕ, β, Я,? 7 bits = 0 127 8 bits (a byte) = 0 255 Different companies/countries defined different meanings for 128...255 Did not play nicely together

Back to characters How to represent ĕ, β, Я,? 7 bits = 0 127 8 bits (a byte) = 0 255 Different companies/countries defined different meanings for 128...255 Did not play nicely together And East Asian "characters" won't fit in 8 bits

1990s: Unicode standard

1990s: Unicode standard Defines mapping from characters to integers

1990s: Unicode standard Defines mapping from characters to integers Does not specify how to store those integers

1990s: Unicode standard Defines mapping from characters to integers Does not specify how to store those integers 32 bits per character will do it...

1990s: Unicode standard Defines mapping from characters to integers Does not specify how to store those integers 32 bits per character will do it......but wastes a lot of space in common cases

1990s: Unicode standard Defines mapping from characters to integers Does not specify how to store those integers 32 bits per character will do it......but wastes a lot of space in common cases Use in memory (for speed)

1990s: Unicode standard Defines mapping from characters to integers Does not specify how to store those integers 32 bits per character will do it......but wastes a lot of space in common cases Use in memory (for speed) Use something else on disk and over the wire

(Almost) everyone uses a variable-length encoding called UTF-8 instead

(Almost) everyone uses a variable-length encoding called UTF-8 instead First 128 characters (old ASCII) stored in 1 byte each

(Almost) everyone uses a variable-length encoding called UTF-8 instead First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc.

(Almost) everyone uses a variable-length encoding called UTF-8 instead First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc. 0xxxxxxx 7 bits

(Almost) everyone uses a variable-length encoding called UTF-8 instead First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc. 0xxxxxxx 7 bits 110yyyyy 10xxxxxx 11 bits

(Almost) everyone uses a variable-length encoding called UTF-8 instead First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc. 0xxxxxxx 7 bits 110yyyyy 10xxxxxx 11 bits 1110zzzz 10yyyyyy 10xxxxxx 16 bits

(Almost) everyone uses a variable-length encoding called UTF-8 instead First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc. 0xxxxxxx 7 bits 110yyyyy 10xxxxxx 11 bits 1110zzzz 10yyyyyy 10xxxxxx 16 bits 11110www 10zzzzzz 10yyyyyy 10xxxxxx 21 bits

(Almost) everyone uses a variable-length encoding called UTF-8 instead First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc. 0xxxxxxx 7 bits 110yyyyy 10xxxxxx 11 bits 1110zzzz 10yyyyyy 10xxxxxx 16 bits 11110www 10zzzzzz 10yyyyyy 10xxxxxx 21 bits The good news is, you don't need to know

2.* provides two kinds of string

2.* provides two kinds of string Classic: one byte per character

2.* provides two kinds of string Classic: one byte per character Unicode: "big enough" per character

2.* provides two kinds of string Classic: one byte per character Unicode: "big enough" per character Write u'the string' for Unicode

2.* provides two kinds of string Classic: one byte per character Unicode: "big enough" per character Write u'the string' for Unicode Must specify encoding when converting from Unicode to bytes

2.* provides two kinds of string Classic: one byte per character Unicode: "big enough" per character Write u'the string' for Unicode Must specify encoding when converting from Unicode to bytes Use UTF-8

created by Greg Wilson October 2010 Copyright Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See http://software-carpentry.org/license.html for more information.