IMDB Data Set Topics: Parsing Input using Scanner class. Atul Prakash



Similar documents
Scanner. It takes input and splits it into a sequence of tokens. A token is a group of characters which form some unit.

JDK 1.5 Updates for Introduction to Java Programming with SUN ONE Studio 4

JAVA.UTIL.SCANNER CLASS

Chulalongkorn University International School of Engineering Department of Computer Engineering Computer Programming Lab.

java.util.scanner Here are some of the many features of Scanner objects. Some Features of java.util.scanner

Using Files as Input/Output in Java 5.0 Applications

Course Intro Instructor Intro Java Intro, Continued

Basics of Java Programming Input and the Scanner class

CSE 1223: Introduction to Computer Programming in Java Chapter 7 File I/O

AP Computer Science File Input with Scanner

Object-Oriented Programming in Java

Topic 11 Scanner object, conditional execution

Install Java Development Kit (JDK) 1.8

File class in Java. Scanner reminder. Files 10/19/2012. File Input and Output (Savitch, Chapter 10)

Handout 1. Introduction to Java programming language. Java primitive types and operations. Reading keyboard Input using class Scanner.

Some Scanner Class Methods

Introduction to Java Applications Pearson Education, Inc. All rights reserved.

Building Java Programs

Chapter 2 Introduction to Java programming

Introduction to Java

Building Java Programs

Carron Shankland. Content. String manipula3on in Java The use of files in Java The use of the command line arguments References:

Chapter 2. println Versus print. Formatting Output withprintf. System.out.println for console output. console output. Console Input and Output

Chapter 3. Input and output. 3.1 The System class

Reading Input From A File

Building Java Programs

Using Two-Dimensional Arrays

13 File Output and Input

Lecture 5: Java Fundamentals III

Visit us at

Java 7 Recipes. Freddy Guime. vk» (,\['«** g!p#« Carl Dea. Josh Juneau. John O'Conner

First Java Programs. V. Paúl Pauca. CSC 111D Fall, Department of Computer Science Wake Forest University. Introduction to Computer Science

6.1. Example: A Tip Calculator 6-1

Chapter 2 Elementary Programming

Chapter 2: Elements of Java

Sample CSE8A midterm Multiple Choice (circle one)

Homework/Program #5 Solutions

1) Which of the following is a constant, according to Java naming conventions? a. PI b. Test c. x d. radius

System.out.println("\nEnter Product Number 1-5 (0 to stop and view summary) :

Classes: Relationships Among Objects. Atul Prakash Background readings: Chapters 8-11 (Downey)

Chapter 2 Basics of Scanning and Conventional Programming in Java

Line-based file processing

JAVA ARRAY EXAMPLE PDF

D06 PROGRAMMING with JAVA

CS106A, Stanford Handout #38. Strings and Chars

The following program is aiming to extract from a simple text file an analysis of the content such as:

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Continuous Integration Part 2

Data Structures Lecture 1

Topics. Parts of a Java Program. Topics (2) CS 146. Introduction To Computers And Java Chapter Objectives To understand:

Overview of Web Services API

CS170 Lab 11 Abstract Data Types & Objects

bbc Developing Service Providers Adobe Flash Media Rights Management Server November 2008 Version 1.5

Scanner sc = new Scanner(System.in); // scanner for the keyboard. Scanner sc = new Scanner(System.in); // scanner for the keyboard

Comp 248 Introduction to Programming

Week 1: Review of Java Programming Basics

Introduction to Object-Oriented Programming

WRITING DATA TO A BINARY FILE

CS 141: Introduction to (Java) Programming: Exam 1 Jenny Orr Willamette University Fall 2013

Handout 3 cs180 - Programming Fundamentals Spring 15 Page 1 of 6. Handout 3. Strings and String Class. Input/Output with JOptionPane.

Introduction to Programming

CS 121 Intro to Programming:Java - Lecture 11 Announcements

qwertyuiopasdfghjklzxcvbnmqwerty uiopasdfghjklzxcvbnmqwertyuiopasd fghjklzxcvbnmqwertyuiopasdfghjklzx cvbnmqwertyuiopasdfghjklzxcvbnmq

Part I. Multiple Choice Questions (2 points each):

1 of 1 24/05/ :23 AM

public static void main(string[] args) { System.out.println("hello, world"); } }

Introduction to Java. CS 3: Computer Programming in Java

Quick Introduction to Java

CS 111 Classes I 1. Software Organization View to this point:

Building a Multi-Threaded Web Server

Arrays. Atul Prakash Readings: Chapter 10, Downey Sun s Java tutorial on Arrays:

Programmierpraktikum

10 Java API, Exceptions, and Collections

Preet raj Core Java and Databases CS4PR. Time Allotted: 3 Hours. Final Exam: Total Possible Points 75

Translating to Java. Translation. Input. Many Level Translations. read, get, input, ask, request. Requirements Design Algorithm Java Machine Language

Event-Driven Programming

Object Oriented Software Design

Word Count Code using MR2 Classes and API

Classes and Objects in Java Constructors. In creating objects of the type Fraction, we have used statements similar to the following:

Licensed for viewing only. Printing is prohibited. For hard copies, please purchase from

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Basic Java Constructs and Data Types Nuts and Bolts. Looking into Specific Differences and Enhancements in Java compared to C

Crash Course in Java

With a single download, the ADT Bundle includes everything you need to begin developing apps:

Getting to know Apache Hadoop

Hadoop WordCount Explained! IT332 Distributed Systems

More on Objects and Classes

Programming Languages CIS 443

IRA EXAMPLES. This topic has two examples showing the calculation of the future value an IRA (Individual Retirement Account).

Getting Started with the Internet Communications Engine

Building Java Programs

Creating a Simple, Multithreaded Chat System with Java

Smooks Dev Tools Reference Guide. Version: GA

Masters programmes in Computer Science and Information Systems. Object-Oriented Design and Programming. Sample module entry test xxth December 2013

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Mail User Agent Project

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Example of a Java program

Transcription:

IMDB Data Set Topics: Parsing Input using Scanner class Atul Prakash

IMDB Data Set Consists of several files: movies.list: contains <movies, year> actors.list: contains <actor, list of movies> aka-titles.list: <title, aliases of the title> Bunch of other files, including other crew, fan ratings, awards, running times, etc. actresses.list: <actress, list of movies>

Example Queries List of movies released in a given year Search for movies, given keywords Find actors in a given movie (requires actors.list) Find actresses who have acted with a particular actor, e.g., Woody Allen (requires actresses.list and actors.list)

Some References IMDB dataset itself in plain-text downloadable files (download just the movies.list.gz for now, and uncompress it). http://www.imdb.com/interfaces An example report of some analysis of the IMDB dataset (glance through it): http://had.co.nz/data/movies/description.pdf A research paper on correlating Netflix dataset with IMDB to break anonymity of reviewers in Netflix (optional read -- shows an example of privacy risks): http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf

Sample Contents (movies.list) Format: <unique title, year> Titles made unique by including the year of the release. Also, articles written after comma. E.g., "The Godfather" is written as "Godfather, The". Annotations after the title: (V): direct video release (TV): made for TV (VG): video game More details on title formatting: http://www.imdb.com/help/show_leaf?titleformat 'Doctor Who': The Tom Baker Years (1991) (V) 1991 'Doctor Who': The Troughton Years (1991) (V) 1991 'Doctor Who': Then and Now (1987) (TV) 1987 'Doctor Who': Thirty Years in the Tardis (1993) (TV) 1993 'Doctor Zhivago': The Making of a Russian Epic (1995) (TV) 1995 'Dog Day Afternoon': After the Filming (2006) (V) 2006 'Dog Day Afternoon': Casting the Controversy (2006) (V) 2006 'Dog Day Afternoon': Recreating the Facts (2006) (V) 2006 'Dog Day Afternoon': The Story (2006) (V) 2006 'Don't Talk' (1942) 1942 'Donnie Darko': Production Diary (2004) (V) 2004 'Dr. Who': Daleks - The Early Years (1993) (V) 1993 'Duel': A Conversation with Director Steven Spielberg (2004) (V) 2004 'Dune': Models and Miniatures (2006) (V) 2006 'Dune': Special Effects (2006) (V) 2006 'Dune': Wardrobe Design (2006) (V) 2006 'E' (1981) 1981 Dune (1973) 1973 Dune (1984) 1984 Dune (2010) 2010 Dune 2000 (1998) (VG) 1998 Dune 7 (2002) 2002 Dune Buddies (1978) 1978 Dune Bug (1969) 1969 Dune II: The Building of a Dynasty (1992) (VG) 1992 Dune Surfer (1988) 1988 Dune Warriors (1990) 1990 Dunechka (2004) 2004 Dunera Boys, The (1985) (TV) 1985 Dunes of Destiny (2005) 2005 Dung che sai duk (1994) 1994

Reading in Data Several classes in Java to help parse and read input: String.split methods Scanner class StringTokenizer class Pattern and Matcher classes Reference with examples: http://www.javapractices.com/ topic/topicaction.do?id=87 Or Google search for: Java practices Parse text input

Example Sample input file height = 167cm mass = 65kg disposition = "grumpy" this is the name = this is the value Format: key = value Strategy: read line by line. Use "=" as a seperator to extract key and value. Store in a list.

Scanner Class Scanner class is very powerful for string parsing and may be all you need in many cases For examples illustrated here, see: http://www.javabeat.net/tips/24-parsing-input-using-scanner.html import java.util.scanner; public class ScannerDemo { public static void example1() { // import Scanner class from java.util package Scanner scanner = new Scanner("EECS 282 is a great class"); while (scanner.hasnext()){ System.out.println(scanner.next()); public static void main(string[] args) { example1(); EECS 282 is a great class

Field-separated values public static void stringscanningwithseparator() { Scanner scanner = new Scanner("George Washington, President, born 1732"); scanner.usedelimiter(","); while (scanner.hasnext()) { System.out.println(scanner.next()); // will print 3 items. George Washington President born 1732 Here, comma is used as a separator. Can use any string as a separator, e.g., "\t" (tab) Default seperator: whitespace

What does the following print? public static void stringscanningwithseparator() { Scanner scanner = new Scanner("George Washington, President, born 1732"); // scanner.usedelimiter(","); while (scanner.hasnext()) { System.out.println(scanner.next()); // will print 3 items. Are spaces printed? How are commas treated?

Reading Typed Values public static void scanints() { Output: Scanner scanner = new Scanner("1 3"); int first = scanner.nextint(); int second = scanner.nextint(); System.out.println("first = " + first + "; second = " + second); first = 1; second = 3 Similar methods: nextboolean, nextbyte, nextdouble, etc. To read String, just use next()

Checking Input Type Before Reading Use hasnexttype method, where Type is the type being read, e.g., hasnextint, hasnextboolean, hasnext, hasnextline public static void scantypes() { Scanner scanner = new Scanner("1 3 5 7 9 11 abc 13 def 15"); while (scanner.hasnextint()){ System.out.println(scanner.nextInt()); hasnextint() returns false on "abc" 1 3 5 7 9 11

Reading Files Scanner s = new Scanner(FileObject) Creates a scanner on a file object Examples of file objects: System.in: standard input, by default, terminal input File in = new File(filepath);

Example public static void scanlinesfromfile() { File in = new File("/Users/aprakash/Documents/282/workspace/282/input.txt"); Scanner scanner = new Scanner(in); int linenum = 1; while (scanner.hasnextline()) { String line = scanner.nextline(); System.out.println(linenum + ": " + line); linenum++; scanner.close(); // important to close scanners on files.

File Paths on Windows // parse a file containing lines with part1,part2 // For windows, you would do something like File in = new File("C:\\282\\input.txt"); Two backslashes needed because "\" is also used in other ways, e.g., \n, \t. It must be "escaped" with another backslash. // or you can use forward-slashes. Java will convert / to \\: File in = new File("C:/282/input.txt"); // You can also use relative path names, e.g., "282/input.txt". // In Eclipse, go to Run->Run Configurations...->Arguments to // set current directory. Relative paths are with respect to the current // directory.

Closing files scanner.close() is important. It closes the scanner object, as well as the file bound to it. Otherwise, file remains open and consumes an operating system resource. Always close files when done reading them.

Exceptions in I/O Unfortunately, Eclipse gives an error in the above code about an uncaught exception: FileNotFoundException

Reason File in = new File("/Users/aprakash/Documents/282/workspace/282/input.txt"); Scanner scanner = new Scanner(in); What if the file does not exist? Unlike run-time errors (e.g., divide by 0), Java requires you to handle I/O exceptions

Declaring the thrown exception public static void scanlinesfromfile() throws FileNotFoundException { // parse a file containing lines with part1,part2 // For windows, you would do something like // File in = new File("C:\\282\\input.txt"); // or you can use forward-slashes on Windows. Java will convert: // File in = new File("C:/282/input.txt"); // You can also use relative path names, e.g., "282/input.txt". // In Eclipse, go to Run->Run Configurations...->Arguments to set current directory) File in = new File("/Users/aprakash/Documents/282/workspace/282/input.txt"); Scanner scanner = new Scanner(in); int linenum = 1; while (scanner.hasnextline()) { String line = scanner.nextline(); System.out.println(linenum + ": " + line); linenum++; scanner.close(); // important to close scanners on files.

Alternative -- Catch Exceptions public static void scanlinesfromfile2() { File in = new File("/Users/aprakash/Documents/282/workspace/282/input.txt"); Scanner scanner; try { scanner = new Scanner(in); catch (FileNotFoundException e) { System.out.println("File not found"); return; int linenum = 1; while (scanner.hasnextline()) { String line = scanner.nextline(); System.out.println(linenum + ": " + line); linenum++; scanner.close(); // important to close scanners on files.

Summary Scanner class is useful for reading files and parsing them. Other useful methods for parsing: String class functions. E.g., trim() method, lastindexof(string), substring(int start, int end), and split() methods.