Uncovering Plagiarism, Authorship, and Social Software Misuse

Similar documents
Plagiarism Detection Using Stopword n-grams

Improved implementation for finding text similarities in large collections of data

YouTube Comment Classifier. mandaku2, rethina2, vraghvn2

Blogs and Twitter Feeds: A Stylometric Environmental Impact Study

Report on the Evaluation-as-a-Service (EaaS) Expert Workshop

Instructor: See information provided in the Syllabus link in the classroom

Identifying Online Sexual Predators by SVM Classification with Lexical and Behavioral Features 1

WHITE PAPER Plagiarism and the Web: Myths and Realities

STOP. THINK. CONNECT. Online Safety Quiz

Determining if Two Documents are by the Same Author

RICH TOWNSHIP HIGH SCHOOL Adopted: 7/10/00 DISTRICT 227 Olympia Fields, Illinois

Summer Term Prof. Dr. Roland Kirstein Economics of Business and Law Faculty of Economics and Management

Project 2: Web Security Pitfalls

Plagiarism and Anti-Plagiarism Software

Distributed Tuning of Machine Learning Algorithms using MapReduce Clusters


Technology Acceptable Use Policy. Introduction

9. Text & Documents. Visualizing and Searching Documents. Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08

Helpdesk Ticketing User Guide

COMPSCI 760 S2 C 2014 Machine Learning and Data Mining Computer Science Department

Understand the purpose of a writing sample 1. Understand the writing sample requirements for this job. Provide exactly what the posting requests.

TS3: an Improved Version of the Bilingual Concordancer TransSearch

DEPARTMENTAL POLICIES

PSYC 1000A General Psychology , 1 st Term Department of Psychology The Chinese University of Hong Kong

CRIM 200: Introduction to Criminal Justice

Real-Time Identification of MWE Candidates in Databases from the BNC and the Web

Search and Information Retrieval

Online Social Culture: Does it Foster Original Work or Encourage Plagiarism?

DGT-TM: A freely Available Translation Memory in 22 Languages

Running head: SAMPLE APA 1

International Baccalaureate Diploma and DP Course Academic Honesty Policy

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

The effectiveness of plagiarism detection software as a learning tool in academic writing education

Getting Started with SandStorm NoSQL Benchmark

THE CHINESE UNIVERSITY OF HONG KONG DEPARTMENT OF TRANSLATION COURSE OUTLINE

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

User (Student) Instruction Manual Local Document Archive Maintenance System (SOLAD)

CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION

Running head: WRITING RESEARCH PAPERS 1. A Guide for Writing APA Style Research Papers. Susan B. Smith. Capital Community College

MGMT S-5033 Course Syllabus Supply Chain Management Online. Spring Harvard University Cambridge, MA. Zal Phiroz, MBA Instructor

Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy

Personality Psychology (PSYC 330) Summer 2015

entigral whitepaper 10 Success Factors for RFID Asset Tracking Deployments

CHAT: A System for Stylistic Classification of Hebrew-Aramaic Texts

HIT 240 HEALTHCARE QUALITY PERFORMANCE IMPROVEMENT AND MANAGEMENT SYLLABUS

Minnesota Virtual Academy Online Syllabus for Eng403 British and World Literature Course Instructor and Communications Name: Mrs.

A Performance Analysis of Distributed Indexing using Terrier

Terminology Extraction from Log Files

STUDENT HANDBOOK. Policies and Procedures. 1 of 11

Students will know and be able to: 1.1. Basic Operations

Ventura Charter School of Arts & Global Education Board Policy for Acceptable Use and Internet Safety

National Initiative for Cybersecurity Careers and Studies (NICCS) Cybersecurity Training and Education Catalog Training Provider Instruction Guide

INTRODUCTION TO CRIMINAL JUSTICE 101- Hybrid

Desire2Learn Learning Environment

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

HIT 284 Coding Exam Prep Course Syllabus

How To Cluster Of Complex Systems

Collecting Polish German Parallel Corpora in the Internet

Identifying free text plagiarism based on semantic similarity

Guide to Knowledge Elicitation Interviews

Comprehensive i-safe Curriculum International Scope of Lessons and Language Availability

Kayako 4.0 Helpdesk Upgrade Questionnaire Form 1: Users.

Report on the embedding and evaluation of the second MT pilot

Realist A Quick Start Guide for New Users

Course Materials Required Text:

Minnesota Virtual Academy Online Syllabus for Web Design

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting

Security 8.0 User Guide

The Slingshot Project

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

Course Delivery Educational Program Development Onsite Blackboard Training

A Statistical Text Mining Method for Patent Analysis

Security Testing with Selenium

Accelerating Change & Transformation (ACT) TM Training

MICROSOFT OUTLOOK TIPS AND TRICKS

Machine Learning Log File Analysis

Haddon Township School District Acceptable Use of Information and Communication Technology for Students

B2B opportunity predictiona Big Data and Advanced. Analytics Approach. Insert

Transcription:

Uncovering Plagiarism, Authorship, and Social Software Misuse Bauhaus-Universität Weimar Universitat Politècnica de València University of Lugano Duquesne University Universitat Politècnica de Catalunya University of the Aegean Bar-Ilan University Illinois Institute of Technology Martin Potthast, Tim Gollub, Maik Anderka, Matthias Hagen, Jan Graßegger, Johannes Kiesel, Maximilian Michel, Arnd Oberländer, Martin Tippmann, and Benno Stein Parth Gupta and Paolo Rosso Giacomo Inches and Fabio Crestani Patrick Juola Alberto Barrón-Cedeño Efstathios Stamatatos Moshe Koppel Shlomo Argamon

Uncovering Plagiarism, Authorship, and Social Software Misuse Outline Plagiarism Detection Author Identification and Sexual Predator Identification Wikipedia Quality Flaw Prediction Summary

Plagiarism Detection To plagiarize means to reuse someone else s work, pretending it to be one s own. Suspicious document Plagiarism Detection Candidate retrieval Candidate documents Detailed comparison Knowledge-based post-processing Document collection Suspicious passages Contributions: q Manually written plagiarism from the ClueWeb q ChatNoir search engine for candidate retrieval q Software submissions for detailed comparison 3 [^] c pan.webis.de 2012

Plagiarism Detection Candidate retrieval (search for source documents): Total Time to Reported Downloaded Team Workload 1st Detection Sources Sources Queries Dwnlds Queries Dwnlds Precision Recall Precision Recall Jayapal 67 174 9 14 0.66 0.28 0.07 0.43 Suchomel 13 95 6 2 0.52 0.21 0.08 0.35... 3 more... Detailed comparison (alignment of plagiarized passages): Team PlagDet Precision Recall Granularity Avg. Runtime (s) Kong 0.70 0.82 0.68 1.01 5.91 Suchomel 0.68 0.89 0.55 1.00 5.36 Grozea 0.67 0.77 0.64 1.03 4.82... 7 more... 4 [^] c pan.webis.de 2012

Lessons Learned Plagiarism detection: q Software submissions are manageable, provide repeatability. q Task-wise evaluation allows for more tailored evaluation. q Fully automatic plagiarism detection evaluation within reach. 5 [^] c pan.webis.de 2012

Author Identification and Sexual Predator Identification An author s personal traits are encoded in her writing. Task: A 11 A 12 A 1... q Given (part of) a document, who wrote it? A 10 A2 q The task covers 8 variants of this problem (closed vs. open class, author clustering, intrinsic plagiarism detection) A9 A8 A? A4 A3... A7 A5 A 6 6 [^] c pan.webis.de 2012

Author Identification and Sexual Predator Identification An author s personal traits are encoded in her writing. Task: A 11 A 12 A 1... q Given (part of) a document, who wrote it? A 10 A2 q The task covers 8 variants of this problem (closed vs. open class, author clustering, intrinsic plagiarism detection) A9 A8 A? A4 A3... A7 A5 A 6 Evaluation Results: Team Avg. Correct Decisions Team Overall Correct Decisions Grozea 86.37% Ryan 88.38% Akiva 83.40% Akiva 81.74% Ryan 82.41% Grozea 81.33% Tanguy 70.81% Tanguy 77.59% Castillo 62.13% Vartapetiance 75.93%... 20 more... 7 [^] c pan.webis.de 2012

Author Identification and Sexual Predator Identification 8 [^] c pan.webis.de 2012

Author Identification and Sexual Predator Identification Task: q Given a chat log, identify a sexual predator, if there is one. q Given chat logs, identify all lines coming from sexual predators. Corpus: 152k adult chats (8k of which predator/victim chats), 70k other chats. 9 [^] c pan.webis.de 2012

Author Identification and Sexual Predator Identification Task: q Given a chat log, identify a sexual predator, if there is one. q Given chat logs, identify all lines coming from sexual predators. Corpus: 152k adult chats (8k of which predator/victim chats), 70k other chats. Evaluation Results: Predator Identification Predator Line Flagging Team Precision Recall F 0.5 Team Precision Recall F 3 Villatoro-Tello 0.98 0.79 0.94 Grozea 0.09 0.89 0.48 Snider 0.98 0.72 0.92 Kontostathis 0.17 0.50 0.42 Parapar 0.94 0.67 0.87 Peersman 0.36 0.26 0.27 Morris 0.97 0.61 0.87 Sitarz 0.33 0.23 0.24... 12 more... 10 [^] c pan.webis.de 2012

Lessons Learned Plagiarism detection: q Software submissions are manageable, provide repeatability. q Task-wise evaluation allows for more tailored evaluation. q Fully automatic plagiarism detection evaluation within reach. Author identification: q Lack of corpora is still a major obstacle to evaluation. q Performance measures are rudimentary; their weighting is not clear. q Large variety of problem classes adds to the difficulties. 11 [^] c pan.webis.de 2012

[citation needed] The PAN Competition Wikipedia Quality Flaw Prediction This presentation does not cite any references or sources. Please help improve this presentation by adding citations to reliable sources. Unsourced material may be challenged and removed. (September 2012) 12 [^] c pan.webis.de 2012

[citation needed] The PAN Competition Wikipedia Quality Flaw Prediction Task: This presentation does not cite any references or sources. Please help improve this presentation by adding citations to reliable sources. Unsourced material may be challenged and removed. (September 2012) q Given a sample of Wikipedia articles containing a specific quality flaw, decide whether or not a previously unseen article contains the same flaw. Corpus: q 170k Wikipedia articles, each tagged with one of 10 quality flaws. q 50k random untagged articles. 13 [^] c pan.webis.de 2012

[citation needed] The PAN Competition Wikipedia Quality Flaw Prediction Task: This presentation does not cite any references or sources. Please help improve this presentation by adding citations to reliable sources. Unsourced material may be challenged and removed. (September 2012) q Given a sample of Wikipedia articles containing a specific quality flaw, decide whether or not a previously unseen article contains the same flaw. Corpus: q 170k Wikipedia articles, each tagged with one of 10 quality flaws. q 50k random untagged articles. Evaluation Results: Team Precision Recall F 1 Ferretti 0.74 0.92 0.82 Ferschke 0.75 0.85 0.80 Pistol 0.04 0.58 0.08 14 [^] c pan.webis.de 2012

[citation needed] The PAN Competition Wikipedia Quality Flaw Prediction Task: This presentation does not cite any references or sources. Please help improve this presentation by adding citations to reliable sources. Unsourced material may be challenged and removed. (September 2012) q Given a sample of Wikipedia articles containing a specific quality flaw, decide whether or not a previously unseen article contains the same flaw. Corpus: q 170k Wikipedia articles, each tagged with one of 10 quality flaws. q 50k random untagged articles. Evaluation Results: Team Precision Recall F 1 Ferretti 0.74 0.92 0.82 Ferschke 0.75 0.85 0.80 Pistol 0.04 0.58 0.08 15 [^] c pan.webis.de 2012

Lessons Learned Plagiarism detection: q Software submissions are manageable, provide repeatability. q Task-wise evaluation allows for more tailored evaluation. q Fully automatic plagiarism detection evaluation within reach. Author identification: q Lack of corpora is still a major obstacle to evaluation. q Performance measures are rudimentary; their weighting is not clear. q Large variety of problem classes adds to the difficulties. Wikipedia quality flaw prediction: q This task subsumes the vandalism detection task of previous years. q Doznes of more flaw types need to be investigated. q Promising performance for some flaws; automatic tagging possible. 16 [^] c pan.webis.de 2012

Lessons Learned Plagiarism detection: q Software submissions are manageable, provide repeatability. q Task-wise evaluation allows for more tailored evaluation. q Fully automatic plagiarism detection evaluation within reach. Author identification: q Lack of corpora is still a major obstacle to evaluation. q Performance measures are rudimentary; their weighting is not clear. q Large variety of problem classes adds to the difficulties. Wikipedia quality flaw prediction: q This task subsumes the vandalism detection task of previous years. q Doznes of more flaw types need to be investigated. q Promising performance for some flaws; automatic tagging possible. A lot to accomplish for PAN 2013 and beyond! 17 [^] c pan.webis.de 2012