Mining Text Data for Useful Information in Higher Education John Zilvinskis Indiana University
Institutional Researchers Credo We have not succeeded in answering all our problems indeed we sometimes feel we have not completely answered any of them. The answers we have found have only served to raise a whole set of new questions. In some ways we feel that we are as confused as ever, but we think we are confused on a higher level and about more important things. Earl C. Kelley, Professor of Secondary Education at Wayne University, 1951
Presentation Overview 1. Describe basic concepts of text mining 2. Invite presentation attendees to ask questions and discuss application of this technology 3. List the differences in text mining software 4. Apply this technique to two real life examples 5. Provide implications and considerations
Raise your hand if You have a general understanding of text mining Keep your hand up if You have or someone you know has participated in a text mining project You have played a significant role in at least one project that used text mining You have written code for or worked on several text mining projects
Learning Outcomes As a result of attending this session, participants will be able to: List fundamental methodologies for organizing text data. Describe how one could integrate mined text in student learning and performance analytics. Compare the differences between text mining software packages. Use text mining methods to refine survey questions.
Big Data & Data Mining Big Data (Laney) volume (amount of data) velocity (speed of data) variety (range of data types and sources) Data Mining - Applying algorithms to big data to generate new information
Analytics Predictive, Automated, Scale, Real time Data mining to create actionable intelligence (Campbell, DeBlois, & Oblinger, 2007, p. 42) Learning v. Student Analytics
Text Mining The need to turn text into numbers so powerful algorithms can be applied to large document databases (Miner, Delen, Elder, Fast, Hill, & Nisbet, 2012, p. 30) Text analytics volume (amount of data) velocity (speed of data) variety (range of data types and sources)
Citation Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications Miner, Delen, Elder, Fast, Hill, & Nisbet, 2012
Text Mining Processes Define project and identify data Process data: Establish a corpus, Pre-process data, Extract knowledge Develop models Evaluate results Disseminate results
Extract Knowledge Classification Clustering Association Trend analysis
Why Not Qualitative Research? Requires extensive resources Data must be processed in a timely fashion Might not be practical with big data Information must integrate with other data
What Kind of Text Can We Mine? For What Purpose Should We Mine? Perhaps attendees could share what type of textbased datasets are available to them or which ones they would like to have access to. This may help IR staff recognize what text they have access to and can analyze in addition to learning how they may conduct such analyses. AIR Program Reviewer
How Can We Mine Text in IR? Kind of Data Application essays Written assignments CMS postings Student blogs Course evaluations Surveys E-portfolios Early alert, course drop text For What Purpose Acceptance, enrollment Likelihood of passing Participation Change in student major Faculty success Open-ended questions Student success Student performance
Software Freeware RapidMiner Easy user interface, inverse document frequencies, some aspects for purchase Weka/KEA R Applicable to machine learning, some resources Computer science heavy, many online resources Commercial Software Modeler Premium (SPSS, IBM), strong user interface, other analytics tools, easy to use and comprehensive dictionary Enterprise Miner (SAS), moderate user interface, comprehensive data manipulation, and integrated clustering function
Classifying Open Ended Responses National Survey of Student Engagement Experimental item set leadership Formal leadership core item 1,482 of 4,836 students listed other Classified 830 (56%) entries
Classifying Open Ended Responses Position n % of other Tutoring 145 9.8% Teaching Assistant 87 5.9% Research Assistant 60 4.0% Secretary 55 3.7% Treasurer 57 3.8% Mentor 54 3.6% Member 51 3.4% Editor 25 1.7%
Classifying Open Ended Responses Position Did Not Complete Formal Leadership Completed Formal Leadership Original Option n % n % Resident Assistant 206 34.3% 395 65.7% Diversity Advocate 28 38.9% 44 61.1% Judicial Officer 20 37.7% 33 62.3% President 41 4.6% 846 95.4% Write-In Other n % n % Tutoring 77 53.1% 68 46.9% Teaching Assistant 44 50.6% 43 49.4% Treasurer 13 23.6% 42 76.4% Editor 5 20.0% 20 80.0%
Clustering E-Portfolio Submissions City University of New York (CUNY) Guttman High touch, block scheduling, learning communities, summer bridge Bill and Melinda Gates grant 163 student e-portfolio introductions
Clustering E-Portfolio Submissions Concept Custered Terms Family family, york, high school, college, child Learning class, teacher, art, math, subject Everyday know, day, love, life College participation high school, school, attend, guttman Gamming game, movie, favorite, watch, video Making friends shy, person, friend, know, quiet Recreation art, basketball, play, sport, travel Society social, worker, work, believe, help Technology technology, information, art, health, mind Business guttman, business, manhattan, administration, graduate
Regression of Academic Preparation and Clustered Text Related to Credit Hours Independent Variable β Sig. SATV -0.23 0.02 SATM 0.22 0.02 WritProf 0.20 0.02 Age 0.08 0.31 Connection to family -0.15 0.06 R 2 0.12
Implications Process of automation Considering text source Weight of sentiment
Considerations Theoretical v. A-theoretical Ethical considerations Creepy treehouse Use of language
Thank You