Scholarly Big Data: CiteSeerX Insights
|
|
|
- Franklin Miles
- 9 years ago
- Views:
Transcription
1 Scholarly Big Data: CiteSeerX Insights C. Lee Giles The Pennsylvania State University University Park, PA, USA Funded in part by NSF & Qatar.
2 Contributors/Collaborators: recent past and present (incomplete list) Projects: CiteSeer, CiteSeer X, Chem X Seer, ArchSeer, CollabSeer, GrantSeer, SeerSeer, RefSeer, CSSeer, AlgoSeer, AckSeer, BotSeer, YouSeer, P. Mitra, V. Bhatnagar, L. Bolelli, J. Carroll, I. Councill, F. Fonseca, J. Jansen, D. Lee, W-C. Lee, H. Li, J. Li, E. Manavoglu, A. Sivasubramaniam, P. Teregowda, J. Yen, H. Zha, S. Zheng, D. Zhou, Z. Zhuang, J. Stribling, D. Karger, S. Lawrence, K. Bollacker, D. Pennock, J. Gray, G. Flake, S. Debnath, H. Han, D. Pavlov, E. Fox, M. Gori, E. Blanzieri, M. Marchese, N. Shadbolt, I. Cox, S. Gauch, A. Bernstein, L. Cassel, M-Y. Kan, X. Lu, Y. Liu, A. Jaiswal, K. Bai, B. Sun, Y. Sung, Y. Song, J. Z. Wang, K. Mueller, J.Kubicki, B. Garrison, J. Bandstra, Q. Tan, J. Fernandez, P. Treeratpituk, W. Brouwer, U. Farooq, J. Huang, M. Khabsa, M. Halm, B. Urgaonkar, Q. He, D. Kifer, J. Pei, S. Das, S. Kataria, D. Yuan, S. Choudhury, H-H. Chen, N. Li, D. Miller, A. Kirk, W. Huang, S. Carman, J. Wu, L. Rokach, C. Caragea, K. Williams. Z. Wu, S. Das, A. Ororbia, others.
3 What is Scholarly Big Data All academic/research documents (journal & conference papers, books, theses, TRs) Related data: Academic/researcher/group/lab web homepages Funding agency and organization grants, records, reports Research laboratories reports Patents Associated data presentations experimental data (very large) video course materials other Social networks Examples: Google Scholar, Microsoft Academic Search, Publishers/repositories, CiteSeer, ArnetMiner, Funding agencies, Universities, Mendeley, others
4 Scholarly Big Data Most of the data that is available in the era of scholarly big data does not look like this Or even like this It looks more like this Courtesy Lise Getoor NIPS 12
5 Where do you get this data? Web (Wayback machine, Heritrix) Repositories (arxiv, CiteSeer) Bibliographic resources (PubMed, DBLP) Funding sources/laboratories Publishers Data aggregators (Web of science) Patents API s (Microsoft Academic) How much is there & how much freely available?
6 Estimate of Scholarly Big Data Use two public academic search engines estimate the size of scholarly articles on the web using capture/recapture (Lincoln Petersen) methods Google Scholar Microsoft Academic Search Find a paper that both search engines have Extract the list of citations for that paper in both search engines and compare overlap The list of citations for a paper is representative of the coverage of a search engine Using the size of one of the search engines, estimate the total size on the web Limit to English articles only
7 Web Size Estimation - Capture/Recapture Analysis Lawrence, Giles, Science 98 Consider the web page coverage of search engines a and b p a probability that engine a has indexed a page, p b for engine b, p a,b joint probability p a, b pa b p b p a p b s a number of unique pages indexed by engine a; N number of web pages s sa, b sa s s b b a N s p a N n b number of documents returned by b for a query, n a,b number of documents returned by both engines a&b for a query sb nb s n Lower bound estimate of size of the Web: N N N Nˆ known random sampling assumption extensions - bayesian estimate, more engines (Bharat, Broder, WWW7 98), etc. a, b s a sa, b n n a, b queries b a ; o a, b queries s a o
8 Freely available by scholarly field
9 Lower bound on potential sources of scholarly data At least 114 million scholarly articles available on the web At least 24% of them are publicly available 27 million Varies significantly based on academic field Computer science! Google Scholar has nearly 100 million articles Other things to do: Distinguish between publication types: paper, thesis, tech report, etc More estimates Longitudinal and geographical study Duplicates Languages besides english Khabsa, Giles, PLoSONE, 2014
10
11 IARPA FUSE Program
12 IARPA FUSE Program
13 SeerSuite Toolkit & Applications SeerSuite: open source search engine and digital library tool kit used to build academic search engines/digital libraries CiteSeer X, Chem X Seer, AckSeer,CSSeer, CollabSeer, RefSeer, etc. Built on commercial grade open source tools (Solr/Lucene; mysql) Our features automated specialized metadata extraction Information extraction tools for PDF documents Authors, titles, affiliations, citations, acknowledgements, etc Entity disambiguation Tabular data Figures & graphs Chemical formulae Wu, IAAI 2014 Equations Teregowda, IC2E 2013 Author ethnicity detection Teregowda, USENIX 2010 Data and search built on top of these CiteSeerX open source Google Scholar (ScholarSeer) ChemXSeer chemical search engine RefSeer citation recommendation system CSSeer expert recommendations CollabSeer- collaboration recommendation AckSeer acknowledgement search ArchSeer archaeology map search
14 CiteSeer (aka ResearchIndex) Project of NEC Research Institute Hosted at Princeton, from Moved to Penn State after collaborators left NEC Provided a broad range of unique services including Automatic metadata extraction Autonomous citation indexing Reference linking Full text indexing Similar documents listing Several other pioneering features Impact First scholarly search engine? Changed access to scientific research Shares code and data C. Lee Giles Kurt Bollacker Steve Lawrence
15 CiteSeer X CiteSeer X actively crawls researcher homepages & archives on the web for scholarly papers, formerly in computer science Converts PDF to text Automatically extracts and tags OAI metadata and other data Automatic citation indexing, links to cited documents, creation of document page, author disambiguation Software open source can be used to build other such tools All data shared 5+ M documents Ms of files 87 M citations 12 M authors 1.3 M disambig 2 to 4 M hits day 100K documents added monthly 300K document downloaded monthly 800K individual users ~40 Tbytes
16 Focused Crawling getting the documents Maintain a list of parent URLs where documents were previously found Parent URLs are usually academic homepages. >1,000,000 unique parent URLs, as of summer 2013 Parent URLs are stored in a database Crawled weekly. Crawling process starts with the scheduler selecting all parent URLs Crawling batch with Heritrix Most discovered documents have been crawled before. Use hash table comparison for detection of new documents Normally retrieve a 10K NEW documents per day, sometimes less than 1K. Very ethical crawler Use whitelist and blacklist policy. Zheng, CIKM 09; Wu Webscience 12
17 Highlights of AI/ML Technologies in CiteSeerX Document Classification Document Deduplication and Citation Graph Metadata Extraction Header Extraction Citation Extraction Table Extraction Figure Extraction Author Disambiguation Wu, et.al IAAI 2014
18 Automatic Metadata Information Extraction (IE) - CiteSeerX Header title, authors, affiliations, abst Table Converter IE Figure Databases Search index PDF Text Formulae Body Other open source academic document metadata extractors available workshop, metadata hackathon, Citations
19 TableSeer Table extraction & search engine Liu, et al, AAAI07, JCDL06,
20 Must scale!! Efficient Large Scale Author Disambiguation CiteSeer X & PubMed Motivation Correct attribution Manually curated databases still have errors DBLP, medical records Entity disambiguation problem documents Actors, entities Determine the real identity of the authors using metadata of the research papers, including co-authors, affiliation, physical address, address, information from crawling such as host server, etc. Entity normalization Challenges Accuracy Scalability Expandability Han, et.al JCDL 2004 Huang, et.al PKDD 2006 Treeratpituk, et.al JCDL 2009 Khabsa, et.al JCDL 2015 Key features Learn distance function Random Forest others DBSCAN clustering Ameliorate labeling inconsistency (transitivity problem) Efficient solution to find name clusters N logn scaling Recently all of PubMed authors, 80M mentions
21 csseer.ist.psu.edu Expert search for authors H-H Chen, JCDL 2014
22
23 Experimental Collaborator recommendation system CollabSeer currently supports 400k authors HH Chen, JCDL 2011
24 Automated Figure Data Extraction and Search Large amount of results in digital documents are recorded in figures, time series, experimental results (eg., NMR spectra, income growth) Extraction for purposes of: Further modeling using presented data Indexing, meta-data creation for storage & search on figures for data reuse Current extraction done manually! Documents Extracted Plot Extracted Info. Document Index Merged Index Plot Index Digital Library User
25 Chem X Seer Figure/Plot Data Extraction and Search Numerical data in scientific publications are often found in figures. Tools that automate the data extraction from figures provide the following: Increases our understanding of key concepts of papers Provides data for automatic comparative analyses. Enables regeneration of figures in different contexts. Enables search for documents with figures containing specific experiment results. X. Lu, et.al, JCDL 2006, Kataria, et.al, 2008 Choudhury, DocEng 2005
26 An Approach to Plot Data Extraction Identify and extract figures from digital documents Ascii and image extraction (xpdf) OCR - bit map, raster pdfs Identify figures as images of 2D plots using SVM (Only for Bit map images) Hough transform Wavelets coefficients of image Surrounding text features Binarization of the 2D plots identified for preprocessing (No need for Vectorized Images) Adaptive Thresholding Image segmentation to identify regions Profiling or Image Signature Text block detection Nearest Neighbor Data point detection K-means Filtering Data point disambiguation for overlapping points Simulated Annealing
27 Automatic Citation (or paper) Recommendation Built on top of millions of papers Never miss a citation and know about the latest work Several recommendations models Huang, AAAI 2015 Huang, CIKM 2013 He, WWW 2010
28 Sun TOIS 2011 Chem X Seer
29 Scholarly Document Size & Numbers Large # of academic/research documents, all containing lots of data Many millions of documents 50M records Microsoft Academic (2013) 25M records, 10 million authors, 3 times mentions PubMed Google scholar (english) estimated to be ~100M records Total online estimate ~120M records ~25 million full documents freely available 100s of millions of authors, affiliations, locations, dates Billions of citation mentions 100s millions of tables, figures, math, formulae, etc. Related & linked data Raw data > petabytes Khabsa, Giles, PLoSONE, 14
30 Challenges Tables, figures, formula, equations, methodologies, etc. How do we effectively integrate and utilize this data for search? Natural language generation? Ontologies for scholarly data Scholarly knowledge vault Big Mechanism approaches
31 Summary/observations Scholarly big data petabytes, billions of objects Rich content large related data Applications more common than most realize Science, research, education, patents, policy, sociology, economics, business, MOOCs, etc Growth of associated data: tables, figures, chemical & drug entities, equations, methodologies, slides, video, etc Many issues AI and ML very useful: Focused NLP Information extraction still an art; domain dependent Data is not always easy to move around or share Some data still not readily available but is changing ¼ of all digital documents freely available Data(s) integration issues Meta analysis big mechanism opportunties Observations Large amount and growing scholarly related data Big scholarly data creates new research opportunities Big scholarly data creates other big data
32 The future ain t what it used to be. Yogi Berra, catcher, NY Yankees. For more information clgiles.ist.psu.edu [email protected] SourceForge.com
33 Online or invisible, Nature, 01, Steve Lawrence, Google Desktop creator 5 times more likely to be cited if your paper is freely available online For more information Most of our papers [email protected] SourceForge.com (github)
Scholarly Big Data Information Extraction and Integration in the CiteSeer χ Digital Library
Scholarly Big Data Information Extraction and Integration in the CiteSeer χ Digital Library Kyle Williams,1, Jian Wu,2, Sagnik Ray Choudhury,3, Madian Khabsa,4, C. Lee Giles,,5 Information Sciences and
CiteSeer x in the Cloud
Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar
Towards Building a Scholarly Big Data Platform: Challenges, Lessons and Opportunities
Towards Building a Scholarly Big Data Platform: Challenges, Lessons and Opportunities Zhaohui Wu, Jian Wu, Madian Khabsa, Kyle Williams, Hung-Hsuan Chen, Wenyi Huang, Suppawong Tuarob, Sagnik Ray Choudhury,
MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS 2014. November 7, 2014. Machine Learning Group
Big Data and Its Implication to Research Methodologies and Funding Cornelia Caragea TARDIS 2014 November 7, 2014 UNT Computer Science and Engineering Data Everywhere Lots of data is being collected and
CAN A PORTABLE, OPEN SOURCE JOURNAL MANAGEMENT/PUBLISHING SYSTEM IMPROVE THE SCHOLARLY AND PUBLIC QUALITY OF RESEARCH? A WORKSHOP
CAN A PORTABLE, OPEN SOURCE JOURNAL MANAGEMENT/PUBLISHING SYSTEM IMPROVE THE SCHOLARLY AND PUBLIC QUALITY OF RESEARCH? A WORKSHOP JOHN WILLINSKY 1 1 Department of Language and Literacy Education University
CiteSeer x : acloud Perspective
CiteSeer x : acloud Perspective Pradeep B. Teregowda Pennsylvania State University C. LeeGiles Pennsylvania State University Bhuvan Urgaonkar Pennsylvania State University Abstract Information retrieval
Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113
CSE 450 Web Mining Seminar Spring 2008 MWF 11:10 12:00pm Maginnes 113 Instructor: Dr. Brian D. Davison Dept. of Computer Science & Engineering Lehigh University [email protected] http://www.cse.lehigh.edu/~brian/course/webmining/
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
Keyphrase Extraction for Scholarly Big Data
Keyphrase Extraction for Scholarly Big Data Cornelia Caragea Computer Science and Engineering University of North Texas July 10, 2015 Scholarly Big Data Large number of scholarly documents on the Web PubMed
HydroDesktop Overview
HydroDesktop Overview 1. Initial Objectives HydroDesktop (formerly referred to as HIS Desktop) is a new component of the HIS project intended to address the problem of how to obtain, organize and manage
ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search
Project for Michael Pitts Course TCSS 702A University of Washington Tacoma Institute of Technology ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Under supervision of : Dr. Senjuti
Reference Software Workshop Tutorial - The Basics
Reference Software Workshop Introduction These tools have the potential to make research easier for you. Reference software is sometimes called citation management tools or bibliographic software. Those
Chapter-1 : Introduction 1 CHAPTER - 1. Introduction
Chapter-1 : Introduction 1 CHAPTER - 1 Introduction This thesis presents design of a new Model of the Meta-Search Engine for getting optimized search results. The focus is on new dimension of internet
How To Use Data Analysis To Get More Information From A Computer Or Cell Phone To A Computer
Applying Big Data approaches to Competitive Intelligence challenges THOMSON REUTERS IP & SCIENCE PHARMA CI EUROPE CONFERENCE & EXHIBITION TIM MILLER 19 FEBRUARY 2014 BIG DATA, NOT JUST ABOUT VOLUMES Patient
Web Archiving and Scholarly Use of Web Archives
Web Archiving and Scholarly Use of Web Archives Helen Hockx-Yu Head of Web Archiving British Library 15 April 2013 Overview 1. Introduction 2. Access and usage: UK Web Archive 3. Scholarly feedback on
Scholarly Use of Web Archives
Scholarly Use of Web Archives Helen Hockx-Yu Head of Web Archiving British Library 15 February 2013 Web Archiving initiatives worldwide http://en.wikipedia.org/wiki/file:map_of_web_archiving_initiatives_worldwide.png
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
Approximate Object Location and Spam Filtering on Peer-to-Peer Systems
Approximate Object Location and Spam Filtering on Peer-to-Peer Systems Feng Zhou, Li Zhuang, Ben Y. Zhao, Ling Huang, Anthony D. Joseph and John D. Kubiatowicz University of California, Berkeley The Problem
Design and Implementation of Domain based Semantic Hidden Web Crawler
Design and Implementation of Domain based Semantic Hidden Web Crawler Manvi Department of Computer Engineering YMCA University of Science & Technology Faridabad, India Ashutosh Dixit Department of Computer
Blog Post Extraction Using Title Finding
Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School
Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)
HIDDEN WEB EXTRACTOR DYNAMIC WAY TO UNCOVER THE DEEP WEB DR. ANURADHA YMCA,CSE, YMCA University Faridabad, Haryana 121006,India [email protected] http://www.ymcaust.ac.in BABITA AHUJA MRCE, IT, MDU University
Data Mining in Web Search Engine Optimization and User Assisted Rank Results
Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management
Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2
Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach
Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA
Dr Billy Wong University Research Centre
Dr Billy Wong University Research Centre 12 March 2015 1 Outline 1. Features of EndNote 2. Demonstration of using EndNote 2.1 Adding references 2.2 Maintenance of your bibliographic library 2.3 The "Cite
Probabilistic User Behavior Models
Probabilistic User Behavior Models Eren Manavoglu Pennsylvania State University 001 Thomas Building University Park, PA 16802 [email protected] Dmitry Pavlov Yahoo Inc. 701 First Avenue Sunnyvale, California
Semantic Search in Portals using Ontologies
Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br
DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2
DATA SCIENCE CURRICULUM Before class even begins, students start an at-home pre-work phase. When they convene in class, students spend the first eight weeks doing iterative, project-centered skill acquisition.
The Application Research of Ant Colony Algorithm in Search Engine Jian Lan Liu1, a, Li Zhu2,b
3rd International Conference on Materials Engineering, Manufacturing Technology and Control (ICMEMTC 2016) The Application Research of Ant Colony Algorithm in Search Engine Jian Lan Liu1, a, Li Zhu2,b
Survey of Canadian and International Data Management Initiatives. By Diego Argáez and Kathleen Shearer
Survey of Canadian and International Data Management Initiatives By Diego Argáez and Kathleen Shearer on behalf of the CARL Data Management Working Group (Working paper) April 28, 2008 Introduction Today,
Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r
Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures ~ Spring~r Table of Contents 1. Introduction.. 1 1.1. What is the World Wide Web? 1 1.2. ABrief History of the Web
Visualization methods for patent data
Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes
Anti-Spam Grid: A Dynamically Organized Spam Filtering Infrastructure
Anti-Spam Grid: A Dynamically Organized Spam Filtering Infrastructure PENG LIU 1,2, GUANGLIANG CHEN 2, LIANG YE 3, WEIMING ZHONG 4 1 Department of Computer Science and Technology, Tsinghua University,
A QoS-Aware Web Service Selection Based on Clustering
International Journal of Scientific and Research Publications, Volume 4, Issue 2, February 2014 1 A QoS-Aware Web Service Selection Based on Clustering R.Karthiban PG scholar, Computer Science and Engineering,
Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web. NLA Gordon Mohr March 28, 2012
Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web NLA Gordon Mohr March 28, 2012 Overview The tools: Heritrix crawler Wayback browse access Lucene/Hadoop utilities:
Functional Requirements for Digital Asset Management Project version 3.0 11/30/2006
/30/2006 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 9 20 2 22 23 24 25 26 27 28 29 30 3 32 33 34 35 36 37 38 39 = required; 2 = optional; 3 = not required functional requirements Discovery tools available to end-users:
Integration of Polish National Bibliography within the repository platform for science and humanities
Marcin Roszkowski Integration of Polish National Bibliography within the repository platform for science and humanities The best thing to do to your data will be thought of by somebody else W3C LLD Agenda
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
Massive Labeled Solar Image Data Benchmarks for Automated Feature Recognition
Massive Labeled Solar Image Data Benchmarks for Automated Feature Recognition Michael A. Schuh1, Rafal A. Angryk2 1 Montana State University, Bozeman, MT 2 Georgia State University, Atlanta, GA Introduction
Web Mining using Artificial Ant Colonies : A Survey
Web Mining using Artificial Ant Colonies : A Survey Richa Gupta Department of Computer Science University of Delhi ABSTRACT : Web mining has been very crucial to any organization as it provides useful
Search Result Optimization using Annotators
Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,
Metasearch Engines. Synonyms Federated search engine
etasearch Engines WEIYI ENG Department of Computer Science, State University of New York at Binghamton, Binghamton, NY 13902, USA Synonyms Federated search engine Definition etasearch is to utilize multiple
American Journal of Engineering Research (AJER) 2013 American Journal of Engineering Research (AJER) e-issn: 2320-0847 p-issn : 2320-0936 Volume-2, Issue-4, pp-39-43 www.ajer.us Research Paper Open Access
Plagiarism. Dr. M.G. Sreekumar UNESCO Coordinator, Greenstone Support for South Asia Head, LRC & CDDL, IIM Kozhikode
Digital Rights Management & Plagiarism Dr. M.G. Sreekumar UNESCO Coordinator, Greenstone Support for South Asia Head, LRC & CDDL, IIM Kozhikode Intranet / Internet K-Assets/Objects, Practices, CoP, Collaborative
Record Deduplication By Evolutionary Means
Record Deduplication By Evolutionary Means Marco Modesto, Moisés G. de Carvalho, Walter dos Santos Departamento de Ciência da Computação Universidade Federal de Minas Gerais 31270-901 - Belo Horizonte
Scientific Knowledge and Reference Management with Zotero Concentrate on research and not re-searching
Scientific Knowledge and Reference Management with Zotero Concentrate on research and not re-searching Dipl.-Ing. Erwin Roth 05.08.2009 Agenda Motivation Idea behind Zotero Basic Usage Zotero s Features
#jenkinsconf. Jenkins as a Scientific Data and Image Processing Platform. Jenkins User Conference Boston #jenkinsconf
Jenkins as a Scientific Data and Image Processing Platform Ioannis K. Moutsatsos, Ph.D., M.SE. Novartis Institutes for Biomedical Research www.novartis.com June 18, 2014 #jenkinsconf Life Sciences are
MENDELEY USING GUIDE CITATION TOOLS CONTENT. Gain access Mendeley Institutional account
MENDELEY USING GUIDE CONTENT Gain access Mendeley Institutional account Create new Mendeley Institutional account Upgrade existing free Mendeley account Start Using Mendeley Download the desktop program
Indexing Repositories: Pitfalls & Best Practices. Anurag Acharya
Indexing Repositories: Pitfalls & Best Practices Anurag Acharya Web search & Scholar Web search indexes all documents Scholar indexes scholarly articles Web search needs document text Scholar also needs
Keywords social media, internet, data, sentiment analysis, opinion mining, business
Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Real time Extraction
Data Publishing Workflows with Dataverse
Data Publishing Workflows with Dataverse Mercè Crosas, Ph.D. Twitter: @mercecrosas Director of Data Science Institute for Quantitative Social Science, Harvard University MIT, May 6, 2014 Intro to our Data
Information access through information technology
Information access through information technology 1 Created to support an invited lecture at the International Conference MDGICT 2009 in Tamil Nadu, India, December 2009 by [email protected]
Our Data & Methodology. Understanding the Digital World by Turning Data into Insights
Our Data & Methodology Understanding the Digital World by Turning Data into Insights Understanding Today s Digital World SimilarWeb provides data and insights to help businesses make better decisions,
Baidu: Webmaster Tools Overview and Guidelines
Baidu: Webmaster Tools Overview and Guidelines Agenda Introduction Register Data Submission Domain Transfer Monitor Web Analytics Mobile 2 Introduction What is Baidu Baidu is the leading search engine
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks
A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks Text Analytics World, Boston, 2013 Lars Hard, CTO Agenda Difficult text analytics tasks Feature extraction Bio-inspired
Virginia Commonwealth University Rice Rivers Center Data Management Plan
Virginia Commonwealth University Rice Rivers Center Data Management Plan Table of Contents Objectives... 2 VCU Rice Rivers Center Research Protocol... 2 VCU Rice Rivers Center Data Management Plan... 3
So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages
ISBN 978-952-5726-07-7 (Print), 978-952-5726-08-4 (CD-ROM) Proceedings of the Second Symposium International Computer Science and Computational Technology(ISCSCT 09) Huangshan, P. R. China, 26-28,Dec.
Data Mining Solutions for the Business Environment
Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Big Data: Image & Video Analytics
Big Data: Image & Video Analytics How it could support Archiving & Indexing & Searching Dieter Haas, IBM Deutschland GmbH The Big Data Wave 60% of internet traffic is multimedia content (images and videos)
Graph Database Performance: An Oracle Perspective
Graph Database Performance: An Oracle Perspective Xavier Lopez, Ph.D. Senior Director, Product Management 1 Copyright 2012, Oracle and/or its affiliates. All rights reserved. Program Agenda Broad Perspective
A Comparative Approach to Search Engine Ranking Strategies
26 A Comparative Approach to Search Engine Ranking Strategies Dharminder Singh 1, Ashwani Sethi 2 Guru Gobind Singh Collage of Engineering & Technology Guru Kashi University Talwandi Sabo, Bathinda, Punjab
OCLC CONTENTdm. Geri Ingram Community Manager. Overview. Spring 2015 CONTENTdm User Conference Goucher College Baltimore MD May 27, 2015
OCLC CONTENTdm Overview Spring 2015 CONTENTdm User Conference Goucher College Baltimore MD May 27, 2015 Geri Ingram Community Manager Overview Audience This session is for users library staff, curators,
Grid Density Clustering Algorithm
Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2
Reference Management Software
KTH ROYAL INSTITUTE OF TECHNOLOGY Reference Management Software Basic Introduction Sara Akramy [email protected] Reference management software is a tool that allows you to: Design an own library by collecting,
Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012
Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering
ProteinQuest user guide
ProteinQuest user guide 1. Introduction... 3 1.1 With ProteinQuest you can... 3 1.2 ProteinQuest basic version 4 1.3 ProteinQuest extended version... 5 2. ProteinQuest dictionaries... 6 3. Directions for
Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies
Semantic Data Management Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies 1 Enterprise Information Challenge Source: Oracle customer 2 Vision of Semantically Linked Data The Network of Collaborative
Search Engine optimization
Search Engine optimization for videos Provided courtesy of: www.imagemediapartners.com Updated July, 2010 SEO for YouTube Videos SEO for YouTube Videos Search Engine Optimization (SEO) for YouTube is just
ASWGF! Towards an Intelligent Solution for the Deep Semantic Web Problem
ASWGF! Towards an Intelligent Solution for the Deep Semantic Web Problem Mohamed A. Khattab, Yasser Hassan and Mohamad Abo El Nasr * Department of Mathematics & Computer Science, Faculty of Science, Alexandria
Exploring Big Data in Social Networks
Exploring Big Data in Social Networks [email protected] ([email protected]) INWEB National Science and Technology Institute for Web Federal University of Minas Gerais - UFMG May 2013 Some thoughts about
Citebase Search: Autonomous Citation Database for e-print Archives
Citebase Search: Autonomous Citation Database for e-print Archives Tim Brody Intelligence, Agents, Multimedia Group University of Southampton Abstract Citebase is a culmination
María Elena Alvarado gnoss.com* [email protected] Susana López-Sola gnoss.com* [email protected]
Linked Data based applications for Learning Analytics Research: faceted searches, enriched contexts, graph browsing and dynamic graphic visualisation of data Ricardo Alonso Maturana gnoss.com *Piqueras
An Imbalanced Spam Mail Filtering Method
, pp. 119-126 http://dx.doi.org/10.14257/ijmue.2015.10.3.12 An Imbalanced Spam Mail Filtering Method Zhiqiang Ma, Rui Yan, Donghong Yuan and Limin Liu (College of Information Engineering, Inner Mongolia
