A neo4j powered social networking and Question & Answer application to enhance scientific communication. René Pickhardt, Heinrich Hartmann

Similar documents
Data sharing in the Big Data era

Visualizing a Neo4j Graph Database with KeyLines

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

In Memory Accelerator for MongoDB

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

CiteSeer x in the Cloud

Bibliometric Big Data and its Uses. Dr. Gali Halevi Elsevier, NY

Real-time Big Data Analytics with Storm

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

LDBC Social Network Neo Technology

InfiniteGraph: The Distributed Graph Database

Invenio: A Modern Digital Library for Grey Literature

Rich Media & HD Video Streaming Integration with Brightcove

A Performance Analysis of Distributed Indexing using Terrier

Peters & Heinrich GFKL An Introduction

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Connecting Basic Research and Healthcare Big Data

Web Performance. Sergey Chernyshev. March '09 New York Web Standards Meetup. New York, NY. March 19 th, 2009

Spark Application Carousel. Spark Summit East 2015

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Abstract. Description

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

Moving From Hadoop to Spark

Introduction to DISC and Hadoop

Graph Databases: Neo4j

Google Web Toolkit. Introduction to GWT Development. Ilkka Rinne & Sampo Savolainen / Spatineo Oy

Cloud Computing and Advanced Relationship Analytics

Performance Monitoring and Tuning. Liferay Chicago User Group (LCHIUG) James Lefeu 29AUG2013

SQL Azure and SqlBulkCopy

!!!!!!!! The Internet of (Connected) Things. by Grace Andrews and Huston Hedinger

Big Data and Apache Hadoop s MapReduce

How to Easily Integrate BIRT Reports into your Web Application

PROPOSAL To Develop an Enterprise Scale Disease Modeling Web Portal For Ascel Bio Updated March 2015

Microblogging Queries on Graph Databases: An Introspection

Using Big Data Analytics

160 Numerical Methods and Programming, 2012, Vol. 13 ( UDC

Log Mining Based on Hadoop s Map and Reduce Technique

State of Cloud Storage Providers Industry Benchmark Report:

RevoScaleR Speed and Scalability

Rich User Interfaces for Web-Based Corporate Applications

NoSQL Roadshow Berlin Kai Spichale

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

SYSTAP / bigdata. Open Source High Performance Highly Available. 1 bigdata Presented to CSHALS 2/27/2014

Information Retrieval Elasticsearch

Surviving the Big Rewrite: Moving YELLOWPAGES.COM to Rails. John Straw YELLOWPAGES.COM

Big Graph Processing: Some Background

Cloud Big Data Architectures

Using Solr search in a Dot Net environment

Zabbix for Hybrid Cloud Management

Scientific Knowledge and Reference Management with Zotero Concentrate on research and not re-searching

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Deploying and Managing SolrCloud in the Cloud ApacheCon, April 8, 2014 Timothy Potter. Search Discover Analyze

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

Realtime

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Case Study : 3 different hadoop cluster deployments

Understanding Neo4j Scalability

White Paper. Java versus Ruby Frameworks in Practice STATE OF THE ART SOFTWARE DEVELOPMENT 1

Apache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380

Augmented Search for Software Testing

Big Data and Data Science: Behind the Buzz Words

Big Data and Scripting Systems beyond Hadoop

Are you ready for your Journey to the cloud? Maybe some of you are already using some cloud- based services?

Framework as a master tool in modern web development

GeoManitoba Spatial Data Infrastructure Update. Presented by: Jim Aberdeen Shawn Cruise

The Cloud to the rescue!

Evaluation of Open Source Data Cleaning Tools: Open Refine and Data Wrangler

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc.

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Speak<geek> Tech Brief. RichRelevance Infrastructure: a robust, retail- optimized foundation. richrelevance

The Learn-Verified Full Stack Web Development Program

Framework Adoption for Java Enterprise Application Development

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Unified Big Data Processing with Apache Spark. Matei

Semantic Search in Portals using Ontologies

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

A Comparison of Open Source Application Development Frameworks for the Enterprise

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

Reference Architecture, Requirements, Gaps, Roles

Graphalytics: A Big Data Benchmark for Graph-Processing Platforms

Who? Wolfgang Ziegler (fago) Klaus Purer (klausi) Sebastian Gilits (sepgil) epiqo Austrian based Drupal company Drupal Austria user group

opennms reporting generation tool

REST-based Offline System

Industrial Adoption of Automatically Extracted GUI Models for Testing

Measuring CDN Performance. Hooman Beheshti, VP Technology

Experimenting in the domain of RIA's and Web 2.0

Large-Scale Web Applications

Graph Mining and Social Network Analysis

Developing modular Java applications

Software Engineering 4C03

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Performance Analysis of Lucene Index on HBase Environment

Science Gateway Services for NERSC Users

1-Oct 2015, Bilbao, Spain. Towards Semantic Network Models via Graph Databases for SDN Applications

Hadoop and Big Data Research

Old and New Building Blocks Come Together For Big Data. MapR Technologies - Confiden6al

Open Source Technologies on Microsoft Azure

Transcription:

A neo4j powered social networking and Question & Answer application to enhance scientific communication. René Pickhardt, Heinrich Hartmann related-work.net

Roadmap Introduction Data structures for Q & A Systems Benchmarking Q & A Systems Our data model and requirements Lessons learnt from handling graphs in neo4j GWT(P) Demo Time (Personalized) Auto Completion

Central problems in Academia 1. Finding new relevant publications 2. Connecting people interested in the same topic

Central problems in Academia Solution: The academic - Social contacts - Citation between papers graph No Open Access to academic data: Citations/Fulltext!

We need open Access in the scientific community Our society pays 1. scientists to do the research 2. scientists to do the peer reviewing 3. Conference fees (not here (: ) 4. subscription fees for scientific journals (through libraries) ==> Open access could give us the citation graph.

Tim Berners Lee - 2012 at WWW conference: "Think of how you want the world to be. Just imagine it! and then: Go geek an code it up"

No open platform for scientists exists on the web Open Source Open Data User Generated Content Google Scholar NO NO Not really Microsoft Academic NO NO Not really SciVerse/Elsevier NO NO NO CiterSeerX YES YES NO (Quality problematic) Mendeley NO Not really YES API (no citation data) ResearchGate NO NO YES Sciweavers NO NO YES RelatedWork YES YES YES

Our vision on blog.related-work.net/proposal Social network for scientists Q&A System to enhance scientific communication Open Access (to all publications) in particular: Open Citation Network Linked Open Data of discussions and meta data Strong recommenders More Trust Altmetrics Personalized news Feed for hot publications (last year)

Roadmap Introduction Data structures for Q & A Systems Benchmarking Q & A Systems Our data model and requirements Lessons learnt from handling graphs in neo4j GWT(P) Demo Time (Personalized) Auto Completion

A graph model for discussions

Relational model for discussions

Discussions in document stores

Roadmap Introduction Data structures for Q & A Systems Benchmarking Q & A Systems Our data model and requirements Lessons learnt from handling graphs in neo4j GWT(P) Demo Time (Personalized) Auto Completion

Datasets from StackExchange -- OPEN DATA!

Graph dbs are competitive

Who likes Python?

Who likes Neo4J?

Neo4J/Python performance???

Neo4J/Python -- WTF?? By marysia_@flickr

Python and Neo4j Please give us a fast API!!

Roadmap Introduction Data structures for Q & A Systems Benchmarking Q & A Systems Our data model and requirements Lessons learnt from handling graphs in neo4j GWT(P) Demo Time (Personalized) Auto Completion

Papers and Authors make up our graph

Caclulating Co-Authorship is BFS depth 2

Caclulating Co-Authorship is BFS with depth = 2

We are also interested in Citations

Authors citing other authors is BFS with depth = 3

We want to see how similar two authors are

Co citations as a similarity measure

Calculating Co-Citations is BSF with depth = 4

All queries are local Graph traversals Papers written by author? Which papers cite a paper? BFS 1 Which papers are cited by a paper? What are frequent co-authors of an author? Which authors does author X like to cite? 2 3 By which authors is a given author often cited? Similar papers. Who are the similar authors? 4

Did you guys see the tables? Did you guys see the tables?

Roadmap Introduction Data structures for Q & A Systems Benchmarking Q & A Systems Our data model and requirements Lessons learnt from handling graphs in neo4j GWT(P) Demo Time (Personalized) Auto Completion

Add Title In.

Add Title

Insert new edges for traversal results 1 mio. traversals per second seems a lot but be aware: FOAF query (co-authors) <==> BFS depth = 2: 2200 requests per second on average Similar authors (co-citation) <==> BFS depth 4 5 requests per sec. With existing similar author edges: about 4000 requests per sec!

Try to avoid hitting the disc during data mining We calculate Page rank iteratively for ranking in search We store page Rank as a node Property While calculation keep. HashMap<Node, Float> pagerankmap; If you need to hit the disc try to group about 50k writes to a single transaction.

4 Lessons Just don't use python with neo4j Use neo4j's Java Core API instead of Cypher Query Language Store the results of big and expensive Traversals in the Graph calculate via cronjob as a data mining task have a processing priority queue coming from your application Writing properties is slow. Try to avoid this

Roadmap Introduction Data structures for Q & A Systems Benchmarking Q & A Systems Our data model and requirements Lessons learnt from handling graphs in neo4j GWT(P) Demo Time (Personalized) Auto Completion

GWT(P) powers our application GWT + clear separation of server and client side code + AJAX: Only transfer data + Robust Java server application which integrates smoothly with neo4j, lucene, tomcat,... + powerfull API's for web programming including html5 GWTP + MVP framework + Injection Handling + History Support + Code Split + Faster testing (if we ever start doing this this) ++ we promise it's on our todo

Roadmap Introduction Data structures for Q & A Systems Benchmarking Q & A Systems Our data model and requirements Lessons learnt from handling graphs in neo4j GWT(P) Demo Time (Personalized) Auto Completion

Roadmap Introduction Data structures for Q & A Systems Benchmarking Q & A Systems Our data model and requirements Lessons learnt from handling graphs in neo4j GWT(P) Demo Time (Personalized) Auto Completion

Some facts about our search architecture Search is based on lucene 1.) its well integrated in neo4j 2.) good search infrastructure also coming in java Autocompletions: We didn't go for solr SuggestTree Class by Nicolai Diethelm. (Source Forge) Personalized autocompletion: gwtp: do as much computation on the client as possible Pagerank projected onto extended ego network Completely cached on the client.

Further reading and resources https://github.com/renepickhardt/related-work.net https://github.com/opencitations/opencitationscorpus https://github.com/heinrichhartmann/discussionbenchmark http://blog.related-work.net/data/ (data sets later LOD) http://blog.related-work.net/proposal/ (Proposal) http://www.rene-pickhardt.de/get-the-full-neo4j-power-byusing-the-core-java-api-for-traversing-your-graph-data-baseinstead-of-cypher-query-language/ http://www.rene-pickhardt.de/graphity

2 way - Q & A Did we miss some technology? How do we scale beyond 1 machine? Any ideas on absolutly mandetory Features we should not miss? What about your Questions?

Thank you very much Thank you very much!