From Distributed Computing to Distributed Artificial Intelligence Dr. Christos Filippidis, NCSR Demokritos Dr. George Giannakopoulos, NCSR Demokritos
Big Data and the Fourth Paradigm The two dominant paradigms for scientific discovery: Theory Experiments large-scale computer simulations emerging as the third paradigm in the 20th century The fourth paradigm, which seeks to exploit information buried in massive datasets, has emerged as an essential complement to the three existing paradigms The complexity and challenge of the fourth paradigm arises from the increasing rate, heterogeneity, and volume of data generation. Large Hadron Collider (LHC) currently generate tens of petabytes of reduced data per year observational and simulation data in the climate domain are expected to reach exabytes by 2021 Light source experiments are expected to generate hundreds of terabytes per day
LHC Data Challenge Starting from this event (particle collision) Data DataCollection Collection Data DataStorage Storage Data Data Processing Processing You are looking for this signature Selectivity: 1 in 1013 Like looking for 1 person in a thousand world populations! Or for a needle in 20 million haystacks!
Amount of data from the LHC detectors Balloon (30 Km) CMS CD stack with 1 year LHC data! (~ 20 Km) ATLAS ~15 PetaBytes / year ~1010 events / year ~103 batch and interactive users ~ 20.000.000 CD / year Concorde (15 Km) Mt. Blanc (4.8 Km) LHCb
Grid / Cloud Technologies
Definition of Grid systems Collection of geographically distributed heterogeneous resources Most generalized, globalized form of distributed computing An infrastructure that enables flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions and resources Ian Foster and Carl Kesselman
Information about sites: http://goc.grid.sinica.edu.tw/gstat/
Exascale Challenges Current Petascale systems is unlike to scale to exascale environments, due to the disparity among computational power, machine memory and I/O bandwidth The exascale simulations will not be able to write enough data out to permanent storage to ensure a reliable analysis Current Grid infrastructures are not user friendly and are far from efficient, for small groups and individuals Grid infrastructures, when implemented by HEP VOs, tends to be centralized, from the data point of view. Users demand mobility, efficient data sharing and in the same time autonomy
IKAROS Platform Data/Metadata-Collector Ikaros-EG plugin job creation Content provider + mobile devices mobile-grid + WI-FI, 3G android.apk android.apk android.apk android.apk android.apk android.apk 20 android.apk
Elastic Transfer (et) Create your Personal Storage Cloud Directly, transfer your files from your workstation to another PC Third-party Data transfer Flexible data & storage sharing You are on the road, behind fifteen firewalls, and want to share some web application you're developing locally, or just share a set of files with someone real quick (Reverse HTTP) http://www.et-js.org/
Nice! So, now can I... Discover whether corruption in politics is a location-based issue? Check what is the best route to a house by the sea, with low rent? Find the ideal husband/wife? Determine how to improve my economy, relying on agriculture?
Well, you kind of can... If you can read through petabytes of information can determine what is useful and what is not contact 30 different organizations hosting the data have experts combining the data visualize them in a meaningful way I hope you got the point by now...
So, did we fail?
Bits and pieces If you had individual people producing simple statements Decipherable by machines People need food Souvlaki is food Souvlaki contains meat <people, need, food> <souvlaki, is, food> <souvlaki, contains, meat> Could computers combine knowledge to be intelligent? <?,need,meat>: Who needs meat? <souvlaki,contains,?>: What do I need to make a souvlaki?
Distributed Artificial Intelligence to the rescue! You start with something like this RDF graph:
You end up with something like...
How does it work? You use MACHINES (agents will do fine...)! You query LOTS of resources... With BILLIONS of small, statements You REASON upon them You provide answers in realistic time You visualize the results
Challenges Data providers speak different languages Data providers can go offline Even knowing who to ask is a problem Responding in time can be challenging The (data) world changes
SemaGrow: Distributed, Heterogeneous, Semantic Query Processing Distributed queries over SPARQL endpoints On-the-fly mapping across data provider languages Adaptive to problematic data providers Allows complex queries Support for streaming data (sensors!)
Summary Distributed computing allows Generating amazing amounts of data Handling amazing amounts of data Computational availability and fail-over On-demand computation power Security Distributed artificial intelligence allows Asking complex questions over data Combining data Generating knowledge Exploiting knowledge
From Distributed Computing to Distributed Artificial Intelligence Dr. Christos Filippidis, NCSR Demokritos Dr. George Giannakopoulos, NCSR Demokritos Thank you!