ポストムーア 時 代 に 向 けた プログラミングモデルと 実 装 技 術 Programming Models and Implementation After the End of Moore s Law

Similar documents
Quality of. Leadership. Quality Students of Faculty. Infrastructure

世 界 のデブリ 関 連 規 制 の 動 向 Current situation of the international space debris mitigation standards

TS-3GA (Rel10)v Telecommunication management; File Transfer (FT) Integration Reference Point (IRP); Requirements

Procedures to file a request to the JPO for Patent Prosecution Highway Pilot Program between the JPO and the HPO

Introducing DIGNIO and WEB DIGNIO Targeting Male Executives in Japan

Application Guidelines for International Graduate Programs in Engineering

Grant Request Form. Request Form. (For continued projects)

レッドハット 製 品 プライスリスト Red Hat Enterprise Linux2013 新 製 品 (ベースサブスクリプション) 更 新 :2015 年 4 22

宇 宙 機 および 衛 星 のワイヤレス 化 を 支 えるマイクロ 波 半 導 体 デバイス 集 積 回 路 技 術 と 無 線 電 力 伝 送

Unwillingness to Use Social Networking Services for Autonomous Language Learning among Japanese EFL Students

2015 ASPIRE Forum Student Workshop Student Reports 参 加 学 生 報 告 書

Local production for local consumption( 地 産 地 消 ) Worksheet 1. Task: Do you agree with this idea? Explain the reasons why you think so.

Miyagi Prefecture Earthquake Disaster Recovery Plan

How To Write A Program For A Microprocessor (Fem) With A Nonzero Power (Fmi) And A Non Zero Power (M) (Fami) (Or A Non-Zero Power) (For Microprocessor) (Program

Quality Risk Management as a Tool for Improving Compliance and E ciency

レッドハット 製 品 プライスリスト Red Hat Enterprise Linux 製 品 (RHEL for HPC) 更 新 :2015 年 4 22

Graduate School of Engineering. Master s Program, 2016 (October entrance)

Financial Modeling & Valuation Training

Cost Accounting 1. B r e a k e v e n A n a l y s i s. S t r a t e g y I m p l e m e n t a t i o n B a l a n c e d S c o r e c a r d s

Kumiko Sakamoto, Ms., Acad, Social Science, Japan: Women and men in changing societies: Gender division of labor in rural southeast Tanzania [A2]

2, The "Leading Initiative for Excellent Young Researchers" application screen (Home)

JAPAN PATENT OFFICE AS DESIGNATED (OR ELECTED) OFFICE CONTENTS

自 計 式 農 家 経 済 簿 記 とその 理 論

グローバル 人 材 育 成 言 語 教 育 プログラム. Global Linkage Initiative Program 2015 GLIP 履 修 ガイド. 夏 学 期 集 中 英 語 プログラム Summer Intensive English が 始 まります

Request for Taxpayer Identification Number and Certification 納 税 者 番 号 および 宣 誓 の 依 頼 書

JF Standard for Japanese-Language Education 2010

英 語 上 級 者 への 道 ~Listen and Speak 第 4 回 ヨーロッパからの 新 しい 考 え. Script

Integrating Elements of Local Identity into Communication Activities

Frontier of Private Higher Education Research in East Asia

Japan s Civil-Military Coordination in U.N. Peacekeeping Operations

Let s ask Tomoko and Kyle to the party. A. accept B. except C. invite D. include

Energy Environmental Strategy and CCS エネルギー 環 境 問 題 とCCS

Introduction to the revised Child Care and Family Care Leave Law

To disseminate TRIZ into the business world

Disaster Management System for Wide-Area Support and Collaboration in Japan

Panasonic AC-DC Power Supply Design Support Service

Even at an early age, students have their own ideas about school,

California Subject Examinations for Teachers

Linc English: A Web-Based, Multimedia English Language Curriculum Implementation and Effectiveness Report

IS MAC Address Restriction Absolutely Effective?

平 成 26 年 度 生 入 学 選 考 試 験 英 語 [ 特 待 生 入 試 ]

Isolation and identification of human skin fibroblast cell growth factor (SFGF) from the scallop shells

Data Mining for Risk Management in Hospital Information Systems

Special Program of Engineering Science 21 st Century. for Graduate Courses in English. Graduate School of Engineering Science, OSAKA UNIVERSITY

Study on the Diagnostics and Projection of Ecosystem Change Associated with Global Change

Hanako: What do you like to listen to? 2 I am still waiting to hear. 3 I can t hear it. 4 I d like you to listen. 会 津 大 学. 1 All kinds of music.

Stormy Energy Future and Sustainable Nuclear Power

育 デジ (Iku-Digi) Promoting further evolution of digital promotion

The acute osteoporotic vertebral compression fracture

ࢪ 㛤 ᅜ ࠉ Ẽ ኚ ᨻ ሗ 㞟 ㄆㄪᰝ ࢪ 㛤 ᅜ Ẽ ኚ ᨻ ሗ 㞟 ㄆㄪᰝ ࠉ ሗ ሗ ᖹᡂ ᖺ ᖹᡂ 㸦 ᖺ㸧 ᖺ ᨻἲ ᨻἲ ᅜ㝿༠ຊᶵᵓ㸦-,&$㸧 ࠉ ᅜ㝿༠ຊᶵᵓ 㝈 ࢫ ࠉᮏࠉᕤࠉႠࠉ ࠉᘧࠉ ࠉ ቃ JR ࠉ ࠉసᴗ㸹 ᕝ

Instructional Design and Teachers and Learners Factors Affecting Active and Meaningful Participation in Online Discussion

JASE-world. Presentation of Japanese technology of waste to energy

A Large-Scale Self-Organizing Map for Metagenome Studies for Surveillance of Microbial Community Structures

Groundwater Temperature Survey for Geothermal Heat Pump Application in Tropical Asia

Communication Strategies and their Role in English Language Learning

Studies of Large-Scale Data Visualization, GPGPU Application and Visual Data Mining

第 9 回 仮 想 政 府 セミナー Introduction Shared Servicesを 考 える ~Old but New Challenge~ 東 京 大 学 公 共 政 策 大 学 院 奥 村 裕 一 2014 年 2 月 21 日

Document and entity information

Optimizing Configuration and Application Mapping for MPSoC Architectures

Industrial Safety and Health Act ( Act No. 57 of 1972)

EDB-Report. 最 新 Web 脆 弱 性 トレンドレポート( ) ~ Exploit-DB( 公 開 されている 内 容 に 基 づいた 脆 弱 性 トレンド 情 報 です

Direct Marketing Production Printing & Value-Added Services: A strategy for growth

(2) 合 衆 国 では, 人 口 の 増 加 のほぼ 三 分 一 が 移 民 [ 移 住 ]によってもたらさている

TS-3GB-S.R v1.0 VoIP Supplementary Services Descriptions: Call Forwarding - Unconditional

Course Material English in 30 Seconds (Nan un-do)

Information Security Policies in the Telecommunications Field

Current Situation of Research Nurse Education and Future Perspectives in Japan

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

T s u n a m i s a n d S t o r m S u r g e s

Graduate Program in Japanese Language and Culture (Master s Program) Application Instructions

A Hybrid Electrical and Optical Networking Topology of Data Center for Big Data Network

October Agency, Food Labeling Division

Why today? What s next?

Discover the power of reading for university learners of Japanese in New Zealand. Mitsue Tabata-Sandom Victoria University of Wellington

Standard specifications SRA210D-01-FD11

Education, English and Positive Psychology

医 学 生 物 学 一 般 問 題 ( 問 題 用 紙 1 枚 解 答 用 紙 2 枚 )

2016 Hosei University Graduate Schools Business School of Innovation Management Global MBA Program Credited Auditor Admission Guideline


SUPPLEMENT 1-2 第 52 回 年 会 講 演 予 稿 集 主 催 日 本 生 物 物 理 学 会 ( 木 ) ~ 27 ( 土 ) 札 幌 コンベンションセンター. Vol.54 ISSN CODEN : SEBUAL

Using the Moodle Reader Module to Facilitate an Extensive. Reading Program

Six degrees of messaging

Universities are coming under more pressure to prove

Offenders Rehabilitation Act (Act No. 88 of June 15, 2007)

Grade 9 Language and Literature (English) Course Description

The meaning of (Second) Life

Research on Support for Persons with Mental Disability to Make Transition to Regular Employment

レッドハット 製 品 プライスリスト 標 準 価 格. Red Hat Enterprise Linux 製 品 (RHEL Server)

Civilization, Nation and Modernity in East Asia By Chih-yu Shih

APPLICANT GUIDELINES for. The International Course on Systems Life Sciences in THE GRADUATE SCHOOL OF SYSTEMS LIFE SCIENCES, KYUSHU UNIVERSITY

Microsoft SQL Server PDW 新世代 MPP 資料倉儲解決方案

<HNAS_SNMP 監 視 取 扱 説 明 書 別 紙 2 イベントリスト> Event ID

Editors: Kim Bradford-Watts Robert Chartrand Eric M. Skier

2009 年 3 月 26 日 2 泊 3 日 スーザン 田 辺 先 生 引 率 のセーレム 市 の 高 校 生 が 当 会 の 会 員 宅 や 会 員 友 人 宅 にホームステイをしました 以 下 は 彼 らの 感 想 文 です

中 国 石 化 上 海 石 油 化 工 研 究 院 欢 迎 国 内 外 高 层 次 人 才 加 入

INTERNATIONAL PACIFIC RESEARCH CENTER APRIL 2005 MARCH 2006 REPORT

MPLS Configration 事 例

JAXA's research strategy to orbital debris mitigation

Transcription:

ポストムーア 時 代 に 向 けた プログラミングモデルと 実 装 技 術 Programming Models and Implementation After the End of Moore s Law 田 浦 健 次 朗 ( 東 京 大 学 情 報 理 工 学 系 研 究 科 ): 議 論 参 加 者 八 杉 昌 宏 ( 九 州 工 業 大 学 情 報 工 学 研 究 院 ) 佐 藤 幸 紀 ( 東 京 工 業 大 学 学 術 国 際 情 報 センター) 平 石 拓 ( 京 都 大 学 学 術 情 報 メディアセンター) 滝 澤 真 一 朗 ( 理 化 学 研 究 所 計 算 科 学 研 究 機 構 )

Latency/Bandwidth Hierarchy A deep and continuous spectrum Bandwidth-increasing technologies not depending on process shrink registers L1 1 ns volatile L2 10 ns intra core 3Ds tacked L3 (2~1 B/F?) DRAM main mem (0.2~0.1 B/F) NVRAM 100 ns 1 s intra chip intra board intra rack (0.1~0.01 B/F) (0.01~0.003 B/F) NAND Flash 10 s 100 s across rasks Optical Circuit Switch (> 1 B/F?) non-volatile 1 ms HDD 10 ms Latency

Desired Programming Model Supports for Future Hardware Hierarchical algorithms (vs. flat SPMD) To access memory along memory hierarchy (blocking) To extract parallelism and exploit locality along time axis (parallel-in-time, temporal blocking) Transparent data accesses (vs. hybrid program) Latency within a board and latency to the next node differ only a few times Cf. latency to L1 and latency within a board are two orders of magnitude difference Difference betwen intra-node and inter-node is no longer especially significant Latency hiding Latency within a board is already too long to waste High-bandwidth memory and networks can support increased traffic

Basic Approach: Global Task Parallelism and Global Address Space Global task parallelism build programs with many fine grain tasks the system automatically load-balances tasks Global address space (GAS) a single data structure can span across nodes

Task parallel and GAS: benefits and challenges Benefit: easy to program Load/data distribution done by the system One can reason about program correctness from a global view point Benefit: (potential) high performance Naturally support nested/recursive parallelism blocking for deep mem/comm hierarchies parallelism involving time-axis Hide latencies by task switching (unsolved) challenges: Fundamental difficulties in making automatic load/data distribution efficiently enough

Our Proposals Theme 1: Global task parallel/address space models utilizing Optical Circuit Switching (OCS) Theme 2: Latency hiding algorithms for task parallel programs: controlled context switching Theme 3: Fault tolerant task parallel systems utilizing NVRAM: local and fine-grain checkpointing algorithms

Theme 1: Global task parallel/address space utilizing Optical Circuit Switching

Optical Circuit Switching (OCS) Network Strengths: High bandwidth Dozens Tbps via Wavelength Division Multiplexing (WDM) Low power Does not need E-O-E conversion Weaknesses Communication degree is limited; need a circuit setup (reconfiguration) to talk to manynodes A reconfiguration takes some time Dozens ms 2D MEMS Wavelength Selection Switching (WSS) 10s [Farrington et al. 2013] need an efficient circuit management by software

OCS and its application to computing toady Feasibility study of OCS for HPC [Barker et al. 2006] Fast reconfigurable OCS [Farrington et al. 2013] Demonstrate 11.5s reconfiguration time with 2D MEMS WSS Managing fast reconfiguring OCS and [Rodrigues et al. 2015] Propose pseudo round-robin switching between topologies Evaluate with KV store and MPI apps A 10240 hosts testbed [Christodoulopoulos et al. 2014] Our past work: OCS-enabled MPI [Takizawa et al. 2008] [Rodrigues et al. 2015] [Christodoulopoulos et al. 2014]

Proposed agenda in global task parallelism and GAS utilizing OCS networks Challenges: Automatic load distribution tends to have random communication patterns How to cope with the limited communication degree? Clues: The communication is induced by load and data balancing decided by the system can we develop a load-balancing algorithm that generates OCS-friendly communication patterns?

Optimizing data processing systems with OCS networks Data processing frameworks utilizing optical paths Reconfigure optical routes depending on traffic volumes Automate latency hiding/reduction for reconfiguration Implement in MapReduce and other data processing models Apply to In-situ visualization, genome, data assimilation in HPC Hybrid network architecture EPS for latency sensitive communication and disk IO OCS for bulk large transfer

Performance profiling for task parallel/gas Data dependence analysis by Exana analysis tool Extract data dependence through memory among loop iterations and functions Discover regions that can be executed independently or in a pipelined manner legend: Solid edges: Control flow Dashed edges: children: nested regions data dependencies siblings: independent regions Dashed edges: Oval nodes: Loops intra-node dependencies Rectangular nodes: Procedure calls Evolve to performance analysis tool for PGAS programs 1) Capture data accesses to various hierarchy levels (caches, local memory, remote memory, other nodes) 2) Analyze how tasks access data 3) Analyze/predict performance under various task/data mapping schemes Dependencies of hmmer (protein homology search based on HMM )

Classifying access patterns Four base patterns More complex access sequences are described as compound patterns R4@4006ab = { Base pattern Example SeqStr Access: R4@4005ac = SeqStr:[4x3](_8_[4x3])*2 } +offset+ +offset+ Base pattern Base pattern Four access patterns currently identified by Exana Currently, Exana automatically inserts prefetching for nearsequential access patterns Extend it to PGAS programs for communication aggregation and prefetching 1) Extend to accommodate communication among tasks 2) Cooperate with performance models and simulaters

Theme 2: Latency hiding algorithms for task parallel programs : controlled context switching

Throughput (M create+join/sec) Task parallel systems today (intra-node) Languages and libraries for shared memory programming Cilk, TBB, OpenMP task, Recent efforts in HPC contexts Qthreads, Argobots (ANL), Plasma/Quark (Tennessee), ParalleX (Indiana), Open Common Runtime, Legion (Stanford), Our work MassiveThreads: Latency of common ops Dual Broadwell E5-2699 2.3GHz (unit: cycles) 700 Task creation/join throughput Operation Create 38 latency (local) Intra-core (across hwt) Intra-socket (across core) Across socket 600 500 400 300 Join (fall through) 24 200 100 Steal 300 1390 2537 Wake up 87 464 807 0 0 20 40 60 80 workers used

Proposed agenda in Latency hiding and task parallelism Challenges Uncontrolled context switches may backfire increase working sets Screw up access patterns Tools in our disposal Hardware threads (oversubscription) Software switch (already < intraboard memory latency) Can we design an algorithm that decides, based on predicted cost/benefits, whether to sit idle or switch, and how (HW or SW)? thread swtich by hardware Task switch by software 遅 延 1 ns 10 ns 100 ns 1 s 10 s 100 s 1 ms 10 ms Intra core Intra chip Intra board Intra rack Across racks

Theme 3:fault tolerant task parallel systems utilizing NVRAM : local and fine-grain checkpointing algorithms

Fault tolerant task parallel systems today Basic approach : obviates the need for globally consistent snapshots The exact order of task execution is irrelevant State of the art Retry failed tasks [Mehmet et al. 14] Only for transient errors Unclear how to deal with sideeffects on shared states

Proposed agenda A truly crash tolerant algorithm supporting side-effects to shared states What if each task is executed like a transaction? Execute(task) { Execute a task (induce side effects) Remove predecessor tasks if no more tasks depend on it } Can we utilize NVRAM to make it efficient?

Summary Proposed three research agendas aligned to bandwidth-enhancing technologies to come, without relying on process shrink OCS-friendly load balancing for tasks Carefully controlled latency hiding NVRAM for Fine grain checkpointing of tasks

共 同 研 究 国 内 産 総 研 工 藤 OCSネットワークアーキテクチャ アルゴ アプリ: アプリ 班 でターゲットとなるもの 東 工 大 横 田 理 央 ExaFMM 国 外 Cray Chapel (Brad Chamberlain) User Level Threads 標 準 化 WG (Pavan Balaji et al.) Locality aware task scheduler (Miquel Pericas)

田 浦 グループ 提 案 Holy Grail: 高 性 能 な 大 域 負 荷 分 散 大 域 アドレス 空 間 光 サーキットスイッチ(OCS)の 効 率 的 利 用 チャレンジ: 通 信 次 数 が 限 られたOCSに friendlyな 動 的 負 荷 分 散 データアクセスの ルーティングアルゴリズム HW/SWを 組 み 合 わせた 包 括 的 遅 延 隠 蔽 チャレンジ: HWTのoversubscription, SWタ スク 切 り 替 えの 是 非 を 性 能 に 基 づき 正 しく 決 定 するアルゴリズム 共 通 基 盤 技 術 としてのタスク 並 列 大 域 ア ドレス 空 間 プログラムのメモリアクセスト レーサ, 性 能 シミュレータ NVRAMを 活 用 した 耐 故 障 タスク 並 列 処 理