A Multi-Core Pipelined Architecture for Parallel Computing

Parallel & Cloud Coputing PCC Vol, Iss A Multi-Core Pipelined Architecture for Parallel Coputing Duoduo Liao *1, Sion Y Berkovich Coputing for Geospatial Research Institute Departent of Coputer Science, George Washington University 1 nd Street NW, Washington DC 5 USA *1 dliao@gwuedu; berkov@gwuedu Abstract- Parallel prograing on ulti-core processors has becoe the industry s biggest software challenge his paper proposes a novel parallel architecture for executing sequential progras using ulti-core pipelining based on progra slicing by a new eory/cache dynaic anageent technology he new architecture is very suitable for processing large geospatial data in parallel without parallel prograing his paper presents a new architecture for parallel coputation that addresses the proble of requiring to relocate data fro one eory hierarchy to another in a ulti-core environent A new eory anageent technology inserts a layer of abstraction between the processor and the eory hierarchy, allowing the data to stay in one place while the processor effectively igrates as tasks change he new architecture can ake full use of the pipeline and autoatically partition data then schedule the onto ulti-cores through the pipeline he ost iportant advantage of this architecture is that ost existing sequential progras can be directly used with nearly no change, unlike conventional parallel prograing which has to take into account scheduling, load balancing, and data distribution he new parallel architecture can also be successfully applied to other ulti-core/any-core architectures or heterogeneous systes In this paper, the design of the new ulti-core architecture is described in detail he tie coplexity and perforance analysis are discussed in depth he experiental results and perforance coparison with existing ulti-core architectures deonstrate the effectiveness, flexibility, and diversity of the new architecture, in particular, for Big Data parallel processing Keywords- Multi-Core Architecture; Pipelining; Sequential Progras; Progra Slicing; Crossbar Switching; Parallel Coputing; Big Data I INRODUCION As ulti-core architectures gain widespread use, it becoes increasingly iportant to be able to harness their additional processing to achieve higher perforance However, exp loit ing parallel cores to ip rove singleprogra perforance is difficult fro a prograer s perspective because ost existing prograing languages dictate a sequential ethod of execution Parallel prograing on ulti-core processors has becoe the industry s biggest software challenge Because ulti-core hardware architectures are changed to parallel structures, single-processor based software has to be optiized or even rewritten with uch work to eet the hardware constraints However, if we can change the hardware to eliinate the constraints, ost of existing single-processor software progras ay be directly used with iniu changes or even without any change For this purpose, this paper proposes a new parallel architecture using ulti-core pipelining based on progra slicing by crossbar switching and a new eory/cache dynaic anageent technology he new architecture can autoatically partit ion data and schedule the onto ult i- cores through the pipeline his architecture provides a siple and effective solution to the on-the-fly coputations by transferring the operating states fro core to core he ost iportant advantage is that it only requires practically the sae software as currently used based on singleprocessor syste, instead of conventional parallel coputing ethods, such as threading, load balancing, and scheduling he rest of this paper is organized as follows: Section gives a broad overview of the backgrounds and related work thus far Section 3 describes the detailed design of the new parallel architecture using ulti-core pipelining based on progra slicing by switch applied It contains crossbar switch based ulti-core eory and cache architecture, ulti-core pipeline organization, tiing diagra, and progra requireents Section gives the tie coplexity, perforance, and experiental analysis In particular, the exaples of large geospatial data processing, such as Digital Elevation Model (DEM) generation fro Light Detection and Ranging (LIDAR) dataset, are discussed Finally, the suary and advantages are concluded in Section 5 II BACKGROUNDS AND RELAED WORK A Conventional Parallel Coputing In general, there are three ajor approaches used for ultiprocessor processing [9] [1] : Data-parallel: partitions data and schedule the onto the ultiple processors, ask-parallel: Partitions a progra into functions/tasks and schedule the onto the ultiple processors, Pipeline-parallel: decoposes a progra and run each state siultaneously on sequential eleents of the data flow he first approach is suitable for the data-independent circustance However, the scheduling ay be coplicated depending on the application progras PCC Vol Iss, 13 PP 9-57 wwwvkingpubco 13 Aerican V-King Scientific Publishing 9

Parallel & Cloud Coputing PCC Vol, Iss For the second ethod, in practice, it is often difficult to divide a progra in such a way that separate CPUs can execute different portions without interfering with each other Furtherore, this type of parallel processing requires very sophisticated software For the third parallel coputing ethod, a pipeline is coon paradig for very high-speed coputation he pipeline parallelis allows for parallelization of a single task when there is a partial or total order in the dataset iplying the need for state and therefore preventing the use of data parallelis his approach is liited by the sequential decoposability of the task and the length of the longest stage In this paper, the new architecture uses a new parallel echanis in cobination with data-parallelis and pipeline-parallelis It can autoatically partition data and schedule the onto ulti-cores through a pipeline without changing original single-processor progra, instead of decoposing the entire progra as conventional pipelineparallel ethod does or scheduling as conventional dataparallel ethod needs B Multiprocessor Pipeline by Progra Slicing Another proising parallel coputing ethod is based on the ulti-processor pipeline architecture by dividing the progra in equal duration by forced interrupts as described in [] [] [5] his technology has US PAEN No 1571 issued in and owned by he George Washington University It can autoatically schedule the progra onto the ultiple processors he architecture processes an inforation flow progressively in a helicoidal pattern by relocating portions of incoing data his pattern ensures that the incoing data flow will not be interrupted he ulti-processor pipeline allows an arbitrary algorith to be perfored on-the-fly on a data chunk, given a sufficient nuber of processors If an algorith can be perfored by a conventional icroprocessor under static conditions, it can be perfored on the ultiprocessor pipeline Another advantage of this architecture is to use practically the sae software as a sequential coputer and to be able to continuously process the intensive inforation flows his ulti-processor pipeline is a siple and effective solution to the proble of continuous processing of intensive flows of inforation his technique has been proposed to be used for effectively processing the challenging proble of very intensive continuous flows of data [3] However, there are several liitations for soe applications Frequent data relocation especially for large data block applications like 3D graphics and iage processing can cause big overhead costs leading to the overall perforance decrease [1] he syste described in [] uses data overlapping to solve the proble of processed data chunks across the segentation iposed by the buffer size he aounts of eories of the processors have to be occupied by these duplicate (ie overlapped) data Moreover, to reduce the bus traffic, the data relocations, ie the loadings/unloadings, are arranged so that only one data chunk relocates in one shared bus at a particular tie Additionally, for the data required longer pipeline to process, the pipeline need special handling, such as overflow facility or accuulation and then sending back while the data strea ceases In this paper, the core difference between the new ultiprocessor pipeline and original one [] [5] is that the new pipeline is driven by crossbar switching instead of forced interrupts Hence, the novel crossbar-switching based ultiprocessor architecture with a new eory/cache anageent technology significantly overcoes all the above liitations of the original ultiprocessor pipeline C Multi-core Architectures and Prograing Currently ulti-core processor architectures are divided into two basic categories: generic ulti-core CPUs and Graphics Processor Units (GPUs) he Intel [11] and AMD [1] provide a large nuber of ulti-core CPUs in the arket Both of the released dual-core chips in 5 and quadcore chips in 7 However, GPUs were originally designed with special purpose for 3D graphics applications he hardware GPU architecture differs fro ulti-core CPUs significantly he latest Intel s ulti-core graphics chip, also known as Larrabee [15], is one of the boldest graphics projects in the world and offers full copatibility with graphics APIs as well as is capable to process the entire x instruction set that coes ipleented into odern processor architectures Currently, there are several ajor ulti-core prograing developent platfors RapidMind [1] s Multicore Developent Platfor supports ultiple processor architectures, including NVidia s GPUs, AI s GPUs, IBM s Cell BE, and Intel s and AMD s x CUDA (Copute Unified Device Architecture) [7, ] is a software platfor for assively parallel high-perforance coputing on the copany s powerful GPUs It does require prograers to exert soe anual effort and write soe explicit code OpenCL (Open Coputing Language) [13] is an open industry standard for general-purpose parallel prograing of heterogeneous systes However, all of the current ulti-core architectures need ulti-threading based parallel prograing In this paper, a copletely different way is proposed to design ulti-core CPU architectures without conventional parallel prograing D Crossbar-Based echnologies he crossbar switch [] has been used in any areas like telephone exchange As coputer technologies have iproved, crossbar switches have found uses in systes such as the ultistage interconnection networks that connect the various processing units in a Unifor Meory Access (UMA) parallel processor to the array of eory eleents Crossbar switches have been also designed to line entire coputer systes as well Crossbar-based eory architecture has been used on ainfrae coputers to increase eory bandwidth in PCC Vol Iss, 13 PP 9-57 wwwvkingpubco 13 Aerican V-King Scientific Publishing 5

Parallel & Cloud Coputing PCC Vol, Iss ulti-processor systes since decades he copanies such as Unisys, SGI, and Sun, have brought the technology down to the server and workstation platfors NVidia patented Light-speed Meory Architecture (LMA) [1] has eployed the use of a crossbar to axiize the efficiency of data transfer between the graphics processing unit and the graphics eory on the GPU A eory crossbar can eliinate bottlenecks associated with existing eory architecture as it replaces the conventional syste bus architecture Instead of sharing a bus, counication between the processor and the eory uses dedicated connections In this paper, a distinct dynaic eory/cache anageent technology using crossbar techniques is proposed for the new ulti-core architectures III A NEW PARALLEL ARCHIECURE A New Multi-Core Meory and Cache Architecture Based on Crossbar Switching he eory and cache anageent is very iportant for ulti-core systes A novel eory/cache anageent technology described in this section is one of core parts of this new architecture design It significantly iproves total perforance in both space and tie herefore, this part is introduced first Meory 1 Meory Meory n Cache 1 Cache Crossbar Switch *PSW : Progra Status W ords Cache n Controller 1 Controller Controller n Core1/PSW1 Core/PSW Coren/PSWn Fig 1 Crossbar switch based Multi-core eory and cache CPU architecture Since data blocks could be very large in the parallel applications, if DMA (Direct Meory Access) is used to ove big data blocks fro the eory in the previous processor to the next processor frequently at each tie, the accuulated overhead costs cannot be ignored o reduce the data relocation costs and bus liitations, the new architecture use a crossbar switch based dynaical anageent technique for both eories and caches to avoid relocating the data down the pipeline fro one processor to another processor he new eory and cache architecture for the technique is illustrated in Fig 1 Each processor/core does not have a fixed eory and cache as does an ordinary processor Instead, each of the will be assigned to connect to a given eory and cache at a given tie here is only one bus between such eory and cache Both of the can be regarded as a group he nuber of the groups is the sae as the nuber of cores Each core has its own controller, which is eployed to switch the entire eory and cache, ie one group, for the previous processor into the next processor he previous processor also passes the Progra Status Words (PSW) to the next processor to resue operations where the previous processor stopped Once the group of previous processors is igrated to the next processor, the current group of the previous processors needs to be cleared for use A crossbar switch is the key to carrying out this dynaic eory/cache anageent technology It oves two eory and cache groups between two processors at a given tie Correspondingly, the bus crossbar is used for connecting all the cores to all the groups to guarantee data transission between the at full speed and with no contention Although the crossbar used for this architecture is siilar to NVidia s LMA, they are different in principle he purpose of the crossbar in our architecture is ainly for eory and cache switch apart fro high-bandwidth data transission Furtherore, fro the view of cores, the N N connection relationship is for all cores and groups his architecture differs fro NVidia s LMA, in which one core has ultiple eory controllers connecting to their corresponding eory banks Actually, fro the view of cores, the 1 N connection relationship relates a given core to all its eory banks Additionally, the bus crossbar used in the new architecture can also reduce large bus traffic because each eory and cache group has a dedicated connection to one processor his not only reduces the bus traffic, but also eliinates the constraint that the data oveent has to be arranged at a particular tie as entioned before B New Multi-Core Pipeline Organization he new ulti-core pipeline architecture does not relocate the data down the pipeline as would be the case for the original ulti-core pipeline when a switch is applied Instead, it only switches the eory and cache group of the previous processor, where the data have been loaded and/or processed, into the group of the next processor hese data also contain the sall aount of the data of the operating state, such as Progra Status Word (PSW) and all registers with the progra counter, so that the next processor could resue operations where the previous processor stopped I N P U ie P1 P P3 Start S S S S S S G1 G1 G G3 G1 G G3 G1 G G1 G G3 G1 G G3 O U P U 1 1 G = Group (eory and cache) S = Switching (on/off for the connection between each processor and each group) P1 = Processor 1 P = Processor P3 = Processor 3 Fig New ulti-core pipeline organization PCC Vol Iss, 13 PP 9-57 wwwvkingpubco 13 Aerican V-King Scientific Publishing 51

Parallel & Cloud Coputing PCC Vol, Iss he new ulti-processor pipeline organization and data flow are depicted in Fig he ajor iproveent is to replace Unloading (U) with Switching (S) in the organization However, Loading (L), Processing (P), and Switching (S) do NO rotate in the coluns in the cycle: L P S L in a helicoidal pattern as does the above original pipeline Instead, the Switching (S) for all the processors occurs at the sae tie in the new ulti-processor pipeline Accordingly, Loading (L), Processing (P), and Switching (S) ay rotate in the coluns in the cycle: L P S, P P S, or P L S as shown in Fig 3 In Fig, the vertical orange dashed lines show the cycles Each sall rectangle represents one operation in one cycle All blocks and arrows with the sae colors describe the flows of the data chunks P1, P, and P3 describe the processors or cores In theory, there could be any nuber of processors, depending on the applications and user requireents Because the architecture does not specify a particular processor to perfor a particular operation, data-dependent branching of the algorith does not require special handling A processor working on a particular data chunk behaves just as a standard processor, resulting in variable-length processing ties for the data chunks If a chunk becoes fully processed before the end of the pipeline, the result can be withdrawn Data which require longer processing can be switched back to connect the first processor to continue processing his is totally different fro the original ultiprocessor pipeline [], which needs to send these data to soe sort of overflow processing facility, or accuulate the and send the back through the pipeline when the incoing data strea ceases his new ethod would be advantageous for highly variable data processing ties he next section also gives ore details to explain this using an exaple in Fig 3 P1 L L L P P S L L L P P S L L L P P S P P P P P P S P P P P P S P3 P P P P P S 1 1 1 P1 L L L P P S P P P P P S P P P P P S P P P P P P S P P P P P S P P P P P S P3 P P P P P S P P L L L S P P P P P S 19 3 3 L = Load (a new data chunk) P = Process S = Switch (on/off) C iing Diagra Fig 3 iing diagra of a 3-core pipeline Fig 3 shows the tiing diagra of an exaple of a 3- core pipeline It is assued that Switching (S) takes one cycle In fact, S ay take ore than one cycle or less than one cycle, which depend on hardware and software Because of the dynaic eory/cache anageent with the crossbar-based techniques, all the Switchings (S) occur at the sae tie hat is, all the processed data fro previous processors to next processors are switched at the sae tie In the Fig 3, each color block represents a data chunk being processed in different processors in different tie he sae color blocks indicate the data flow for one chunk A new data chunk can be loaded (L) by any one of the processors as long as the previous data chunk is finished processing In other word, any one of the processors can load the data as long as it is free he various lengths or processing tie of the data chunks can thus be autoatically scheduled onto the processors without considering any load balancing or scheduling issues In fact, one data chunk is always being processed within this eory and cache group although the core connected to this group is changed at every switching tie hat is to say, the data do not ove while the core or processor oves his is the ajor difference fro the original ultiprocessor pipeline, in which the data chunk has to be relocated when a switch is applied he new architecture is uch ore efficient in both space and tie he overall perforance can be iproved significantly his is totally different fro the ultiprocessor pipeline syste in [, 5], which allows only one data chunk relocation at a particular tie on one shared bus D Progra Requireents An iportant feature of this architecture is that it uses practically the sae software as a sequential coputer A progra for this syste can be developed on an ordinary sequential coputer Each processor is distributed with the sae application progra o run on the syste, the progra would just have to incorporate soe interrupts he interrupt is triggered by the crossbar switching A processor working on a particular piece of data, upon crossbar switching, will ove its eory to the next processor he processor will also pass the PSW so that the next processor could resue operations where the previous processor stopped he foration of the PSW in the syste is siilar to the routine procedure of foratting the PSW for interrupts in ordinary icroprocessors IV PERFORMANCE AND EXPERIMENAL ANALYSIS A Perforance Clearly, the tie coplexity of the algorith based on the new ulti-core pipelined architecture will only affect the total length of the pipeline including the overhead of the eory and cache group switches between processors after switching Let e analyse how this new ulti-core pipelined architecture iproves the overall perforance for parallel coputing he principle of the N-core pipeline is descried in Fig PCC Vol Iss, 13 PP 9-57 wwwvkingpubco 13 Aerican V-King Scientific Publishing 5

Speedup Parallel & Cloud Coputing PCC Vol, Iss Fig he principle of the N-core pipeline For the ulti-core pipeline syste as shown in Fig, to better explain the perforance, the pattern of the syste operations is assued to be discretized into cycles with each cycle, C In fact, it can be discretized into uch saller tie unit in odern architectures so that the architectures can be prograable Let us consider a progra to process a dataset containing data chunks with equal or various sizes Processing the entire dataset requires K cycles On a conventional processor, the execution t ie of a progra,, can be expressed by c KC (1) In Fig, let N be the nuber of processors/cores If the entire dataset can be processed on all N cores, on the average, the nuber of cycles on each processor/core, n, can be given by K n () N hus, the total tie spent on the conventional processor can also be described in cobination with above two c P1 P P3 equations as ie S S S S S (Cycles) n L p P1 = Processor 1 P = Processor P3 = Processor 3 S = Switching c nnc (3) For the N-core pipeline, let be the nuber of cycles for each interval of the equal duration in the pipeline hus, the length of the N-core pipeline for processing of data, L p, can be described by L p n ( N 1) () Let q be the nuber of the internals for the entire pipeline, it can be L p q (5) n ( N 1) Since switching by crossbar ay take soe extra tie, let d be the nuber of cycles of processing and delays or latencies for each switching based on hardware he total nuber of cycles of the overhead of the context switching in the N-core pipeline, can be expressed as H ( q 1) d n ( N ) d c () hus, in the N-core pipeline syste, the total length of the pipeline, noted by L, to process the sae size of the data including the overhead of content switching, can be described by L Lp H (7) n n ( N 1) ( N ) d he total tie spent on this ulti-core pipelined syste can be obtained by LC () Let s first calculate the speed-up of the execution tie on the new N-core pipeline syste copared to the tie on one conventional processor Cobined with Equation (1), Equation (3), Equation (7), and Equation (), the speed-up is given by Speedup B Siulation Analysis c nn n n ( N 1) ( N ) d We developed a tool to evaluate this new architecture In this section, soe experiental results are reported based on this architecture with various overheads taken into account In the first four experients as shown in Fig 5, Fig, Fig 7, and Fig, we assue the total execution of a given sequential progra takes K = 1, cycles and one-tie switching overhead d takes cycles 9 7 5 3 1 1 3 5 7 9 11 13 15 he Perforance for Different Intervals 17 19 1 3 5 7 9 31 Nuber of Cycles of the Interval Fig 5 he perforance for different Intervals Fig 5 shows the perforance increases with interval tie increasing for fixed nuber of cores, N (N=1), and switching tie, d, and reaches the axiu then decreases slowly his is because the pipeline will becoe longer if becoes bigger he perforance is affected by the length of the pipeline Fig shows the perforance increases with nuber of cores increasing for fixed interval tie, ( =), and switching tie, d, and reaches the axiu, then decreases slowly his is also because the pipeline ay increase 33 35 37 39 1 3 5 7 9 (9) PCC Vol Iss, 13 PP 9-57 wwwvkingpubco 13 Aerican V-King Scientific Publishing 53

Nuber of Cycles of Interval Speedup & Interval ie (Cycle) Speedup Speedup 1 1 1 1 1 3 3 3 3 3 5 Parallel & Cloud Coputing PCC Vol, Iss slowly while the nuber of cores becoes big hat is to say, for a given progra with fixed interval tie,, and switching tie, d, the perforance ay not reach the axiu for soe nubers of cores Put another way, if given an appropriate interval tie and switching tie, d, the perforance can reach the axiu speedup In practice, since d is usually fixed for such a ulti-core syste due to the crossbar switch, the axiu speedup can be obtained with an appropriate ties of nuber of cores, which is the axiu perforance for any ulti-core syste in theory 5 Perforance for Different Dataset Size with Different Nuber of Cores K = 1, K = 1, K = 1,, K = 1,, K = 1,, he Perforance for Different Nuber of Cores 3 1 1 1 1 1 1 1 1 1 1 Nuber of Cores 3 3 3 3 3 Fig he perforance for different nuber of cores Maxiu Perforance for Different Switching ie 1 Interval Cycle Nuber Maxiu Speedup 1 5 Nuber of Cores Fig he perforance for different dataset size with different nuber of cores Fig 9 shows the best interval tie, b, to reach the axiu perforance decreases and then becoe stable with the increase of the nuber of cores Furtherore, all the best interval tie, b, for different size of the application progras is close to each other when the nuber of cores increases his iplies that the ost appropriate interval tie,, can be chosen to axiize the perforance for all the application progras on a ulti-core syste with the certain aount of cores his is a trade-off for a ulti-core syste to ensure the best perforance for all the applications Maxiu Perforance for Different Interval and Nuber of Cores 5 35 3 K = 1, K = 1, K = 1,, K = 1,, K = 1,, 1 11 1 1 31 3 1 51 5 1 71 7 1 91 9 Nuber of Cycles of Switching ie 5 Fig 7 Maxiu perforance for different switching tie Fig 7 shows the axiu perforance for different switching tie, d In the figure, the blue line shows the interval tie,, and the red line shows the axiu perforance, ax_speedup For a given progra and fixed nuber of cores, N (N=1), with the increase of the switching tie, d, the best interval tie, b, to reach the axiu perforance, ax_speedup, increases when the switching tie, d, increases However, the ax_speedup decreases slowly accordingly his clearly indicates the pipeline length increases with the increase of both interval tie and overhead of content switching (ie and d) Accordingly, the perforance decreases Fig shows the perforance Speedup increases with increasing the nuber of cores and the size of the data Note that the Speedup is the axiu, ax_speedup, in this experient he figure clearly indicates when the application progra becoes larger, the Speedup increases linearly with the nuber of cores hat is, the Speedup is alost close to the nuber of cores for the large-size applications his indicates the perforance increases the 15 1 5 1 1 1 1 1 3 Nuber of Cores Fig 9 Maxiu perforance for different intervals and nuber of cores In suary, with the increase of the nuber of cores, the perforance increases totally However, choosing the best interval tie, b, is the key to ake full use of all the cores to axiize the perforance for all the application progras Such a possible b can be found for all the application progras based on the analysis of Fig 9 In addition to selecting one fixed best interval tie, b, for the entire syste, the dynaical best interval tie, b, can also be autoatically assigned to each application progra according to the total application and data size, the nuber of cores, and the switching tie on a ulti-core syste while the progra is copiled C Experiental Results and Perforance Coparison 3 We developed a function siulator to evaluate this 3 3 3 5 PCC Vol Iss, 13 PP 9-57 wwwvkingpubco 13 Aerican V-King Scientific Publishing 5

Speedup DM Generating ie (s) Parallel & Cloud Coputing PCC Vol, Iss ulti-core pipelined parallel syste and copare it with existing ulti-core systes he testing ulti-core progras are based on the algoriths of DEM generation fro LIDAR dataset designed and developed by e before (a) (b) Fig 1 1 stripes of the LIDAR data he tests were run on a Dell PC with Pentiu (R) D a single processor 3GHZ and GB RAM equipped with a GPU, NVidia GeForce GX featured with 19 CUDA cores he LIDAR dataset contains 1,5 points in (x, y, z, value) In order to test it on the ulti-core architectures, it is divided into 1 stripes along Y-direction as shown in Fig 1 (a) he DEM generation algorith is eployed for each LIDAR stripe he size of each DEM stripe is 53 by he experiental results are reported in the Figs 1-13 Fig 1 and Fig 13 show the experiental results on the single-processor syste using the algoriths to generate a DEM based on LIDAR data Fig 1 indicates the tie spent on the DEM generation for all LIDAR stripes fro 1 to 1 Obviously, the processing tie for each LIDAR stripe is different Fig 13 shows the coparison of the new ulti-core pipelined GPUs and existing ulti-core GPUs to generate the sae DEM fro the 1 LIDAR stripes with the size of 53x It obviously indicates the perforance of new ulti-core GPU architecture is better than existing ones his is because data partition and load balancing and scheduling need be considered for existing ulti-core GPU syste Moreover, these conventional parallel ethods cannot be done easily on current ulti-core systes he perforance ay be affected by different data partition or load balancing and scheduling ethods However, the new ulti-core syste can directly use original sequential progra for parallel coputing hus, it does not need to take into account the data distribution and load balancing and scheduling issues All of the data can be autoatically partitioned and scheduled onto the different cores through the pipeline Furtherore, with the increase of the nuber of cores, the perforance of the new architecture increases uch ore than the existing ones with the sae nuber of cores 9 ie Spent on Each LIDAR Stripe 7 5 3 1 (a) 1 3 5 7 9 1 11 1 13 1 15 1 Slice Nuber Fig 1 ie spent on each LIDAR stripe Coparison of the Multi-core Pipelined Architecture and Conventional Multi-core Archteture 1 Speedup of Multi-Core Pipelined Architecture Speedup of Conventional Multi-Core Architecture 1 1 (b) Fig 11 (a) LIDAR atches the generated DEM in 3D space; (b) Generated DEM rendering in color raps Fig 1 (b) shows the corresponding rendering effects of the cobined DEM with 1 pieces Fig 11 (a) shows the generated DEM atches the original LIDAR data very well in 3D space he 3D terrain based on the DEM is rendered in color raping as indicated in Fig 11 (b) 1 1 Nuber of Cores Fig 13 Coparison of the new ulti-core pipelined architecture and existing ulti-core architectures PCC Vol Iss, 13 PP 9-57 wwwvkingpubco 13 Aerican V-King Scientific Publishing 55

Parallel & Cloud Coputing PCC Vol, Iss V CONCLUSIONS AND FUURE WORK his paper presents a new ulti-core pipelined architecture based on core crossbar switching driven ultiprocessor pipelining and dynaic eory and cache anageent techniques he new architecture is very suitable for processing large data in parallel without parallel prograing in geospatial doain as well as other high perforance coputing areas he proposed architecture provides a siple and effect ive ipleentation for on-thefly parallel coputing by switching the entire data fro core to core through the crossbar switch his architecture akes full use of the pipeline It can autoatically partition data and schedule the onto ulti-cores through the pipeline It does not need conventional coplicated parallel coputing ethods, such as load balancing, scheduling, and data distribution his is exactly the advantage of this proposed architecture Obviously, the new ulti-core pipeline architecture significantly overcoes all these liitations of the original ultiprocessor pipeline as described in Section Especially, the core difference of these two ultiprocessor architectures is the new architecture relocates the cores instead of oving the data in the original one More specifically, the new pipeline advantages over the original pipeline liitations are suarized as follows: No data relocation Content switching is eployed to iniize big overhead costs due to data relocation while a switch is eployed No data overlapping Due to content switching, no processed data chunks across the segentation All the Switches (S) are forced by the crossbar switching control at a regular tie All the data are switched to their corresponding processors However, the original ultiproces sor pipeline only allows one data relocation at a particular tie due to one shared bus No special handling for the data required longer pipeline to process he new architecture does not need special handing for the data required longer pipeline to process But, the original ultiprocessor pipeline need overflow facility or accuulation and sending back while the data strea ceases Additionally, ore specific advantages for the new switch-based dynaic eory and cache-anageent technology in the new architecture are ephasized as follows: avoids oving the data fro one eory to another eory allows ore than one Switching operation at a tie; all the Switchings occur at the sae tie reduces bus liitations and large bus traffic using the bus crossbar iproves perforance in both space and tie Finally, to suarize, there are several overall advantages of this new ulti-core pipelined architecture as follows: provides the continuous data processing of intensive inforation flows requires essentially the sae software as ordinary sequential algorith avoids load balancing and scheduling avoids the need for synchronization aong the processors avoids busy waiting of processors on a spin-lock avoids the duplication for incoing data strea be suitable for highly variable data processing tie Another advantage worth entioning is that this core ultiprocessor/ulti-core pipelining technology provides an iportant solution to processing the continuous intensive inforation flows without liitation by size Consequently, this would be very helpful to real-tie assive data processing, especially geospatial coputing REFERENCES [1] AMD Corporation, White Paper: AMD Multi-core Processors AMD Corporation [] S Berkovich, Z Kitov, A Meltzer: On-the-fly processing of continuous data streas with a pipeline of icroprocessors In Proceedings of the International Conference on Databases, Parallel Architectures, and heir Applications (PARBASE- 9), IEEE Coputer Society, Maiai Beach, Florida, March 199, pp 5-97 [3] S Berkovich, M Loew, and M Zaghloul: On-Line Processing and Archiving of Continous Data Flows In IEEE Proceedings of 35th Midwest Syposiu on Circuits and Systes Washington DC, Aug 199, pp 777-779 [] E Berkovich, S Berkovich, M Loew: A Multi-Layer Conveyor for Processing Intensive Inforation Flows he echnical Report, GWU-IIS-9-13, he George Washington University, 199 [5] S Berkovich, E Berkovich, and M Loew, Multi- Layer Multi-Processor Inforation Conveyor with Periodic ransferring of Processor s States for On-he-Fly ransforation of Continuous Inforation Flows and Operating Method herefor, US PAEN No 1571, owned by George Washington University Date issued - Noveber 7, [] Crossbar Switch on Wikipedia http://enwikipediaorg/wiki/crossbar_switch [7] NVidia NVIDIA CUDA Copute Unified Device Architecture Prograing Guide (Version 1 Beta), Oct [] NVidia NVIDIA CUDA Copute Unified Device Architecture Reference Manual (Version 1 Beta), Nov [9] D Culler, JP Singh, Anoop Gupta, Parallel Coputer Architecture: A Hardware/Software Approach, Morgan Kaufann, 199 ISBN 1-55-33-3 [1] Ananth Graa, Anshul Gupta, George Karypis, Vipin Kuar, PCC Vol Iss, 13 PP 9-57 wwwvkingpubco 13 Aerican V-King Scientific Publishing 5

Parallel & Cloud Coputing PCC Vol, Iss An Introduction to Parallel Coputing, Design and Analysis of Algoriths: /e, Addison-Wesley, 3 ISBN -1-5- [11] Intel Corporation, White Paper: Intel Multi-Core Processor Architecture Developent Backgrounder Intel Corporation [1] NVidia Corporation, echnical Brief: GeForce3: Lightspeed Meory Architecture Nvidia Corporation 1 [13] Aaftab Munshi, he OpenCL Specification (Version 1) Khronos OpenCL Working Group Dec [1] RapidMind Easily build applications for ulti-core http://wwwrapidindnet/productphp [15] Seiler, L, Carean, D, Sprangle, E, Forsyth,, Abrash, M, Dubey, P, Junkins, S, Lake, A, Sugeran, J, Cavin, R, Espasa, R, Grochowski, E, Juan,, and Hanrahan, P Larrabee: a any-core x architecture for visual coputing In ACM SIGGRAPH Papers (Los Angeles, California, August 11-15, ) SIGGRAPH ' ACM, New York, NY, 1-15 [1] Stopel, A, Ma, K, Lu, EB, Ahrens, J, and Patchett, J 3 SLIC: Scheduled Linear Iage Copositing for Parallel Volue Rendering In Proceedings of the 3 IEEE Syposiu on Parallel and Large-Data Visualization and Graphics (October - 1, 3) Parallel and large-data visualization and graphics IEEE Coputer Society, Washington, DC, Sion Y Berkovich earned a MS in Applied Physics fro Moscow Physical-echnical Institute and a PhD in Coputer Science fro the Institute of Precision Mechanics and Coputer echnology of the USSR Acadey of Sciences He is a Professor of School of Engineering and Applied Science at George Washington University Prof Berkovich played a leading role in a nuber of research and developent projects on the design of advanced hardware and software systes hose projects include construction of superconductive associative eory, developent of large inforation systes for econoics, investigation of coputer counications for ultiprocessor systes, and enhanceent of inforation retrieval procedures Prof Berkovich has several hundred professional publications in various areas of physics, electronics, coputer science, and biological cybernetics He is an author of six books and holds 3 patents Aong his inventions is a ethod for dynaic file construction that later becoe known as B-tree and extendible hashing In, he was elected a eber of the European Acadey of Sciences for an outstanding contribution to coputer science and the developent of fundaental coputational algoriths Duoduo Liao earned a PhD and a MS in Coputer Science fro George Washington University and Purdue University in USA, respectively Since, she has worked for the federal governent agencies and universities on PC-clustered highway driving siulator systes, 3D graphics & visualization, virtual reality, GIS, traffic siulation, air traffic anageent, ulti-core architectures, heterogeneous coputing, etc She ever worked at ESRI and developed the first version of Stereo Viewer for ArcGIS in 1 In 199, she pioneered the product developent of the PC-based high-resolution quad-buffered 3D stereographic accelerators using the earliest PC graphics chips invented by 3D Labs Dr Liao has authorized ore than technical publications and two professional books on GPU-based research and OpenGL prograing She has been invited to give the talks by the federal governents, leading industries, and universities She was an adjunct professor at George Mason University She is a eber of ACM and IEEE, and serves the conference chairs, editorial boards, and coittees of several international conferences PCC Vol Iss, 13 PP 9-57 wwwvkingpubco 13 Aerican V-King Scientific Publishing 57