BIG data are a collection of dataset consisting of

Transcription

1 1 A ensor-based Approach for Big Data Representation and Dimensionality Reduction Liwei Kuang, Fei Hao, Laurence. Yang, Man Lin, Changqing Luo, and Geyong Min Abstract Variety and veracity are two distinct characteristics of large-scale and heterogeneous data. It has been a great challenge to efficiently represent and process big data with a unified scheme. In this paper, a unified tensor model is proposed to represent the unstructured, semi-structured and structured data. With tensor extension operator, various types of data are represented as sub-tensors and then are merged to a unified tensor. In order to extract the core tensor which is small but contains valuable information, an Incremental High Order Singular Value Decomposition (IHOSVD) method is presented. By recursively applying the incremental matrix decomposition algorithm, IHOSVD is able to update the orthogonal bases and compute the new core tensor. Analyses in terms of time complexity, memory usage and approximation accuracy of the proposed method are provided in this paper. A case study illustrates that approximate data reconstructed from the core set containing 18% elements can guarantee 93% accuracy in general. heoretical analyses and experimental results demonstrate that the proposed unified tensor model and IHOSVD method are efficient for big data representation and dimensionality reduction. Index erms ensor, HOSVD, Dimensionality Reduction, Data Representation 1 INRODUCION BIG data are a collection of dataset consisting of massive unstructured, semi-structured, and structured data. he four main characteristics of big data are volume (amount of data), variety (range of data types and sources), veracity (data quality), and velocity (speed of incoming data). Although many studies have been done on big data processing, very few have addressed the following two key issues: (1) how to represent the various types of data with a simple model; (2) how to extract the core data sets which are smaller but still contain valuable information, especially for streaming data. he purpose of this paper is to explore the above raised issues which are closely related to the variety and veracity characteristics of big data. Logic and Ontology [1], two knowledge representation methodologies, have been investigated widely. Composed of syntax, semantics and proof theory, Logic is used for making statements about the world. Although Logic is concise, unambiguous and expressive, it works with the statements that are true or false and is hard to be used for reasoning with L. Kuang, F. Hao and C. Luo are with the School of Computer Science and echnology, Huazhong University of Science and echnology, Wuhan , China. L.. Yang is with the School of Computer Science and echnology, Huazhong University of Science and echnology, Wuhan , China, and the Department of Computer Science, St. Francis Xavier University, Antigonish, NS, Canada. M. Lin is with the Department of Computer Science, St. Francis Xavier University, Antigonish, NS, Canada. Geyong Min is with the College of Engineering, Mathematics and Physical Sciences, University of Exeter, Exeter, EX4 4QF, United Kingdom. unstructured data. Ontology is the set of concepts and relationships that can help people communicate and share knowledge. It is definitive and exhaustive, but it also causes incompatibility among different application domains, and thus is not suitable for representing and integrating heterogeneous big data. he study of data dimensionality reduction has been reported in the literature. Previous approaches include Principal Component Analysis (PCA) [2], Incremental Singular Value Decomposition (SVD) [3], and Dynamic ensor Analysis (DA) [4]. hese methods are available for low dimension reduction but suffer from some limitations because they are timeconsuming when being performed on high-dimension data and fail to extract the core data sets from streaming big data. his paper presents a unified tensor model for big data representation and an incremental dimensionality reduction method for high-quality core set extraction. Data with different formats are employed to illustrate the representation approach, and equivalent theorems are proven to support the proposed reduction method. he major contributions are summarized as follows. Unified Data Representation Model: We propose a unified tensor model to integrate and represent the unstructured, semi-structured, and structured data. he tensor model has extensible orders to which new orders can be dynamically appended through the proposed tensor extension operator. Core ensor Equivalence heorem: o tackle the recalculation and order inconsistency problems in big data processing with tensor model, we prove a core tensor equivalence theorem which

2 2 can serve as the theoretical foundation for designing incremental decomposition algorithms. Recursive Incremental HOSVD Method: We present a recursive Incremental High Order Singular Value Decomposition method for streaming data dimensionality reduction. Detailed analyses in terms of time complexity, memory usage and approximation accuracy are also investigated. he remainder of this paper is organized as follows. Section 2 recalls the preliminaries of tensor decomposition. Section 3 presents a framework for big data representation and processing. A unified tensor model for big data representation is proposed in Section 4. Section 5 presents a novel incremental dimensionality reduction method. A case study of intelligent transportation is investigated in Section 6. After reviewing the related works in Section 7, we conclude the paper in Section 8. 2 PRELIMINARIES his section reviews the preliminaries of singular value decomposition [5] and tensor decomposition [6]. he core tensor and truncated bases described in the preliminaries can be employed to make big data smaller. Definition 1: Singular Value Decomposition (SVD). Let M R m n denote a matrix, the factorization M = UΣV (1) is called the SVD of M. Matrices U and V refer to the left singular vector space and the right singular vector space of matrix M respectively. Both U and V are unitary orthogonal matrices. Matrix Σ = diag(σ 1, σ 2,..., σ k,..., σ l ), l = min{m, n} is a diagonal matrix that contains the singular values of M. In particular, M k = U k Σ k V k (2) is called the rank-k truncated SVD of M, where U k = [u 1,.., u k ], V k = [v 1,.., v k ], Σ k = diag(σ 1,..., σ k ), k < l. he truncated SVD of M is much smaller to store and faster to compute. Among all rank-k matrices, M k is the unique minimizer of M M k F. Definition 2: ensor Unfolding. Given a P -order tensor R I1 I2... I P, the tensor unfolding [7] (p) R I p (I p+1 I p+2...i P I 1 I 2...I p 1 ) contains the element t i1i 2...i pi p+1...i P at the position with row number i p and column number that is equal to (i p+1 1)I p+2 I P I 1 I p 1 +(i p+2 1)I p+3 I P I 1 I p (i 2 1)I 3 I 4 I p i p 1. Example 1. Consider a three-order tensor R 2 4 3, Fig. 1 shows the three unfolded matrices (1), (2) and (3). Definition 3: p-mode product of a tensor by a matrix. Suppose a tensor (2) = (3) = (1) = Fig. 1. hree-order tensor unfolding; tensor is unfolded to three matrices. R I1 I2... Ip 1 Ip Ip+1... I P and a matrix U R J p I p, the p-mode product ( p U) R I 1 I 2... I p 1 J p I p+1... I P is defined as ( p U) i1 i 2...i p 1 j p i p+1...i P = I p i p=1 (a i1 i 2...i p 1 i p i p+1...i P u jp i p ). he p-mode product is a key linear operation for dimensionality reduction, and the truncated left singular vector matrix U Jp I p (J p < I p ) is used to reduce the dimensionality of order I p from I p to J p. I 4 I 3 I 5 I 6 I 2 I 1 I 4 2 = I 3 I 2 I 5 I 6 Fig. 2. ensor dimensionality reduction with p-mode product; the dimensionality of the 2nd order is reduced from 8 to 2 by a 2 8 matrix. Definition 4: Core ensor and Approximate ensor. For an initial tensor, the core tensor S [8] and the approximate tensor ˆ are defined as and (3) S = 1 U 1 2 U 2... P U P, (4) ˆ = S 1 U 1 2 U 2... P U P. (5) he core tensor S is viewed as a compressed version of initial tensor. By keeping only the left k unitary orthogonal vectors of the unfolded matrix, the principal characteristics are reserved. Big data applications can simply keep the core tensor S and truncated bases U 1, U 2,, U P. When needed, data can be reconstructed by generating the approximation tensor with Eq. (5). he right singular vector matrices V 1, V 2,, V P and the singular values are unified to the core tensor which contains the coordinates of the left singular vector matrices in the approximate tensor. I 1

3 3 In general, the reconstructed data are more efficient than the original data as noise, inconsistency and redundancy are removed. ^ Fig. 3. Illustration of the core tensor and the approximate tensor. he core tensor and the truncated unitary orthogonal bases (U 1, U 2, U 3 ) are called core data sets that can be used to make big data smaller, while the reconstructed approximate tensor is a substitute for the initial tensor. 3 DAA REPRESENAION AND PROCESS- ING FRAMEWORK In this section, a tensor-based data representation and processing framework is proposed. Fig. 4 depicts a three-tier framework in which different modules are enabled in each layer. We elaborate the functions and responsibilities of each module through a bottom-up view approach. ransportation Finance Mining Algorithm Inference Method Data Visualization Streaming Data Healthcare Video... Audio XML... HML GPS... EHR Data Service Data Analysis Data Dimensionality Reduction Data ensorization Data Collection data are not uniform, these data need to be represented as a unified tensor model. he subtensors with various orders will be generated to model the data according to their initial format. hen, all the sub-tensors will be integrated as a unified heterogeneous tensor. 3) Data Dimensionality Reduction Module. his module is for efficiently processing the high dimension tensorized data, and extracting the core data sets that are more smaller for storage and computation. he reduction can be enhanced by virtue of implementation of the proposed IHOSVD algorithm which can incrementally update the orthogonal bases of each unfolded matrices. 4) Data Analysis Module. Numerous algorithms such as clustering algorithms, multi-aspect predication algorithms, etc., are included in this module. he module can help obtain potential values behind large scale heterogeneous data. Data visualization module in this layer helps users easily understand the data values. 5) Data Service Module. Data service module provides services according to the requirements of different applications. For instance, with the s- mart monitor appliances, proactive health care services can be provided to users based on the thorough understanding of their physical status. his paper mainly focuses on data tensorization module and data dimensionality reduction module. 4 A UNIFIED DAA REPRESENAION MOD- EL his section proposes a tensor-based data representation model and tensorization approach for transforming heterogeneous data to a unified model. Firstly, an extensible order tensor model and tensor extension operator are presented. Secondly, we illustrate how to tensorize the unstructured, semi-structured and structured data as sub-tensors. hirdly, the integration of sub-tensors as a unified tensor is studied. ensor order and tensor dimension, two confusing concepts, are then discussed in the end. Unstructrued Data Semi-structured data Structured data Fig. 4. Data representation and processing framework. 1) Data Collection Module. his module is in charge of collecting various types of data from different areas, for example, video clip, XML document and GPS data. he streaming data will incrementally arrive and temporarily agglomerate together without changing their original format. 2) Data ensorization Module. Since the collected unstructured, semi-structured and structured 4.1 Extensible Order ensor In general, time and space are two basic characteristics of data collected from different areas, while users are major recipients of data services. herefore, a general tensor-based data model is defined as R It Is Iu I1... IP. (6) Eq. (6) shows a (P + 3)-order tensor which contains two parts, namely, the fixed part R It Is Iu and the extensible part R I1... IP. he tensor orders I t, I s and I u denote time, space and user respectively.

4 4 In the tensor model, data characteristics are represented as tensor orders. For example, the color space characteristic of unstructured video data can be modeled as I c. For heterogeneous data, various characteristics are represented as tensor orders and attached to the fix part using the proposed tensor extension operator. Definition 5: ensor Extension Operator. Let A R I t I s I u I 1, and B R I t I s I u I 2, the tensor extension operator is given by the following function f : A B C, C R It Is Iu I1 I2. (7) Operator satisfies the associative law. In other words, (A B) C = A (B C). By virtue of Eq. (7), heterogeneous data can first be tensorized as low order sub-tensors and then extended to a high order unified tensor. he operator merges the identical orders while keeping the diverse orders. Elements of the identical order are accumulated together. For instance, sub-tensor sub1 and sub-tensor sub2 have time order denoted as I t 1, I t 2, where I t 1 {i 1, i 2 }, I t 2 {i 1, i 3 }. After extension, time order of the new tensor = sub1 sub2 becomes I t {i 1, i 2, i 3 }. 4.2 ensorization Method Examples of unstructured data include video data and audio data, while semi-structured data are composed of XML documents, ontology data, etc. Representatives of structured data are numbers and character strings stored in relational database. In this paper, video clip, XML document and GPS data are employed to illustrate the tensorization process. Height Frames Width One frame Blue Green Red Fig. 5. Represent video clip as four-order tensor. I c I h I f I w <?xml version='1.0' encoding='uf-8'?> <University> <Student Category= doctoral'> <ID> </ID> <Name>Liang Chen</Name> <Research> <Area>Internet of hings</area> <Focus>Architecture;Sensor Ontology</Focus> </Research> </Student> </University> Element <ID> ext (a) Root Element <University> Element <Student> Element <Name> ext Liang Chen (b) Attribute <Category> Element <Area> ext Element <Research> Element <Focus> ext I ec Fig. 6. Represent XML document data as a threeorder tensor; (a) gives an initial XML document, (b) is the parsed tree, (c) shows the relationships between elements, and the three-order tensor is illustrated in (d). contain tags and contents both consisting of characters from unicode repertoire. An XML document has a hierarchical structure and can be parsed as a tree. Fig. 6(b) is the parsed tree of Fig. 6(a). XML Document can be tensorized as a three-order tensor, where I er and I ec indicate the row and column orders of the markup matrix, and I en represents the content vector order. For example, the XML document in Fig. 6(a) is tensorized as R , where 28 is the length of element Focus. Relationships among element, attribute and text are represented as numbers. In Fig. 6(c), number 1 is used to indicate the parent-child relationship. (c) (d) I en I er Video data can be represented as four-order tensor or three-order tensor. o represent a video clip of MPEG-4 format, 25 frames per second, resolution and RGB color space, a four-order tensor R I f I w I h I c is adopted with I f, I w, I h, I c indicating frame, width, height and color space. For instance, a 750-frame MPEG-4 video clip with resolution of and RGB color can be tensorized as R In some applications, RGB color is usually transformed to gray level using equation Gray = 0.299R G B, and the representation is replaced by a three-order tensor R Fig. 5 shows the process of transforming a video clip to a four-order tensor. Extensible Markup Language (XML) is semistructured. Fig. 6 shows a simple XML document with seven elements and one attribute. he elements Record StudentID Longitude Latitude ime 1 D :36:15 2 D :36:25 3 D :36:35 Record StudentID StudentName 1 D Liang Chen I y I x I name = I t I id I id I name Fig. 7. he upper table is modeled as a four-order subtensor, the lower table is modeled as a two-order subtensor, and the two sub-tensors are unified as a fiveorder tensor. Relational database is widely used to manage structured data. In database table, simple fields with num- I t I y I x I id

5 5 ber or character string type can be modeled as a matrix. For complex field, e.g. BLOB, new orders are needed for representation. In Fig. 7, the structured GPS data and student data are unified as a five-order tensor. GPS I s Video 4.3 Unified ensor Representation Model Big data are composed of unstructured data d u, semistructured data d semi and structured data d s. Due to the requirement of processing all types of heterogenous data, a unified data tensorization operation is performed using the following equation f : (d u d semi d s ) u semi } {{ } s. (8) With Eq. (7) and Eq. (8), d u, d semi and d s are transformed to subtensors u, semi and s which will later be integrated as a unified tensor. For example, on the basis of transformed video clip, XML document and structured tables as described in Figs. 5 7, the final tensor is consequently obtained as follows, R I t I s I u I w I h I er I ec I en I id I na. (9) In Eq. (9), order I f is identical to order I t, order I x, I y are combined to order I s, and order I c is unnecessary because gray level is adopted. Since too many orders may increase the decomposition complexity, less orders are preferable at the data representation stage. An element of the ten-order tensor in Eq. (9) is described as an eleven-tuple e = ( I, SP, U, W, H, ER, EC, EN, ID, NA, V ), (10) where I, SP and U refer to the fixed order time, space and user, W and H denote the orders from video data, ER, EC and EN are XML document characteristics, ID and NA are for GPS data, and V is the value of element e. Such type of tuples generated from heterogeneous tensor are usually sparse, and only the nonzero elements are essential for storage and computation. he generalized tuple formate according to Eq. (6) is defined as e = ( I, SP, U, i 1,..., i P, V ). (11) Fig. 8 illustrates the extensible order tensor model from another point of view. he fixed part containing I, SP and U is seen as an overall layer, while the extensible part is deemed as an inner layer. he tensor is simplified as a two layer model where the inner model is embedded to the three-order (I t I s I u ) overall model. Using the tensorization method, the heterogeneous data are modeled as sub-tensors that are inserted to the two-layer model to generate the unified tensor. I u I t XML Document Fig. 8. Visualization of the two-layer model for data representation. 4.4 ensor Order and ensor Dimension As tensor order and tensor dimension are two key concepts for data representation, we give a brief comparison between them. ensor R I1 I2... I P has P orders, and order i (1 i P ) has I i dimensions. A P -order tensor can be unfolded to P matrices. For the mode-i unfolded matrix (i), the number of rows is equal to I i, while the number of columns is equal to I j. In many big data applications, it is 1 j P, j i impractical to store all dimensions of big data which contain redundancy, uncertainty, inconsistency and incompleteness, thus it is essential to extract valuable core data. During the extraction of core data set, the number of tensor orders remains the same while the dimensionality is significantly reduced. 5 INCREMENAL ENSOR DIMENSIONALIY REDUCION A novel method is proposed for dimensionality reduction on streaming data in this section. Firstly, two problems of tensor decomposition are defined. hen two equivalence theorems are proven and an Incremental High-Order Singular Value Decomposition (IHOSVD) method that can efficiently compute the core data sets on streaming data is presented. Finally, complexity and accuracy of the proposed method are discussed. 5.1 Problems Definition wo important problems related to incremental tensor dimensionality reduction are: (1) the recalculation problem; (2) the order inconsistency problem. hey are formally defined below. Problem 1: ensor Decomposition Recalculation. Let S 1 denote the core tensor obtained from the previous tensor 1. denotes a new tensor. Combining 1 with, we obtain 2 = 1. According to Eq. (4), the new core tensor S 2 of new tensor 2 is computed with S 2 = 2 1 U 1 2 U 2... P U P. (12)

6 6 Decomposition recalculation occurs in Eq. (12) because the previous decomposition results during computing core tensor S 1 are not reused. Problem 1 can be solved using Algorithm 1 and Algorithm 2 that are designed with the proposed recursive incremental singular value decomposition method. Problem 2: ensor Order Inconsistency. Assume 1, S 2 and 2 are defined as previous tensor, new core tensor and new combined tensor, to compute S 2 with Eq. (4), the row number of the truncated orthogonal matrix U must be consistent with dimensionality of the tensor order I n. However, one order dimensionality of the combined tensor 2 is not equal to the row number of truncated orthogonal matrix U. For instance, let 1 R be a three-order tensor, 1(1) R 2 4, 1(2) R 2 4 and 1(3) R 2 4 are three unfolded matrices of 1. Given a new tensor χ R 2 2 2, combining it with previous tensor 1 along the third order I 3, we obtain 2 R he third order dimensionality of 2 is 4, while the row number of the truncated orthogonal basis computed from matrix 1(3) is 2. his leads to order inconsistency. In this paper, heorem 1, heorem 2 and Algorithm 3 are presented to address Problem Basis and Core ensor Equivalence heorems he left singular vector matrix U plays a key role on dimensionality reduction and data reconstruction. Similarly, the truncated k-rank orthogonal unitary bases U 1, U 2,..., U P of the unfolded matrices construct the most basic coordinate axes of a P-order tensor. For heterogeneous big data dimensionality reduction, the major difficulty lies in computing the bases on variable dimension. Our approach extends dimension to fixed length and finds out equivalent basis. In this paper, two theorems are presented and proven to support our approach. heorem 1: Basis Equivalence of SVD. Let M 1 be a m 1 by n matrix, and M 2 be a m 2 by n matrix whose left m 1 columns contain matrix M 1 and right m 2 m 1 columns are zeros. Namely, M 2 = [M 1 0], M 1 R m1 n, M 2 R m2 n, m 1 < m 2. If the singular value decompositions of matrix M 1 and matrix M 2 are expressed as M 1 = U 1 Σ 1 V 1, M 2 = U 2 Σ 2 V 2, (13) hen, the unitary orthogonal basis U 1 is equivalent to U 2. Proof. From Eq. (13), we obtain [ ] M 2 M2 M = [M 1 0] 1 = M 0 1 M1. (14) Consider M 2 M 2 = U 2 Σ 2 V 2 V 2 Σ 2 U 2 = U 2 (Σ 2 Σ 2 )U 2, (15) and M 1 M 1 we obtain = U 1 Σ 1 V 1 V 1 Σ 1 U 1 = U 1 (Σ 1 Σ 1 )U 1, (16) U 1 (Σ 1 Σ 1 )U 1 = U 2 (Σ 2 Σ 2 )U 2. (17) Note that both sides of Eq. (17) are spectral decompositions of two equal symmetric matrix. Additionally, the diagonal matrices Σ 1 Σ 1 and Σ 2 Σ 2 consist of the eigenvalues of the equal matrix. According to the uniqueness characteristic of eigenvalues, Σ 1 Σ 1 and Σ 2 Σ 2 are equal. It can be concluded that U 1 is equivalent to U 2. he equivalence implies that U 1 can be calculated by multiplying U 2 with a series of Elementary Matrix [9]. Based on heorem 1, the following two corollaries can be derived. Corollary 1: Let M 1 = [v 1, v 2,..., v n ], M 2 = [v 1, v 2,..., 0,..., 0,..., v n ], where v i is column vector, then the two matrices have equivalent left singular vector bases. [ ] M1 Corollary 2: Suppose M 2 =, then matrix 0 M 1 and matrix M 2 have equivalent left singular vector bases. With Corollary 2, the orthogonal basis U 1 can be obtained by trimming the bottom zeros of the orthogonal basis U 2. heorem 1, Corollaries 1 and 2 are employed to prove heorem 2 defined as follows. Before the proof, we introduce a special matrix which will be used in heorem 2. Definition 6: Extension Matrix. An extension matrix is defined as M = [ I 0 ], M R Jp Ip, J p > I p. Multiply the P -order tensor R I 1 I 2... I p... I P with extension matrix M along order p, the dimensionality of this order is extended from I p to J p. heorem 2: Core ensor Equivalence of HOSVD. Let and G be P-order tensors, where R I1 I2... I P and G R I1 I2... (lip)... I P, l is a non-negative integer. Define M as an extension matrix, M R Ip (lip). ensor and G satisfy ]. = G p M = G n [ Ip 0 lp Proof. Unfold tensor and tensor G to P matrices (1), (2),..., (P ), and G (1), G (2),..., G (P ). According to heorem 1, Corollaries 1 and 2, the corresponding unfolded matrices of tensor and G have equivalent left singular vector bases. Besides, the p-mode product of tensor by matrices A, B posses the following properties i A j B = j B i A, (18) and i A i B = i (BA). (19)

7 7 Employing Eq. (4), core tensors S, S G are calculated with the following equations and S = 1 U 1 2 U P U P, (20) S G = G 1 U 1 2 U P U P. (21) With Eqs. (18) (21), we obtain S = 1 U 1 2 U P U P = (G p M) 1 U 1 2 U P U P = G 1 U 1 2 U P U P pm = S G p M. (22) heorem 2 reveals that extending a tensor by padding zero elements will not transform the core tensor. After unified representation of big data, order number of the incremental tensor and the initial tensor are equal, but the dimensionality are different. heorem 2 can be used to solve this problem by resizing dimensionality. 5.3 Incremental High Order Singular Value Decomposition We propose an IHOSVD method for incremental dimensionality reduction on streaming data. IHOSVD method consists of three algorithms that are used for recursive matrix singular value decomposition and incremental tensor decomposition. he three algorithms are separately described in detail. Algorithm 1 is a recursive algorithm with recursive function given in Eq. (23). During the running process, function f will call itself (Step 4) over and over again to decompose matrices M i and C i. Each successive call reduces the size of matrix and moves closer to a solution until matrix M 1 is reached finally, the recursion stops and the function can exit. f(m i, C i ) = { svd(m1 ), i = 1 mix(f(m i 1, C i 1 ), C i ), i > 1 (23) Algorithm 1 Recursive matrix singular value decomposition, (U, Σ, V ) = R MSvd(M i, C i ). Input: Initial matrix M i. Incremental matrix C i. Output: Decomposition results U, S, V of matrix [M i C i ]. 1: if (i == 1) then 2: [U, Σ, V ] = svd(m 1 ). 3: else 4: [U j, Σ j, V j ] = R MSvd(M i 1, C i 1 ). 5: [U, Σ, V ] = mix(m i 1, C i 1, U j, Σ j, V j ). 6: end if 7: return U, S, V. Algorithm 1 calls function mix (Step 5) to merge column vectors of the incremental matrix with the decomposed components of initial matrix. Additional vectors are projected to the orthogonal bases and the coordinates are combined to the singular values. Detailed procedures of function mix are described in Algorithm 2. For most tensor unfolding, the number of rows is less than the number of columns. For such type of matrices, Algorithm 1 can efficiently compute the singular values and singular vectors by splitting the columns for recursive decomposition. Coordinates L K Coordinates C U Projection Projection J U J (a) V U V L K I (b) Fig. 9. (a) Incrementally incoming column vectors are projected on unitary orthogonal bases; (b) he middle quasi-diagonal matrix is diagonalized and the previous singular vector matrices are updated. Algorithm 2 Merge incremental matrix with decomposition results, (U, Σ, V ) = mix(m i 1, C i 1, U j, Σ j, V j ). Input: Initial matrix M i 1 and incremental matrix C i 1. Decomposition results U j, Σ j, V j of matrix M. Output: New decomposition results U, Σ, V. 1: Project C i 1 on the orthogonal space spanned by U j, L = Uj C i 1. 2: Compute H which is orthogonal to U j, H = C i 1 U j L. 3: Obtain the unitary orthogonal basis J from matrix H. 4: Compute the coordinates of matrix H, K = J H. 5: Execute SVD on the new matrix [U J], [U, Σ, V ] = svd([u J]). 6: Obtain new decomposition results, ([U J], U ) U, Σ Σ, V V. 7: return U, S, V. Algorithm 2 applies SVD updating [3] technique for incrementally matrix factorization. he additional columns in matric C i 1 are projected on the unitary orthogonal bases of previous matrix M i 1 (Step 1). Some column vectors are linear combination of orthogonal unitary bases U j, others are components orthogonal to the space spanned by U j. As illustrated in Fig. 9, these two types of vectors are separated J U

8 8 to obtain the bases U j, J and coordinates L, K. he operations are implemented as Steps 2 4. he column space of singular vector matrix U are spanned by the direct sum of the above two unitary orthogonal bases as follows CS(U) = span(u j J). (24) Combining the coordinates with the previous singular values, we obtain a quasi-diagonal sparse matrix which is easy for decomposition. he new equation consisting of the above orthogonal bases and coordinates is defined as [M i 1, C i 1 ] = [U j, J] [ Σj L 0 K ] [ V 0 0 I ]. (25) Let Ū and V denote the unitary orthogonal bases of the quasi-diagonal matrix in Eq. (25), the updated singular vector matrices are U = [U j J] Ū, V = [ V 0 0 I ] V. (26) Eq. (4) suggests only the left singular vector matrix U is essential for tensor decomposition. herefore, computation of matrix V can be omitted in Step 6 of Algorithm 2. Employing the above two algorithms, we propose Algorithm 3 for incrementally computing the core tensor. In this algorithm, extension matrix is used to ensure order consistency (Step 1). Unitary orthogonal bases U (1),..., U (P ) are updated from Step 2 to Step 4, as well as the new core tensor S is obtained in Step 6. For demonstration purpose, Fig. 10 shows a simple example with a three-order tensor. 3.Extension 1.Extension 4.Unfolding 2.HO-SVD U 1 (1) U 2 5.Update U 1,U 2,U 3 U 3 (3) (2) S U 1,U 2 6. New U 3,S Fig. 10. Example of incremental tensor decomposition, truncated orthogonal bases U 1, U 2, U 3 of new tensor are updated incrementally. 5.4 Complexity and Approximation Accuracy ime Complexity Execution time of the proposed IHOSVD method consists of matrix unfolding, incremental singular value decomposition of each unfolded matrices, and product of a tensor by the truncated bases. Let ime unf, ime isvd and ime prod denote the time used by the Algorithm 3 Incremental tensor singular value decomposition, (S, [U, Σ, V ] new ) = I Svd(χ,, [U, Σ, V ] initial ). Input: New tensor χ R I 1 I 2... I P. Previous tensor R I 1 I 2... I P. Previous unfolded matrices SVD results [U, Σ, V ] initial. Output: New truncated SVD results [U, Σ, V ] new. New core tensor S. 1: Extend tensor χ and tensor to identical dimensionality. 2: Unfold new tensor χ to matrices χ (1),..., χ (P ). 3: Call algorithm R M Svd to update above unfolded matrices. 4: runcate the new orthogonal bases. 5: Combine new tensor χ with initial tensor. 6: Obtain new core tensor S with n-mode product. 7: return S, and [U, Σ, V ] new. above processes respectively, the total time consumption ime satisfies ime = ime unf + ime isvd + ime prod. (27) ensor unfolding is a simple transformation with O(1) time complexity. ime isvd is equal to ime 1 + ime ime P = P i=1 ime i, where ime i refers to the time consumed by unfolded matrix (i). According to Eq. (23), time ime isvd can be obtained with { C1, i = 1 ime(i) = (28) ime(i 1) + C 2, i > 1, where C 1 and C 2 are constants. he recursive calling process first adds columns and then updates them with the previous decomposition results. he time complexity of decomposing one unfolded matrix is O(k 2 n), where k refers to the number of the truncated left singular vectors. For a truncated orthogonal basis U with k column vectors, time complexity of the product of a tensor by a matrix is O(k 2 n). o decompose a p-order tensor with p unfolded matrices, the time complexity of the proposed IHOSVD method is O(1) + O(pk 2 n) + O(pk 2 n), namely O(pk 2 n) Memory Usage Let Mem u denote the memory used to store all truncated orthogonal bases, Mem r msvd and Mem mix refer to the memory usages for recursive process in Algorithm 1, then the total memory used by the proposed IHOSVD method is defined as Mem = Mem u + Mem r msvd + Mem mix. (29) Complexity of Mem u is equal to O(kn). o incrementally compute the core tensor, IHOSVD method needs to keep all the truncated orthogonal bases, and the

9 9 memory usage is P i=1 k ii i. According to Eq. (23), the needed memory during the recursive process is equal to M i + C i + M i 1 + C i M 1 + C 1. (30) Complexity of the above memory usage is O(kn). herefore, the complexity of total memory usage is O(kn)+O(kn), i.e. O(kn). For a p-order tensor with p unfolded matrices, the complexity is O(pkn) Approximation Accuracy Reconstruction error between initial tensor and approximate tensor can be exactly measured with Frobenius Norm [10] as ˆ = ( F I 1 i 1 =1,..., I P i p =1 (a i1,...,i p â i1,...,i p ) 2 ) 1 2. (31) For the unfolded matrix (i) of initial tensor, the approximate matrix is ˆ (i) = U i Σ i Vi. he reconstruction error is caused by approximation of all unfolded matrices. o clearly analyze tensor dimensionality reduction degree and tensor approximation degree, we present two ratios. Definition 7: he Dimensionality Reduction Ratio of tensor is defined as nnz(s) + N nnz(u i ) i=1 ρ = nnz( ), (32) where S denotes the core tensor, and U i is the mode-i truncated orthogonal basis. he core data sets of tensor are composed of S (core tensor) and U 1, U 2,..., U P. Because only nonzero elements of the core data sets are stored, ratio ρ can accurately reflect the dimensionality reduction degree. Definition 8: he Reconstruction Error Ratio of tensor is defined as ˆ e = F. (33) F Ratio e reflects the degree of reconstruction error with tensor Frobenius Norm. In this paper, the pair (ρ, e) is employed to describe the dimensionality reduction degree and reconstruction error degree. Obviously, the ratio ρ is inversely proportional to ratio e. Computation accuracy is important for tensor data approximation, and in most applications, HOSVD type algorithms can find a better approximation. o obtain higher accuracy, High-Order Orthogonal Iteration (HOOI) [11] method can be utilized to find the best rank approximation. he High-Order Singular Value Decomposition (HOSVD) and the Higher Order Orthogonal Iteration (HOOI) of ensor can be viewed as extensions to the Singular Value Decomposition (SVD). 6 CASE SUDY In this section, we illustrate the proposed unified data representation model and incremental dimensionality reduction method with an Intelligent ransportation case. he test data used in experiments consist of unstructured video data collected with fixed cameras and mobile phones, semi-structured XML documents about traffic information, and structured trajectory data. After dimensionality reduction, the core tensor and the truncated bases are small to store, but accurate and fast for reconstruction of big data. 6.1 Demonstration of ensor Unfolding We construct a five-order tensor R by extracting three frames from unstructured video clip and three users from semi-structured XML document. Fig. 11(a) shows the five unfolded matrices of tensor. he five orders represent height, width, color space, time and user respectively. (2) (1) (3) (4) (5) (a) Five unfolded matrices of five-order tensor. I er I u I s I t Incremental ensor Data Previous ensor Data Video User Order Inconsistency (b) Incremental data on unfolded matrices of eight-order tensor. Fig. 11. Heterogenous tensor unfolding and incremental tensor unfolding. o demonstrate incremental tensor unfolding, an eight-order tensor R It Is Iu I h I w I c I ec I er is constructed. Incremental data are appended along the time order I t. Unfolded matrices of the combined new tensor (initial tensor and incremental tensor) are shown in Fig. 11(b). Order inconsistency of the new tensor occurs in order I t, because the incremental data are appended as rows on the bottom of the unfolded matrix.

10 his article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation 10 45% 41% Dimensionality Reduction Ratio (ρ) Reconstruction Error Ratio (e) 35% Ratio Fig. 11(a), Fig. 11(b) and Fig. 8 in Section 4 illustrate the tensor model from different viewpoints, and demonstrate how the heterogeneous data are stacked together. Fig. 8 demonstrates the procedure of embedding unstructured video data and semi-structured XML document to a three-order tensor, as well as Fig. 11(a) and Fig. 11(b) show the inner elements of the unified tensor model. 25% 18% 24% 15% 5% 5% 6.2 Dimensionality Reduction and Approximation Error Approximate ensor 2 Experiment No. 3 25% Core ensor runcated U 1 runcated U2 runcated U,U,U 3 15% 4 5 5% 1 Initial ensor 4% (a) Proportion here exists a tradeoff between dimensionality reduction and approximation error. Fig. 12 shows two video frames reconstructed from the above five-order tensor under three different approximation error ratio, namely 4%, 7%, and 24%. Fig. 13(a) plots the two ratios together, and illustrates that the reconstruction error ratio increases gradually as the dimensionality reduction ratio decreases. he core data sets are composed of core tensor S and truncated orthogonal bases U1,..., U5. Fig. 13(b) shows their proportions to the dimensionality reduction ratio. Generally, the proportion of the core tensor is bigger than the truncated bases. 1 7% 2 Experiment No. 3 (b) Fig. 13. (a) radeoff between dimensionality reduction and reconstruction error; (b) Proportion of the core tensor to truncated orthogonal bases. e = 0% e = 0.4% e = 7% e = 24% Fig. 12. Video frames reconstructed with different approximation error ratios. Diverse data types can result in different dimensionality reduction ratios and approximation error ratios. With repeated experiments on video clips, XML documents and GPS data, the results show that the core set containing 18% elements can guarantee 93% accuracy in general. In practice, the balance between dimensionality reduction and computation accuracy is determined by the application requirement. 6.3 ime and Memory Comparison Compared with the general High Order Singular Value Decomposition method, the proposed incremental High Order Singular Value Decomposition method is efficient and memory saving. o evaluate the two decomposition methods, we perform them in computers of Intel Core (M) i5 CPU at 3.2 GHZ with total 4 cores and 8 GB RAM. We divide the unified tensor to four blocks and normalize the tensor size as well as the decomposition time for better comparison. During the process of dimensionality reduction, the general HOSVD method integrates the additional tensor blocks with previous tensor blocks to generate a new tensor which is then repeatedly decomposed. Different from this type of repeated HOSVD method, the incremental HOSVD method updates the truncated orthogonal bases and dynamically computes the core tensor. Fig. 14 demonstrates that the decomposition time of the repeated HOSVD method is greater than the incremental HOSVD method. Additionally, decomposition time of the incremental HOSVD method increases more gently than the repeated HOSVD method from the normalized tensor size As the normalized tensor size grows beyond 0.75, the repeated HOSVD method runs out of memory while the incremental HOSVD method continues to run. From theoretical point of view, with more orthogonal bases are appended to the left singular vector matrix, the middle quasi-diagonal contains less orthogonal columns, and the time consumption during the diagonalization process decreases. In brief, the incremental HOSVD method is more efficient because it projects additional tensor unfolding to previous truncated orthogonal bases rather than directly execute the orthogonalization procedure.

11 11 Nomalized Decomposition ime Incremental HOSVD Repeated HOSVD Out of Memory Normalized ensor Size Fig. 14. Comparison between the repeated HOSVD method and the incremental HOSVD method. 7 RELAED WORK his section reviews related works on data representation and high order singular value decomposition. Data Representation: Big data are composed of unstructured, semi-structured and structured data. In particular, the multimedia as an unstructured data, is mostly encoded as MPEG4 and H.264. MPEG-4 [12] is a method for defining compression of audio and visual digital data. H.264 [13] is a widely used standard for video compression. he semi-structured Extensible Markup Language (XML) [14] is a flexible text format that defines a set of rules for Encoding documents. XML is both for human-readable and machine-readable. he characteristics making up an XML document are divided into markup and content. Kim and Candan [15] proposed a tensor-based relational data model that can process multi-dimensional structured data. Ontology, such as resource description framework (RDF) [16] and web ontology language (OWL) [17], is playing an ever important role in the exchange of a wide variety of data. Higher Order Singular Value Decomposition: A tensor [6, 7] is the generalisation of a matrix and usually called multidimensional array. ensor is a more effective data representation model from which valuable information can be extracted using high order singular value decomposition (HOSVD) [8] method. Because HOSVD imposes orthogonal constraints on the truncated column bases, it may be considered as a special case of the commonly used UCKER [18] decomposition. Although low rank truncation of the HOSVD is not the best approximation of the initial data, it is considered to be sufficiently good for many applications. Analysis and mining of data with HOSVD has been adopted in many applications such as tag recommendations [19, 20], trajectory indexing and retrieval [21], hand-written digit classification [22]. Studies of data representation and dimensionality reduction have been reported in literatures. However, unified model for heterogenous data representation has been neglected, as well as decomposition problems during incremental data processing have not been considered. he contributions of this paper are using a unified tensor model to represent the large scale heterogeneous data and developing an efficient approach for extracting the high-quality core tensor which is small but contains valuable information. 8 CONCLUSION his paper aims at representing and processing the large scale heterogeneous data generated from multiple sources. Firstly, we present a unified tensor-based data representation model that can integrate unstructured, semi-structured and structured data. Secondly, according to the proposed model, an incremental high order singular value decomposition (IHOSVD) method is proposed for dimensionality reduction on big data. We prove two theorems that can solve the problem of decomposition recalculation and order inconsistency. Finally, an intelligent transportation case is investigated for evaluating the method. heoretical analyses and experimental results of the case study provide the evidences that the proposed data representation model and incremental dimensionality reduction method are promising, and they pave a way for efficiently mining and analyzing in big data applications. 9 ACKNOWLEDGMEN his work was supported by the National Nature Science Foundation of China under Grant and by the Fundamental Research Funds for the Central Universities, HUS: CXY13Q017 and 2013QN122. REFERENCES [1] I. F. Cruz and H. Xiao, Ontology Driven Data Integration in Heterogeneous Networks, in Complex Systems in Knowledge-Based Environments: heory, Models and Applications. Springer, 2009, pp [2] H. Abdi and L. J. Williams, Principal Component Analysis, Wiley Interdisciplinary Reviews: Computational Statistics, vol. 2, no. 4, pp , [3] M. Brand, Incremental Singular Value Decomposition of Uncertain Data with Missing Values, in Computer Vision ECCV Springer, 2002, pp [4] J. Sun, D. ao, and C. Faloutsos, Beyond Streams and Graphs: Dynamic ensor Analysis, in Proc. of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2006, pp

12 his article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation 12 [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] E. Henry, J. Hofrichter et al., Singular Value Decomposition: Application to Analysis of Experimental Data, Essential Numerical Computer Methods, vol. 210, pp , C. M. Martin, ensor Decompositions Workshop Discussion Notes, American Institute of Mathematics, G. Kolda and B. W. Bader, ensor Decompositions and Applications, SIAM Review, vol. 51, no. 3, pp , L. De Lathauwer, B. De Moor, and J. Vandewalle, A Multilinear Singular Value Decomposition, SIAM Journal on Matrix Analysis and Applications, vol. 21, no. 4, pp , H. Anton, Elementary Linear Algebra. Wiley. com, C. Meyer, Matrix Analysis and Applied Linear Algebra Book and Solutions Manual. SIAM, 2000, vol. 2. L. De Lathauwer, B. De Moor, and J. Vandewalle, On the Best Rank-1 and Rank-(R 1, R 2,..., Rn) Approximation of Higher-Order ensors, SIAM Journal on Matrix Analysis and Applications, vol. 21, no. 4, pp , I. E. Richardson, H. 264 and MPEG-4 Video Compression: Video Coding for Next-Generation Multimedia. Wiley. com, D. Marpe,. Wiegand, and G. J. Sullivan, he H. 264/MPEG4 Advanced Video Coding Standard and Its Applications, IEEE Communications Magazine, vol. 44, no. 8, pp , E. Van der Vlist, XML Schema: he W3C s ObjectOriented Descriptions for XML. O Reilly Media, Inc., M. Kim and K. S. Candan, Approximate ensor Decomposition within a ensor-relational Algebraic Framework, in Proc. of the 20th ACM International Conference on Information and Knowledge Management. ACM, 2011, pp I. Horrocks, P. F. Patel-Schneider, and F. Van Harmelen, From SHIQ and RDF to OWL: he Making of a Web Ontology Language, Web Semantics: Science, Services and Agents on the World Wide Web, vol. 1, no. 1, pp. 7 26, D. L. McGuinness, F. Van Harmelen et al., OWL Web Ontology Language Overview, W3C Recommendation, vol. 10, p. 10, L. R. ucker, Some Mathematical Notes on hree-mode Factor Analysis, Psychometrika, vol. 31, no. 3, pp , P. Symeonidis, A. Nanopoulos, and Y. Manolopoulos, ag Recommendations Based on ensor Dimensionality Reduction, in Proc. of the 2008 ACM Conference on Recommender Systems. ACM, 2008, pp R. Wetzker, C. Zimmermann, C. Bauckhage, and S. Albayrak, I ag, You ag: ranslating ags for Advanced User Models, in Proc. of the 3rd ACM International Conference on Web Search and Data Mining. ACM, 2010, pp [21] Q. Li, X. Shi, and D. Schonfeld, A General Framework for Robust HOSVD-Based Indexing and Retrieval with High-Order ensor Data, in Proc. of the 36th IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2011, pp [22] B. Savas and L. Elde n, Handwritten Digit Classification Using Higher Order Singular Value Decomposition, Pattern Recognition, vol. 40, no. 3, pp , Liwei Kuang is currently studying for the PhD degree in School of Computer Science and echnology at Huazhong University of Science and echnology, Wuhan, China. He received the master s degree in School of Computer Science from Hubei University of echnology, Wuhan, China, in From 2004 to 2012, he was a Research Engineer with FiberHome echnologies Group, Wuhan, China. His research interests include Big Data, Pervasive Computing and Cloud Computing. Fei Hao is an assistant professor in Huazhong University of Science and echnology. He received the B.S. and M.S. degrees in School of Mathematics and Computer Engineering from Xihua University, Chengdu, China, in 2005 and 2008, respectively. He was a research assistant at Korea Advanced Institute of Science and echnology and Hangul Engineering Research Center, Korea. He has published over 30 research papers in international and national Journals as well as conferences. His research interests include social computing, big data analysis and processing and mobile cloud computing. Laurence. Yang received the B.E. degree in Computer Science and echnology from singhua University, China and the PhD degree in Computer Science from University of Victoria, Canada. He is a professor in the School of Computer Science and echnology at Huazhong University of Science and echnology, China, and in the Department of Computer Science, St. Francis Xavier University, Canada. His research interests include parallel and distributed computing, embedded and ubiquitous/pervasive computing, and Big Data. His research has been supported by the National Sciences and Engineering Research Council, and the Canada Foundation for Innovation.

13 13 Man Lin received the B.E. degree in Computer Science and echnology from singhua University, China, She received the Lic. and Ph.D degrees from the Department of Computer Science and Information at Linkopings University, Sweden, in 1997 and 2000, respectively. She is currently an associate professor in Computer Science at St. Francis Xavier University, Canada. Her research interests include system design and analysis, power aware scheduling, optimization algorithms. Her research is supported by NSERC (National Sciences and Engineering Research Council, Canada) and CFI (Canada Foundation for Innovation). Changqing Luo received his B.E. and M.E. degree from Chongqing University of Posts and elecommunications in 2004 and 2007, respectively, and the Ph.D. from Beijing University of Posts and elecommunications in 2011, all in Electrical Engineering. After the graduation, he joined the school of Computer Science and echnology, Huazhong University of Science and echnology in 2011, where he currently works as an Assistant Professor. His current research focuses on algorithms and optimization for wireless networks, cooperative communication, green communication, resouce management in heterogeneous wireless networks, and mobile cloud computing. Geyong Min is a Professor of High Performance Computing and Networking in the Department of Mathematics and Computer Science within the College of Engineering, Mathematics and Physical Sciences at the University of Exeter, United Kingdom. He received the PhD degree in Computing Science from the University of Glasgow, United Kingdom, in 2003, and the B.Sc. degree in Computer Science from Huazhong University of Science and echnology, China, in His research interests include Next Generation Internet, Wireless Communications, Multimedia Systems, Information Security, High Performance Computing, Ubiquitous Computing, Modelling and Performance Engineering.