Chapter 3 Data Warehouse - technological growth

Chapter 3 Data Warehouse - technological growth Computing began with data storage in conventional file systems. In that era the data volume was too small and easy to be manageable. With the increasing data volume the management of data and managing data in traditional file systems led to serious challenges with regard to the integrity of the data. The research to solve this issue brought in the database management system. Database management system has shown considerable growth in its capability and technology supported services. These databases could support the scaling of small volume data oriented OLTP system to very large database oriented OLTP systems. The proliferation of networking of system could extend the databases from its centralize database system to distributed database systems to manage data efficiently at multiple locations. The highly dynamic situation of market compel the utility of data, not limited to process generating timely report but also needed analytical reports and forecast to sustain the challenges of dynamic changes taking place in the business environment. Business organizations have large volume of data but have found it increasingly difficult to access and analyze it. This is because data are in different formats, exists on different platforms and resides in different databases developed by different vendors. To transform these data in a unified manner for analysis and reporting hundreds of programs are written and maintained. Once the initial findings are made, decision makers want to dig deeper into the data which requires modification in existing program or development of new programs. This process is inefficient and very time consuming. This brought in analytical processing in, "OLAP". OLAP is an analysis techniques used to explore the data. OLAP not only needs transactional based data but also past data. These generated the need of data warehouse. Data Warehouse - A very large domain of database. Data warehouse is a central repository of storage where large amount of data are stored from heterogeneous sources. Data warehouse needs a well designed analytical system. The purpose of the system is to provide analysis with an integrated and consistent view and all the data relevant to the organization. A database is an application oriented collection of data. That is 22

organized, structured, and coherent with minimum and controlled redundancy and which is usually accessed by several users while a data warehouse is a subject oriented collection of data specifically designed to analyze data for decision making process. According to [69], the data modeling paradigm for a data warehouse must comply with requirements that are totally different from the data models in OLTP environments. Data warehouse systems are designed to facilitate data analytics with visualization. 3.1 Data Warehouse Data warehouse is a subject oriented, non volatile, time variant collection of data in support of management's decision making process [77, 36]. Subject-oriented A data warehouse can be used to analyze subject areas and therefore data warehouse's data is organized around specific subject areas. For example, sales, customer, product. Integrated A data warehouse is build by integrating data from heterogeneous or homogeneous sources such as relational databases, flat files, etc. Integration process takes place when data passes from one or more data sources to the data warehouse. Integration defines a unique representation of data coming from different sources with data and attribute inconsistencies. Non-volatile A data warehouse data cannot be changed or updated. Retaining the data makes analysis over long time periods possible. Time-variant A data warehouse stores historical data. Time variance implies that every unit of data in the data warehouse is accurate as of some moment in time. The storage, access, usage, performance requirement and responsiveness to queries in data warehouse are different from those in OLTP environment. Data warehouse contains historical and summarized data over long period of time. The size of a data 23

warehouse can vary from gigabytes to terabytes. Information in the data warehouse is organized around major subject areas and is modeled in order to allow precomputation and fast access to summarized data [7]. Decision makers of business domain generates regular predefined reports, execute ad hoc or complex queries for decision making process. These queries require numerous scan, join and aggregate operation across the data warehouse to access millions of records. As a result, query response time is a matter of concern for data warehouse. For this, data warehouse must have an architecture that allows gathering, organization, manipulation and presentation of data quickly and efficiently. Figure 3.1 shows the generic data warehouse architecture. Data Warehouse builds by integrating large amount of data from multiple heterogeneous or homogeneous sources. The source data can be any operational system, data stores, files or external sources. ETL process is used to extract these data from the multiple sources. After extracting these data, data needs to be integrated, cleaned and transformed into the format and structure compatible with the data warehouse. These data are then loaded into the data warehouse. At this stage, data is restructured for query optimization. These data are then used for query, reporting and data analysis by OLAP and data mining tools. E T L P R O C E S S Data Warehouse End User Access Tool Source System Figure 3.1 Generic architecture of Data Warehouse 24

To build a data warehouse, Inmon proposed architecture which follows top-down approach and Kimball proposed architecture which follows bottom-up approach to build data warehouse. Hybrid approach is also being used to build data warehouse which is a combination of top-down and bottom-up approach. Inmon's top-down architecture uses ETL tool to extracts and transforms data from data source systems. After transforming the data, it is loaded into the data warehouse. After creating data warehouse, if necessary, data marts can be created. Data marts can be created for specific purpose or for specific subject domain. Following figure shows the top-down approach. E T L P R O C E S S Data Warehouse Source System Data Mart Figure 3.2 Top-Down approach architecture Kimball s bottom-up architecture starts with building data marts from individual department s data. ETL tool used to extract and transform data from source system and load it into the data mart. Then, it uses these individual data marts to build the data warehouse. The idea of this architecture is to construct the data warehouse in incremental way. Following figure shows the bottom-up approach. 25

Data Warehouse Source System ETL Process Data Mart Figure 3.3 Bottom-Up approach architecture 3.2 Building the Data Warehouse To build data warehouse, generic data warehouse architectural model can be described by three phases: Data Extraction and Integration Data Modeling Data Analysis 3.2.1 Data Extraction and Integration Data in the data warehouse can be extracted from homogeneous or heterogeneous sources. These data needs to be integrated, transformed and cleansed before store it into the data warehouse. In the ETL process, data are collected from different sources and stored in the data staging area for data integration and transformation process. After transforming these data, it is loaded into the data warehouse. Following figure shows the data extraction and integration from homogeneous and heterogeneous sources using ETL. 26

Homogeneous Source Heterogeneous Sources Data Source DS1 DS2 DS3 ETL Tool ETL Tool Data Warehouse Data Warehouse Figure 3.4 ETL process for Homogeneous and Heterogeneous sources 3.2.2 Data Modeling Phase To best support the needs of the data warehouse users, the data warehouse databases should be designed perfectly. Well designed data model allows the data warehouse to grow easily as well as provides good performance. To design data warehouse database different levels of models has been used. Data Modeling is a technique that records the inventory, shape, size, contents and rules of data elements used in the scope of a business process [71]. Data modeling provides a kind of map that describes the data used in process. To support the requirements of a data warehouse, the data warehouse can be designed with the help of three levels of model: Conceptual, logical and physical. Following figure shows the different levels of data model used for data warehouse. Conceptual Data Model Logical Data Model Physical Data Model Figure 3.5 Data Model for Data Warehouse 27

Conceptual Data Model The conceptual level data model shows a high level view of the data warehouse. It is a brief description of the users data requirements without taking into account implementation details. This model is closer to the real world than to the implementation level. Conceptual data model are typically expressed using the ER model or the Unified Modeling Language (UML) [37, 62, 43, 63]. Logical Data Model The logical data model is built based on the user requirements and then translated into the physical data model. The logical level data model shows entities and their relationships in a logically sound manner, to serve as model for physical implementation. This model includes all entities and relationships among them, all attributes, primary key for each entity and associated foreign keys. The goal of this model is to describe the data in detail as much as possible. Physical Data Model The physical data model shows the actual representation of the physical tables in the database as they are implemented. This model represents the actual design of a database. This model also includes the techniques like: indexes, materialized views and partitioning. 3.2.2.1 Dimensional Modeling Data warehouse uses dimensional modeling structure to store large volume of integrated data which is suitable to answer analytical queries. It stores data in a way that user can analyze data from multiple perspective. Dimensional modeling uses the concept of fact table and dimension tables. Fact table contains measures and related data. Each fact table contains the keys to associated dimension tables. These are called foreign keys in the fact table. This table has usually small number of columns and has large number of rows as compared to dimension tables. Dimension tables contain attributes that describe fact records in the fact table. It contains the information about the numerical values in the fact table. Dimension tables contain large number of columns and small number of rows as compared to fact tables. Facts are considered as dynamic part of warehouse and dimensions are considered as static entities because dimensions are computed once during the ETL process. 28

Multidimensional modeling technique uses star schema or snow flake schema to store data in warehouse. 3.2.2.1.1 Star Schema Star schema consists of the fact table which is surrounded by number of single level of collapsed or consolidated dimension tables. Each dimension is represented as a single table. The primary key in each dimension table is related to a foreign key in the fact table. The fact table in the dimensional model is joined with all the other dimension tables, there is only a single join line connecting the fact table to the dimension tables. This will lead to better query performance. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table. D1 D2 F D3 D4 Figure 3.6 Star schema 3.2.2.1.2 Snowflake Schema The snowflake schema is an extension of the star schema. The snowflake schema consists of a central fact table surrounded by hierarchies of dimension tables. The dimensions usually relate to the facts in one-to-many relationships and the snowflake schema exposes them as fully normalized structures usually consisting of many entities with often complex intra dimensional relationships. D1.2 2 D4.2 22 D1.1 11 F D2.1 111 D4.1 D3.1 11 111 1 Figure 3.7 Snowflake schema pattern D2.2 222 D3.2 222 29

3.2.3 Data Analysis Phase Data from the data warehouse is retrieved and analyzed using ad-hoc query, OLAP tool, reporting tool or data mining tool. Data warehouses used to store numeric and textual data for decision making process and most industry applications are designed to operate with data warehouse of this nature. Majority of the Data warehouse systems helps in analyzing numeric data. Much research work has been done to design data warehouse for storing, aggregating and summarizing these data and good performance is achieved while accessing and analyzing these data. Data warehouse technology with numeric data is considered to be matured [9]. 3.3 Multimedia Analysis In today s business scenario the type of data is not limited to numeric or textual data but it includes wide varieties of images, audio, video etc. Multimedia data is widely used in the field of science, engineering, medicine, modern biology, geography, biometrics, weather forecast, digital libraries, manufacturing and retailing, art and entertainment, journalism, social sciences and distance learning. These data comprise of various formats like image, audio, video, text and signal data. As the usage of multimedia grows, users require sorted, combined and analyzed multimedia data in innovative ways this leads to build Multimedia data warehouse. Multimedia analysis has focused on images, audios and videos with different goals or objectives. Multimedia Information Retrieval (MIR) uses the areas of computer vision, machine learning, digital image processing, pattern recognition, database management and information retrieval. Multimedia Information Retrieval includes multimedia data analytics, feature extraction, information visualization and more. It is difficult to gather multimedia information from different sources and with different goals or objectives. Multimedia information retrieval started in the late 1970s. In 1980s, edge finding, boundary and curve detection, region growing, shape identification, feature extraction of individual images or frames of images studied. In 1990s, Content Based Image Retrieval (CBIR) [10, 30] and Content Based Video Clip Retrieval (CBVR) are accomplished. During that era, multimedia data grows due to the wide use of WWW. The computer vision community uses a visual based approach 30

while the database management community uses a text based approach. In the visual based approach, visual characteristics identified from the multimedia object using program and searches object based on these characteristics. In text based approach, the multimedia object is annotated manually and then searches objects using text based attributes in database management systems. Traditional data management systems are designed and suitable for structured data and expected exact query match results. Multimedia data is unstructured or semistructured data and expected exact or non-exact query match results. Traditional database management systems are unable to keep up with these ever demanding requirements. Thus, it is needed to have effective and yet efficient methods for the management of ever growing multimedia data. Content based approach is a promising way for multimedia data storage and retrieval. 3.3.1 Multimedia Data Features Multimedia data can be described by the content, characteristic or feature they provide. These features can be extracted manually, semi-automatically or automatically from data. Features are also known as descriptors. Multimedia data features can be classified according to the level of abstraction. Low-level, Mid-level and High-level features are used. Low-level Feature. Low level features are the features which can be extracted automatically using computer program with minimum human intervention. Therefore, they can be extracted and processed automatically or semi-automatically. Mid-level Feature. Mid level features are the middle level feature between low level feature and high level domain oriented semantic rich feature. Identification of objects in multimedia is known as medium level features. These features can be extracted automatically to a certain extent to a particular domain. Mid level feature requires less domain knowledge but some general knowledge is required. Shape matching and object recognition are members of this type. 31

High-level Feature. High level features are semantically rich information and information is close to the human perception. The description based descriptors are difficult to extract automatically and usually specified manually as text annotations. These features are extracted manually by domain experts. Domain concepts, text annotation, events, emotions, identification of objects and keywords are categorized as high level features. These features are known as semantic feature or high-level feature. Human uses high level features to interpret, recognize or analyze multimedia data. Following table shows the levels of feature to represent multimedia data: Level of Feature Low level Feature Medium level Feature High level feature Description Color, texture, shape Objects in image Keywords, event, emotion, identification of objects Table 3.1 Feature Levels of Multimedia Data Earlier methods for the representation of multimedia data use low-level features. This data seldom represent the semantic content and have little or nothing to do with human perception. Representation of multimedia data with only low level features lack of semantic meaning. Therefore, retrieval results are unsatisfactory if retrieval or analysis is based on only low level features. In addition, high level semantic features may be defined differently because each human being interprets the content what they see from their point of view. Calculated features can also be used which can be calculated from retrieved features. Image and video information retrieval system is provided in [4]. They have included traditional video analytics, video parsing and video abstraction. Video analytics from color, texture, shape and spatial similarities, video parsing includes temporal segmentation, object motion analysis, framing and scene analysis and video abstraction includes skimming, key frame extraction, content based retrieval of clips, indexing and annotation. Image information retrieval taxonomy for image retrieval system is described in [86] which includes, feature extraction using color, texture, shape, color layout and segmentation, image indexing such as dimensional reduction and multidimensional indexing. [73] extracted features from an image and form a 32

hierarchy of global features, salient features, signs, shapes and object features. For similarity matching, they have used machine learning and semantic interpretation. They have also provided image indexing, storage and query regarding database. [50] focused on shot boundary detection in video. [86] focused on face detection. Video indexing process is described in [74] as a hierarchy that groups different index types, characterize different genres and sub-genres according to the prominent layout and contents, and splits the hierarchy structure into named events and logical units. Multimedia data are extracted from the different sources and stored in a data archive for information retrieval [49]. They tried to bridge the semantic gap between the users and their multimedia information. They have translated the computable low level content based media features to high level concepts or terms which can be used from the user perspective. [23] used data mining technique for knowledge discovery of image databases in content based information Retrieval (CBIR). [45] proposed semantic based image retrieval system. [53] proposed content based image retrieval system modeling. Multimedia data warehouse study is rooted in traditional areas of multimedia analysis and Data warehouse, which started in the late 1990s to early 2000s. Till date, new models, architectures and framework have continued to emerged and proposed in the multimedia data warehouse research and development (R&D) community to efficiently store, access and process multimedia data in warehouse environment. There is much to do in regard to complex, multimedia data warehousing [32]. The challenges prevail in the field of data warehouse particularly in building multimedia data warehouse integrating multimedia databases with data in the form of static images or motion pictures, audio and video data and to achieve optimum level of performance for data storage, access and analysis. 3.4 Multimedia Data Warehouse Multimedia data can be stored in data warehouse from homogeneous or heterogeneous sources. Multimedia databases will have to be integrated, transformed and cleansed to develop a centralized multimedia repository i.e. multimedia data warehouse. Multimedia data warehouse should be designed in such a way that data can be quickly and easily be extracted and analyzed. Multimedia data management needs efficient storage and access mechanism that supports multimedia data 33

warehouse. Storage and access of multimedia data is a critical issue for the overall system's performance and functionality. Hence deployment of new techniques to store, retrieve and process multimedia data is essential and imperative. 3.5 Building Multimedia Data Warehouse Building multimedia data warehouse includes three major aspects: Multimedia feature extraction, dimensional modeling design and usage of techniques which helps to achieve performance. Among the three aspects, multimedia feature extraction is the key to accuracy performance, dimensional modeling design is critical for data storage and representation, and performance techniques is important to achieve storage and retrieval performance. 3.5.1 Multimedia Data Extraction To build multimedia data warehouse, data should be integrated, transformed and cleansed as these data are coming from different data sources and in semi-structured or unstructured format. Therefore, it is needed to integrate and transform these data in structured manner or to provide such an environment where these data can be stored in a proper manner in a way that can be efficiently accessed or analyzed later. Along with the multimedia data extraction from the source system, multimedia features should also be extracted. The representation of multimedia data is an imperative task. Multimedia data are represented by features that constitute data to be analyzed. The features are specific to the domain area therefore relevant characteristic of multimedia data should be extracted according to the analysis goal. To integrate multimedia data, [37, 62, 43] build a generic UML model that represents the multimedia data through low-level and semantic descriptors. [7, 8, 9] represents multimedia data through content based and description based descriptors. Researchers also perform content based image clustering [2]. [62, 63, 15] uses low-level features and meta information. [5, 62, 63, 82] use semantic data. The described multimedia data is stored at the operational level or at data staging level into relational and/or XML-native databases. These stored data can be loaded into a dimensional structure, in data warehouse, data marts or data cubes. After feature extraction and data transformation process, dimensional modeling process takes place. 34

The type of feature extracted from the multimedia data is specific to the application domain. For example, to extract features from face images, low level features such as color and texture are not effective when used for face recognition as there are many parameters that affects the image such as: angle from which image is taken, lighting condition. In case of geographic images the parameters that affect the image is satellite altitude, angle of acquisition and climate circumstances. e-learning video data are videos based on specific course content which is effective when accessed and recognized by high level features such as title, course content and keywords. 3.5.1.1 Biometrics Image Data Face image data in biometrics can be described by different levels of features extraction. Low level features are not effective for face recognition as human face can be described by verbal description of the high level semantic feature. Human being perceive facial images and compare their similarity using high level features such as gender, hair color, race, etc. Therefore, it is essential to describe face using high level semantic features or combination of both the features. [3] retrieved image by integrating CBIR and FERET, with semantic features and eigenface to extract low level features. [67] uses low level features with high level attributes and proposed attribute enhanced sparse coding and attributes embedded inverted indexing to retrieve image. [15] proposed methods to use automatically detected human attributes that contains semantic cues of the face by constructing semantic codewords and low level features. [89] use local and global features. 3.5.1.2 Geographic Image Data The geographic image data provides the information regarding objects in the real world. These objects can be represented by low level and high level semantic features. Low level features include color in RGB or HSV, shapes, polygons and textures. High level semantic feature includes application oriented semantic classes like river, forest, desert, etc. [61] designed region based image retrieval system and the similarity between two images was measured based on individual region to region similarity which is extended to image to image similarity based on all segmented regions within the 35

image. [51] retrieved images based on the classification of the images into predefined semantic features as cloud, water, forest, farmland, bare soil, rock, urban area using gray scale images. The same concept is used to retrieve images using multispectral isolated images [46]. They have used multi-band isolated JPEG 2000 codec images to retrieve an area of interest using hue, saturation and value color model. [84] retrieved predefined classes from isolated images and databases. They predefined classes as city, cloud, desert, field, forest and sea. 3.5.1.3 e-learning video data e-learning data are video data used in online course for teaching purpose. e-learning data is a combination of audio, video and text data. e-learning video data are created for specific course or topic therefore they are usually represented and retrieved by the high level semantic features. [90] proposed a system that uses a domain ontology. They have defined academic elements as introduction, definitions and theorems, theory, diagrams discussions, review, question and answer, subtopics. 3.5.2 Dimensional Modeling for Multimedia data A dimensional model for multimedia data warehouse uses star schema [81, 78, 79, 64, 58], starflake [87, 42] and snowflake schema [64, 9]. [78, 79] uses star schema and uses features of object relational database to meet the requirements of integrating heterogeneous types of data. XML based multidimensional approach [78, 79, 44, 63, 43, 62, 88, 55] for the storage of multimedia data and complex data has also been used. [9] uses snowflake schema, named multiversion model which stores measures in fact table and descriptors of multimedia data is stored in dimension tables. They present the concept of multiversion dimension, which is composed of several versions of dimensions, each one being a dimension for a given version with its own schema. The schema of various dimensions is described using the hierarchical levels. [14] proposed temporal data model for semi-structured data. [8] uses a data mining technique - decision trees to select relevant data to be modeled according to the analysis goals. [82] designed visual cube and proposed algorithm for visual cube construction. They introduce Multi-Dimension scheme in which cube has three dimension schema and Single Dimension scheme in which cube has two dimension 36

schemas. [5] presented a hierarchical way of structuring the data and extracted information. The data model represents facts and dimensions according to the hierarchical structure of entities captured in multimedia objects. XML based approach is also used to store data in xml data warehouse in terms of XML documents, XML database [24] and XML cubes [17, 29, 31, 32, 47, 48, 80]. Once data is modelled in data warehouse in data cubes they are analyzed or extract relevant information using appropriate tools. 3.5.3 Multimedia Data Analysis Multimedia data can be accessed and analyzed by providing different dimensional criteria. To analyze multimedia data, [9] presented prototyping model from which aggregated data for ECG is calculated. Xin Jin et al[82] constructed visual cube and extracted and count images from the presented prototype. [19] proposed prototype of Multidimensional Image Retrieval. 3.6 Performance Factors Performance factors are some of the core features that can be used to improve warehouse s storage performance and query performance. Following are main performance pillars for the data warehouse: Compression Indexing Partitioning Materialized view 3.6.1 Compression Data compression improves storage performance. The aim of data compression is to minimize the amount of data to be stored and transmitted. Compressed files occupy less disk space than uncompressed files. Therefore, data compression reduces the storage costs. On the other hand, data compression increases the speed of data transfer because a smaller file transferred faster than a larger file. The use of data compression in databases also improves system performance by reducing the I/O cost. It also reduces the number of bits required to store and/ or transmit digital media. There are four levels at which compression can be performed on Data warehouse [25] - File 37

level compression, Page level compression, Record level compression and Attribute level compression. File level and Page level compression are better but as far as the query processing is concern they do not perform well as entire file or page has to be compressed or decompressed which increases overhead on CPU, hence performance degrades. Record level compression and Attribute level compression perform well but, does not give good compression ratio in comparison to the first two types [25]. Image compression refers to the reduction of irrelevance and redundancy of image data. The redundancy and similarity among different regions of images makes compression feasible. It is very convenient to store compressed data in data warehouse to save disk storage [22]. Other reasons for storing data in compressed way are [52]: Reduces query execution time as static data is stored in data warehouse. Reduces CPU overhead as it needs to search data in less space. Reduces data redundancy. Reduces the probability of transmission errors since fewer bits are transferred [54]. There are two types of compression techniques used with multimedia data, lossless compression and lossy compression. Both compression techniques further uses different techniques to compress files and each compression techniques have various methods which are used by different file formats and achieve different results. Classification of compression is described in following figure. Compression Techniques Lossless Compression Lossy Compressio n a. Run Length Encoding b. Huffman Encoding c. Arithmetic Encoding d. Entropy Encoding e. Area Coding a. Predictive Coding b. Transform Coding (FT / DCT / DWT) Figure 3.8 Classification of compression technique 38

Lossless Compression Lossless compression algorithm reduces file size with no loss in quality. When the file is saved it is compressed, when it is decompressed the original data is retrieved. It uses Run Length encoding, Huffman encoding [40], Arithmetic encoding, entropy encoding, or area coding algorithm. This compression is suited for drawings, logos, text and other simple images that would not look good when compressed with lossy compression. [11] discussed Huffman and Arithmetic algorithms for multimedia compression. Examples of lossless image compression format are PNG and GIF. Lossy Compression Lossy compression algorithm permanently discards redundant pixel information. This means that when the file is decompressed the original data isn't retrieved. It uses predictive and Transform coding technique. As shown in figure 3.8, transform coding uses FT, DCT [1] or DWT method. This technique is suited for photographs and videos. Example of a lossy compression format is JPEG and MPEG. Compression format can be chosen according to the type of image or video data and application domain in which it is used. Like images, videos can be compressed using lossless and lossy compression technique. Lossless video compression codec performs an average compression while lossy video compression provides better compression ratio. Majority of video compression algorithm uses lossy compression. At the same time highly compressed video may present visible or distracting pictures. Video compression uses different techniques to reduce redundancy in video data. Lossy compression technique includes MPEG format and many other. Mohd. Fraz et al[59] proposed lossless and lossy compression techniques on relational databases. The proposed technique is used at attribute level on Data warehouse by applying lossless compression on three types of attributes (string, integer, and float) and lossy compression on image attribute. They have used JPEG coding algorithm to compress image. They have got 13.5% compression ratio on image. P Singh et al[68] surveyed the features for image and video compression. For image compression they come to the conclusion that JPEG is an excellent way to store 39

24-bit photographic image and was designed to compress color or gray scale continuous tone images or any graphics and the vector graphics do not get compress well under JPEG. [26] performed analysis of lossy compression algorithm for medical images while [28] performed analysis of Multimedia Compression Algorithm. [57] provides comparative study on lossy image compression in multimedia data warehouse. 3.6.2 Indexing Indexing in data warehouse environment reduces the query execution time to see query results. Usage of too few indexes loads the data quickly but the query response is slow. Usage of too many indexes loads the data slowly and storage resource requirement increases but the query response is good. Selection of right index structure built on columns, improve the performance of queries [42]. B-Tree index[21], Bitmap index [18, 60, 66] and Join index [34, 63] are indexing techniques used in data warehouse. Each existing technique is suitable for a particular situation. B-Tree Indexes should only be used for high cardinality data and predicted queries. It is used in warehouse to enforce unique keys. Bitmap index is best suited for columns having low cardinality and should only be considered for low-cardinality data [72, 70].Bitmap indexing is useful for lowcardinality domains because comparison, join and aggregation operations are reduced to arithmetic, which reduces the processing time. It reduces the space and I/O. The join index is useful for maintaining the relationship between a foreign key and its matching primary key. The star schema model of data warehouse makes use of join index by the linkage between a fact table and corresponding dimension table. Join indexing maintains relationships between attribute values of a dimension and the corresponding rows in the fact table. Multimedia data or the data that represents the multimedia data can be indexed [20]. Multimedia data warehouse [15, 65, 79, 85, 16, 42] uses indexing to speed up query processing. Ankush Mittal et al[6] have designed a system for indexing videos using audio, video and power Point slides and segmenting them into various lecture components. Content server which is a repository of multimedia content 40

will maintain the indexes for metadata. The physical records will be retrieved by comparing their indexes with the domain specific indexes stored in content server [54]. 3.6.3 Partitioning The data warehouse houses tables which are sometimes millions of rows deep and thousands of columns wide. This increases access time and maintenance cost. The partitioning is done to enhance the performance and makes the management easy. Partitioning can be done horizontally or vertically. Horizontal partition distributes rows in a table into groups which decreases maintenance cost as the number of index level decreases due to decreases in number of rows per partition. Vertical partition creates group of columns from a table and divide them into number of tables. It improves data access time by accessing required columns. Partitioning optimize the hardware performance and simplify the management of data warehouse. The fact table in data warehouse grows to many hundreds of gigabytes in size. This too large size of fact table is very hard to manage as a single entity. By partitioning the fact table into sets of data the query procedures can be enhanced. The query performance is enhanced because the query scans the partitions that are relevant. It does not have to scan the large amount of data. 3.6.4 Materialized View Materialized view contains the aggregated data derived from a fact table in order to provide fast answer to user queries. It requires amounts of space in order to store aggregated and pre-calculated data. Views are frequently updated whenever the associated tables upon which views are built are updated. When working with large amount of joined and aggregated data, materialized view helps improve overall performance. Materialized view can also be created with unique clustered index to improve query performance because the view is stored in the database in the same way a table is stored with a clustered index. Indexed materialized view enhances performance because of the following reasons: Aggregations are pre-calculated and stored in the index Stored pre-joined tables 41

Materialization views can be built upon and often supports frequent queries. However for unpredicted queries, the system must scan and access the actual data. [13] proposed several algorithms for optimized cost effective selection of materialized view. Materialized view created by the selection of query clustering in XML data warehouse [33]. 42