The Import & Export of Data from a Database

The Import & Export of Data from a Database Introduction The aim of these notes is to investigate a conceptually simple model for importing and exporting data into and out of an object-relational database, and use it in RAQUEL. The Database The database is considered as a mathematical set of relations, which may also be organised into subsets, each subset corresponding to the relations of a database schema. (Strictly speaking, the relations are relational variables, since their contents may vary over time). There will also be an associated meta DB which holds data dictionary information about the DB. The meta DB is used by the DBMS to handle the DB, is only ever updated by the DBMS, but may also be used by applications and users via the DBMS. It is self-referential to avoid any need for a meta, meta DB. Logically the meta DB can be considered as either completely separate from the DB, or as a permanent subset of the DB. In fact, the DB contents must be stored in operating system physical files. However as applications and users only ever view the DB as a set of relations, the physical files are ignored. (Of course, the DB Administrator must manage the files, but the provision by the DBMS of physical data independence means that they can be ignored from the viewpoint of importing and exporting data to and from the DB 1 ). Importing & Exporting Data The DB will always be accessed via the Database Management System and never directly, in order to provide data independence to applications and users of the DB. Thus the overall system architecture can be viewed as follows :-. Source Current Application Sink. DBMS Import of Relations Export of Relations DB = Set of Relations 1 This does not deny the potential value of (say) Load tools that physically import large quantities of data into DB relations and set up physical indexes, etc for them.

The current application is that whose instructions the DBMS is currently executing in principle there could be many applications that the DBMS is serving in parallel and may be a traditional application program or a human user interface. Relations may be imported/exported from/to the current application and/or to other data sources/sinks; in this context, the current application is just another source/sink. The underlying model is the same as that of the Unix system, where data can be piped from one application program to another. However in Unix, the flow of data is always a sequence of bytes. This could be inadequate for a database system. For example : To import a large file of data, considerable work could be caused by the need to convert the file to a byte stream for import and then back to its original structure. To export the results of a query to a user interface usually requires formatting the data so as to obtain an acceptable screen presentation. However one could not expect the DBMS to be able to cope with an unlimited range of data structures. Therefore an appropriate strategy would be for the DBMS to handle a range of data structures, namely those suitable for its own internal operation, plus conversion facilities to/from bytestreams, and leave external applications to handle data structures that they need, e.g. formatting of relations for presentation. Relational DB Languages Following C. J. Date, a relational DB language is regarded as comprising two parts, a sub-language to define relations (in this case relational algebra is used), and a sublanguage to specify actions to be carried out on the defined relations. In RAQUEL, actions are modelled as assignments. Most of the actions in a relational DB language concern the internal manipulation of the DB. However, there must be at least one action that exports data from the DB to users and/or applications, or the DB will be of no use. Thus RAQUEL currently has the Retrieve action specifically to do this. Data must also be imported into a DB or it will remain empty. Thus RAQUEL currently has the Insert action specifically to do this. In fact, as with most relational DB languages e.g. SQL the data inserted is typically embedded as literals in the Insert statement. This corresponds to the importation being from the current application. The literal data must be in an appropriate format so that it can be accepted by the DBMS, possibly after prior processing by the application. However there is no reason why all inserted data must come from the application itself, and it would also be useful if the application could use the Insert action to import the data from other external sources, e.g. files, peripheral devices, Unix pipes, etc. Likewise it would be useful if the Retrieve action could export data to other data sinks. Traditional SQL interfaces allow embedded SQL in an application program. The SQL is extended to allow data to be retrieved into program variables. Where the retrieval imports more than one row of data into the program, then SQL is extended by means of cursors to allow each retrieved row to be accessed by the program each time through a program loop. Rows can also be deleted and updated using embedded SQL, but not inserted. 2

ODBC interfaces are for programming languages that have no facility for embedded SQL. They allow any SQL statement to be passed to an SQL DB, and therefore allow any normal SQL insertion, deletion or amendment. However to import or export data between the DB and any other source or sink using SQL is not permitted, except via the use of vendor-supplied utilities, which generally must be handled in a completely different way to the normal use of SQL. The aim here is to generalise the insertion, deletion and amendment of data in relations so as to easily accommodate the importation and exportation of data from a much greater range of sources and sinks. This in turn is to facilitate the handling of object classes as data types, e.g. for sound and video, and to facilitate the handling of data over the Internet. The DB Language RAQUEL All RAQUEL actions are monadic or dyadic, as required, i.e. they take one or two relational operands respectively. Where an action requires an additional input(s) to specify precisely what it should do, it also takes parameter(s). (This makes it equivalent to RAQUEL s algebra operators, and achieves simplicity). Currently Insert is a dyadic action, which takes an algebra expression that evaluates to a relational value on the RHS, and inserts by assignment that value into a relational variable on the LHS. All the imported data is embedded in the expression, and so the only genuinely new data that can be imported into the DB is literal data. Hence some development is required to allow the import of other external data. For simplicity, this should make external an data source look like a relation so that it can easily be incorporated into an expression. Currently Retrieve is a monadic action, which takes an algebra expression that evaluates to a relational value on the RHS, and exports that value, typically to a screen for display, but in principle to any suitable file, pipe or device. Thus on the face of it, Retrieve is the export action. However Retrieve is anomalous, because all other monadic actions assign relational values to a relation on the LHS, whereas Retrieve assigns a relational value to an external entity. It is a semantic matter rather than a syntactical one, and is therefore of concern even if one required a different syntax to RAQUEL, e.g. a graphical syntax. Again for simplicity, making an external data sink look like a relation(s) would solve the problem. RAQUEL currently has three categories of actions, to help simplify the language. (Each category can be reflected in an action s syntax to facilitate its use). The categories are ; 1. Conventional value assignments, where values are assigned to relational variables. 2. Constraint assignments, where constraints are assigned to relational variables. 3. Binding assignments, where bindings to physical storage are assigned to relational variables. 3

The Problem : What Action(s) should Import/Export Data For simplicity, the design decision is taken that external sources and sinks will appear as relational variables. This ensures that there continues to be only one kind of value structure and one kind of variable structure in RAQUEL, namely the relation. This also fits in with the Insert and Retrieve actions (and incidentally the Delete and Amend actions too) as described above. Thus the question arises as to what additional actions are required in RAQUEL to make this possible. There are two possible models for the operation of import and export actions : 1. Have an action that formally binds a relation variable to a data source or sink outside the DB, say a data file or computer screen respectively. When data is inserted or retrieved into or from such a relation, using a normal action, then data is actually imported or exported. 2. Be able to use a data source or sink explicitly as if it were a relation, but without binding. This would work like the first possibility, but would not require an action. However it would be necessary to ensure that RAQUEL could distinguish a data source or sink from any other version of a relation, presumably by some syntactic convention. If a variety of types of source and sink are to be made available, then the first option is necessary to have a means of declaring what type each source/sink is. There is no necessity for the second option in addition to the first. So for simplicity just the first option is used. Monadic Source and Sink assignments are proposed which, together with their parameters, would assign a type of source or sink respectively to a variable name on its LHS. The question arises as to which of the three categories Source and Sink should fit into. They cannot sensibly be put into the value assignment category, as a relational value is not being assigned. They cannot sensibly be put into the constraint category, as constraint here is used only to constrain the permissible values that can be held in a relation. They can be put into the binding category, since that category is designed to associate a relation with its storage mechanism, which corresponds to what sources and sinks are. A source/sink differs from internal DB physical storage since it is external to the DB, by definition. However they are the same in that logical relations are being mapped onto physical storage. Logical data independence is provided in that relational algebra can be applied to a source/sink relation in the same way as any other. Binding actions still need never be known by DB users or applications, unless a user wants to specifically associate something external as a source/sink. Thus relational values are generalised so that they can now be expressed in four ways : 1. The name of a relational variable. 2. A relational literal (or constant). 3. A source or sink that contains or will contain a relational value. 4. A relational algebraic expression, which can involve any of the above three forms of relational value. 4

There is no logical reason why a relation defined as a source should not also be defined as a sink, although there may be practical or physical constraints that prevent this. In practice it will probably be very useful to have certain default sources and sinks, in the Unix manner. In particular, a sink that takes query results and formats and displays them on a screen for a user would be helpful. Alternatively, required defaults could be automatically declared when the DB is opened. Questions Arising Under this data model, is there a logical difference between inserting a relation into a sink using the Insert action and using a Retrieve action to put data into a sink? Do the differences arise solely due to the nature of the sink? A sink which accumulates data will act like a traditional relation that has data inserted into it. A sink that throws its data away after using it, as a display screen might do, would only contain what was inserted into it (and that possibly only transiently). Therefore can the Retrieve action be removed to simplify the language? Note that exporting to sinks and importing from sources specify particular directions of data movement. Assignment specifies movement to the relation on its LHS, regardless of whether it is a source or sink. Currently the only thing that the Retrieve action can do that the Insert cannot do is sort the retrieved relation. Since a relation is a set of tuples, it is not meaningful to sort it as a logical action within the DB, but it can be of great practical importance to sort a retrieved relational result as it is exported out of the DB. Should the sorting be left to the external interface? Should the relation be turned into a sequence relation before being retrieved in order to get the tuples in the desired order? Examples 1. Result ==Sink[ /usr/db/result ] Result <--Retrieve AlgebraExpression These two can be combined into one statement :- Assumes Unix file. Other parameters could be added. ( Result ==Sink[ /usr/db/result ] ) <--Retrieve AlgebraExpression Result is permanently defined. 2. Result <--Retrieve AnotherAlgebraExpression Overwrites contents of file Result with new result. 3. Result <--Insert YetAnotherAlgebraExpression Inserts a relational value into Result, which also retains its original content; checks no duplicate tuples are inserted. 5

4. MuchData ==Source[ /usr/files/muchdata ] MyRelation <--Insert MuchData Project[ ~ TelNo ] Removes TelNo attribute from relation in file MuchData and inserts result into MyRelation. 5. Result <--Retrieve MuchData Restrict[ Telno Like 0800% ] This retrieval just passes through the DB. Abnormal, but not invalid! David Livingstone, 4 January 2002 6