International Language Character Code

, pp.161-166 http://dx.doi.org/10.14257/astl.2015.81.33 International Language Character Code with DNA Molecules Wei Wang, Zhengxu Zhao, Qian Xu School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, Hebei, 050043, China {wangwei, zhaozx, xuqian}@stdu.edu.cn Abstract. In 1994, Dr Adleman solved problem using DNA as computational mechanism. He proved the principle that DNA computing could be used to solve computationally complex problems. Recent 20 years with the rapid development of biological molecular computer, scientist have set a series of theoretical model and succeed in biochemical experiment. DNA computing has become an important research direction of the computer science and molecular biology. This research present a novel approach in which character could be encoded by the permutation and combination of the four nitrogenous bases (Adenine, Guanine, Cytosine and Thymine) in DNA molecules. The character encoding should support multi-language and unique identifier. Keywords: DNA Storage, Character Encoding, DNA Computing 1 Introduction The rapid development of science and information industry, especially the development of multimedia technology, cloud computer and computer network, computer storage equipment not only has a larger data storage capacity, higher data transmission rate and more reliable data storage quality. Also on how to make the data more economic and safe storage, storage in time and space on the extensibility, have put forward higher requirements. Current computer storage system the birth defects are revealed and the subsequent development of lack of power, has become one of the bottleneck of the computer promotion. Whether the HDD or optical storage technology is unable to cope with the future demand for storage of computer. It is estimated that in the future semiconductor, disk, and CD-ROM data storage density will achieve its physical limit [1], it is urgent need to develop a new generation of alternative storage technology. On the other hand, Biological molecular computer which Adleman [2] completed the first experimental verification has been rapid development. Nearly two decades, a variety of theoretical models and experimental methods emerge in endlessly, such as Adleman model, Splicing System model, Insertion-Deletion System model and DNA- EC model [3]. DNA storage as an important branch in the field of biological molecular computer, because it has high storage density and low hardware cost, access procedure parallelizable, good scalability and integration, and long term ISSN: 2287-1233 ASTL Copyright 2015 SERSC

storage. In the foreseeable future DNA storage system will be likely to replace the traditional storage systems. [4] [5] DNA molecule is a powerful and effective natural information storage medium, it has been widely used since 1985 when DNA molecule was synthesized for the first time. There are obvious similarities between DNA storage system and traditional storage system, both of two storage system are sequential storage devices, and use special symbols to indicate the beginning and end of a single information section, and the data error correction coding is used to ensure the integrity of their information. As a result, DNA molecules can be used as a medium of the information is stored. DNA storage technology is based on the DNA molecule storage medium. The four nitrogenous bases (Adenine, Guanine, Cytosine and Thymine) what are contained within DNA molecule can be used to encode information. With the existing biochemical experiment method, it's easily complete the clone operation of DNA molecules and the modify operation of the nitrogenous bases what has been encode in the DNA molecules, these operations are similar with the traditional storage system which read and write operations. Because of the advantages of DNA storage system such as stable and reliable work, no wear, huge information capacity, long life, high quality, low price of bits of information and access procedure parallelizable, DNA storage system is seen as high density and large capacity of storage. Although DNA molecule as a data storage method has been proposed, but at this stage how to encode the information what will be stored in DNA molecule has not yet been determined. The method of character encoding is one of most important foundations of computer system, there is an exploratory research what use permutation and combination of four nitrogenous bases of DNA molecule to encode the character information. This research include two major problems, storage medium select and coding rules. 2 Storage Medium DNA molecule as information storage medium can take many forms. As information storage medium of DNA molecule can be a single-stranded, also can be doublestranded; can be a long chain, can also be a circular strand, some with special biological meaning chain is called the plasmid [6]. These different modes have their different advantages and disadvantages when they are as information storage medium, therefore must consider these factors when choosing storage medium, to make the DNA molecule storage advantages and simplicity of operation have been play. DNA storage system using circular single-stranded DNA molecule as storage medium. Compared with single-stranded and double-stranded each have each advantages and disadvantages. Double-stranded DNA is more stability than single-stranded DNA, that is one of the most important reasons what the most living organisms choose double-stranded DNA as their genetic materials, but the data which stored in the double-stranded DNA are difficult to read. Double-stranded should be unzipped their two attached chains into single-stranded before reading and clone. Single-stranded DNA can use Watson-Crick Complement principle to read data, but it is not stable, and single-stranded DNA is not only more easily fracture than double-stranded DNA, 162 Copyright 2015 SERSC

but also easily to form own complementary hairpin structure. It is the reasons why we choose single-stranded that single-stranded easier to read and clone than doublestranded. In addition we can avoid the generation of the hairpin structure in the singlestranded special design. Compare with long-chain DNA than circular strand DNA, long chain will be cut into two independent segments by endonuclease at a time, but circular strand is still together, under certain conditions can also even the back circular strand again. Even more long chain easy to be degraded by certain exonuclease from its ends, and this degradation possibility of a circular strand is less than long chain 3 Coding Rules The DNA molecule is composed of four nitrogenous bases, therefore the permutation and combination of the four nitrogenous bases can be used to encode information which will be stored in the DNA storage system. The coding rules are as follows: 3.1 Unique Code In order to compatible with different countries and languages, multi-language environment, it is must be defined each character as unique code. Coding using an abstract way which combines Adenine, Guanine, Cytosine and Thymine (A, G, C and T for short) to deal with characters, and the visual image work, such as font size, shape, font, form, style and so on for application software to deal with, such as a web browser or word processor. 3.2 Permutation and Combination of Nitrogenous Bases Use The coding rule is composed of four nitrogenous bases permutation and combination. In order to maximize the including information about the character of all countries and languages, from 0 to 0x10FFFF are used to indicate all countries and the language character in Unicode encoding, a total of 1114112 code points. If use the nitrogenous bases permutation and combination to represent 1114112 code points, in order to defined each character as unique code, it need 11 nitrogenous bases to represent each code point. For economizing on space of storage, reducing duplication of nitrogenous bases which are from the high-order to low-order. And the adenine (A for short) as '00', the guanine (G for short) as '01', the cytosine (C for short) as '10', the thymine (T for short) as '11'. The table 1 is mapping table of nitrogenous bases. Copyright 2015 SERSC 163

Table 1. The mapping table of nitrogenous bases Unicode Binary Sequence 0 0 A 0x1 1 G 0x2 10 C 0x3 11 T 0xA 1010 CC 0xAF 1010 1111 CCTT 0x10FFFF 1 0000 1111 1111 1100 0000 GAATTTTTAAA 3.3 Latin Letters Computer system support the basic Latin letters. In the ISO8859-1 it defined 256 commonly used characters, such as numbers, uppercase Latin letters, lowercase Latin letters, etc. So the first 256 positions in the character encoding reserved for the characters which include in the ISO8859-1, in order to improve the character encoding efficiency and compatibility. 3.4 Multi-Languages Environment To improve the efficiency and compatibility of multi-languages, the character encoding provide independent zone for different language. The Unicode plane is a good reference for the character encoding. 5 Algorithm Algorithm describes how to perform the character encode with nitrogenous bases. First import the text file which will be transform into the memory. According to the order of the characters in the text, get the Unicode of the character one by one. Follow the code rules, transcode the Unicode to nitrogenous bases. Output the final result to store DNA sequence. For example, the character "A" Unicode is 0x41 (01000001), the corresponding nitrogenous bases is AAAAAAAAGAAG, simplified nitrogenous bases is GAAG. In encryption round, the nitrogenous bases (DNA sequence) will add round key, sub bytes, shift rows, mix columns. The final ciphertext will be storage. 1: Initialization 2: Import the plaintext file 3: for each character do 4: Get Unicode of the characters C unicode 5: Transcode C unicode to C DNA 6: Output C DNA to store DNA sequence 7: end for 164 Copyright 2015 SERSC

6 Verification of Algorithm The Import the text file which include Latin alphabets, Chinese characters, Japanese characters, numbers, and symbols. The application software (Fig. 1 is an example) get Unicode of the character in binary at first. Then follow the coding rules the application software transcode the Unicode to the nitrogenous bases. Inverse this operation, the application software also get the raw text from DNA sequence. Fig. 1. Example of the Character encoding 7 Conclusions This paper puts forward a set of encoding of characters used to DNA storage system. The character encoding can be implemented to convert character to sequence of nitrogenous bases so as to implement the encoding and decoding of character information. This character encoding are more compatible with the multi-language environment, and all character encoding is uniqueness. Acknowledgment. Dr. Yang Guo are greatly acknowledged for supporting this study. Laboratory of complex network and visualization has made publishing of this article possible. Copyright 2015 SERSC 165

References 1. Wei Dan, "Review of magnetic information storage technology," in Physics, vol. 33(9), 2004, pp. 646-651 2. Adleman LM., "Molecular Computation of Solution to Combination Problems," in Science, vol. 266(11), 1994, pp. 1021-1023 3. ZINGEL T., "Formal models of DNA computing:a survey," in Proc Estonian Acad Sci Phys Math, vol. 49(2), 2000, pp. 90-99. 4. Dietrich A. and Been W., "Memory and DNA," in J theor Biol, vol. 208, 2001, pp. 145-149 5. Garzon MH., Neel A., Chen H., "Efficiency and Reliability of DNA Based Memories," in GECCO, 2003, pp. 379-389 6. ROBERT F W., Molecular Biology, 2nd ed., Beijing:Science Press, 2003, pp. 642-682. 166 Copyright 2015 SERSC