Introduction. The XML- XSLT Workflow: Building a Digital Edition. skim or skip if you are more experienced.

Introduction. The XML- XSLT Workflow: Building a Digital Edition This chapter is for absolute newcomers to digital editions and xml please skim or skip if you are more experienced. There are a few terms we will use in discussing a digital scholarly edition that is built using TEI, a set of XML tags designed by the Text Encoding Initiative Consortium (http://www.tei- c.org). The first is really a set of terms all of which are used when talking about XML and HTML encoding: tag, root, element, path, and attribute. In XML and HTML encoding, that something is a tag is indicated by angle brackets: <>. These angle brackets indicate that the word inside them is not part of the text of the document but is to be read as code. The root of a document and the elements that structure an XML document always occur inside tags in that document. The <root> names the kind of document that it is: <html> is the first and the last tag in an HTML document, a web page; <TEI> is the first and last tag in a TEI/XML document. I say first and last because in XML there is always an open and a close tag, the close tag indicated by a forward- slash /. The open root tag says, the document begins here: <TEI> <html>. The close root tag says, the document ends here : </TEI> </html>. Similarly, all of a document s structural elements must begin with an open tag, <body> <div> <p>, e.g., and end with a close tag, </body> </div> </p>. Sometimes elements have attributes. These specify something about the element: <div type= poem >. Attribute Tag anything inside <> Root document tag Element name for a structure to be coded Path the road to documents (and even inside) Attribute information about an element names are typically indicated by the @ sign. In the TEI Protocol 5 Guidelines can be found an alphabetical list of all the TEI elements (http://www.tei- 1

c.org/release/doc/tei- p5- doc/en/html/ref- ELEMENTS.html) and attributes (http://www.tei- c.org/release/doc/tei- p5- doc/en/html/ref- ATTS.html), but not all attributes can be used by all elements, so it is best to find the list of attributes available for any element on a specific element page: Figure 1: Attributes of the <div> element shown in the TEI Manual The eyes of your computer don t really see folders and documents, only the paths to things: Figure 2: Paths visible in the Mac Finder Window, two ways. 2

When you search for documents and folders on your computer, it follows a path through it only knows the difference between a folder and a document because a document name ends with an extension indicating that it is a separate document: 1..docx (a Word document) 2..pdf (an Adobe Acrobat document it stands for portable document format and is opened by Adobe Reader or other Acrobat products) 3..xml (a TEI or an XML document it stands for a document coded in extensible Markup Language and it is opened by oxygen if you have downloaded it, but can also be opened by a plain- text editor and some browsers, though it can only be viewed in the latter, not edited) 4..html (a web page it stands for a document coded in hypertext markup language and it is opened by a browser such as Internet Explorer, Safari, Firefox, or Chrome.) Your computer knows which programs to use in order to open each kind of document, sometimes after you have set the program as your default. The second term is metadata (Fig. 3): 3

Figure 3: card from a card catalog. If you are old enough, you ll remember the card catalog, and metadata is every factual element about a book that can be expressed: it s title, author, publisher, place of publication, size, and numbers identifying it in various ways. The third term we will use is digital surrogate, which is a surrogate for a printed document that is available on the computer. Here (Fig. 4), Figure 4: entry in the Oxford Text Archive (OTA) 4

you can see an entry for The Life of John Bundle by Thomas Amory. The entry page for this particular document allows you to download and use in multiple ways, according to a Creative Commons License (https://creativecommons.org/). Each digitized version of this document available for download XML; HTML; epub; mobi (Kindle); plain text is a digital surrogate of the original document. (You can see that the XML link is in purple because I have downloaded it: as I will explain shortly, the XML is the master that was used to generate the other surrogates in the list. Again, I will explain what that means shortly, but first, in order to understand the difference between a master and other surrogates, let s talk about the conceptual and material difference between XML and HTML by looking at the original book, a document (Fig. 5), and its digital expressions. Figure 5: a picture of the actual document, the Northpoint Press Expanded Edition of The Senses of Walden, by Stanley Cavell. Here (Fig. 6) is a web page containing the book: 5

Figure 6: The beginning of Senses of Walden This (Fig. 6) is an HTML page that one might come to, presenting the text online. Browsers the one I m using here is Chrome are software that render code to the screen, and so the coded document that the browser is reading is an HTML document, and here (Fig. 7) is what the code looks like, behind the scenes: Figure 7: HTML code informing Figure 4 6

Whatever browser you use, be it Microsoft Explorer, Firefox, Safari, or Chrome, transforms HTML code into the web pages that you see when you surf the net. An XML surrogate of the book (Fig. 8) is quite different: Figure 8: XML encoding of the first part of Senses of Walden I have not used TEI to XML- encode this document. I made up my own XML tags, which anyone can do in creating a valid XML document: that s the power of XML. It is code written in human language, readable by us as well as by machines. XML is called semantic markup because the code has meaning in human language. I have created the tags I wish to give my surrogate: book, title, author, chapter, and p for paragraph. Any XML document that you create with any words or names you wish to use is valid as long as it has a root tag or element, and an element inside that root (Fig. 9): 7

Figure 9: Minimal XML Document This is a minimal XML document in oxygen, the software that we will use for XML encoding throughout this book. Notice the green box toward the top right: that green box means that this is a valid document. The software reads the document declaration at the top, enclosed by angle brackets and question marks <?xml version= 1.0?> and knows that is can have any tags inside its angle brackets, any words whatsoever, but that it must have a root element/tag and some other element/tag. XML documents are idiosyncratic: any word can be used to name an element or tag. Consequently, browsers cannot read them. No software could process any and every word in our natural language used any and every way. XML documents are only behind- the- scenes documents. The only reason to create an XML document is because its elements/tags are semantically meaningful to human readers, not to software. Let me compare the surrogate of The Senses of Walden that 8

browser software can translate into a web page (HTML) with the surrogate of it that cannot be read by your browser (XML), figures 10 and 11: Figure 10: HTML Figure 11: XML Whereas the HTML (Fig. 10) encloses the title of the book with an <h2> tag, the XML (Fig. 11) encloses it with a <title> tag. In both HTML and XML, we have an open <h2><title> and a close </h2></title> tag, of course the names inside the angle brackets differ: h2 says, make these words a certain size, to the browser software; title says to human readers of the tags, the words inside this tag are a title. Semantic markup, markup that is meaningful to humans, allows for using digital tools to explore the sets of documents using the same elements. Someone making a web page might choose to make the author s name normal size, using a <p> tag, or vary the size using <h1>, <h2>, <h3>, or <h4> and might inconsistently make such choices across documents. In contrast, in semantic markup, one would always use the <author> tag for an author. While this is true within one coder s set of documents, it is even more true when documents come from multiple coders and document sets. If everyone encoded in XML, and every XML- encoder had an <author> tag, we can search through thousands or tens of thousands of documents coded by many people and say, give me all the author names for these documents. 9

The TEI Consortium came into existence precisely to specify XML elements so that encoders can all use the same names for things, making it possible to text- mine documents or put them all into a database no matter who has encoded them. And then there is the future. If we were only encoding digital surrogates to be shown on web pages, now and forever, HTML would be sufficient. We are not. We are encoding library- quality digital surrogates to last forever: to be usable by whatever software comes into existence in the future. An XML document cannot be read by a browser, but it can easily be transformed into HTML so that browsers can read it, and this book teaches you how to create code that will perform such transformations as well as to process documents to get a list of authors, or load information from myriad TEI documents into a database. Just as a print edition has parts (Fig. 12), Figure 12: Parts of a Print Edition so does a digital edition (Fig. 13): 10

Figure 13: the parts of a digital edition I will come back to describe all of the parts, but for the moment, I will discuss first the relationship between a cascading stylesheet or css and HTML, the focus of Chapter 1. A server that is accessible to the World Wide Web contains html code of your digital edition. That server sends the code to any user who visits your web site, and the user s browser processes the HTML code into the web pages you typically see on the Internet (Fig. 14, 15, 16): 22 Figure 14 11

Figure 15: Server Side (the code that a URL goes to get) Figure 16: Client Side what you see on the Internet (The code as it has been read and transformed by a browser) But wait, you may say: that s not what I see when I go to Robert Bloomfield s poem To Immagination (he spelled it that way) in the Romantic Circles edition of Robert 12

Bloomfield s Letters (https://www.rc.umd.edu/editions/bloomfield_letters/); there, I see this (Fig. 17): Figure 17: an early version of the web edition If you were to look into the code, you would see that the HTML code for this version calls a cascading stylesheet. That is, more code in another coding language that is not HTML is also on the server, and a link to it is put into the HTML code so that, when you as a viewer go to Bloomfield s poem using your browser, the page comes up no longer looking like a plain white web page: a css file is used to style the plain web page into something else. Early in the history of making web pages, coders simply added the css code to the HTML code, but now they make styling a separate file so that you can easily use the same style in all your webpages (without retyping the code), and just as easily change the look of the page, as Romantic Circles has done for the Bloomfield letters. Changing one css file changes the look of all the 13

letters or chapters in your edition, and of course they look uniform throughout your edition. You will learn the HTML code that references a stylesheet and more about making css in Chapter 1. But for now, it is important to know how css, a very simple language, can dramatically change the look of all the documents in an edition that rely on that particular cascading stylesheet, that particular css file. To get a sense of how much a css file can change things, one only needs to go to Zen Garden (http://www.csszengarden.com) to see how one and the same HTML document has been styled using different css files (Fig. 18, 19, 20): Figure 18: csszengarden.com s splash page 14

Figure 19: the same HTML code calling a different stylesheet 15

Figure 20: A third example There is more to a digital edition, however, than HTML pages with an accompanying css file. The most important part of an archival quality digital edition is the TEI master file. When you go via the Internet to the Bloomfield edition, as it appears now, you are given access to the TEI file that generated the HTML which you are viewing (Fig. 21): 16

Figure 21: The Digital Edition s XML/TEI button If you click on the XML/TEI button, which I have enlarged here, you will get to the TEI Master. And, in the original web version of the edition, if you had viewed the HTML code (via your browser s View tab), you would have seen HTML encoding that looked like this (Fig. 22): html <! THIS,FILE,,IS,GENERATED,FROM,AN,XML,MASTER.,,DO,NOT,EDIT,(5)! 37 Figure 22: The HTML File for the Original Bloomfield letters 17

A comment in the HTML code says, Do not edit this HTML page because the edition itself is encoded in TEI and then transformed into HTML. In other words, if you want to edit the poem further, make changes to the TEI Masterfile, not to this web page. An XSLT is used to transform the TEI document into an HTML document. The XSLT adds the link that the HTML document needs to use a stylesheet. [Aside: Right now, Romantic Circles Editions no longer work this way because the site has been transferred to Drupal, a content management system (and the topic of another book in this Coding for Humanists series, Drupal for Humanists, by Quinn Dombrowski). You can put your XSLT s into a Drupal Module, and then it transforms the original TEI into a Drupal page (I wrote the XSLTs for Romantic Circles, and then a very talented programmer, Dave Rettenmaier, incorporated them into a Drupal Module for TEI documents). Even if you later put your edition into a content management system, be it WordPress or Drupal, you will need to have XSLTs to transform it into the HTML required by that system. css files are also used by Drupal and WordPress, though creating and editing them in those systems is quite complicated.] Our workflow for our archival quality digital edition, then, looks like this (Fig. 23): 18

TEI XSLT css html'code HTML 40 Figure 23: TEI to HTML workflow Using oxygen, an XSLT is run on the TEI pages of your digital edition in order to make HTML pages that link to your css file. I have set up a Digital Edition folder so that you can see the parts of your edition that I have just described (Fig. 24): Figure 24: Digital Edition Folder 19

Our workflow, then, consists in 1) creating TEI files for documents in the edition, 2) creating a css file to make HTML documents look the way we want them to, 3) creating an XSLT to transform the TEI into HTML documents. There are a few other parts to the digital edition. A search engine has to be created for your website, and doing so requires creating a database that contains plain text. And more can be done with a digital edition: it can be analyzed in various ways, including analyzing its networks. Thus, we will also use XSLT to transform those master TEI files into a few other things: a) plain text files for searching or text mining; b) a database (.csv) file; c) documents created by analyzing the edition in various ways, including.csv files for loading into Gephi to perform network analysis. Creating a database is the topic of another book in the Coding for Humanists series, Databases for Humanists, by Harvey Quamen; text mining the topic of many good books coming out and available now. 1 You may find collaborators who can create a database and perform text mining, instead of learning to do those things for yourself either way, your edition will be ready for further development by the time you finish this book. It will give you all you need to prepare your digital edition for searching and data mining, and for using in network analysis tools such as Gephi. 1 See for example Matthew Jockers, Text Mining with R for Students of Literature (Springer, 2014). 20