the most striking thing about this white-paper is not the content itself, but its _form_. specifically, that it is _conceptualized_ as existing on paper... it is deeply imbued with signals that convey this. this comes as a bit of a shock, not just because this is a document that was "born digital", and not just because it's being distributed digitally (via cyberspace), but also because the _topic_ of the paper is _digitization_, and it was written by someone in charge of a digitization project! yet this paper shows absolutely no _consciousness_ of itself as a digital entity. it's mired back in paper. if it was written by someone who deeply understands the issues involved in digitization, i firmly believe they would not have structured the paper the way they did. and that's not the half of it! the paper was sponsored by an organization which has gotten a grant -- from the mellon foundation -- in the amount of $2.19 million, to provide feedback to the scanning projects on their utility to scholars... how can this organization assess digitization efforts and provide suggestions for improving those efforts when it displays so blatantly with its very own work that it's frozen in the mentality of paper documents? if these are the people who are in charge of things, is it any wonder everything is messed up so badly? *** so, how does this paper-based mentality manifest? first, the white-paper is being distributed as a .pdf. i recognize, and acknowledge, that that is common. yet still, i believe that everyone needs to recognize, and acknowledge, that this format is unacceptable. as _one_output_format_, for reading, .pdf is tolerable. (but .html would be more in keeping with cyberspace.) as our _only_ format, .pdf is simply not acceptable. and as our _archival_ format, .pdf is totally wrong... .pdf is a lousy format, for access _and_ preservation, and we need to start talking the talk on that matter, and -- even more importantly -- walking the walk... to continue to use this lousy format, just because it is "adequate", and convenient, and ubiquitous, is stupid. it is especially bad this .pdf is formatted at 8.5*11, as that page-size smacks of business-office output. if it were 6*9, then someone could have _argued_, at least, that size is typical of books and journals, and thus was in keeping with its roots in academia... or a flip-switch from portrait orientation to landscape would made the .pdf readable on computer monitors, and thus lent some authority to it as a _digital_entity_. better yet, 5*8 pages would display 2-up on a monitor, with the familiar 2-up facing-pages look from books... but no, it's 8.5*11, so it doesn't fit well on a monitor. and that is perhaps the main sign of its cluelessness... this document doesn't _think_ of itself as being digital, primarily because its _author_ didn't think of it that way. *** the strength of .pdf is that it mirrors paper exactly. i am not denigrating that -- indeed, i believe that is an extremely important property -- but it's certainly not the _only_ important property in a digital world. in fact, it's not even the _most_ important property. unless you think of your document as a paper entity, which is the thinking that was being displayed herein. moreover, this .pdf was prepared with microsoft word, which makes it worse. to understand _why_ that is so (it has very little to do with ms-word being proprietary, or an inferior app, or bloatware, as some may suppose, even though all of those are appropriate reservations), we need to review some of the fundamental problems with the .pdf format. pardon be for being pedagogical. *** the _worst_ aspect of .pdf is that it's the "roach motel" of formats -- documents can go in, but can't come out. yet one of the most important aspects of scholarly text is that it be _reusable_. might even be its sine qua non, seeing that it is a stand-in for _having_been_influential_. so we must be able to copy-and-paste an entire article -- if need be -- and have the result be instantly usable. and .pdf fails that test, and fails it badly. *** there are two easy ways to get the text out of a .pdf... the first is to go into text-selection mode, do select-all, copy to the clipboard, and then paste the text elsewhere. for this white-paper .pdf, i've posted such output here: > http://z-m-l.com/oyayr/oya-frompdf.txt what is not clear from that plain-text web-page is that much of the text _styling_ (e.g., italics, bold) is retained when you retrieve the text in this way. that's excellent. the bad news is that not _all_ of the styling is retained. worse, the _main_deficiency_ in this clipboard approach _is_ obvious. namely, the text has lost its paragraphing. it's vague where one paragraph ends and the next begins. all the paragraphs run together. this is not good; it's bad. similarly, like the indent on the first line of each paragraph, indents on various indented segments of the text were lost. as the author used indentation often, this could be serious... (it ended up that most of the indentation by this author was gratuitous, in the sense it had no semantic meaning, but that's not always the case. block-quotes, in particular, are often marked _only_ by indentation, and thus when it is lost by adobe on a copy operation, it's a serious matter.) *** now, the second easy way to get the text out of a .pdf is to select the "save as text" item found under the "file" menu. you can see the results of that method for our .pdf here: > http://z-m-l.com/oyayr/oya-saveastext.txt it is encouraging to see that the paragraph breaks _are_ retained, in the form of a blank line between paragraphs. the downside of this approach, though, is that we lost the text styling; it isn't that big of a deal in this particular .pdf, because there weren't too many places set in bold or italics. still, that emphasis _is_ important (or the author wouldn't have gone to the trouble of doing it), so it's bad to lose it. and in many documents (books, etc.), the text styling can be very important, and much more common, so we can't lose it. *** perhaps adobe could "mash" these two approaches together and create output that gave us both styling and paragraphing. that'd be great, and help in solving the "roach motel" problem. but it wouldn't solve all the problems, and the remaining ones are more troublesome, in that they are _structural_ in nature... in a nutshell, adobe copies the text as it was written to each page by the program which created the .pdf. (for this, it was ms-word.) what this means is that, in _many_ cases, the document's flow of text is severely twisted in and around the conventions of the page -- things like inserted tables, footnotes, page-numbers, and so on. as an example, let's take some text from pages 4-5 of the .pdf: > Access. According to answers to frequently asked questions (FAQs) issued by cultural > institutions participating in LSDIs, the libraries' primary motivation for partnership is to > support their core mission of advancing knowledge and to transform the ways in which > users search and access library content.12 Several of the participating libraries also say > that these initiatives support their vision to enhance access to information in support of > > collections and produce complete sets of documents. See Karen Coyle. 2006. "Mass Digitization of Books." > Journal of Academic Librarianship 32(6) [November]: 641ив645. > > 11 Richard K. Johnson provides a useful synopsis of implications of book-digitization projects and provides > examples of core library interests in digitization partnerships in his article "In Google's Broad Wake: Taking > Responsibility for Shaping the Global Digital Library." ARL: A Bimonthly Report 250: (February 2007). > Available at http://www.arl.org/bm~doc/arlbr250digprinciples.pdf. > > 12 Examples of FAQs include > Stanford: http://www-sul.stanford.edu/about_sulair/special_projects/google_sulair_project_faq.html > Harvard: http://hul.harvard.edu/hgproject/faq.html > University of Michigan: http://www.lib.umich.edu/staff/google/public/faq.pdf > Cornell: http://wiki.library.cornell.edu/wiki/x/gng > > 4 > > > scholarship at local institutions and beyond. A related motivation for participation is to > make the institutional collections visible worldwide. > A related motivation for participation is to > make the institutional collections visible worldwide. here we can see how the flow of the document's actual body-text -- ..."access to information in support of scholarship at local institutions"... is disturbed by the intrusion of footnotes and the page-number (4). the line starting with "collections and produce complete sets" is even a _continuation_ of footnote #10, which began on the previous page, so it's an especially grievous example. but the general objection on the difficulties caused by interjection of page-elements into the text is one that arises on almost every document that we want to digitize. (in the arena of books, it's the running-heads that cause the problem.) *** it ends up that rearranging the flow of text copied out of a .pdf can be an excruciating task. and while it can be mechanized, to some degree, it will likely never become fully automatic, because the problem is based in the way a .pdf is created, i.e., from the output from other programs. so unless _ms-word_ changes the way that it prints out its documents, the problem will remain. (some hope was raised when adobe gave users the option of creating "tagged" documents, but users largely ignored it, so the percentage of "tagged" .pdf in the wild has remained fairly tiny; the new .epub format that adobe is embracing also gives _some_ hint of solving the problem, but it remains to be seen if that solution works.) nor are these interjections limited to page cruft. there are many places in this white-paper where the _author_ has included other elements that disturb the text-flow, such as tables (both numbered and unnumbered). this is _especially_ true when the paper shifts into a 2-column format, so as to include one of those tables in a column. adobe doesn't "know" this extra text should be "separate", so it mixes it into the body-text... these tables _could_ have been placed such that they did not disturb the flow of the text. but the emphasis was on design of the _page_, and not the text-flow when copied out of the .pdf. again, this shows a mentality that was focused on _print_, and not the _digital_ entity. one sees this same mentality in the word-based "pointers" in the text: > Table 1 on page 9 > Table 2 on the following page > The sidebar on the following page > The libraries that completed the survey are > among those listed on page 9 of this report. these references to page-numbers are fine and helpful _on_paper_, but lose much significance when the document moves to cyberspace. my favorite of all of these, however, is the "ibid" in footnote #120. this is a little trick that printers invented to save themselves work -- as they didn't have to reset the type of the previous reference -- but is unnecessary in an environment where copy-and-paste is easy. i would have been very heartened if this .pdf would have contained _hotlinks_ in it -- at least in the table of contents if no place else -- but alas, it has none. in the library of the future, of course, hotlinks should be a very important element. _every_document_ listed in the footnotes or a reference section, for instance, should be hotlinked... (and so should the word-based pointers to the tables and sidebars.) i would suggest that we should radically restructure our references. rather than formatting them roughly the same as _text-paragraphs_ -- that is, wrapping them -- i believe we should treat references as _database_items_, and format them in a manner appropriate to that. *** i am not one who criticizes without making constructive suggestion. so i've reworked the text of this white-paper, and posted my version. (this will show how i'd restructure the references, among other stuff.) but for the sake of completion, i'll detail the remaining problems with the .pdf as it currently exists, before i go on to what it _should_ be... *** first, when text is copied out of a .pdf where hyphenation was used, adobe dehyphenates. but it's not always smart about how it does it. so we have the following words that were incorrectly "dehyphenated": > farreaching > bestpossible > costeffective > datatransfer > emonograph > fulltext > highcost > imagequality > up-todate > winwin > just-incase not a big deal to go in and make these corrections on one document, not really. but when we talk about digitizing millions of documents... and finally, a few copy-editing notes, to improve the final product, and show that i was paying close attention: there was a typo in the text: > investigatation. and a date inconsistency in these references: > RLG DigiNews 10(1) [February]. > RLG DigiNews 10(3) [June 15]. > RLG DigiNews 11(1) [April 15]. and a punctuation discrepancy between the table of contents and body: > Why Join Forces > Why Join Forces? and some floating colons: > 24 Internet Archive : http://www.archive.org/index.php. > The Open Library : http://www.openlibrary.org/toc.htm. > 80 ISO 14721:2003 OAIS : and sentence-terminating punctuation should be dropped from url's: > 24 Internet Archive : http://www.archive.org/index.php. > The Open Library : http://www.openlibrary.org/toc.htm. (these are just two examples; there were many more.) adobe retains linebreaks when it copies text out of a .pdf, which is good, because _if_ we want to unwrap the paragraphs, we always _can_, but it's _informative_ to us to know what the linebreaks were in the original file. however, when you retain the linebreaks in a file, it's important that you mark lines which _should_not_ be unwrapped in an automatic operation. the obvious example is that lines in a _table_ should not be unwrapped; those linebreaks are _significant_, and thus should be retained. however, the .pdf gives us no indication of such lines, which causes us a problem. a smaller point is that url's are sometimes split by a linebreak when copied from a .pdf; the copy operation doesn't know to delete them: > http://books.google.com/googlebooks/partner > s.html and down to the smallest point of all, you can see in the copied text that i posted on the websites, there can sometimes be problems with the upper-ascii characters that display curly quote-marks and bullets. in general, to the extent you can, you should use lower-ascii instead. *** ok, now to the question of what we _really_ want in a digital document. it's important to know this if you're looking at the question of how to digitize a paper-based document, because you need to know the goal. it's also important because we have to do some training of our authors in order to get them to "think digitally" when they _create_ a document, whether it be a white-paper or a journal article or a book or whatever... i'll simply assert that what we need is an easy-to-edit "master format" which is capable of _generating_ output in a variety of other formats... (note that .pdf falls far short of those goals, in just about every respect. it's not easily-editable, and it's very bad at generating any other format.) what we're generally looking for is a format that "explains itself clearly" (a) to a person who is looking at it, or (b) to a computerized analysis. put another way, the "structure" of the document should be _obvious_. in some ways, the author of this white-paper did a good job doing this. specifically, she used a numbered-outline format, which makes it clear what the framework of the document is, to both humans and machines. (one computerized algorithm might be to pull out numbered paragraphs, e.g., to form a table of contents, a simple routine that would work well.) unfortunately, the author didn't stick to this numbered-outline format. in some places, she used numbered-paragraphs that aren't in the outline. and in other places, she used indentation that isn't reflected in the outline. and at times she used bulleted lists, and it's ambiguous how those related. these kinds of ambiguities confound the machine analyses that would let us automatically move paper-based documents to the digital sphere, and for a person who is in charge of digitization efforts to seemingly be blind to those ambiguities -- by letting them crop into her document -- is something that should be seen as a very dangerous sign, to my mind. although they cannot be eliminated from documents printed ages ago, if we continue to propagate 'em in our new text, we're inviting trouble. as another example, adobe often badly mangles the text from _tables_. there are times when that cannot be avoided, but it's also the case that a careful consideration about the way that you _format_ the tables can ease the burden considerably. unfortunately, however, the tables made by this author are formatted so that they badly exacerbate the problem. see specifically table 1, which has the additional difficulty that it spans page 9 and 10, compounding the confusion even more, a _sad_ case... oh, by the way, that's page 9-10 according to the numbers on the page, not the .pdf page-numbers. according to those, since the adobe reader can't know, without coaching, to ignore the unnumbered "front-matter", table 1 is located on pages 14-15. of course, the way to keep both sets of numbers "in sync" is to go ahead and start your numbering _with_ the front-matter. that this author didn't do this is somewhat embarrassing, in that it indicates she might possibly be totally unaware of this issue. on the other hand, to be fair, many of these various issues might stem from the person who _prepared_ the .pdf, not from the author herself. at any rate, _someone_ was unaware of this common digitizing glitch... back to the matter of making the structure of the document obvious, a final issue there revolves around _footnotes_ and their _pointers_. if they exist as simple _numbers_, without being marked in some way, it's difficult (in some cases) for them to be recognized by the machine. *** so, if we were to re-work this paper in the type of format i suggest, what would it look like? i've done just that, and posted the file here: > http://z-m-l.com/oyayr/oyayr.zml (i will note here that it took way too many hours for me to do this.) this file is in a format that i've invented called "zen markup language". (or z.m.l. for short.) most of the "markup" is of the "invisible" variety, in that it's embedded in white-space. (white-space -- carriage returns and spaces -- isn't really "invisible", just not indicated by ink-on-paper.) z.m.l. is a "light markup" that is an ascii file, and thus _text-styling_ information is visible. _italicized_text_ is indicated by _underscores_, and *bold*text* is represented by being *surrounded*with*asterisks*. as an ascii file, z.m.l. is easy to edit, so meets our first format criterion. ascii -- and its utf8 or utf16 cousins -- is also hardy for preservation... plus z.m.l. is also facile at generating other formats, including .html for staging up on the web, and .pdf for people who like that format, whether it be for on-screen reading or for printing out to hard-copy. so it meets our second criterion for our "master file format" as well... here is the .html version of the paper: > http://z-m-l.com/oyayr/oyayr.html i'll also make a .pdf version soon, and it'll be here: > http://z-m-l.com/oyayr/oyayr.pdf touching on points i mentioned along the way above, you'll notice footnotes and their pointers are surrounded by brackets in z.m.l. i've also reworked the reference section in the way i recommended. this is not part of z.m.l. per se, but i believe references should be formatted more like a _database_ than to look like text paragraphs. in this light, i've placed hard returns to format the references such. as just one example, you'll find no url's that are split across lines... i believe that authors shouldn't have to spend time on references. none. we should have programs that can automatically generate the reference section, based on the pointers entered in the text, so that the author can simply point-and-click to settle ambiguity. for instance, if an author includes this in the body of the text: > [RLG_DigiNews_10(1)] or, probably much better, > [Allen, DigiNews10(1), 2006] or even just > [Allen,2006] the authoring-tool should automatically generate this reference: > [117] Christy E. Allen. 2006. > "Foundations for a Successful Digital Preservation Program: > Discussions from Digital Preservation in State Government: > Best Practices Exchange 2006." > _RLG_DigiNews_ 10(3) [June 15]. > http://www.rlg.org/en/page.php?Page_ID=20952 i've reworked the tables, sometimes considerably, so that they will still continue to function when they're represented in a flow of text. i also moved them around a bit, so they wouldn't disrupt that flow. i've removed the indentation, because it seemed to be gratuitous. if it wasn't, then i should be informed, and i will render it correctly. what's needed, of course, is a reworking of the numbered-outline, so it accurately depicts the underlying structure of the document... i've also marked the lines that should not be unwrapped by putting a space in column 1, which is the way z.m.l. makes that distinction. and -- of course -- i moved all the footnotes into their own section. this is _not_ to say that the footnotes must be converted to endnotes. the .zml viewer-program would indeed display these as _footnotes_, i.e., at the specific place in the document where the pointer is located. and, in the .html, there is a link from the pointer to the footnote itself, and then a link from each footnote _back_ to the location of its pointer. and, depending on the specification of the user who generates the .pdf, the .pdf could have the footnotes placed as such (i.e., at page-bottom), or as links similar to the .html version. the choice is up to the end-user. but you place all the footnotes in their own section so that they do not disrupt the flow of the main text. *** there's probably more, but i am out of time and patience in doing this. if anyone has any questions, i will be more than happy to answer them. the most important thing is that the flow of the text has been preserved. that means that someone could come and grab a piece of it, or all of it, and repurpose it as they wished, without having to do any text-editing... and that's the way it should be... -bowerbird