From ralf at ark.in-berlin.de Mon Oct 1 01:19:23 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Mon, 1 Oct 2007 10:19:23 +0200 Subject: [gutvol-d] gnutenberg-press maintenance offer (was Re: Proposal to add OpenDocument as an additional Message-ID: <20071001081923.GA29575@ark.in-berlin.de> I agree with bowerbird here: > jeroen said: > > Some attempts are in progress to encode texts as TEI, > > and automatically create text, html, and pdf from them. > > those attempts have been "in progress" for seven years now. > i invite people to view the .html and .pdf files that are created. The last release of the gnutenberg-press software was 2005. There is no reply by the last maintainer on my eMails. There is no place to report bugs which are plenty. The package is effectively unmaintained. I repeat here what I sent to Marcello: I would maintain the package but I don't have a pglaf account, and I know zilch about XSLT or stylesheets. I would play the maintainer part (sf or berlios, your choice) and I could try my luck with the LaTeX/PDF backend, though, if someone else does the rest of the bugfixing. This offer is up until the end of the month (2007-Oct). I will not make this offer a third time. Sincerely, ralf From marcello at perathoner.de Mon Oct 1 07:01:51 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 01 Oct 2007 16:01:51 +0200 Subject: [gutvol-d] gnutenberg-press maintenance offer (was Re: Proposal to add OpenDocument as an additional In-Reply-To: <20071001081923.GA29575@ark.in-berlin.de> References: <20071001081923.GA29575@ark.in-berlin.de> Message-ID: <4700FDCF.1060009@perathoner.de> Ralf Stephan wrote: > I agree with bowerbird here: >> jeroen said: >>> Some attempts are in progress to encode texts as TEI, >>> and automatically create text, html, and pdf from them. >> those attempts have been "in progress" for seven years now. >> i invite people to view the .html and .pdf files that are created. > > The last release of the gnutenberg-press software was 2005. > There is no reply by the last maintainer on my eMails. > There is no place to report bugs which are plenty. > The package is effectively unmaintained. > > I repeat here what I sent to Marcello: > > I would maintain the package but I don't have a pglaf account, > and I know zilch about XSLT or stylesheets. I would play the > maintainer part (sf or berlios, your choice) and I could try my luck > with the LaTeX/PDF backend, though, if someone else does the rest of the > bugfixing. Maintenance of the "gnutenberg press" is on ice for the present. 1. I don't have much time. 2. The "gnutenberg press" is GPL. So if I don't maintain, you can. Other people have already created derivative works. 3. My personal impression is that the adoption of TEI is not hampered by the few small bugs in the conversion chain, but by the complete lack of TEI authoring tools in DP. Those tools can be developed independently from the conversion chain (aren't standards nice?). I don't know much about DP internals, nor do I want to. Too little free time. Maybe if DP "adopts" TEI and oodles of TEI books start pouring in, my motivation will rise :-) 4. TEI version 5 is coming Real Soon Now. TEI 5 will have many backward incompatibilities and new features added. No good to "maintain" now and having to "re-maintain" later. 5. I'm fed up with buggy Perl XSL modules. I'm going to rewrite the next version in Python. Also 6. the next version will use a standard presentational format (with semantic hinting) as intermediate format, lets say: XSL-FO. So people who want to write backends for their favourite pet formats (eg. OpenDocument, epub, that-other-noring-format-which-was-all-the-rage-some-months-ago-but-is-forgotten-now etc.) can simply convert one open presentational standard into whatever without having to worry about the "gnutenberg press"'s internal cogs. The hard conversion between semantic TEI and presentational FO will be the job of the "gnutenberg press". This all makes all output look more consistent. -- Marcello Perathoner webmaster at gutenberg.org From rolsch at verizon.net Mon Oct 1 08:31:06 2007 From: rolsch at verizon.net (Roland Schlenker) Date: Mon, 01 Oct 2007 11:31:06 -0400 Subject: [gutvol-d] =?iso-8859-1?q?gnutenberg-press_maintenance_offer_=28w?= =?iso-8859-1?q?as_Re=3A_Proposal_to_add=09OpenDocument_as_an_additional?= In-Reply-To: <4700FDCF.1060009@perathoner.de> References: <20071001081923.GA29575@ark.in-berlin.de> <4700FDCF.1060009@perathoner.de> Message-ID: <200710011131.06953.rolsch@verizon.net> On Monday 01 October 2007 10:01 am, Marcello Perathoner wrote: > Ralf Stephan wrote: > > I agree with bowerbird here: > >> jeroen said: > >>> Some attempts are in progress to encode texts as TEI, > >>> and automatically create text, html, and pdf from them. > >> > >> those attempts have been "in progress" for seven years now. > >> i invite people to view the .html and .pdf files that are created. > > > > The last release of the gnutenberg-press software was 2005. > > There is no reply by the last maintainer on my eMails. > > There is no place to report bugs which are plenty. > > The package is effectively unmaintained. > > > > I repeat here what I sent to Marcello: > > > > I would maintain the package but I don't have a pglaf account, > > and I know zilch about XSLT or stylesheets. I would play the > > maintainer part (sf or berlios, your choice) and I could try my luck > > with the LaTeX/PDF backend, though, if someone else does the rest of the > > bugfixing. > > Maintenance of the "gnutenberg press" is on ice for the present. > > 1. I don't have much time. > > 2. The "gnutenberg press" is GPL. So if I don't maintain, you can. Other > people have already created derivative works. If a maintained version of "gnutenberg press" is created. Will it become the official one used at PG. > 3. My personal impression is that the adoption of TEI is not hampered by > the few small bugs in the conversion chain, but by the complete lack of > TEI authoring tools in DP. Those tools can be developed independently > from the conversion chain (aren't standards nice?). I don't know much > about DP internals, nor do I want to. Too little free time. Maybe if DP > "adopts" TEI and oodles of TEI books start pouring in, my motivation > will rise :-) IHMO, I have noted a rise of interest by a number of Post-Processors to use TEI to Post-Process their projects. However, their first impression of the conversion chain has been their biggest disappointment. I will agree that a TEI version of Guiguts would go a long way in helping adoption of TEI. > 4. TEI version 5 is coming Real Soon Now. TEI 5 will have many backward > incompatibilities and new features added. No good to "maintain" now and > having to "re-maintain" later. I may be a good idea after TEI 5 comes out to establish a working group to discuss, design and write a new PG conversion tool and a DP to TEI tool. > 5. I'm fed up with buggy Perl XSL modules. I'm going to rewrite the next > version in Python. Also By all means please write it in Python. > 6. the next version will use a standard presentational format (with > semantic hinting) as intermediate format, lets say: XSL-FO. So people > who want to write backends for their favourite pet formats (eg. > OpenDocument, epub, > that-other-noring-format-which-was-all-the-rage-some-months-ago-but-is-forg >otten-now etc.) can simply convert one open presentational standard into > whatever without having to worry about the "gnutenberg press"'s internal > cogs. The hard conversion between semantic TEI and presentational FO will > be the job of the "gnutenberg press". This all makes all output look more > consistent. From ralf at ark.in-berlin.de Mon Oct 1 09:59:28 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Mon, 1 Oct 2007 18:59:28 +0200 Subject: [gutvol-d] gnutenberg-press maintenance offer (was Re: Proposal to add?OpenDocument as an additional In-Reply-To: <200710011131.06953.rolsch@verizon.net> References: <20071001081923.GA29575@ark.in-berlin.de> <4700FDCF.1060009@perathoner.de> <200710011131.06953.rolsch@verizon.net> Message-ID: <20071001165928.GA18806@ark.in-berlin.de> Thanks Marcello for a clear statement. Roland: > If a maintained version of "gnutenberg press" is created. Will it become the > official one used at PG. As Marcello intends to rewrite the package, even collecting bug reports for 0.4 is useless. I draw back the maintenance offer for 0.4 but will offer for the next version to launch pages on sourceforge or berlios.de for collecting bugs (at least) and to have a specific forum. However, a help text for the DP Wiki for 0.4 users is still a good idea, and after finishing a TEI drama project, I think I'll have some stuff to write. Regards, ralf From jeroen.mailinglist at bohol.ph Mon Oct 1 11:34:23 2007 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Mon, 01 Oct 2007 20:34:23 +0200 Subject: [gutvol-d] gnutenberg-press maintenance offer (was Re: Proposal to add OpenDocument as an additional In-Reply-To: <4700FDCF.1060009@perathoner.de> References: <20071001081923.GA29575@ark.in-berlin.de> <4700FDCF.1060009@perathoner.de> Message-ID: <47013DAF.5090400@bohol.ph> Marcello Perathoner wrote: > Maintenance of the "gnutenberg press" is on ice for the present. > 1. I don't have much time. > > A common problem. Sometimes I feel I spend too much time on preparing actual texts in TEI, and too little on tool development, but then, I want my texts to be TEI. > 2. The "gnutenberg press" is GPL. So if I don't maintain, you can. Other > people have already created derivative works. > > 3. My personal impression is that the adoption of TEI is not hampered by > the few small bugs in the conversion chain, but by the complete lack of > TEI authoring tools in DP. Those tools can be developed independently > from the conversion chain (aren't standards nice?). I don't know much > about DP internals, nor do I want to. Too little free time. Maybe if DP > "adopts" TEI and oodles of TEI books start pouring in, my motivation > will rise :-) > > I think the biggest barrier here is the steep learning curve of TEI (20% of the tags cover 80% of the things you encounter, but every other book you will need something from those remaining 80%, and, oh gosh, which tag can I use then), combined with the fact that it is a far stretch from TEI to WYSIWYG. Maybe somebody can help build an authoring tool, but, in my opinion, it should not even try to be WYSIWYG, as with proper tagged text, you get much more than you can see in one view.... WYGIMMTWYS > 4. TEI version 5 is coming Real Soon Now. TEI 5 will have many backward > incompatibilities and new features added. No good to "maintain" now and > having to "re-maintain" later. > Their will be migration paths from P3 to version 5, supported by XSLT, etc. However, version 5 will mainly add much awaited features. > 5. I'm fed up with buggy Perl XSL modules. I'm going to rewrite the next > version in Python. Also > Yep, they are a pain, as are those in PHP. > 6. the next version will use a standard presentational format (with > semantic hinting) as intermediate format, lets say: XSL-FO. So people > who want to write backends for their favourite pet formats > can simply convert one open presentational standard into whatever > without having to worry about the "gnutenberg press"'s internal cogs. > The hard conversion between semantic TEI and presentational FO will be > the job of the "gnutenberg press". This all makes all output look more > consistent. > > > I found XSL-FO some kind of overkill for most projects, and have had good results using Prince (unfortunately a non-free tool, supporting many CSS3 features) with the generated HTML (or actually, somewhat modified generated HTML), to get to a printable PDF. Jeroen. From Bowerbird at aol.com Mon Oct 1 12:48:30 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 1 Oct 2007 15:48:30 EDT Subject: [gutvol-d] more good news Message-ID: there's more good news on the light-markup revolution. john macfarlane, out of uberkeley, reports: > I've put together two small web apps to demonstrate pandoc "pandoc" does conversions between various other formats. john says: > http://johnmacfarlane.net/pandoc/html2x.html > can convert most web pages to markdown, reStructuredText, > DocBook, LaTeX, ConTeXt, RTF, or groff man. he continues: > http://johnmacfarlane.net/pandoc/try > allows you to experiment with pandoc > without going to the trouble of > installing it on your system. these are great steps forward for people exploring light-markup. -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071001/02211ccd/attachment.htm From julio.reis at tintazul.com.pt Mon Oct 1 13:43:00 2007 From: julio.reis at tintazul.com.pt (=?ISO-8859-1?Q?J=FAlio_Reis?=) Date: Mon, 01 Oct 2007 21:43:00 +0100 Subject: [gutvol-d] PG-E shows text, encoding jumbled In-Reply-To: References: Message-ID: <47015BD4.20809@tintazul.com.pt> I've checked today, and PG-Europe seems to be... huh... different than it was. Still not OK, for sure. It's good to see that apparently someone fiddled with it -- so there's hope that DP-E might get fixed. I used to search for texts in Portuguese and get the full list. The text pages themselves were then unreachable (following a link to any text returned an empty document.) Now when I search for Portuguese, I get... one record at a time. Go ahead, try it. http://pge.rastko.net/catalog/world/search The upshot is -- now I can /actually /follow the link to the e-text page, and I can click to download the document. Dandy! But the encoding is wrong. For instance, in http://pge.rastko.net/etext/7384 there are utf-7 and iso-8859-1 versions of the /Carta da Companhia /by Jos? de Anchieta. The guy's name inside the file shows as "Jos+AOk- de Anchieta". It's the same in either encoding. The number of documents listed in the /Language/ drop-down list box doesn't match the records found on the database. For instance, it reads "Portuguese (77)" and when I search I get the message "85 headings found" (and then I am given a single record at a time, of course. :-) That is a minor issue, though. When I search for texts in French, after a few minutes I get: "More than 1000 records found. Please refine your query." That's minor too, because I was only checking what would happen if I searched for other languages. Now if I were /actually /meaning to find texts in French, it would be not minor, but a major issue. The huge time lag smells like a badly-programmed query, or an inefficient database. Thanks to the people who give their time to fix stuff like this... DP-E is really important; there are huge chunks of European recent literary history which can't yet be released in the USA. So kudos to all the Euro-guys who are taking care of the Euro-stuff... Tintazul. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071001/2bc8f33b/attachment.htm From desrod at gnu-designs.com Mon Oct 1 16:29:47 2007 From: desrod at gnu-designs.com (David A. Desrosiers) Date: Mon, 01 Oct 2007 19:29:47 -0400 Subject: [gutvol-d] gnutenberg-press maintenance offer (was Re: Proposal to add OpenDocument as an additional In-Reply-To: <200710011131.06953.rolsch@verizon.net> References: <20071001081923.GA29575@ark.in-berlin.de> <4700FDCF.1060009@perathoner.de> <200710011131.06953.rolsch@verizon.net> Message-ID: <1191281387.9646.2.camel@localhost.localdomain> On Mon, 2007-10-01 at 11:31 -0400, Roland Schlenker wrote: > > 5. I'm fed up with buggy Perl XSL modules. I'm going to rewrite the > > next version in Python. > By all means please write it in Python. Great idea! Let's trade the "buggy" XSL modules in Perl with memory leaks in Python instead. Can you point me to the bug reports you've filed against these XSL modules, so I can test/follow-up on them myself? Thanks. -- David A. Desrosiers desrod at gnu-designs.com setuid at gmail.com http://projects.plkr.org/ Skype...: 860-967-3820 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071001/af4f38ee/attachment.pgp From desrod at gnu-designs.com Mon Oct 1 16:51:15 2007 From: desrod at gnu-designs.com (David A. Desrosiers) Date: Mon, 01 Oct 2007 19:51:15 -0400 Subject: [gutvol-d] more good news In-Reply-To: References: Message-ID: <1191282675.9646.9.camel@localhost.localdomain> On Mon, 2007-10-01 at 15:48 -0400, Bowerbird at aol.com wrote: > these are great steps forward for people exploring light-markup. This reminds me of those tools that purport to convert PG texts to HTML, by slapping at the top and at the bottom, and calling it done. I tried a bunch of documents through the conversion, including DocBook and Markdown, and what it produced... was... how shall I say.. short of the mark I expected. This is a great start though.. and to draw a parallel, rss feeds are causing people to think about a whole new way of reproducing, writing and sharing news/blogs/etc. with other people. There will probably always be two camps, light and heavy markup. I'm in the heavy-markup camp, simply because I haven't yet seen the proof that all of the necessary semantics a complex document requires can be reproduced purely with light markup. Add to this that light markup output can be produced from heavy markup, but not the reverse. If given a choice, I'd rather have the book, than the CliffNotes. I can always produce my own CliffNotes from the book, but I can't create the book from the CliffNotes. -- David A. Desrosiers desrod at gnu-designs.com setuid at gmail.com http://projects.plkr.org/ Skype...: 860-967-3820 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071001/5b0460ee/attachment.pgp From Bowerbird at aol.com Mon Oct 1 17:19:11 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 1 Oct 2007 20:19:11 EDT Subject: [gutvol-d] more good news Message-ID: david said: > There will probably always be two camps, light and heavy markup. > I'm in the heavy-markup camp, simply because I haven't yet seen > the proof that all of the necessary semantics a complex document > requires can be reproduced purely with light markup. show me that complex document. preferably one already in p.g. or, if you're willing to o.c.r. and correct it, _any_ google scan-set. > Add to this that light markup output > can be produced from heavy markup, > but not the reverse. well, first of all, your first part hasn't been adequately demonstrated. and second of all, your second part has not been fully _disproven_... but set all of that aside. if the markup could be applied by _magic_, i would chose the heavy markup too! (and then convert it to light.) the point is that it's more costly to apply and maintain heavy markup. _much_ more costly. so much more costly we actually can't afford it. and that added cost is for benefits that haven't yet proven themselves. if you're living in fairy-tale land, then sure, you take the heavy-markup. but for those of us living in the real world, light-markup is a _bargain_ that costs 15% as much and delivers 85% of the benefits. easy decision. -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071001/7a92cc4b/attachment.htm From jon at noring.name Tue Oct 2 07:23:36 2007 From: jon at noring.name (Jon Noring) Date: Tue, 2 Oct 2007 08:23:36 -0600 Subject: [gutvol-d] Update on the use of captchas to "proof" digital texts Message-ID: <56977291.20071002082336@noring.name> http://news.bbc.co.uk/1/hi/technology/7023627.stm From desrod at gnu-designs.com Tue Oct 2 07:40:01 2007 From: desrod at gnu-designs.com (David A. Desrosiers) Date: Tue, 02 Oct 2007 10:40:01 -0400 Subject: [gutvol-d] Update on the use of captchas to "proof" digital texts In-Reply-To: <56977291.20071002082336@noring.name> References: <56977291.20071002082336@noring.name> Message-ID: <1191336001.9646.20.camel@localhost.localdomain> On Tue, 2007-10-02 at 08:23 -0600, Jon Noring wrote: > http://news.bbc.co.uk/1/hi/technology/7023627.stm I've been using reCaptcha on my Wordpress blog for quite some time now, and I highly recommend it to anyone else who has a well-trafficked blog. http://recaptcha.net/plugins/wordpress/ -- David A. Desrosiers desrod at gnu-designs.com setuid at gmail.com http://projects.plkr.org/ Skype...: 860-967-3820 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071002/9c6d15b3/attachment.pgp From Bowerbird at aol.com Tue Oct 2 14:34:53 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 2 Oct 2007 17:34:53 EDT Subject: [gutvol-d] if the entire population of the earth was a village of 100 people Message-ID: michael- this will give you some updates you've been looking for: > http://www.miniature-earth.com/ -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071002/866938fb/attachment.htm From hart at pglaf.org Wed Oct 3 10:46:16 2007 From: hart at pglaf.org (Michael Hart) Date: Wed, 3 Oct 2007 10:46:16 -0700 (PDT) Subject: [gutvol-d] if the entire population of the earth was a village of 100 people In-Reply-To: References: Message-ID: Does anyone have these in plain text nubmers? mh On Tue, 2 Oct 2007, Bowerbird at aol.com wrote: > michael- > > this will give you some updates you've been looking for: >> http://www.miniature-earth.com/ > > -bowerbird > > > > ************************************** > See what's new at http://www.aol.com > From Bowerbird at aol.com Wed Oct 3 11:00:51 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 3 Oct 2007 14:00:51 EDT Subject: [gutvol-d] sony stupidity Message-ID: sony says ripping your own cd to mp3 is "stealing": > http://arstechnica.com/news.ars/post/20071002-sony-bmgs-chief-anti-piracy-lawyer-copying-music-you-own-is-stealing.html in court, under oath, jennifer pariser said that. and she's the head of litigation for sony b.m.g. -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071003/d4eda06f/attachment.htm From Bowerbird at aol.com Thu Oct 4 10:05:57 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 4 Oct 2007 13:05:57 EDT Subject: [gutvol-d] give 1, get fuzzy Message-ID: you've probably heard that the o.l.p.c. program will be offering a give-1-get-1 deal next month, where $399 will purchase one o.l.p.c. machine for a needy child and one for yourself... great! the offer starts november 12th. it's scheduled to run for two weeks, or until they sell a certain number of units, and some people think that they will reach that number of units _quickly_ -- some predictions say it'll be the first day -- so if you really want one, plan to buy _early_... but if you can afford to give a $199 machine to a needy child without getting one yourself, you can do it right _now_, without any waiting: > http://www.xogiving.org you might not get a machine, but i guarantee you will get a gratifying warm fuzzy feeling... i haven't seen a more inspirational tech project than this one in recent memory, maybe never... -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071004/b2e092be/attachment.htm From Bowerbird at aol.com Thu Oct 4 12:58:06 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 4 Oct 2007 15:58:06 EDT Subject: [gutvol-d] how much does it cost to print a book? Message-ID: michael- it looks like you could use some information on how much it typically costs to print a book: > http://z-m-l.com/misc/printerpricing.html -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071004/2b9da5c7/attachment.htm From lee at novomail.net Thu Oct 4 15:12:18 2007 From: lee at novomail.net (Lee Passey) Date: Thu, 04 Oct 2007 16:12:18 -0600 Subject: [gutvol-d] more good news In-Reply-To: References: Message-ID: <47056542.4000002@novomail.net> Bowerbird at aol.com wrote: > the point is that it's more costly to apply and maintain heavy markup. > _much_ more costly. You know, I have never bought this argument. Personally, I find it much easier and cheaper to maintain XHTML that some kind of structured text which relies on subtle markup that can easily be mistaken for content. The use of subtle distinctions in text as markup has been one of my greatest sources of annoyance (and lost productivity) in trying to "proofread" for Distributed Proofreaders. I would much rather have an "in-your-face"

tag for a chapter heading than being forced to move the cursor into a text field and counting the number of blank lines before and after a line. And is it one space or two, or maybe a tab character, that signals that word-wrapping should be turned off? "In-your-face" markup may be distracting if you're trying to read around it, but if you're looking for, and trying to manipulate, the markup, having it blatant is a huge time saver and great check against errors. Additionally, there are all sorts of great tools to maintain XML files, and virtually nothing to parse or check Plain Text Markup Languages. I would be interested in knowing what other people's experiences have been in the maintenance of full featured, obvious markup languages vs. "lite", subtle markup languages. -- Nothing of significance below this line. From lee at novomail.net Thu Oct 4 15:20:18 2007 From: lee at novomail.net (Lee Passey) Date: Thu, 04 Oct 2007 16:20:18 -0600 Subject: [gutvol-d] gnutenberg-press maintenance offer (was Re: Proposal to add OpenDocument as an additional In-Reply-To: <47013DAF.5090400@bohol.ph> References: <20071001081923.GA29575@ark.in-berlin.de> <4700FDCF.1060009@perathoner.de> <47013DAF.5090400@bohol.ph> Message-ID: <47056722.9000801@novomail.net> Jeroen Hellingman (Mailing List Account) wrote: > I think the biggest barrier here is the steep learning curve of TEI (20% > of the tags cover 80% of the things you encounter, but every other book > you will need something from those remaining 80%, and, oh gosh, which > tag can I use then) .... I am intrigued by this comment (and not only because it mirrors my own experience). So by way of information gathering among those who use TEI on a regular basis, I would you to tell me, perhaps simply as an ordered list, what TEI tags you believe are most used and most valuable (not necessarily the same thing). In other words, what are the 20% of the tags that cover 80% of the need, and from the remaining 80% what seems to come up the most often? I'm thinking of writing a little script that will try to automate the collection of usage data from current Gutenberg TEI texts. -- Nothing of significance below this line. From jon at noring.name Thu Oct 4 18:14:13 2007 From: jon at noring.name (Jon Noring) Date: Thu, 4 Oct 2007 19:14:13 -0600 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <47056722.9000801@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <4700FDCF.1060009@perathoner.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> Message-ID: <392096647.20071004191413@noring.name> Lee wrote: > Jeroen Hellingman wrote: >> I think the biggest barrier here is the steep learning curve of TEI >> (20% of the tags cover 80% of the things you encounter, but every >> other book you will need something from those remaining 80%, and, >> oh gosh, which tag can I use then) .... > I am intrigued by this comment (and not only because it mirrors my > own experience). So by way of information gathering among those who > use TEI on a regular basis, I would you to tell me, perhaps simply > as an ordered list, what TEI tags you believe are most used and most > valuable (not necessarily the same thing). In other words, what are > the 20% of the tags that cover 80% of the need, and from the > remaining 80% what seems to come up the most often? > > I'm thinking of writing a little script that will try to automate > the collection of usage data from current Gutenberg TEI texts. Regarding the last paragraph Lee wrote, I think that's a splendid idea, to see what elements, attributes and attribute values have been used, doing a statistical analysis of their usage. I hope the analysis will also look at content models, maybe building a minimal DTD to which all the documents will validate, (I believe there are tools which will build a common DTD from a set of XML documents -- hmmm, maybe that's the approach to take first, then one can do a statistical analysis by comparing to this minimal DTD. Also, inspection of that DTD will provide insights.) ***** Regarding the rest of what Lee wrote, a few days ago I outlined in a private email to Lee some preliminary ideas, which I'll restate here for discussion purposes: The gist of the idea is that a group of us create a very strict subset of TEI: elements, attributes and *standardized* attribute values, and constrained element content models, along with any other markup usage rules that cannot be enforced by a DTD or schema. This subset and associated ruleset would be sufficient to consistently, uniformly, and in standardized fashion (especially attribute values), markup 80% or 90% of the public domain books which PG/DP works with. A related goal (if possible to achieve) would be that when different people independently markup the same book, and follow the rules, the marked up documents would be, for all practical purposes, canonically identical with one another. Furthermore, we would actually build our own DTD or schema so that those authoring to this strictly applied subset could immediately validate to it. Also, we could write a script to do conformance checking to check any other requirements that cannot be enforced by DTD or schema -- a sort of "conformance chekcer." We could also write a brief document describing how to markup documents using this subset vocabulary, and minimizing the need for people learning to markup using our simple subset to have to slog through the TEI manual to figure out how to do something. And for most (but clearly not all) of the remaining texts, we could slowly build a "superset" of the basic DTD, so that at least the more complicated books follow the strict subset in uniformity of basic markup. This superset could be slowly built over time. The benefit of this approach is that we can now involve more people in marking up books, have the validation tools, and provide a much more uniform basis by which authoring tool and conversion tools can be built more quickly. The problem with the full-blown TEI, and even TEI-Lite (which is not "Lite"), is that it is so massive, and the manual so difficult to comprehend without spending a year studying it, that those trying to build dedicated authoring tools and conversion tools to handle all possibilities is much more difficult. And I'm not advocating not using the full blown TEI for the extremely funky texts, but let's at least standardize on something simple and uniform for the vast majority of the books. Anyway, that's the core of the idea. And not actually new as Josh has mentioned it in some fashion, but maybe what is proposed here has a few twists on what previously has been proposed. And I'm willing to help build the DTD (I prefer DTDs since they should be sufficient for this purpose and there are other advantages to DTD over schema which I won't get into here.) I am quite experienced in building DTDs, having built by hand the OEBPS 1.2 Document and Package DTDs, the OpenReader Binder DTD, and the BookX DTD (which has a similar philosophy to that described above but focused on new book publishers, clearly not for use by PG/DP for reasons I won't delve into.) Jon Noring (p.s., since TEI P5 is soon to be released -- it is currently at version 0.9 and may be elevated to 1.0 at this years TEI Annual Meeting in November -- our subset should definitely be built on P5.) From rolsch at verizon.net Thu Oct 4 18:33:42 2007 From: rolsch at verizon.net (Roland Schlenker) Date: Thu, 04 Oct 2007 21:33:42 -0400 Subject: [gutvol-d] =?iso-8859-1?q?gnutenberg-press_maintenance_offer_=28w?= =?iso-8859-1?q?as_Re=3A_Proposal_to_add=09OpenDocument_as_an_additional?= In-Reply-To: <47056722.9000801@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> Message-ID: <200710042133.43012.rolsch@verizon.net> On Thursday 04 October 2007 6:20 pm, Lee Passey wrote: > Jeroen Hellingman (Mailing List Account) wrote: > > I think the biggest barrier here is the steep learning curve of TEI (20% > > of the tags cover 80% of the things you encounter, but every other book > > you will need something from those remaining 80%, and, oh gosh, which > > tag can I use then) .... > > I am intrigued by this comment (and not only because it mirrors my own > experience). So by way of information gathering among those who use TEI > on a regular basis, I would you to tell me, perhaps simply as an ordered > list, what TEI tags you believe are most used and most valuable (not > necessarily the same thing). In other words, what are the 20% of the > tags that cover 80% of the need, and from the remaining 80% what seems > to come up the most often? In my experience, it is not the simple matter of elements but, of the usage of which element to use for a peculiar piece of text. I use a python program, which marks up about 90% of a DP formatted text. So, my time is spent marking up the non-general use cases. I use a minimal TEI file as a test file to check how a peculiar piece of text will render in the three current formats. Then, when the TEI markup is correct, I copy it into the master TEI file. The following are some examples from a condensed test file:

Marcia Schuyler

SIXTH EDITION

Copyright by C. Klackner Oh, You Naughty Man! She Exclaimed Prettily, How Dare You! Illustration: Copyright by C. KlacknerOh, You Naughty Man! She Exclaimed Prettily, How Dare You!

Marcia Schuyler

byGrace Livingston Hill LutzAuthor of The Story of a Whim, According to thePattern, An Unwilling Guest, etc.

Illustrations byE. L. HENRY, N.A.

GROSSET & DUNLAPPUBLISHERS · NEW YORK

Copyright, 1908By J. B. Lippincott Company

Published February, 1908

Electrotyped and printed by J. B. Lippincott CompanyThe Washington Square Press, Philadelphia, U. S. A.

TOTHE DEAR MEMORY OFMY FATHERThe Rev. CHARLES MONTGOMERY LIVINGSTONWHOSE COMPANIONSHIP AND ENCOURAGEMENTHAVE BEEN MY HELP THROUGHTHE YEARS

The Squire with deepening frown was studying his elder senses that a girl of his could be so heartless.

Dear David, the letter ran,—written as though in a hurry, done at the last moment,—which indeed it was:—

I want you to forgive me for what I am doing. I know you will feel bad about it, but really I never was the right one for you. I’m sure you thought me all too good, and I never could have stayed in a strait-jacket, it would have killed me. I shall always consider you the best man in the world, and I like you better than anyone else except Captain Leavenworth. I can’t help it, you know, that I care more for him than anyone else, though I’ve tried. So I am going away to-night and when you read this we shall have been married. You are so very good that I know you will forgive me, and be glad I am happy. Don’t think hardly of me for I always did care a great deal for you.

Your loving

Kate.

It was characteristic of Kate that she demanded the love the mantel-piece.

They waxed a trifle sentimental at the parting, but when spirit without a guiding star.

Dear Lemuel: she wrote:—

I am coming home. I wonder if you will be glad?

(Artful Hannah, as if she did not know!)

It is very delightful in New York and I have been having a gay time since I came, and everybody has been most pleasant, but—

“’Mid pleasures and palaces though we may roam, Still, be it ever so humble, there’s no place like home. A charm from the skies seems to hallow it there, Which, go through the world, you’ll not meet with elsewhere. Home, home, sweet home! There’s no place like home.’[**PM typo: no ’]

That is a new song, Lemuel, that everybody here is singing. It is written by a young American named John Howard Payne who is in London now acting in a great playhouse. Everybody is wild over this song. I’ll sing it for you when I come home.

I shall be at home in time for singing school next week, Lemuel. I wonder if you’ll come to see me at once and welcome me. You cannot think how glad I shall be to get home again. It seems as though I had been gone a year at least. Hoping to see you soon, I remain

Always your sincere friend,

Hannah Heath.

And thus did Hannah make smooth her path before her, further time in chasing will-o-the-wisps.

It did not take her long to reduce the dinner table to one they sang in school,

“Sister, thou wast mild and lovely, Gentle as the summer breeze, Pleasant as the air of evening When it floats among the trees.”

But the first words set her to thinking of her own sister, and girl for whom that song was written.

Roland Schlenker From rolsch at verizon.net Thu Oct 4 19:29:39 2007 From: rolsch at verizon.net (Roland Schlenker) Date: Thu, 04 Oct 2007 22:29:39 -0400 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <392096647.20071004191413@noring.name> References: <20071001081923.GA29575@ark.in-berlin.de> <47056722.9000801@novomail.net> <392096647.20071004191413@noring.name> Message-ID: <200710042229.39180.rolsch@verizon.net> On Thursday 04 October 2007 9:14 pm, Jon Noring wrote: > Lee wrote: > > Jeroen Hellingman wrote: > >> I think the biggest barrier here is the steep learning curve of TEI > >> (20% of the tags cover 80% of the things you encounter, but every > >> other book you will need something from those remaining 80%, and, > >> oh gosh, which tag can I use then) .... > > > > I am intrigued by this comment (and not only because it mirrors my > > own experience). So by way of information gathering among those who > > use TEI on a regular basis, I would you to tell me, perhaps simply > > as an ordered list, what TEI tags you believe are most used and most > > valuable (not necessarily the same thing). In other words, what are > > the 20% of the tags that cover 80% of the need, and from the > > remaining 80% what seems to come up the most often? > > > > I'm thinking of writing a little script that will try to automate > > the collection of usage data from current Gutenberg TEI texts. > > Regarding the last paragraph Lee wrote, I think that's a splendid > idea, to see what elements, attributes and attribute values have been > used, doing a statistical analysis of their usage. I hope the analysis > will also look at content models, maybe building a minimal DTD to which > all the documents will validate, (I believe there are tools which will > build a common DTD from a set of XML documents -- hmmm, maybe that's > the approach to take first, then one can do a statistical analysis by > comparing to this minimal DTD. Also, inspection of that DTD will > provide insights.) > > ***** > > Regarding the rest of what Lee wrote, a few days ago I outlined in a > private email to Lee some preliminary ideas, which I'll restate here > for discussion purposes: > > The gist of the idea is that a group of us create a very strict subset > of TEI: elements, attributes and *standardized* attribute values, and > constrained element content models, along with any other markup usage > rules that cannot be enforced by a DTD or schema. This subset and > associated ruleset would be sufficient to consistently, uniformly, and > in standardized fashion (especially attribute values), markup 80% or > 90% of the public domain books which PG/DP works with. A related goal > (if possible to achieve) would be that when different people > independently markup the same book, and follow the rules, the marked > up documents would be, for all practical purposes, canonically > identical with one another. I have been following my rule "make look like the original book". Thou, I have thought at times that what I am producing is a PG edition of the original book. That is, it would be nice if all PG books had the same look and feel, like a book from O'Rielly Publishers. > > Furthermore, we would actually build our own DTD or schema so that > those authoring to this strictly applied subset could immediately > validate to it. Also, we could write a script to do conformance > checking to check any other requirements that cannot be enforced by > DTD or schema -- a sort of "conformance chekcer." We could also write > a brief document describing how to markup documents using this subset > vocabulary, and minimizing the need for people learning to markup > using our simple subset to have to slog through the TEI manual to > figure out how to do something. IMO, is this not what we have a the present time, PGTEI is a minimal subset of TEI. The "Guide to PGTEI" a document describing PGTEI. In my short time using TEI, having marked up five books of simple fiction. I have encountered enough non-general use cases, that have not been covered by PGTEI and the "Guide to PGTEI". That have required me to seek for more information in TEI-Lite and TEI4 to mark them up. > > And for most (but clearly not all) of the remaining texts, we could > slowly build a "superset" of the basic DTD, so that at least the more > complicated books follow the strict subset in uniformity of basic > markup. This superset could be slowly built over time. > > The benefit of this approach is that we can now involve more people in > marking up books, have the validation tools, and provide a much more > uniform basis by which authoring tool and conversion tools can be > built more quickly. The problem with the full-blown TEI, and even > TEI-Lite (which is not "Lite"), is that it is so massive, and the > manual so difficult to comprehend without spending a year studying it, > that those trying to build dedicated authoring tools and conversion > tools to handle all possibilities is much more difficult. And I'm not > advocating not using the full blown TEI for the extremely funky texts, > but let's at least standardize on something simple and uniform for the > vast majority of the books. > > Anyway, that's the core of the idea. And not actually new as Josh has > mentioned it in some fashion, but maybe what is proposed here has a > few twists on what previously has been proposed. > > And I'm willing to help build the DTD (I prefer DTDs since they should > be sufficient for this purpose and there are other advantages to DTD > over schema which I won't get into here.) I am quite experienced in > building DTDs, having built by hand the OEBPS 1.2 Document and > Package DTDs, the OpenReader Binder DTD, and the BookX DTD (which has > a similar philosophy to that described above but focused on new book > publishers, clearly not for use by PG/DP for reasons I won't delve > into.) > > Jon Noring > > > (p.s., since TEI P5 is soon to be released -- it is currently at > version 0.9 and may be elevated to 1.0 at this years TEI Annual > Meeting in November -- our subset should definitely be built on P5.) IMO, I think that there are two obstacles to the adoption of TEI. One, there is no conversion tool such as Guiguts to convert a DP formatted text to TEI. Two, DP has very strong community of PPers, who know how to mark up a DP formatted text into HTML. I believe a conversion tool to TEI and a very helpfull group of DPers, well versed in TEI, are needed to further the adoption of TEI. Roland Schlenker From vze3rknp at verizon.net Thu Oct 4 19:40:31 2007 From: vze3rknp at verizon.net (Juliet Sutherland) Date: Thu, 04 Oct 2007 22:40:31 -0400 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <392096647.20071004191413@noring.name> References: <20071001081923.GA29575@ark.in-berlin.de> <4700FDCF.1060009@perathoner.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <392096647.20071004191413@noring.name> Message-ID: <4705A41F.2060708@verizon.net> In thinking about persuading DP volunteers to use TEI, one area to consider is the periodicals. They raise interesting formatting problems, and have lots and lots of instances so that time put into work on one will be useful over and over. Punch is a good sample case. It has some funky formatting issues, that we've come to deal with in a way that is common across the final document producers (post-processors). Any proposed TEI DTD or schema or whatever will have to be able to produce something that looks very similar to the issues that we've produced so far. I know that Josh has been using TEI for American Missionary. It would be good to have some nice examples of other periodicals as well. Another place where the investment of some time by someone skilled in such things would really pay off is in making different style sheets (forgive me if I have the term wrong) for drama. Having one that lays out the material for reading, and another that lays it out as a script for actors to work from (with lots more white space, for example) would provide a clear benefit for semantic markup of the basic elements (who's talking, stage directions, acts and scenes, list of characters, etc). Metrical verse plays can introduce some interesting combinations as can ones that use both prose and verse. But the basic idea is the same. I'm not a current DP post-processor, but my impression from interactions with many of them is that in addition to the lack of authoring tools for TEI, another major stumbling block is that the currently available transforms (or whatever they are properly called) produce ugly looking results. Post-processors put a lot of time into their projects and they want them to look nice. It's all very well to say that only semantic markup should be used, but that's not going to win converts if the results are dramatically different in character and feel from the original book and just don't look good. This applies mostly to the html versions, since there's not a lot that can be done to make plain text worse. Something that comes to mind immediately is dealing with decorative capital letters at the beginning of a chapter or paragraph. It's a presentation matter, that has nothing to do with the semantics of the book, but can be important to the look and feel. I guess my basic message is that two things are needed in order to persuade the majority of book producers for PG to use TEI (or any other master format). The first is the need to produce a result that the post-processors can feel proud of when they see it, and the second is good authoring tools. JulietS Jon Noring wrote: > Lee wrote: > >> Jeroen Hellingman wrote: >> > > >>> I think the biggest barrier here is the steep learning curve of TEI >>> (20% of the tags cover 80% of the things you encounter, but every >>> other book you will need something from those remaining 80%, and, >>> oh gosh, which tag can I use then) .... >>> > > >> I am intrigued by this comment (and not only because it mirrors my >> own experience). So by way of information gathering among those who >> use TEI on a regular basis, I would you to tell me, perhaps simply >> as an ordered list, what TEI tags you believe are most used and most >> valuable (not necessarily the same thing). In other words, what are >> the 20% of the tags that cover 80% of the need, and from the >> remaining 80% what seems to come up the most often? >> >> I'm thinking of writing a little script that will try to automate >> the collection of usage data from current Gutenberg TEI texts. >> > > Regarding the last paragraph Lee wrote, I think that's a splendid > idea, to see what elements, attributes and attribute values have been > used, doing a statistical analysis of their usage. I hope the analysis > will also look at content models, maybe building a minimal DTD to which > all the documents will validate, (I believe there are tools which will > build a common DTD from a set of XML documents -- hmmm, maybe that's > the approach to take first, then one can do a statistical analysis by > comparing to this minimal DTD. Also, inspection of that DTD will > provide insights.) > > ***** > > Regarding the rest of what Lee wrote, a few days ago I outlined in a > private email to Lee some preliminary ideas, which I'll restate here > for discussion purposes: > > The gist of the idea is that a group of us create a very strict subset > of TEI: elements, attributes and *standardized* attribute values, and > constrained element content models, along with any other markup usage > rules that cannot be enforced by a DTD or schema. This subset and > associated ruleset would be sufficient to consistently, uniformly, and > in standardized fashion (especially attribute values), markup 80% or > 90% of the public domain books which PG/DP works with. A related goal > (if possible to achieve) would be that when different people > independently markup the same book, and follow the rules, the marked > up documents would be, for all practical purposes, canonically > identical with one another. > > Furthermore, we would actually build our own DTD or schema so that > those authoring to this strictly applied subset could immediately > validate to it. Also, we could write a script to do conformance > checking to check any other requirements that cannot be enforced by > DTD or schema -- a sort of "conformance chekcer." We could also write > a brief document describing how to markup documents using this subset > vocabulary, and minimizing the need for people learning to markup > using our simple subset to have to slog through the TEI manual to > figure out how to do something. > > And for most (but clearly not all) of the remaining texts, we could > slowly build a "superset" of the basic DTD, so that at least the more > complicated books follow the strict subset in uniformity of basic > markup. This superset could be slowly built over time. > > The benefit of this approach is that we can now involve more people in > marking up books, have the validation tools, and provide a much more > uniform basis by which authoring tool and conversion tools can be > built more quickly. The problem with the full-blown TEI, and even > TEI-Lite (which is not "Lite"), is that it is so massive, and the > manual so difficult to comprehend without spending a year studying it, > that those trying to build dedicated authoring tools and conversion > tools to handle all possibilities is much more difficult. And I'm not > advocating not using the full blown TEI for the extremely funky texts, > but let's at least standardize on something simple and uniform for the > vast majority of the books. > > Anyway, that's the core of the idea. And not actually new as Josh has > mentioned it in some fashion, but maybe what is proposed here has a > few twists on what previously has been proposed. > > And I'm willing to help build the DTD (I prefer DTDs since they should > be sufficient for this purpose and there are other advantages to DTD > over schema which I won't get into here.) I am quite experienced in > building DTDs, having built by hand the OEBPS 1.2 Document and > Package DTDs, the OpenReader Binder DTD, and the BookX DTD (which has > a similar philosophy to that described above but focused on new book > publishers, clearly not for use by PG/DP for reasons I won't delve > into.) > > Jon Noring > > > (p.s., since TEI P5 is soon to be released -- it is currently at > version 0.9 and may be elevated to 1.0 at this years TEI Annual > Meeting in November -- our subset should definitely be built on P5.) > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > From joshua at hutchinson.net Fri Oct 5 06:45:51 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Fri, 5 Oct 2007 13:45:51 +0000 (UTC) Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? Message-ID: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> >----Original Message---- >From: vze3rknp at verizon.net > >... my impression from interactions >with many of them is that in addition to the lack of authoring tools for >TEI, another major stumbling block is that the currently available >transforms (or whatever they are properly called) produce ugly looking >results. This is a huge issue, believe it or not. Basically, the problem lies in that there are two camps of thought. Camp 1 - The layout of the original book is important and the final product (in the HTML and PDF at least) should as closely resemble the original look and feel of the book as is possible. Things like drop caps at the beginning of chapters, poems that are centered on the page as opposed to left justified, etc. are EXTREMELY important to folks in this camp. Camp 2 - The content is king. This group doesn't care so much about the original layout of the book, but rather what that layout is meant to convey. Is a chapter heading BOLD, CENTER and ITALICS in the original? Great. But in the PG version, it should be a chapter heading and use whatever standard PG stylesheets use. Things like drop caps or matching the idiosyncratic layout of the original table of contents are "fluff" to folks in this camp. *** You will find loud proponents of both camps. (You'll also find quiet and reasonable proponents, too.) The problem is that TEI is better suited to Camp 2 (though Camp 1 can be accomodated ... it's just much more work). *** Is one camp right and the other wrong? Is it necessary to have one camp or the other "win"? Can both be adequately served? Is it worth the effort to TRY to serve both camps? These are some questions that need to be answered. Josh PS I don't believe that the current output of PGTEI process is "ugly". Rather, it is uniform and loses the original "charm" of the original book's layout. From rolsch at verizon.net Fri Oct 5 07:27:32 2007 From: rolsch at verizon.net (Roland Schlenker) Date: Fri, 05 Oct 2007 10:27:32 -0400 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> Message-ID: <200710051027.32646.rolsch@verizon.net> On Friday 05 October 2007 9:45 am, joshua at hutchinson.net wrote: > >----Original Message---- > >From: vze3rknp at verizon.net > > > >... my impression from interactions > >with many of them is that in addition to the lack of authoring tools > > for > > >TEI, another major stumbling block is that the currently available > >transforms (or whatever they are properly called) produce ugly > > looking > > >results. > > This is a huge issue, believe it or not. > > Basically, the problem lies in that there are two camps of thought. > > Camp 1 - The layout of the original book is important and the final > product (in the HTML and PDF at least) should as closely resemble the > original look and feel of the book as is possible. Things like drop > caps at the beginning of chapters, poems that are centered on the page > as opposed to left justified, etc. are EXTREMELY important to folks in > this camp. > > Camp 2 - The content is king. This group doesn't care so much about > the original layout of the book, but rather what that layout is meant > to convey. Is a chapter heading BOLD, CENTER and ITALICS in the > original? Great. But in the PG version, it should be a chapter > heading and use whatever standard PG stylesheets use. Things like drop > caps or matching the idiosyncratic layout of the original table of > contents are "fluff" to folks in this camp. > > *** > > You will find loud proponents of both camps. (You'll also find quiet > and reasonable proponents, too.) > > The problem is that TEI is better suited to Camp 2 (though Camp 1 can > be accomodated ... it's just much more work). > > *** > > Is one camp right and the other wrong? Is it necessary to have one > camp or the other "win"? Can both be adequately served? Is it worth > the effort to TRY to serve both camps? I do not think it is a matter of which camp is right or wrong but, how many of the PPers at DP, which are followers of Camp 1, can be convinced to adopt TEI. Since, I believe at present, the PPers of Camp 1 greatly out number the PPers of Camp 2. So, I think it very much worth the effort to accommodate the followers of Camp 1. Otherwise, the number of PPers who use TEI will continue to expand at the present slow rate of growth. Thereby, earning us very little on our time and effort. > These are some questions that need to be answered. > > Josh > > PS I don't believe that the current output of PGTEI process is > "ugly". Rather, it is uniform and loses the original "charm" of the > original book's layout. Roland Schlenker From jon at noring.name Fri Oct 5 09:05:15 2007 From: jon at noring.name (Jon Noring) Date: Fri, 5 Oct 2007 10:05:15 -0600 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> Message-ID: <724651335.20071005100515@noring.name> Joshua wrote: > Basically, the problem lies in that there are two camps of thought. > > Camp 1 - The layout of the original book is important and the final > product (in the HTML and PDF at least) should as closely resemble the > original look and feel of the book as is possible. Things like drop > caps at the beginning of chapters, poems that are centered on the page > as opposed to left justified, etc. are EXTREMELY important to folks in > this camp. > > Camp 2 - The content is king. This group doesn't care so much about > the original layout of the book, but rather what that layout is meant > to convey. Is a chapter heading BOLD, CENTER and ITALICS in the > original? Great. But in the PG version, it should be a chapter > heading and use whatever standard PG stylesheets use. Things like drop > caps or matching the idiosyncratic layout of the original table of > contents are "fluff" to folks in this camp. > > *** > > The problem is that TEI is better suited to Camp 2 (though Camp 1 can > be accomodated ... it's just much more work). Josh, this is an excellent summary of the two camps! Most who have followed my messages over the years know that I am firmly in Camp 2, but I have no difficulty with interested individuals in taking the digital content and producing "facsimile" digital renditions. (Juliet's message seems to indicate there are many at DP who fall into Camp 1.) One thing I do know, that with a properly proofed and marked up digital content master which focuses on identifying the universally-important document structures and inline text semantics, it is possible to repurpose the content almost anyway one wants, and this includes producing facsimile renditions (sometimes this can be done automatically, other times does require human intervention.) However, if the markup is mostly presentationally-oriented so as to focus on only on facsimile production, then the content is much less reusable. I do believe most of the Camp 1 people at DP understand this and strive in their work product to markup document structures and inline text semantics as much as possible. (There are a number of older PG texts, though, where the HTML markup is wholly presentational, such as what happened to the 1001 Nights -- the markup of that is *so bad* that I gave up trying to elevate it to proper markup. Harumph. Also, the source text used was NOT the right one to use, unfortunately, but that's not germane to this discussion.) Two other factors to consider: 1) Most now recognize the importance of having the original page scans available alongside the digitized text master. For some who wish to see how the original book was typeset, this is more than sufficient and they would not need to see a facsimile digital rendition. (Aside, I've always believed that if we are to scan the public domain books, we should do so at sufficient scan quality so the scans will be useful for ludic reading, and not just as feed for OCR, and thus have always advocated higher quality master scans than has been done. This zeal for speed has troubled me, especially in that the bottleneck at DP is not scans, but proofing -- I would hope that DP will begin to encourage book scanners to focus on archival quality -- even presentation quality -- rather than the current "just scan 'em good enough for OCR." Make the scans themselves a valuable work product that the scanner can be proud to distribute to the world, and not only to the inner workings of DP. Let's have a "proudly scanned by ..." added to the information (metadata) provided along-side by the scansets.) 2) When one analyzes the total work to get from a paper book to an accurate digital facsimile rendition, most of the work is still in the scanning, proofing, and structure/semantic markup stages (with most in the first two). I can only guesstimate the percentage of the total person-hours to accomplish these three items, but I believe it's well into the upper 90% range. The person-hours to take the digital master and then produce a facsimile, even if done mostly manually, is relatively minor -- those skilled at it actually enjoy doing this, and will do it mostly manually anyway, and can work quite fast, sometimes just a few hours. (True facsimiles probably require InDesign to layout the text.) Juliet mentioned the difficulties with "Punch", and definitely Punch stretches things with it's "stream of consciousness", visually-oriented approach. I've only looked at one so far, #22698: http://www.gutenberg.org/files/22698/22698-h/22698-h.htm But as I look at the HTML example (not sure where the original page scans are), it still appears the content is pretty linear and so a digital text master can still be produced which focuses on structural/ semantic markup (just a stream of section after section, each section autonomous from the others.) So long as the page scans are available, someone can then take that digital text master and produce a facsimile version. In some ways, the complications DP is facing with Punch is *because* many there want to directly *autoconvert* from the Master to the Facsimile: Digital Master --> Digital Facsimile What I propose is the following: Digital Master --> "Facsimile Master" --> Digital Facsimile The focus would be on producing the Digital Master, but if someone wants to produce a "Facsimile Master" from the Digital Master (even if it necessitates redoing the document markup from scratch or putting in a lot of hand labor), then all the power to them. I see a separation into different groups: 1) Distributed Scanning 2) Distributed Proofing 3) Digital Text Mastering (sufficient to autoconvert into ebook formats optimized for various target platforms -- focus on content.) 4) Digital Facsimile Production I see getting to #3 as the most important for fully usable texts. I do not see #4 as being that important for the vast majority of texts, especially now that we will have the original scan sets available in sufficiently high quality for reference or direct reading. However, all the power to those who want to produce Digital Facsimiles of *selected* Digital Text Masters. (Do all public domain texts require we produce digital facsimiles of them, at least right away, especially when we now will have scansets available? I'd say only for certain texts do digital facsimiles make any sense -- that's why I have trouble with the zeal to make digital facsimiles of all books.) To be frank, I see Juliet in a sort of bind since a lot of the DP volunteers are so enamored with #4 that they don't properly focus on #3 in a separate manner. A separation into different "projects" makes sense. Of course, those who love digital facsimiles can continue to advise the digital text mastering folk how to better markup the digital masters to make facsimile production a little easier, but there's a point when the digital mastering folk have to say "enough" and tell the facsimile folk to do it themselves. The digital text mastering folk *have* to focus on the repurposeability and accessibility of their work product to the world at large, to focus on the content and not on original presentation (most of which is arbitrary anyway.) Well, obviously I've opened up a lot of contentious issues here, but they are my opinions, and hope others will respond in an objective, unemotional way. Jon Noring From joshua at hutchinson.net Fri Oct 5 09:29:35 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Fri, 5 Oct 2007 16:29:35 +0000 (UTC) Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? Message-ID: <28140207.1191601775105.JavaMail.?@fh1064.dia.cp.net> >----Original Message---- >From: jon at noring.name > >Juliet mentioned the difficulties with "Punch", and definitely Punch >stretches things with it's "stream of consciousness", >visually-oriented approach. I've only looked at one so far, #22698: > > http://www.gutenberg.org/files/22698/22698-h/22698-h.htm > Just a quick FYI: The page images can currently be found here: http://pageimages.pglaf.org/Loewenstein/22698/22698-page-images/ and will be moved to the archives once the WWer (in this case, Joe) okays them. We're about a month behind posting images due to September being a pretty heavy month. Josh From jon at noring.name Fri Oct 5 10:03:01 2007 From: jon at noring.name (Jon Noring) Date: Fri, 5 Oct 2007 11:03:01 -0600 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <28140207.1191601775105.JavaMail.?@fh1064.dia.cp.net> References: <28140207.1191601775105.JavaMail.?@fh1064.dia.cp.net> Message-ID: <422003665.20071005110301@noring.name> Josh wrote: > Just a quick FYI: > > The page images can currently be found here: > > http://pageimages.pglaf.org/Loewenstein/22698/22698-page-images/ > > and will be moved to the archives once the WWer (in this case, Joe) > okays them. We're about a month behind posting images due to September > being a pretty heavy month. Thanks! Hmmm, from a Digital Text Mastering approach, that version of Punch would be relatively easy. It is actually a quite linear text divided up into autonomous sections/divisions. The only complication is that it is a mix of Prose, Verse, Drama, and apparently stand-alone images, but TEI has markup for all that which can be applied in standardized manner. So my view would be to mark this up using TEI (and where necessary bring in both the Verse and Drama modules for their added "stuff"), in linear fashion and not be concerned at all about layout since in the example I looked at I did not see any "layout is content" stuff (and when there is the rare "typographic layout is itself content", use SVG to mark that up.) Issue that TEI as a Digital Text Master, then let the Digital Facsimile enthusiasts, if they so choose, to take that and produce a facsimile edition by creating their own derivative master(s), using InDesign, or whatever path(s) makes sense for their purposes. But trying to embed the digital text master with special markup so it will autoconvert to the desired facsimile result is, in my opinion, not the way to go to produce a near-facsimile (such as HTML) or a perfect-facsimile (e.g., PDF or SVG). This might be where some bottlenecks in the DP process is occurring and making adoption of TEI for digital text mastering difficult: a focus on facsimile reproduction *directly* from the Master. If the digital text mastering is separated from the digital facsimile production, such as by separating into different groups, this may free up one bottleneck. Just an outside observation. To be clear, I am not hostile to creating digital facsimiles, but trying to produce a single Digital Text Master, along with a universal conversion system, which will pushbutton convert to both 1) optimized digital format versions focusing on target platforms and content, and 2) quite faithful facsimile reproductions of the original typesetting, is a near impossible proposition in a practical sense -- it is doable, but so damn complicated it will never be implemented (I am aware of commercial systems that do this, such as Rosetta Solutions, and they are complicated, meant to be used commercially.) This probably explains why DP and PG are still struggling with TEI -- to try to be all things to all people. The separation of Digital Text Mastering with Digital Facsimile production makes sense, then, as I previously noted. DTM would focus on the content and repurposeability to all target digital platforms; DFP would focus on taking the DTM of selected texts and producing various levels of facsimiles using a variety of tools and output formats optimized for that purpose. DFP will probably create its own "internal master" if they wish, but this would not replace the DTM master. Jon Noring From grythumn at gmail.com Fri Oct 5 10:17:58 2007 From: grythumn at gmail.com (Robert Cicconetti) Date: Fri, 5 Oct 2007 13:17:58 -0400 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <724651335.20071005100515@noring.name> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> <724651335.20071005100515@noring.name> Message-ID: <15cfa2a50710051017i4b4476acv19a8aa63037c3089@mail.gmail.com> On 10/5/07, Jon Noring wrote: > (Aside, I've always believed that if we are to scan the public > domain books, we should do so at sufficient scan quality so the > scans will be useful for ludic reading, and not just as feed for > OCR, and thus have always advocated higher quality master scans > than has been done. This zeal for speed has troubled me, especially > in that the bottleneck at DP is not scans, but proofing -- I would > hope that DP will begin to encourage book scanners to focus on > archival quality -- even presentation quality -- rather than the > current "just scan 'em good enough for OCR." Make the scans You're confusing speed of scanning with bandwidth restrictions[0].. CPers are encouraged to provide sub-100k page images[1] to make sure that page load times are not prohibitive for people on modems, (and, incidently, to stay under the monthly transfer limits on the server..) I generally scan in grayscale, and later convert to B/W before uploading. In addition, most of the page images actually look pretty good scaled down if you use something better than nearest neighbor.. try using Opera, or IE7 with a line of CSS to enable bicubic scaling. R C [0] Although IIRC some of the high-speed scanners are black and white only.. [1] But not required.. I've provided several projects with 4 or 8 color grayscale pages weighing in at 150-200k.. when the condition of the text required it. From jon at noring.name Fri Oct 5 10:43:02 2007 From: jon at noring.name (Jon Noring) Date: Fri, 5 Oct 2007 11:43:02 -0600 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <15cfa2a50710051017i4b4476acv19a8aa63037c3089@mail.gmail.com> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> <724651335.20071005100515@noring.name> <15cfa2a50710051017i4b4476acv19a8aa63037c3089@mail.gmail.com> Message-ID: <1941308865.20071005114302@noring.name> Robert > Jon Noring wrote: >> (Aside, I've always believed that if we are to scan the public >> domain books, we should do so at sufficient scan quality so the >> scans will be useful for ludic reading, and not just as feed for >> OCR, and thus have always advocated higher quality master scans >> than has been done. This zeal for speed has troubled me, especially >> in that the bottleneck at DP is not scans, but proofing -- I would >> hope that DP will begin to encourage book scanners to focus on >> archival quality -- even presentation quality -- rather than the >> current "just scan 'em good enough for OCR." Make the scans > You're confusing speed of scanning with bandwidth restrictions[0].. > CPers are encouraged to provide sub-100k page images[1] to make sure > that page load times are not prohibitive for people on modems, (and, > incidently, to stay under the monthly transfer limits on the server..) > I generally scan in grayscale, and later convert to B/W before > uploading. > > In addition, most of the page images actually look pretty good scaled > down if you use something better than nearest neighbor.. try using > Opera, or IE7 with a line of CSS to enable bicubic scaling. Well, I am aware of "bandwidth restrictions", but I still think scanning should be an autonomous activity whose work product is itself publishable, such as donating to IA. One can certainly burn DVDs containing a couple gigs of scans of a book, then mail the DVD to IA and/or elsewhere, including DP folk. And certainly once one has hi-rez versions, they can be downsampled before uploading to DP for OCR purposes. I think the confusion lies in that DP still considers the sole purpose of scans to be input to its process, so are not concerned as much by resolution and color depth (other than for images in the books). So long as DP does not make any effort to encourage those who scan books to do so at archival or even presentational quality, most won't. But if it is encouraged, I think most will take the time to do it. If volunteers are given reasons why to do something to a certain higher level of quality, most will gladly do so. I reject the notion that *asking* them to take the effort to produce archival quality will turn them away. The end result is that a lot of high quality scan sets will result and be made available to the world. Good enough for DP should NOT be considered good enough. Jon Noring From editor at pg-news.org Fri Oct 5 11:19:22 2007 From: editor at pg-news.org (Mike Cook) Date: Fri, 5 Oct 2007 20:19:22 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <392096647.20071004191413@noring.name> References: <20071001081923.GA29575@ark.in-berlin.de> <4700FDCF.1060009@perathoner.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <392096647.20071004191413@noring.name> Message-ID: <005901c8077c$42e25f10$c8a71d30$@org> Of the 70 PG texts I've made into TEI, this is my list of current tags. Just of the section. (I don't think I've missed anything out ;-)

And also &backslash; &braceleft; &braceright; ° > … < —   – &qdash; That's all I have :) In some of my previous messages I talked about making very simple TEI files. Once we have all the PG texts in this type of minimal markup then the TEI guru's can start adding the more interesting markup options. Mike -----Original Message----- From: Jon Noring [mailto:jon at noring.name] Sent: 05 October 2007 03:14 To: Project Gutenberg Volunteer Discussion Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? Lee wrote: > Jeroen Hellingman wrote: >> I think the biggest barrier here is the steep learning curve of TEI >> (20% of the tags cover 80% of the things you encounter, but every >> other book you will need something from those remaining 80%, and, >> oh gosh, which tag can I use then) .... > I am intrigued by this comment (and not only because it mirrors my > own experience). So by way of information gathering among those who > use TEI on a regular basis, I would you to tell me, perhaps simply > as an ordered list, what TEI tags you believe are most used and most > valuable (not necessarily the same thing). In other words, what are > the 20% of the tags that cover 80% of the need, and from the > remaining 80% what seems to come up the most often? > > I'm thinking of writing a little script that will try to automate > the collection of usage data from current Gutenberg TEI texts. Regarding the last paragraph Lee wrote, I think that's a splendid idea, to see what elements, attributes and attribute values have been used, doing a statistical analysis of their usage. I hope the analysis will also look at content models, maybe building a minimal DTD to which all the documents will validate, (I believe there are tools which will build a common DTD from a set of XML documents -- hmmm, maybe that's the approach to take first, then one can do a statistical analysis by comparing to this minimal DTD. Also, inspection of that DTD will provide insights.) ***** Regarding the rest of what Lee wrote, a few days ago I outlined in a private email to Lee some preliminary ideas, which I'll restate here for discussion purposes: The gist of the idea is that a group of us create a very strict subset of TEI: elements, attributes and *standardized* attribute values, and constrained element content models, along with any other markup usage rules that cannot be enforced by a DTD or schema. This subset and associated ruleset would be sufficient to consistently, uniformly, and in standardized fashion (especially attribute values), markup 80% or 90% of the public domain books which PG/DP works with. A related goal (if possible to achieve) would be that when different people independently markup the same book, and follow the rules, the marked up documents would be, for all practical purposes, canonically identical with one another. Furthermore, we would actually build our own DTD or schema so that those authoring to this strictly applied subset could immediately validate to it. Also, we could write a script to do conformance checking to check any other requirements that cannot be enforced by DTD or schema -- a sort of "conformance chekcer." We could also write a brief document describing how to markup documents using this subset vocabulary, and minimizing the need for people learning to markup using our simple subset to have to slog through the TEI manual to figure out how to do something. And for most (but clearly not all) of the remaining texts, we could slowly build a "superset" of the basic DTD, so that at least the more complicated books follow the strict subset in uniformity of basic markup. This superset could be slowly built over time. The benefit of this approach is that we can now involve more people in marking up books, have the validation tools, and provide a much more uniform basis by which authoring tool and conversion tools can be built more quickly. The problem with the full-blown TEI, and even TEI-Lite (which is not "Lite"), is that it is so massive, and the manual so difficult to comprehend without spending a year studying it, that those trying to build dedicated authoring tools and conversion tools to handle all possibilities is much more difficult. And I'm not advocating not using the full blown TEI for the extremely funky texts, but let's at least standardize on something simple and uniform for the vast majority of the books. Anyway, that's the core of the idea. And not actually new as Josh has mentioned it in some fashion, but maybe what is proposed here has a few twists on what previously has been proposed. And I'm willing to help build the DTD (I prefer DTDs since they should be sufficient for this purpose and there are other advantages to DTD over schema which I won't get into here.) I am quite experienced in building DTDs, having built by hand the OEBPS 1.2 Document and Package DTDs, the OpenReader Binder DTD, and the BookX DTD (which has a similar philosophy to that described above but focused on new book publishers, clearly not for use by PG/DP for reasons I won't delve into.) Jon Noring (p.s., since TEI P5 is soon to be released -- it is currently at version 0.9 and may be elevated to 1.0 at this years TEI Annual Meeting in November -- our subset should definitely be built on P5.) From lee at novomail.net Fri Oct 5 11:33:36 2007 From: lee at novomail.net (Lee Passey) Date: Fri, 05 Oct 2007 12:33:36 -0600 Subject: [gutvol-d] gnutenberg-press maintenance offer (was Re: Proposal to add OpenDocument as an additional In-Reply-To: <200710042133.43012.rolsch@verizon.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710042133.43012.rolsch@verizon.net> Message-ID: <47068380.1060802@novomail.net> Roland Schlenker wrote: > On Thursday 04 October 2007 6:20 pm, Lee Passey wrote: [snip] >> So by way of information gathering among those who use TEI >> on a regular basis, I would you to tell me, perhaps simply as an ordered >> list, what TEI tags you believe are most used and most valuable (not >> necessarily the same thing). In other words, what are the 20% of the >> tags that cover 80% of the need, and from the remaining 80% what seems >> to come up the most often? [snip] > I use a minimal TEI file as a test file to check how a peculiar piece of text > will render in the three current formats. Then, when the TEI markup is > correct, I copy it into the master TEI file. > > The following are some examples from a condensed test file: This post wasn't really a response to the question I asked, which I suppose is not surprising. I have taken the liberty of editing the examples, removing purely presentational attributes, and attempting to distinguish between ambiguous uses of different elements. This is what I conclude are the elements most important to Mr. Schlenker:

// used as

// Used as <p> // Used as <docEdition> <p> // used as <ab> <p> // used as <byline> <p> // used as <respStmt> <p> // used as <imprint> <p> // used as <imprint><publisher> <p> // used as <imprint><date> <p> // used as <availability> <p> // used as <div type="dedication"> <p> // used as <closer><salute> <p> // used as <closer><signed> <figure> <figDesc> <q> // used as <said> <q> // used for quotation marks in a context other than a quote <quote> <lg> <l> ---------------------- <hi> // Used purely for presentation <pb> <anchor> <lb> Interestingly, this is only 13 unique elements (although it would be at least double that if the <p> element were used correctly). All in all, not an unreasonably large number of elements to learn. I have also segregated the <pb>, <lb> and <anchor> tags from the rest because I just have this gut feeling that while they are useful elements, I don't really think they are crucial. I'm trying to categorize elements as "crucial", "useful", and "esoteric." No doubt there will be a fair amount of controversy when it comes to the border cases, but maybe there will some consensus for a significant number of elements. I also think I'm going to ignore (for now) those elements unique to the <teiHeader> (which encode metadata) and focus on those elements used in the <text> element (which encode the actual content). Personally, I think encoding of metadata /is/ crucial, but it feels enough different from the encoding of the text that I think it can be dealt with separately. Hopefully, I can get some more feedback in this type of summary format, and I'll try to post my own list (perhaps with some justifications) later this weekend. -- Nothing of significance below this line. From joshua at hutchinson.net Fri Oct 5 11:38:38 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Fri, 5 Oct 2007 18:38:38 +0000 (UTC) Subject: [gutvol-d] gnutenberg-press maintenance offer (was Re: Proposal to add OpenDocument as an additional Message-ID: <19938225.1191609518323.JavaMail.?@fh1064.dia.cp.net> Work from Mike's list. Quite comprehension (the only other tag I can think of that I use regularly is the <divGen> tag that generates various automated sections of text, like a titlepage or a table of contents). Josh >----Original Message---- >From: lee at novomail.net > >Hopefully, I can get some more feedback in this type of summary format, >and I'll try to post my own list (perhaps with some justifications) >later this weekend. > From Bowerbird at aol.com Fri Oct 5 12:46:12 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 5 Oct 2007 15:46:12 EDT Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? Message-ID: <d31.17e946be.3437ee84@aol.com> robert said: > You're confusing speed of scanning with bandwidth restrictions[0].. is noring back on his "scan at high resolution" merry-go-round? why not use the "distributed scanners" yahoogroup to discuss it? and wasn't there a separate listserve set up to discuss t.e.i. issues? or if you're discussing d.p. doing .tei, why not discuss it over there? -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071005/ecf5d5a6/attachment.htm From jon at noring.name Fri Oct 5 14:22:10 2007 From: jon at noring.name (Jon Noring) Date: Fri, 5 Oct 2007 15:22:10 -0600 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <d31.17e946be.3437ee84@aol.com> References: <d31.17e946be.3437ee84@aol.com> Message-ID: <1491082365.20071005152210@noring.name> Bowerbird wrote: > is noring back on his "scan at high resolution" merry-go-round? > why not use the "distributed scanners" yahoogroup to discuss it? > > and wasn't there a separate listserve set up to discuss t.e.i. issues? > or if you're discussing d.p. doing .tei, why not discuss it over there? Hmmm, are you saying that discussion about books scans is off-topic here? And has Greg and Michael given you the authority to decide what is on- and off-topic in this group? Jon Noring From rolsch at verizon.net Fri Oct 5 14:27:42 2007 From: rolsch at verizon.net (Roland Schlenker) Date: Fri, 05 Oct 2007 17:27:42 -0400 Subject: [gutvol-d] =?iso-8859-1?q?gnutenberg-press_maintenance_offer_=28w?= =?iso-8859-1?q?as_Re=3A_Proposal_to_add=09OpenDocument_as_an_additional?= In-Reply-To: <47056722.9000801@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> Message-ID: <200710051727.42671.rolsch@verizon.net> On Thursday 04 October 2007 6:20 pm, Lee Passey wrote: > Jeroen Hellingman (Mailing List Account) wrote: > > I think the biggest barrier here is the steep learning curve of TEI (20% > > of the tags cover 80% of the things you encounter, but every other book > > you will need something from those remaining 80%, and, oh gosh, which > > tag can I use then) .... > > I am intrigued by this comment (and not only because it mirrors my own > experience). So by way of information gathering among those who use TEI > on a regular basis, I would you to tell me, perhaps simply as an ordered > list, what TEI tags you believe are most used and most valuable (not > necessarily the same thing). In other words, what are the 20% of the > tags that cover 80% of the need, and from the remaining 80% what seems > to come up the most often? > > I'm thinking of writing a little script that will try to automate the > collection of usage data from current Gutenberg TEI texts. From my lastest project, Marcia Schuyler, by Grace Livingston Hill Lutz: <p> - 1687 <q> - 934 <anchor> - 434 <pb> - 358 <hi> - 204 <item> - 118 <ref> - 76 <lb/> - 71 <index> - 63 <div> - 41 <list> - 39 <corr> - 38 <head> - 37 <milestone> - 27 <l> - 25 <quote> - 8 <lg> - 5 <divGen> - 5 <figure> - 4 <figDesc> - 4 <title> - 3 <name> - 3 <date> - 3 <publisher> - 2 <idno> - 2 <classCode> - 2 <bibl> - 2 <author> - 2 <titleStmt> - 1 <textClass> - 1 <text> - 1 <teiHeader> - 1 <taxonomy> - 1 <sourceDesc> - 1 <revisionDesc> - 1 <respStmt> - 1 <publicationStmt> - 1 <pubPlace> - 1 <projectDesc> - 1 <profileDesc> - 1 <language> - 1 <langUsage> - 1 <keywords> - 1 <imprint> - 1 <front> - 1 <fileDesc> - 1 <encodingDesc> - 1 <editorialDecl> - 1 <editionStmt> - 1 <edition> - 1 <classDecl> - 1 <change> - 1 <body> - 1 <back> - 1 <availability> - 1 <TEI.2> - 1 The python script used to create the above: #!/usr/bin/env python import sys # Usage filename.py filename xml = open(sys.argv[1]).read() # Create a list of elements xml = xml.split('<') xml = [x for x in xml if not x.startswith('!--')] xml = [x for x in xml if not x.startswith('-->')] xml = [x for x in xml if not x.startswith('!DOCTYPE')] xml = [x for x in xml if not x.startswith('?xml')] xml = [x.split()[0] for x in xml if len(x.split()) > 0] xml = [x.split('>')[0] for x in xml if len(x.split('>')) > 0] xml = [x for x in xml if not x.startswith('/')] # Count number of elements elements = {} for element in xml: if element in elements: elements[element] += 1 else: elements[element] = 1 # Create a sorted list of elements and counts sorted_list = [[value, '<' + element + '>', value] for element, value in elements.iteritems()] sorted_list.sort() sorted_list.reverse() sorted_list = [r[1:3] for r in sorted_list] # Output to standard output a formatted sorted list and counts result = ['%-20s - %s\n' % (e, c) for e, c in sorted_list] result = ''.join(result) print result Roland Schlenker PS Best viewed using a fixed font. From lee at novomail.net Fri Oct 5 16:08:55 2007 From: lee at novomail.net (Lee Passey) Date: Fri, 05 Oct 2007 17:08:55 -0600 Subject: [gutvol-d] gnutenberg-press maintenance offer (was Re: Proposal to add OpenDocument as an additional In-Reply-To: <200710051727.42671.rolsch@verizon.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> Message-ID: <4706C407.7060706@novomail.net> Roland Schlenker wrote: [snip] >>From my lastest project, Marcia Schuyler, by Grace Livingston Hill Lutz: Thanks, that's excellent data! Of course, the number have to be looked at in context. For example, even though there's only one <body> tag, /every/ TEI file is going to have to have one, so it's pretty important. On the other hand, if we find elements that are used only rarely /in/ a document, and are used only rarely /across/ documents, that's a good candidate for "esoteric" status. And if those esoteric elements can be modeled with other more generic tags (you could probably mark up an entire text using only <div>, <ab> and <seg>) then maybe they aren't necessary to include in a TEI tutorial. Thanks again for the data. -- Nothing of significance below this line. From jon at noring.name Fri Oct 5 17:26:30 2007 From: jon at noring.name (Jon Noring) Date: Fri, 5 Oct 2007 18:26:30 -0600 Subject: [gutvol-d] gnutenberg-press maintenance offer (was Re: Proposal to add OpenDocument as an additional In-Reply-To: <4706C407.7060706@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <4706C407.7060706@novomail.net> Message-ID: <463976383.20071005182630@noring.name> Lee wrote: > Roland Schlenker wrote: >> From my lastest project, Marcia Schuyler, by Grace Livingston Hill Lutz: > Thanks, that's excellent data! Of course, the number have to be looked > at in context. For example, even though there's only one <body> tag, > /every/ TEI file is going to have to have one, so it's pretty important. > On the other hand, if we find elements that are used only rarely /in/ a > document, and are used only rarely /across/ documents, that's a good > candidate for "esoteric" status. And if those esoteric elements can be > modeled with other more generic tags (you could probably mark up an > entire text using only <div>, <ab> and <seg>) then maybe they aren't > necessary to include in a TEI tutorial. > > Thanks again for the data. As I previously mentioned, another thing that could be done would be to gather some or all the TEI documents done for DP/PG and from them generate a "minimum DTD" that covers all of them. I think there exists, but I'm not certain, tools which will do this. I know there exists tools which build a DTD from a single XML document, but not certain a tool exists which will generate a DTD for a batch of XML documents using a common vocabulary (of course the same root element). This is something I'll be happy to check on if no here knows of a tool and thinks such a minimal DTD would be useful for analysis. I believe a minimal DTD would be useful in that it gathers in a single place all the important things about an XML vocabulary: 1) Elements used 2) Attributes used 3) Attribute values used 4) Element content models (often overlooked but important!) Of course, one still needs to do a statistical analysis to determine, as Lee describes above, how often and across how many documents certain tags are used in DP/PG TEI texts. Another value of such a DTD is that it could be the starting point, if we deem it useful, to build a "Basic DP/PG" TEI-subset DTD which covers a reasonable fraction of the PG texts (say 80 or 90%). Of course, we'd have to adapt the DTD to the changes (and new elements) in the upcoming TEI P5 vocabulary, as well as do other changes as deemed necessary to streamline the vocabulary (such as trying to constrain it so there's just one way to markup documents that will be fully characterized by the Basic DTD.) I think it is a way that maybe we can finally converge on a TEI-based strategy for mastering PG/DP texts which is easy-to-use for the vast majority of the texts and allows for better standardization of authoring and conversion tools. Jon Noring From prosfilaes at gmail.com Fri Oct 5 17:46:11 2007 From: prosfilaes at gmail.com (David Starner) Date: Fri, 5 Oct 2007 20:46:11 -0400 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> Message-ID: <6d99d1fd0710051746r5cf61006v9e83091543fcb830@mail.gmail.com> On 10/5/07, joshua at hutchinson.net <joshua at hutchinson.net> wrote: > Things like drop > caps or matching the idiosyncratic layout of the original table of > contents are "fluff" to folks in this camp. The frustrating thing about this is that the distinctive features of the original were frequently added carefully and with great care, and you lose a lot by folding it into a regular pattern. > PS I don't believe that the current output of PGTEI process is > "ugly". Rather, it is uniform and loses the original "charm" of the > original book's layout. I disagree; in particular, my stopping point has always been the title page. The title page of <http://www.gutenberg.org/files/19487/19487-h/19487-h.html> doesn't look like any title page I've ever seen. Looking "Money, Money, Money" by Ed McBain, "Outwitting History", by Aaron Lansky, and Dover's edition of Fantomas, all three of them have the text centered, the title in the largest font, the subtitle, then the author's name, which is also in a font that is larger and more dominant than the font for the body of the text. Then, on obverse side, is all the fine detail about editions and copyrights and years and publication history, in a small font. The TEI-Lite one just looks like some text dumped onto a page in a way that doesn't stand out in the least; in fact, in a typesetting atrocity, "GENERAL PREFACE" and other section headings are larger than the title. <http://www.gutenberg.org/files/21195/21195-h/21195-h.html> actually manages to be worse; the title page is a hideous atrocity of Lovecraftian proportions. Once again, the text isn't centered. In an incredibly tacky, Anglo-centric manner, we introduce English onto the title page of a book written completely in Esperanto. Bad English, mind you, since English speakers always say First Edition, never Edition 1. I've never seen anything like "Edition 1, (April 20, 2007)", either; you don't separate off a parenthetic clause with a comma. Nor have I ever seen the full date written out on a title page. Nor is this the first edition; that's been justified to me on the claims that it's the first PG edition, but that's still very misleading, especially as Project Gutenberg isn't mentioned on the title page. We also fail to credit the translator on the title page, which is usually done in more respectable books. And don't give me for one second that all we need to do is translated the appropriate files for TEI-Lite. That's not an excuse for publishing the book without translating those files. Furthermore, I know there are languages where those files can't be simply translated, where the gender of the author matters in which word-forms you need to use. And we may want to do books in languages that no one--or at least we--aren't entirely fluent in. Beyond, it mostly looks acceptable, with the exception of the continued Anglo-centric use of pg to indicate page numbers in an Esperanto text. I object to how the page numbers are handled; even in an English text, pg instead of page looks informal and unprofessional. Any repeating word should be unnecessary. It also ignores best current practices in how to display the page number; most modern DP HTML texts don't display it so large and loud. In a lot of cases, what makes a DP HTML text look good is not the original "charm"; it's the work put into making the HTML look good. There doesn't seem to have been much if any such work on making the output of TEI-Lite look good. Nobody seems to have looked at the ways that HTML writers have produced page numbers and made it look right, or looked at title pages in etexts and in real life, or even bothered to listen to what I said last time I complained about the title pages. Listening to people who don't already think everything about TEI-Lite is the bee's knees might help you reach the 90% that can be swayed. From traverso at posso.dm.unipi.it Fri Oct 5 22:40:38 2007 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Sat, 6 Oct 2007 07:40:38 +0200 (CEST) Subject: [gutvol-d] gnutenberg-press maintenance offer (was Re: Proposal to add OpenDocument as an additional In-Reply-To: <463976383.20071005182630@noring.name> (message from Jon Noring on Fri, 5 Oct 2007 18:26:30 -0600) References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <4706C407.7060706@novomail.net> <463976383.20071005182630@noring.name> Message-ID: <20071006054038.34D0C101FD@posso.dm.unipi.it> >>>>> "Jon" == Jon Noring <jon at noring.name> writes: Jon> As I previously mentioned, another thing that could be done Jon> would be to gather some or all the TEI documents done for Jon> DP/PG and from them generate a "minimum DTD" that covers all Jon> of them. I think that it would be important not only to gather what has been used, but also what should have been used. For example, I have never done TEI PG books, but I am planning to do them, and in the tags that I plan to use, and one of the main reasons to me to use TEI, are the <corr> and <sic> tags, to document the errors in the original. The support in the conversion is not important, (as long as the tag does not make the conversion fail, but the tag is just discarded), since the feature may be added later, but it is a kind of information that is preserved in DP proofreading and should not be discarded. A different consideration, of more general type. It would be useful to have a network of formats and conversion tools; of these formats some should be considered "essential" (currently, it is only the txt format), and an ebook coded in any format could be considered a "master text" if from it every essential format can be obtained with some combination of the tools. For example, to come back to the original thread, an OpenDocument book might be accepted as master, if an accepted tool exists to convert (a subset of) OpenDocument to PGTEI, (and the book submitted can be converted with this tool) since from PGTEI the other formats may be obtained. Another example, a carefully hand-made HTML could be accepted as master if a tool to convert to PGTEI exists. Of course, even a (regularized) txt format might be a master, if every essential format can be reached from it. A "master format" should be a format in which a very large part of books can be represented faithfully as master text. Carlo From rolsch at verizon.net Sat Oct 6 06:03:32 2007 From: rolsch at verizon.net (Roland Schlenker) Date: Sat, 06 Oct 2007 09:03:32 -0400 Subject: [gutvol-d] gnutenberg-press maintenance offer (was Re: Proposal to add OpenDocument as an additional In-Reply-To: <4706C407.7060706@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <200710051727.42671.rolsch@verizon.net> <4706C407.7060706@novomail.net> Message-ID: <200710060903.32265.rolsch@verizon.net> On Friday 05 October 2007 7:08 pm, Lee Passey wrote: > Roland Schlenker wrote: > > [snip] > > >>From my lastest project, Marcia Schuyler, by Grace Livingston Hill Lutz: > > Thanks, that's excellent data! Of course, the number have to be looked > at in context. For example, even though there's only one <body> tag, > /every/ TEI file is going to have to have one, so it's pretty important. > On the other hand, if we find elements that are used only rarely /in/ a > document, and are used only rarely /across/ documents, that's a good > candidate for "esoteric" status. And if those esoteric elements can be > modeled with other more generic tags (you could probably mark up an > entire text using only <div>, <ab> and <seg>) then maybe they aren't > necessary to include in a TEI tutorial. > > Thanks again for the data. I would be willing to process a random sample of TEI book available at PG for more data. How would this be for a starters: english only etext # elements > 3 formatted the same Posted to the list? Any suggestions? Roland Schlenker From marcello at perathoner.de Sat Oct 6 07:51:53 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat, 06 Oct 2007 16:51:53 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <6d99d1fd0710051746r5cf61006v9e83091543fcb830@mail.gmail.com> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> <6d99d1fd0710051746r5cf61006v9e83091543fcb830@mail.gmail.com> Message-ID: <4707A109.5070000@perathoner.de> David Starner wrote: >> PS I don't believe that the current output of PGTEI process is >> "ugly". Rather, it is uniform and loses the original "charm" of the >> original book's layout. > > I disagree; in particular, my stopping point has always been the title > page. The title page of > <http://www.gutenberg.org/files/19487/19487-h/19487-h.html> doesn't > look like any title page I've ever seen. You don't realize that a TEI title page is in no way different from any other page in the book and you can make it look any way you want. See: http://www.gnutenberg.de/download/candide/candi.html http://www.gnutenberg.de/download/candide/candi.pdf http://www.gnutenberg.de/download/candide/candi.txt http://www.gnutenberg.de/download/candide/candi.tei As you can see "Candide" was done with the TEI converter of 2003. The current converter can do much more than this. The reason why most TEI title pages in the archive follow the same template is that there is a labour-saving macro <divGen type="titlepage" /> that generates a stock title page from the data in the TEI header. Of course, you don't have to use that macro. You may use it if you think the time it makes you save is better spent making more books than dinking with the title page yourself. > <http://www.gutenberg.org/files/21195/21195-h/21195-h.html> actually > manages to be worse; the title page is a hideous atrocity of > Lovecraftian proportions. Once again, the text isn't centered. In an > incredibly tacky, Anglo-centric manner, we introduce English onto the > title page of a book written completely in Esperanto. Again, see the Candide example for a book done entirely in the French manner down to the French spaces before punctuation. > And don't give me for one second that all we need to do is translated > the appropriate files for TEI-Lite. You don't need even that. Just build your own title page and use all the words you like. Full unicode support comes included. > Beyond, it mostly looks acceptable, with the exception of the > continued Anglo-centric use of pg to indicate page numbers in an > Esperanto text. The "[pg ]" was an explicit request from DP PPers. I personally would dump page numbers from all formats except the TEI master. > There doesn't seem to have been much if any such work on making the > output of TEI-Lite look good. Of course not. Because every PPer has a different idea of "look good". What for one PPer is a "conditio sine qua non TEI" will make the other PPer go berserk for hours about your "ugly output". The PGTEI output is intentionally left as neutral as possible. If you don't like that, you can use a style sheets inside your TEI master. > Nobody seems to have looked at the ways > that HTML writers have produced page numbers and made it look right, > or looked at title pages in etexts and in real life, or even bothered > to listen to what I said last time I complained about the title pages. If you -- again -- criticize things that have been solved since at least four years, nobody will listen to you -- again. -- Marcello Perathoner webmaster at gutenberg.org From joshua at hutchinson.net Sat Oct 6 09:11:07 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Sat, 6 Oct 2007 16:11:07 +0000 (UTC) Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? Message-ID: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> >----Original Message---- >From: prosfilaes at gmail.com > >I disagree; in particular, my stopping point has always been the title >page. > ><http://www.gutenberg.org/files/21195/21195-h/21195-h.html> actually >manages to be worse; the title page is a hideous atrocity of >Lovecraftian proportions. Yep, it's basic and quick and dirty. But it's my fault, not PGTEI. I, as the PPer, don't care about fancy title page layout (though you do have a point about the translator's name being missing. That is my bad.) I just wanted a bare minimum that let the reader know what he/she was getting. However, this is not the standard. You can create a manual title page (I used the built in title page macro). It can be as elaborate and francy as you want. *Almost* anything you can do in HTML can be done in a manual title page. >Beyond, it mostly looks acceptable, with the exception of the >continued Anglo-centric use of pg to indicate page numbers in an >Esperanto text. The [Pg xxx] format was chosen at the behest of DP suggestions. Since there are always English parts of a PG texts (see header/footer), it doesn't seem like a big deal to me. I honestly don't know if it can be changed in the master or if it is hardwired into the conversion right now. >most modern DP HTML texts don't display it so large and loud. That can be changed by adding a style-sheet to the TEI master. > >Listening to people who don't already think everything about TEI-Lite >is the bee's knees might help you reach the 90% that can be swayed. > Honestly, most of the things you don't like are my fault, not the TEI's. I'll see about marking up some better examples (maybe some of the bajillion Punch issues waiting over at DP like Juliet suggested). The output should look just like Punch issues done in HTML. Josh From prosfilaes at gmail.com Sat Oct 6 11:19:35 2007 From: prosfilaes at gmail.com (David Starner) Date: Sat, 6 Oct 2007 14:19:35 -0400 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <4707A109.5070000@perathoner.de> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> <6d99d1fd0710051746r5cf61006v9e83091543fcb830@mail.gmail.com> <4707A109.5070000@perathoner.de> Message-ID: <6d99d1fd0710061119o7bd0e56ax7a96c679d6dc3a33@mail.gmail.com> On 10/6/07, Marcello Perathoner <marcello at perathoner.de> wrote: > You don't realize that a TEI title page is in no way different from any > other page in the book and you can make it look any way you want. See: No, of course I realize that. Of course that's true for every format in PG; if I don't like the way that the HTML file looks, I can open it up in a text editor and change how it looks. The issue is, the default looks like crap. This is what I saw when I clicked on the HTML versions of TEI-Lite in the archive. This is what most of our users are going to see, and very few of them are going to realize that it's supposed to be flexible, and a small percent of that group is going to be willing to put in the work to change it. It has to be good out of the box. > The reason why most TEI title pages in the archive follow the same > template is that there is a labour-saving macro > > <divGen type="titlepage" /> > > that generates a stock title page from the data in the TEI header. > > Of course, you don't have to use that macro. You may use it if you think > the time it makes you save is better spent making more books than > dinking with the title page yourself. So basically the default is "look like crap" and if you actually want it to look decent, the people who do TEI are going to mock you for putting the extra work in. Wow, that's a motivation to use TEI. > > There doesn't seem to have been much if any such work on making the > > output of TEI-Lite look good. > > Of course not. Because every PPer has a different idea of "look good". > What for one PPer is a "conditio sine qua non TEI" will make the other > PPer go berserk for hours about your "ugly output". Once again you dismiss any concepts that it could be done better by attacking the very concept of better. There's lots of agreement among PPers about what you need for a high quality HTML version, agreement that you completely ignore. > The PGTEI output is intentionally left as neutral as possible. This is about as neutral as bowerbird's writing style. As I said before, it blatantly violates the standards that virtually all books I've seen printed after 1700 adhere to. I've never seen Edition 1 or the full date on a title page, ever. How is that neutral? > If you -- again -- criticize things that have been solved since at least > four years, nobody will listen to you -- again. If you want to ignore your critics, that's your right. But very few people are doing TEI-Lite, and that's not going to change if you ignore the reasons people aren't using it. From jon at noring.name Sat Oct 6 11:45:54 2007 From: jon at noring.name (Jon Noring) Date: Sat, 6 Oct 2007 12:45:54 -0600 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> Message-ID: <1977299086.20071006124554@noring.name> Josh wrote: > I, as the PPer, don't care about fancy title page layout (though you do > have a point about the translator's name being missing. That is my > bad.) I just wanted a bare minimum that let the reader know what > he/she was getting. I view title pages in books to be essentially metadata, and in some cases a work of art to be treated as a graphic. But, in toto, not part of the book's textual content. Sometimes title pages includes textual content (almost always epigraphs and dedications), but these can be considered independent of the title page and marked up in their own special sections (usually placed in the frontmatter division.) So for digital text mastering I wouldn't even bother with trying to markup title pages so long as: 1) the metadata contained is recorded in a special metadata section so a "title page" optimized for every target reading platform/format can be built (e.g., like what MS Reader does for LIT -- it uses the Dublin Core data in the LIT's OEBPS Publication to auto-generate a title page.) (I think Dublin Core is sufficient, particularly if we use the IDPF OPS extension for Creator/Contributor. How can we incorporate Dublin Core in TEI digital masters?) 2) A high-quality scan of the original title page (and any related page or pages) is available to the reader. For creating a digital facsimile rendition, the facsimile creator can certainly re-create the title page using the original page scan as a visual template. The facsimile title page can either be mastered in SVG, or in XHTML (or a similar vocabulary) using CSS or XSL-FO for exact styling, or whatever format the facsimile creator prefers. ***** Likewise, the same goes for a table of contents and other types of navigational lists we see in books (I'll ignore back of book indexes in this discussion -- that's a whole different animal.) These nav-lists are not part of a book's content, but are a sort of navigational metadata. Thus, for the digital text master we could do one of at least two methods to record the navigational data: 1) Use Digital Talking Book's NCX. The NCX can be embedded in the digital text master document (using either prefixed namespaces which is messy, or within a CDATA section), or kept external. TEI might offer other mechanisms. Note that NCX is now legally mandatory for accessible educational materials and is required in all OPS Publications (which are inside "EPub".) 2) Markup the navigation target points along with sufficient metadata for each target describing what nav-list it belongs to, some sort of notemark or index number, some sort of title, and its hierarchical level which is important to include for accessibility. With either method, a table of contents and other nav-lists (as needed) can be machine built fully optimized for each target platform/format. Adobe Digital Editions, for example, does this using the NCX in EPub. Again, for those who want to create a facsimile version, they will refer to the original page scans and build their facsimile Table of Contents, etc., exactly the way they want and without worry of repurposability for non-facsimile formats and platforms. Jon Noring From marcello at perathoner.de Sat Oct 6 14:47:18 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat, 06 Oct 2007 23:47:18 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <6d99d1fd0710061119o7bd0e56ax7a96c679d6dc3a33@mail.gmail.com> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> <6d99d1fd0710051746r5cf61006v9e83091543fcb830@mail.gmail.com> <4707A109.5070000@perathoner.de> <6d99d1fd0710061119o7bd0e56ax7a96c679d6dc3a33@mail.gmail.com> Message-ID: <47080266.20403@perathoner.de> David Starner wrote: > On 10/6/07, Marcello Perathoner <marcello at perathoner.de> wrote: >> You don't realize that a TEI title page is in no way different from any >> other page in the book and you can make it look any way you want. See: > > No, of course I realize that. Of course that's true for every format > in PG; if I don't like the way that the HTML file looks, I can open it > up in a text editor and change how it looks. The issue is, the default > looks like crap. If the PPers thought so, they wouldn't have used the divGen macro. If they use it, they probably don't think it "looks like crap". As for everything else, crap is in the eyes of the beholder. > It has to be good out of the box. Well. It cannot, because the divGen macro can only use the metadata it finds in the TEI header. If you want more information or if you want prettier formatting, you have to roll your own title page. To roll your own title page will take you about half an hour. Much less than the time you spent griping here. -- Marcello Perathoner webmaster at gutenberg.org From jeroen.mailinglist at bohol.ph Sat Oct 6 15:23:34 2007 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Sun, 07 Oct 2007 00:23:34 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <005901c8077c$42e25f10$c8a71d30$@org> References: <20071001081923.GA29575@ark.in-berlin.de> <4700FDCF.1060009@perathoner.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <392096647.20071004191413@noring.name> <005901c8077c$42e25f10$c8a71d30$@org> Message-ID: <47080AE6.8060906@bohol.ph> Mike Cook wrote: > Of the 70 PG texts I've made into TEI, this is my list of current tags. Just of > the <body> section. (I don't think I've missed anything out ;-) > > > Just the tag count from my most recent TEI file (Expedition to Borneo of H.M.S. Dido, by Henry Keppel and James Brooke, only HTML and Text got posted). This shows that most tags are used only a few times, and the title page and teiHeader account for half of these. Both could be provided as templates to fill in. XML-Tag Frequencies Tag Count |argument| *24* |author| *2* |availability| *1* |back| *1* |bibl| *1* |body| *4* |byline| *1* |cell| *291* |corr| *25* |date| *2* |dateline| *2* |div1| *34* |div2| *7* |divGen| *1* |docAuthor| *1* |docDate| *1* |docImprint| *1* |docTitle| *1* |encodingDesc| *1* |fileDesc| *1* |front| *1* |head| *41* |hi| *759* |idno| *3* |item| *25* |language| *7* |langUsage| *1* |lb| *16* |list| *2* |milestone| *2* |name| *1* |note| *40* |opener| *3* |p| *1439* |pb| *436* |profileDesc| *1* |publicationStmt| *1* |publisher| *1* |pubPlace| *1* |q| *10* |ref| *36* |resp| *1* |respStmt| *1* |revisionDesc| *1* |row| *71* |salute| *2* |sic| *1* |signed| *3* |sourceDesc| *1* |table| *11* |TEI.2| *1* |teiHeader| *1* |text| *4* |title| *2* |titlePage| *1* |titlePart| *2* |titleStmt| *1* |xref| *4* From prosfilaes at gmail.com Sat Oct 6 16:12:13 2007 From: prosfilaes at gmail.com (David Starner) Date: Sat, 6 Oct 2007 19:12:13 -0400 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> Message-ID: <6d99d1fd0710061612r529723fftdc41b7d0c05435df@mail.gmail.com> On 10/6/07, joshua at hutchinson.net <joshua at hutchinson.net> wrote: > Yep, it's basic and quick and dirty. But it's my fault, not PGTEI. > I, as the PPer, don't care about fancy title page layout (though you do > have a point about the translator's name being missing. That is my > bad.) I just wanted a bare minimum that let the reader know what > he/she was getting. I don't necessarily care about fancy title page layout; but if you look at almost any book, almost any webpage, heck almost any movie or TV show, the biggest, most prominent text is the title, followed by the author/main artists. We layout poetry so it looks like poetry, we layout prose so it looks like prose, we need to layout title pages so they look like title pages. I also disagree with the text/metadata dictonomy that Jon Noring brings up here. The title page is text, and as long as we are precisely recording the rest of the book, I don't see any reason not to precisely record the title page. If you want to make a pretty modernized edition, then you're can do whatever you want to the title page, but that's not what we do here. > However, this is not the standard. You can create a manual title page > (I used the built in title page macro). It can be as elaborate and > francy as you want. *Almost* anything you can do in HTML can be done > in a manual title page. You never want to have a macro that shouldn't be used. Either it should be fixed to look like a title page and use proper English (not Edition 1!), or it should be deleted so nobody uses it. > The [Pg xxx] format was chosen at the behest of DP suggestions. Since > there are always English parts of a PG texts (see header/footer), it > doesn't seem like a big deal to me. The header/footer are more inevitable facts of nature type things, and they are header and footer, that is outside the text itself. > >most modern DP HTML texts don't display it so large and loud. > > That can be changed by adding a style-sheet to the TEI master. But the default needs to look sharp, because that is the only thing 95% of our readers will ever see, and I doubt even most of that 5% will regularly regenerate HTML from TEI masters. > Honestly, most of the things you don't like are my fault, not the > TEI's. To the extent that that is true, and I'm not sure that it entirely is, even then, if TEI is going to be a success, the books that people look at in the archives that are it TEI need to look good. Excuses and blame doesn't matter. Also, what people do most with our books is look at them and read them, so they have to continue to look good. All the cool and fancy stuff you can do with TEI isn't going to win anything if they make people who pick up an HTML versions (and that means the default style-sheet) wish it hadn't been done with TEI. From prosfilaes at gmail.com Sat Oct 6 16:19:30 2007 From: prosfilaes at gmail.com (David Starner) Date: Sat, 6 Oct 2007 19:19:30 -0400 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <1977299086.20071006124554@noring.name> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> <1977299086.20071006124554@noring.name> Message-ID: <6d99d1fd0710061619q17716999ia749ba4447ebb40b@mail.gmail.com> On 10/6/07, Jon Noring <jon at noring.name> wrote: > I view title pages in books to be essentially metadata, and in some > cases a work of art to be treated as a graphic. But, in toto, not > part of the book's textual content. It may be metadata, but that doesn't stop it from being text. If we were making new editions, then that might be another thing, but as long as we're making faithful copies of old editions, we must preserve that information, all of that information, in a way that it is accessible to the end user in HTML, not just the TEI geek. Table of Contents and such are different, mainly because they can be automatically regenerated and they aren't so diverse in style and content. From prosfilaes at gmail.com Sat Oct 6 16:26:15 2007 From: prosfilaes at gmail.com (David Starner) Date: Sat, 6 Oct 2007 19:26:15 -0400 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <47080266.20403@perathoner.de> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> <6d99d1fd0710051746r5cf61006v9e83091543fcb830@mail.gmail.com> <4707A109.5070000@perathoner.de> <6d99d1fd0710061119o7bd0e56ax7a96c679d6dc3a33@mail.gmail.com> <47080266.20403@perathoner.de> Message-ID: <6d99d1fd0710061626h3791b588h464a1eac38eaf6c2@mail.gmail.com> On 10/6/07, Marcello Perathoner <marcello at perathoner.de> wrote: > If the PPers thought so, they wouldn't have used the divGen macro. If > they use it, they probably don't think it "looks like crap". As for > everything else, crap is in the eyes of the beholder. Which format should be used for Project Gutenberg texts is also in the eye of the beholder. But somehow you manage to hold and loudly express your opinion on that. Perhaps PPers should start sending in their texts in their favorite wordprocessing format; after all, it's all in the eye of the beholder and what people complain about really doesn't matter. From Bowerbird at aol.com Sat Oct 6 17:06:17 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 6 Oct 2007 20:06:17 EDT Subject: [gutvol-d] introducing zandbox Message-ID: <c6d.1910ae02.34397cf9@aol.com> zandbox -- the z.m.l. sandbox -- is a free authoring tool that helps authors conveniently format their book in z.m.l. -- "zen markup language" -- a revolutionary light markup. zandbox uses a convenient 2-up interface, where you can edit text on one half of the window, and have it displayed in its fully-formatted form on the other half of the window, displaying very much like it'll be shown to the people who will eventually read your book using the free zml-viewer... because it helps ensure that your z.m.l. is how you want it, you can think of zandbox as a "validity-checker" for z.m.l. if the display-side doesn't look and act the way you want, you'll know you need to change the text on the edit-side. the "rules" of z.m.l. are simple, so it's usually obvious from the display what went wrong, and how to correct the text. and even before all that, zandbox is a way to _learn_ z.m.l. from the immediate feedback via the display-side, you will rapidly learn the rules for formatting text on the edit-side, and before long you'll be able to create z.m.l. in any editor. of course, since other editors don't display z.m.l. correctly, you'll probably want to stick with zandbox to do your z.m.l. but it's comforting to know that you could use any editor... to get your preview copy of sandbox, just backchannel me. the zandbox manual, in progress: > http://z-m-l.com/go/zandbox_manual.zml a z.m.l. "skeleton document", for use with zandbox: > http://z-m-l.com/go/zml_skeleton_book.zml -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071006/93ad1582/attachment.htm From joshua at hutchinson.net Sun Oct 7 17:43:05 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Mon, 8 Oct 2007 00:43:05 +0000 (UTC) Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? Message-ID: <22751892.1191804185436.JavaMail.?@fh1036.dia.cp.net> >----Original Message---- >From: prosfilaes at gmail.com > >On 10/6/07, joshua at hutchinson.net <joshua at hutchinson.net> wrote: > >> That can be changed by adding a style-sheet to the TEI master. > >But the default needs to look sharp, because that is the only thing >95% of our readers will ever see, and I doubt even most of that 5% >will regularly regenerate HTML from TEI masters. > I was unclear. I mean the style-sheet embedded in the TEI master file. If the PPer changes the "setting" in the master, the resultant files that are automatically generated and posted to the archive will have that change. Josh From ralf at ark.in-berlin.de Mon Oct 8 00:08:35 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Mon, 8 Oct 2007 09:08:35 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <4707A109.5070000@perathoner.de> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> <6d99d1fd0710051746r5cf61006v9e83091543fcb830@mail.gmail.com> <4707A109.5070000@perathoner.de> Message-ID: <20071008070835.GA27881@ark.in-berlin.de> > http://www.gnutenberg.de/download/candide/candi.pdf > [...] > > > And don't give me for one second that all we need to do is translated > > the appropriate files for TEI-Lite. > > You don't need even that. Just build your own title page and use all the > words you like. Full unicode support comes included. That example isn't representative because 'Chapitre' and 'Candide' which are the only words in the running header don't have accented characters. ralf From ralf at ark.in-berlin.de Mon Oct 8 00:25:24 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Mon, 8 Oct 2007 09:25:24 +0200 Subject: [gutvol-d] full TEI on PG? Message-ID: <20071008072523.GB27881@ark.in-berlin.de> This may be a FAQ. Would PG accept files that are a superset of PGTEI and a subset of TEI? If so, which ending should the file have to not confuse it with a possible PGTEI file? Maybe like others, I'm thinking about a Plan B in case there is no PGTEI 0.5 version. Also, without doubt, any file marked up with such a set is both worth preserving and publishing. ralf From tb at baechler.net Mon Oct 8 00:55:55 2007 From: tb at baechler.net (Tony Baechler) Date: Mon, 08 Oct 2007 00:55:55 -0700 Subject: [gutvol-d] PG-E shows text, encoding jumbled In-Reply-To: <47015BD4.20809@tintazul.com.pt> References: <mailman.2.1191265202.24007.gutvol-d@lists.pglaf.org> <47015BD4.20809@tintazul.com.pt> Message-ID: <20071008080125.6112B352602@mail1.pglaf.org> Hello, sorry if this post is irrelevant and/or out of date. I think there's still a problem with PGE as outlined in this email. Etext 7384 is from PG US. Check GUTINDEX.ALL. There are 7-bit and 8-bit files apparently. I only know English so I know nothing about encoding but I see the 8-bit file in the PG US index. At 09:43 PM 10/1/07 +0100, you wrote: >The upshot is -- now I can actually follow the link to the e-text >page, and I can click to download the document. Dandy! But the >encoding is wrong. For instance, in ><http://pge.rastko.net/etext/7384>http://pge.rastko.net/etext/7384 >there are utf-7 and iso-8859-1 versions of the Carta da Companhia by >Jos? de Anchieta. The guy's name inside the file shows as "Jos+AOk- >de Anchieta". It's the same in either encoding. From walter.van.holst at xs4all.nl Mon Oct 8 01:13:07 2007 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Mon, 08 Oct 2007 10:13:07 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <1977299086.20071006124554@noring.name> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> <1977299086.20071006124554@noring.name> Message-ID: <4709E693.9040900@xs4all.nl> Jon Noring wrote: > I view title pages in books to be essentially metadata, and in some > cases a work of art to be treated as a graphic. But, in toto, not > part of the book's textual content. As mostly a consumer of Gutenberg etexts (and occassionally proofing a page or two on DP), I have to say that the title pages are often an indication of the effort that has been put in to make the etext's reading a pleasant experience. An ugly title pages will put readers off. But maybe less readers will be put off by ugly title pages than by etexts that are only available as ASCII files. Regards, Walter From marcello at perathoner.de Mon Oct 8 02:23:20 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 08 Oct 2007 11:23:20 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <20071008070835.GA27881@ark.in-berlin.de> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> <6d99d1fd0710051746r5cf61006v9e83091543fcb830@mail.gmail.com> <4707A109.5070000@perathoner.de> <20071008070835.GA27881@ark.in-berlin.de> Message-ID: <4709F708.20604@perathoner.de> Ralf Stephan wrote: >> http://www.gnutenberg.de/download/candide/candi.pdf >> [...] >> >>> And don't give me for one second that all we need to do is translated >>> the appropriate files for TEI-Lite. >> You don't need even that. Just build your own title page and use all the >> words you like. Full unicode support comes included. > > That example isn't representative because 'Chapitre' and 'Candide' > which are the only words in the running header don't have accented > characters. So what? "Accented characters" are no different from other ones. -- Marcello Perathoner webmaster at gutenberg.org From marcello at perathoner.de Mon Oct 8 02:28:08 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 08 Oct 2007 11:28:08 +0200 Subject: [gutvol-d] full TEI on PG? In-Reply-To: <20071008072523.GB27881@ark.in-berlin.de> References: <20071008072523.GB27881@ark.in-berlin.de> Message-ID: <4709F828.9000207@perathoner.de> Ralf Stephan wrote: > This may be a FAQ. Would PG accept files that are a superset > of PGTEI and a subset of TEI? If so, which ending should the > file have to not confuse it with a possible PGTEI file? .tei You cannot confuse them files because PGTEI requires a pointer to the used DTD in the DOCTYPE. > Maybe like others, I'm thinking about a Plan B in case there > is no PGTEI 0.5 version. Also, without doubt, any file marked > up with such a set is both worth preserving and publishing. There should be no reservations if the file validates against the full TEI DTD and you provide at least a plain vanilla TXT version to go along with it. -- Marcello Perathoner webmaster at gutenberg.org From marcello at perathoner.de Mon Oct 8 02:46:44 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 08 Oct 2007 11:46:44 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <4709E693.9040900@xs4all.nl> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> <1977299086.20071006124554@noring.name> <4709E693.9040900@xs4all.nl> Message-ID: <4709FC84.1090709@perathoner.de> Walter van Holst wrote: > As mostly a consumer of Gutenberg etexts (and occassionally proofing a > page or two on DP), I have to say that the title pages are often an > indication of the effort that has been put in to make the etext's > reading a pleasant experience. An ugly title pages will put readers off. Did you do any research to prove these claims? Google is the most popular page on the web and look at their "title page". But maybe they are successful because they put their efforts into search engine programming and not into cute title page design. -- Marcello Perathoner webmaster at gutenberg.org From walter.van.holst at xs4all.nl Mon Oct 8 02:52:49 2007 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Mon, 08 Oct 2007 11:52:49 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <4709FC84.1090709@perathoner.de> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> <1977299086.20071006124554@noring.name> <4709E693.9040900@xs4all.nl> <4709FC84.1090709@perathoner.de> Message-ID: <4709FDF1.7080406@xs4all.nl> Marcello Perathoner wrote: >> As mostly a consumer of Gutenberg etexts (and occassionally proofing a >> page or two on DP), I have to say that the title pages are often an >> indication of the effort that has been put in to make the etext's >> reading a pleasant experience. An ugly title pages will put readers off. > > Did you do any research to prove these claims? > > > Google is the most popular page on the web and look at their "title > page". But maybe they are successful because they put their efforts into > search engine programming and not into cute title page design. Ugly versus beautiful is always subject, however, Google's title page is too minimalistic to be ugly by most standards. It is a apples and oranges comparison anyways, people do judge books by their covers. Everytime you see the cover or the title page of a new book it is something you have to make a new decision about whether it is worth your time. In Google's case it is almost always a repeat business. After a first experience Google's users _know_ how good it is, so that minimalistic title page (which in the end contributes to its usability as a _search engine_, mind you, not a _book_) won't put many users off anymore. Regards, Walter From ralf at ark.in-berlin.de Mon Oct 8 02:36:52 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Mon, 8 Oct 2007 11:36:52 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <4709F708.20604@perathoner.de> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> <6d99d1fd0710051746r5cf61006v9e83091543fcb830@mail.gmail.com> <4707A109.5070000@perathoner.de> <20071008070835.GA27881@ark.in-berlin.de> <4709F708.20604@perathoner.de> Message-ID: <20071008093652.GA24464@ark.in-berlin.de> Marcello: > > That example isn't representative because 'Chapitre' and 'Candide' > > which are the only words in the running header don't have accented > > characters. > > So what? "Accented characters" are no different from other ones. You don't even read those bug reports? A shame. Plan B it will be. To repeat: Latin-1 characters in the <title>, even coded as HTML entities like ä, garble PDF output in the running header. ralf From ralf at ark.in-berlin.de Mon Oct 8 02:39:12 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Mon, 8 Oct 2007 11:39:12 +0200 Subject: [gutvol-d] full TEI on PG? In-Reply-To: <4709F828.9000207@perathoner.de> References: <20071008072523.GB27881@ark.in-berlin.de> <4709F828.9000207@perathoner.de> Message-ID: <20071008093912.GB24464@ark.in-berlin.de> > > This may be a FAQ. Would PG accept files that are a superset > > of PGTEI and a subset of TEI? If so, which ending should the > > file have to not confuse it with a possible PGTEI file? > > .tei > > You cannot confuse them files because PGTEI requires a pointer to the > used DTD in the DOCTYPE. And what if they are both in the PG directory? > > Maybe like others, I'm thinking about a Plan B in case there > > is no PGTEI 0.5 version. Also, without doubt, any file marked > > up with such a set is both worth preserving and publishing. > > There should be no reservations if the file validates against the full > TEI DTD and you provide at least a plain vanilla TXT version to go along > with it. Of course. ralf From joshua at hutchinson.net Mon Oct 8 05:31:25 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Mon, 8 Oct 2007 12:31:25 +0000 (UTC) Subject: [gutvol-d] full TEI on PG? Message-ID: <15854003.1191846685163.JavaMail.?@fh1036.dia.cp.net> The point is that the tei file itself defines what version of TEI it was written to, so there is no need to change the file extension. Josh >----Original Message---- >From: ralf at ark.in-berlin.de >Date: Oct 8, 2007 5:39 >To: "Marcello Perathoner"<marcello at perathoner.de> >Cc: "Project Gutenberg Volunteer Discussion"<gutvol-d at lists.pglaf. org> >Subj: Re: [gutvol-d] full TEI on PG? > >> > This may be a FAQ. Would PG accept files that are a superset >> > of PGTEI and a subset of TEI? If so, which ending should the >> > file have to not confuse it with a possible PGTEI file? >> >> .tei >> >> You cannot confuse them files because PGTEI requires a pointer to the >> used DTD in the DOCTYPE. > >And what if they are both in the PG directory? > >> > Maybe like others, I'm thinking about a Plan B in case there >> > is no PGTEI 0.5 version. Also, without doubt, any file marked >> > up with such a set is both worth preserving and publishing. >> >> There should be no reservations if the file validates against the full >> TEI DTD and you provide at least a plain vanilla TXT version to go along >> with it. > >Of course. > > >ralf > >_______________________________________________ >gutvol-d mailing list >gutvol-d at lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d > From traverso at posso.dm.unipi.it Mon Oct 8 05:50:31 2007 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Mon, 8 Oct 2007 14:50:31 +0200 (CEST) Subject: [gutvol-d] PG-E shows text, encoding jumbled In-Reply-To: <20071008080125.6112B352602@mail1.pglaf.org> (message from Tony Baechler on Mon, 08 Oct 2007 00:55:55 -0700) References: <mailman.2.1191265202.24007.gutvol-d@lists.pglaf.org> <47015BD4.20809@tintazul.com.pt> <20071008080125.6112B352602@mail1.pglaf.org> Message-ID: <20071008125031.58D5E93B62@posso.dm.unipi.it> >>>>> "Tony" == Tony Baechler <tb at baechler.net> writes: Tony> Hello, sorry if this post is irrelevant and/or out of date. Tony> I think there's still a problem with PGE as outlined in this Tony> email. Etext 7384 is from PG US. Check GUTINDEX.ALL. Tony> There are 7-bit and 8-bit files apparently. I only know Tony> English so I know nothing about encoding but I see the 8-bit Tony> file in the PG US index. The problem is at PG US, the file is encoded in UTF-7, that is obsolete, but the zip file is labeled iso-8859-1, that is wrong. There is (of course) no 7-bit txt file, that does not make sense in portuguese. Carlo From marcello at perathoner.de Mon Oct 8 06:07:45 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 08 Oct 2007 15:07:45 +0200 Subject: [gutvol-d] full TEI on PG? In-Reply-To: <20071008093912.GB24464@ark.in-berlin.de> References: <20071008072523.GB27881@ark.in-berlin.de> <4709F828.9000207@perathoner.de> <20071008093912.GB24464@ark.in-berlin.de> Message-ID: <470A2BA1.20507@perathoner.de> Ralf Stephan wrote: >>> This may be a FAQ. Would PG accept files that are a superset >>> of PGTEI and a subset of TEI? If so, which ending should the >>> file have to not confuse it with a possible PGTEI file? >> .tei >> >> You cannot confuse them files because PGTEI requires a pointer to the >> used DTD in the DOCTYPE. > > And what if they are both in the PG directory? You mean doing the same book both in PGTEI *and* in "full" TEI? We have posted books with two different PDF versions, so posting a book with two different TEI versions should not be impossible. Before doing that you should make sure you absolutely need those extra tags and check if the PGTEI converter actually chokes on them. The converter quietly drops unknown tags and this should output the Right Thing in most cases. Also, to save work, you should use a transform to go from full TEI to PGTEI just tweaking the extra tags. -- Marcello Perathoner webmaster at gutenberg.org From jon at noring.name Mon Oct 8 07:32:46 2007 From: jon at noring.name (Jon Noring) Date: Mon, 8 Oct 2007 08:32:46 -0600 Subject: [gutvol-d] Separation of the "master" from the "reader" versions In-Reply-To: <4709E693.9040900@xs4all.nl> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> <1977299086.20071006124554@noring.name> <4709E693.9040900@xs4all.nl> Message-ID: <1868761478.20071008083246@noring.name> Walter wrote: > Jon Noring wrote: >> I view title pages in books to be essentially metadata, and in some >> cases a work of art to be treated as a graphic. But, in toto, not >> part of the book's textual content. > As mostly a consumer of Gutenberg etexts (and occassionally proofing a > page or two on DP), I have to say that the title pages are often an > indication of the effort that has been put in to make the etext's > reading a pleasant experience. An ugly title pages will put readers off. > But maybe less readers will be put off by ugly title pages than by > etexts that are only available as ASCII files. Hmmmm. This answer sort of represents the older paradigm thinking where the "master" is also the "display" copy. Back when Michael started PG there were various technical reasons why this made sense. But not in 2007. Yet that mode of thinking still permeates our thinking today. So, we must not confuse mastering with end-user formatting. The purpose of this discussion is to explore the concept of a PG "master", a single format, or a small number of compatible formats, from which all the readable versions are derived by standardized conversion processes (preferably all automated but for a few end formats may require some human intervention.) That is, we don't care if the "master" is itself in a form ready for the end-user to *directly* enjoy [endnote]. Rather, the focus is on requirements so that the "master" contains the most important stuff (accurate content, document structure, metadata, etc.) so as to allow conversion to virtually any format, both digital and material. When we look at it this way, the only purpose of carefully trying to reproduce not only the content of a title page (which is NOT the content of the Work's Expression itself -- a Title Page contains mostly to completely metadata which the *publisher*, not the *author* produced) is to aid those here who produce reading versions which are semi-exact and exact facsimiles of the original source book. (It should be clear that I am not hostile to those who wish to produce digital versions which are semi- to full-facsimiles of the original -- It's just that such facsimile versions are *derivative* end-user formats, not masters.) So long as we have the original page scans sitting along-side the "master" format, those who want to produce a semi-facsimile (usually HTML) or facsimile copy (e.g. PDF) should put in the work to format the title page so it looks like it originally did. This is work that is unnecessary for the mastering process since for most digital formats that facsimile title page markup can't be used anyway (after all, it is NOT part of the authored content) and has to be tossed aside. Should the "mastering team" put in that extra effort, especially for books where those who build facsimile editions may decide to pass on the book? This separation of mastering from producing facsimiles now frees up those creating facsimile versions to not try to thread the needle of creating title pages which are both "repurposeable" (which as just noted is a losing proposition -- many formats require special Title Pages built from the metadata, e.g. MS Reader LIT, so all that hard work will be simply get tossed aside) and which are formatted to render very much like the original. (Note, too, that so long as we have original page scans, the title page scan can be viewed by the end-user as a graphic, and this is about as original as one can get.) Now, I am intrigued with the "Facsimile Team" taking the Digital Master and producing an SVG document which reproduces the original title page. This SVG document would not be part of the master, but could go into the repository to sit along side the master, graphics images, page scans, and various end-user derivatives. SVG support is slowly becoming ubiquitous in web browsers. Jon Noring [endnote: Of course, Bowerbird would say that we can have both mastering and direct readability of the master -- his ZML. And that is true. The issue is not whether ZML can be used as a mastering format -- it can -- but whether it sufficiently meets the various requirements PG/DP needs in a mastering format. I don't believe it does.] From jon at noring.name Mon Oct 8 08:11:57 2007 From: jon at noring.name (Jon Noring) Date: Mon, 8 Oct 2007 09:11:57 -0600 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <4709FC84.1090709@perathoner.de> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> <1977299086.20071006124554@noring.name> <4709E693.9040900@xs4all.nl> <4709FC84.1090709@perathoner.de> Message-ID: <1659264922.20071008091157@noring.name> Marcello wrote: > Walter van Holst wrote: >> As mostly a consumer of Gutenberg etexts (and occassionally proofing a >> page or two on DP), I have to say that the title pages are often an >> indication of the effort that has been put in to make the etext's >> reading a pleasant experience. An ugly title pages will put readers off. > Did you do any research to prove these claims? > > Google is the most popular page on the web and look at their "title > page". But maybe they are successful because they put their efforts into > search engine programming and not into cute title page design. One can certainly take a book's title page metadata (title, creators, contributors, etc.) and build a beautiful, standardized "title page", such as to use for the web presentation version. Now, this title page won't be a facsimile, but it will be attractive and appealing. If one wants to show the reader how the original title page looked, simply provide a link to the original page scan image (or embed it directly). One can't get more exact than that. And the "Facsimile Edition Team" (FET) can of course produce a true facsimile of the original title page, either in SVG or an elaborately styled XHTML+CSS. But re-creating the original title page down to the last serif in the TEI master should not be included as part of the mastering process. Rather, focus on the metadata to assure it is complete so that platform-optimized title pages can be auto-generated. Jon Noring From lee at novomail.net Mon Oct 8 09:16:57 2007 From: lee at novomail.net (Lee Passey) Date: Mon, 08 Oct 2007 10:16:57 -0600 Subject: [gutvol-d] full TEI on PG? In-Reply-To: <4709F828.9000207@perathoner.de> References: <20071008072523.GB27881@ark.in-berlin.de> <4709F828.9000207@perathoner.de> Message-ID: <470A57F9.7090806@novomail.net> Marcello Perathoner wrote: > Ralf Stephan wrote: [snip] >> Maybe like others, I'm thinking about a Plan B in case there is no >> PGTEI 0.5 version. Also, without doubt, any file marked up with >> such a set is both worth preserving and publishing. Ostensibly, PGTEI is not a superset of TEI, but simply a refinement of it (some attribute values which are left undefined in TEI are defined in PGTEI). Thus, every valid TEI file is also a valid PGTEI file (it simply doesn't take advantage of the attribute set defined by PGTEI) and vice versa. > There should be no reservations if the file validates against the > full TEI DTD and you provide at least a plain vanilla TXT version to > go along with it. It should, by now, be well established that Mr. Hart and the PGPTB are strongly opposed to the establishment of any file format as the "preferred" format, regardless of its capabilities. If you look carefully at the PG FAQ you will not that while an ASCII text version is requested, it is not required. Thus, you should be able to submit a valid TEI file to PG, and no other format. Those people who want a degraded text version can derive it from the TEI file just as those people who want an RTF version can do so. -- Nothing of significance below this line. From prosfilaes at gmail.com Mon Oct 8 10:59:18 2007 From: prosfilaes at gmail.com (David Starner) Date: Mon, 8 Oct 2007 13:59:18 -0400 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <4709FC84.1090709@perathoner.de> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> <1977299086.20071006124554@noring.name> <4709E693.9040900@xs4all.nl> <4709FC84.1090709@perathoner.de> Message-ID: <6d99d1fd0710081059s6b296329n9a5cf7245cb6585b@mail.gmail.com> On 10/8/07, Marcello Perathoner <marcello at perathoner.de> wrote: > Walter van Holst wrote: > > > As mostly a consumer of Gutenberg etexts (and occassionally proofing a > > page or two on DP), I have to say that the title pages are often an > > indication of the effort that has been put in to make the etext's > > reading a pleasant experience. An ugly title pages will put readers off. > > Did you do any research to prove these claims? Did you do any research to disprove these claims? Maybe just about every serious publisher in the world is wasting time spending a lot of time making title pages, but I'd be willing to be bet they have done research. > Google is the most popular page on the web and look at their "title > page". But maybe they are successful because they put their efforts into > search engine programming and not into cute title page design. Look at <http://www.msn.com> and <http://www.yahoo.com> and <http://www.weather.com>. You think it's a coincidence that <http://www.google.com> looks very little like them, that's it's laziness? I use Google because they have cute title page design, because they have title page design that's attractive to the eye (uses colors, appropriately centered) and doesn't clutter up the page with a lot of junk. That's good title page design, and it didn't come by magic or luck. From joshua at hutchinson.net Mon Oct 8 11:06:57 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Mon, 8 Oct 2007 18:06:57 +0000 (UTC) Subject: [gutvol-d] full TEI on PG? Message-ID: <29931807.1191866817909.JavaMail.?@fh1036.dia.cp.net> Well, yes and no. The FAQ does not say it is required ... but none of the whitewashers will post it without a text file. You'd have to go through Greg Newby and get a special dispensation from on high. :) And there has to be a "read good reason" to not post a text version. That being said, the part about any TEI file working is pretty much correct. Josh >----Original Message---- >From: lee at novomail.net >Date: Oct 8, 2007 12:16 >To: "Project Gutenberg Volunteer Discussion"<gutvol-d at lists.pglaf. org> >Subj: Re: [gutvol-d] full TEI on PG? > >Marcello Perathoner wrote: > >> Ralf Stephan wrote: > >[snip] > >>> Maybe like others, I'm thinking about a Plan B in case there is no >>> PGTEI 0.5 version. Also, without doubt, any file marked up with >>> such a set is both worth preserving and publishing. > >Ostensibly, PGTEI is not a superset of TEI, but simply a refinement of >it (some attribute values which are left undefined in TEI are defined in >PGTEI). Thus, every valid TEI file is also a valid PGTEI file (it simply >doesn't take advantage of the attribute set defined by PGTEI) and vice >versa. > >> There should be no reservations if the file validates against the >> full TEI DTD and you provide at least a plain vanilla TXT version to >> go along with it. > >It should, by now, be well established that Mr. Hart and the PGPTB are >strongly opposed to the establishment of any file format as the >"preferred" format, regardless of its capabilities. If you look >carefully at the PG FAQ you will not that while an ASCII text version is >requested, it is not required. Thus, you should be able to submit a >valid TEI file to PG, and no other format. Those people who want a >degraded text version can derive it from the TEI file just as those >people who want an RTF version can do so. > > >-- >Nothing of significance below this line. > >_______________________________________________ >gutvol-d mailing list >gutvol-d at lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d > From prosfilaes at gmail.com Mon Oct 8 11:10:45 2007 From: prosfilaes at gmail.com (David Starner) Date: Mon, 8 Oct 2007 14:10:45 -0400 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <1659264922.20071008091157@noring.name> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> <1977299086.20071006124554@noring.name> <4709E693.9040900@xs4all.nl> <4709FC84.1090709@perathoner.de> <1659264922.20071008091157@noring.name> Message-ID: <6d99d1fd0710081110l29ab6d60q66d06fd5cc5201b7@mail.gmail.com> On 10/8/07, Jon Noring <jon at noring.name> wrote: > But re-creating the original title page down to the > last serif in the TEI master should not be included as part of the > mastering process. I think this is a strawman here. I'm not, nor is anyone else here I've read, asking for the original title page being reproduced down to the last serif. I would like for the original text of the title page to be preserved just like the original text anywhere else in the book. I would happily settle for a decent looking title page that preserves and displays the basic information every title page has, combined with the original edition information that good scholarly editions (for Dover to the Library of America) never forget to include, all in a nice sharp neutral package. (And once again, neutral means it looks like everything else out there, not just throw it on the page.) From Bowerbird at aol.com Mon Oct 8 11:25:57 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 8 Oct 2007 14:25:57 EDT Subject: [gutvol-d] what's all the fuss about title pages? Message-ID: <d4d.fd415c7.343bd035@aol.com> what's all the fuss about title pages? just _center_ the text on the title page, and do a few other small adjustments, and david starner will be happy with it... -bowerbird p.s. i've never seen a professionally-made book that didn't have its title page centered. ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071008/93dcbe8f/attachment-0001.htm From jon at noring.name Mon Oct 8 11:30:35 2007 From: jon at noring.name (Jon Noring) Date: Mon, 8 Oct 2007 12:30:35 -0600 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <6d99d1fd0710081059s6b296329n9a5cf7245cb6585b@mail.gmail.com> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> <1977299086.20071006124554@noring.name> <4709E693.9040900@xs4all.nl> <4709FC84.1090709@perathoner.de> <6d99d1fd0710081059s6b296329n9a5cf7245cb6585b@mail.gmail.com> Message-ID: <294960705.20071008123035@noring.name> David Starner wrote: > Marcello Perathoner wrote: >> Did you do any research to prove these claims? > Did you do any research to disprove these claims? Maybe just about > every serious publisher in the world is wasting time spending a lot of > time making title pages, but I'd be willing to be bet they have done > research. For final end-user viewing of a PG/DP book, certainly a nice title page improves the reading experience. But such a title page can be auto-generated from the book's title-page-related metadata, and presented to the end-user in a form which is beautiful, standardized, and optimized for the target platform. It will also "brand" the book. And now with swappable style sheets, there's even the intriguing ability to let the end-user choose the CSS formatting for the whole book, and not be constrained by the styling someone chose for them. Most of the modern paperbook reissues of public domain books that I've seen have redesigned title pages that don't look anything like the title pages in the "original" (pre-1923) books. Why? Because the title page does not contain (other than the occasional dedication or epigraph) authorial content -- it is a publisher device to present the Work to the reader -- the content (the "text") in a title page is NOT part of the Work itself -- it is not part of the Work's content. Thus, I see no reason for digital text mastering to laboriously try to exactly format the title page. It's a waste of the digital text master's time and makes the markup more complex, and produces something that is of interest *only* to those who wish to produce some sort of facsimile edition. (And again, make sure the end-user has access to the original title page scan.) Now, if the "Facsimile Editions Team" wishes to take the digital text master and create semi- to full-facsimile reader versions, all the power to them! That's great! And now that team has free rein to produce the title page anyway they want for a particular book: SVG, LaTeX, PDF, RTF, XHTML+CSS -- whatever works best for them without having to worry about what they produce being repurposeable or accessible or whatever (since facsimile versions are for visual presentation only.) Jon Noring From marcello at perathoner.de Mon Oct 8 11:33:10 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 08 Oct 2007 20:33:10 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <4709FDF1.7080406@xs4all.nl> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> <1977299086.20071006124554@noring.name> <4709E693.9040900@xs4all.nl> <4709FC84.1090709@perathoner.de> <4709FDF1.7080406@xs4all.nl> Message-ID: <470A77E6.5020301@perathoner.de> Walter van Holst wrote: > It is a apples and oranges comparison anyways, people do judge books by > their covers. Some people surely do. But by different standards. I personally prefer "ugly" web sites because experience has shown their contents to be more informative. Every author has only a finite amount of time, if more time goes into the formatting, less time will go into the contents, and vice versa. -- Marcello Perathoner webmaster at gutenberg.org From prosfilaes at gmail.com Mon Oct 8 11:35:43 2007 From: prosfilaes at gmail.com (David Starner) Date: Mon, 8 Oct 2007 14:35:43 -0400 Subject: [gutvol-d] Separation of the "master" from the "reader" versions In-Reply-To: <1868761478.20071008083246@noring.name> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> <1977299086.20071006124554@noring.name> <4709E693.9040900@xs4all.nl> <1868761478.20071008083246@noring.name> Message-ID: <6d99d1fd0710081135h4275354bpe9a079ed79d4fec7@mail.gmail.com> On 10/8/07, Jon Noring <jon at noring.name> wrote: > Walter wrote: > > Jon Noring wrote: > > >> I view title pages in books to be essentially metadata, and in some > >> cases a work of art to be treated as a graphic. But, in toto, not > >> part of the book's textual content. > > > As mostly a consumer of Gutenberg etexts (and occassionally proofing a > > page or two on DP), I have to say that the title pages are often an > > indication of the effort that has been put in to make the etext's > > reading a pleasant experience. An ugly title pages will put readers off. > > But maybe less readers will be put off by ugly title pages than by > > etexts that are only available as ASCII files. > > Hmmmm. > > This answer sort of represents the older paradigm thinking where the > "master" is also the "display" copy. Jon, are you listening? The issue is, the title page of the display copy, the HTML edition, looks like crap, and that will turn off readers. > The purpose of this discussion is to explore the concept of a PG > "master", a single format, The point of my comments in this discussion is that all the pie in the sky winnings aren't going to get people to work with if what they see looks terrible. > When we look at it this way, the only purpose of carefully trying to > reproduce not only the content of a title page (which is NOT the > content of the Work's Expression itself -- a Title Page contains > mostly to completely metadata which the *publisher*, not the *author* > produced) is to aid those here who produce reading versions which are > semi-exact and exact facsimiles of the original source book. I don't make this distinction between author and publisher that you do. I want to capture books that drove generations wild. The generations didn't read the original manuscript, they read the book. That's what I want to capture, that book, no matter what the publisher did. Furthermore, your line between publisher and author doesn't make a whole lot sense for a lot of material, material that has passed through many hands beside the author, be it committee-produced or anthologies. or on the flip side material that was produced entirely by the author, title page included. Not only that, changing and reformatting this information is inherently a lossy process; if we record it exactly as it was in the original book, we preserve that information in a lossless manner. And more to the point, I believe we should be doing as little editing as possible. Creating a new title page is like updating the spelling; it's changing the original volume into a more modern form. Updating and modernizing is not our job; we are merely ideal scribes, copying the original works. From marcello at perathoner.de Mon Oct 8 11:41:52 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 08 Oct 2007 20:41:52 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <6d99d1fd0710081059s6b296329n9a5cf7245cb6585b@mail.gmail.com> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> <1977299086.20071006124554@noring.name> <4709E693.9040900@xs4all.nl> <4709FC84.1090709@perathoner.de> <6d99d1fd0710081059s6b296329n9a5cf7245cb6585b@mail.gmail.com> Message-ID: <470A79F0.3020709@perathoner.de> David Starner wrote: > On 10/8/07, Marcello Perathoner <marcello at perathoner.de> wrote: >> Walter van Holst wrote: >> >>> As mostly a consumer of Gutenberg etexts (and occassionally proofing a >>> page or two on DP), I have to say that the title pages are often an >>> indication of the effort that has been put in to make the etext's >>> reading a pleasant experience. An ugly title pages will put readers off. >> Did you do any research to prove these claims? > > Did you do any research to disprove these claims? In my world the person who makes a claim that has to prove it. Also, there are two claims made. The one you didn't spot is that the amount of work gone into the text is directly proportional to the amount of work gone into the title page. This is speculative at best. -- Marcello Perathoner webmaster at gutenberg.org From prosfilaes at gmail.com Mon Oct 8 11:45:05 2007 From: prosfilaes at gmail.com (David Starner) Date: Mon, 8 Oct 2007 14:45:05 -0400 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <294960705.20071008123035@noring.name> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> <1977299086.20071006124554@noring.name> <4709E693.9040900@xs4all.nl> <4709FC84.1090709@perathoner.de> <6d99d1fd0710081059s6b296329n9a5cf7245cb6585b@mail.gmail.com> <294960705.20071008123035@noring.name> Message-ID: <6d99d1fd0710081145y38e6fa9dq5024ea84f32b58f6@mail.gmail.com> On 10/8/07, Jon Noring <jon at noring.name> wrote: > But such a title page can be auto-generated from the book's > title-page-related metadata, and presented to the end-user in a > form which is beautiful, standardized, and optimized for the target > platform. Great. Then let's see it done. I'm tired of hearing about theory, Jon. I wouldn't have brought it up if we were looking beautiful title pages. But we've all heard huge promise made for things that never worked out. I'm not going to work on TEI, nor will I encourage others to, until it actually works right. From joshua at hutchinson.net Mon Oct 8 11:50:09 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Mon, 8 Oct 2007 18:50:09 +0000 (UTC) Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? Message-ID: <6947627.1191869409300.JavaMail.?@fh1036.dia.cp.net> Ok, everyone, David has a point. A very GOOD point. Master formats are great and all, but the final output has to look nice or no one will want to use it. Let's table this part of the discussion for a few days. I'll try to fix up some examples of nice looking title pages and post them up. Hopefully, we can get some consensus on whether TEI can do the job (and some feedback on how hard it was/wasn't from me). Josh >----Original Message---- >From: prosfilaes at gmail.com > >Great. Then let's see it done. I'm tired of hearing about theory, Jon. >I wouldn't have brought it up if we were looking beautiful title >pages. But we've all heard huge promise made for things that never >worked out. I'm not going to work on TEI, nor will I encourage others >to, until it actually works right. From jon at noring.name Mon Oct 8 12:18:08 2007 From: jon at noring.name (Jon Noring) Date: Mon, 8 Oct 2007 13:18:08 -0600 Subject: [gutvol-d] Separation of the "master" from the "reader" versions In-Reply-To: <6d99d1fd0710081135h4275354bpe9a079ed79d4fec7@mail.gmail.com> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> <1977299086.20071006124554@noring.name> <4709E693.9040900@xs4all.nl> <1868761478.20071008083246@noring.name> <6d99d1fd0710081135h4275354bpe9a079ed79d4fec7@mail.gmail.com> Message-ID: <44282966.20071008131808@noring.name> David Starner wrote: > Updating and modernizing is not our job; we are merely ideal scribes, > copying the original works. Well, we do agree on some aspects of this last point David made. The digital text master (DTM) should reproduce the original textual content in the source book as much as possible, including authors' and printers' errors. The nice thing about XML is that we can add markup in the DTM pointing to corrections to clearly known errors (which over time can expand), then let those who create reader editions to decide how they want to tweak the text. And I also agree 120% with Bowerbird on the need for the DTM to include the exact point of line breaks (including mid-word) and page breaks in the source book. In XML this would be done by markup specific for this purpose. This is one of the few "original typographic presentation" items I would include in the DTM markup. Why? For at least three reasons (if you think of other reasons, let us know!): 1) Alignment of the text with the source book for future proofing and other unforeseen needs, and 2) For page breaks to know the original page number associated with a piece of content so existing references by page number can be perfectly pointed to that piece of content, and 3) To aid in the production of *perfect* facsimiles for those who wish to do so. Perfect facsimiles (PF) require access to the original page scans *anyway*, and there's a lot of items we need not record in the DTM since they can be seen in the PF and with relatively little work implemented in the PF. However, it would take a whole lot of work to put back the exact points of line and page breaks if those were stripped away in the DTM XML version. I mean, major big time work. (To cite an example: Putting the first word of every chapter in small caps is pretty trivial to do, but reinserting many thousands of line breaks, including breaking words, and hundreds of page breaks, is downright a lot of work.) The bottom line is that I see the DTM embodying the most critical information in the book which requires the most work to capture correctly (perfect text to the original, preserving line and page breaks), and information needed for general repurposability and accessibility. In addition, I see that the DTM *must* have associated with it the original page scans, which is a huge problem with the PG collection as it now stands. For those Works where a perfect facsimile (PF) is called for (and many works simply don't need this level of digital reproduction), those interested can take the DTM+scans and with relatively little work create a PF in a format optimized for that purpose, such as PDF, LaTeX, SVG, or whatever. The same argument applies for those wanting to create a "semi-facsimile", which would be, for example, an XHTML document that significantly captures the typographic flavor of the original but is not concerned with exact line breaks. These are my thoughts. What do the others think? Jon Noring From jon at noring.name Mon Oct 8 12:35:23 2007 From: jon at noring.name (Jon Noring) Date: Mon, 8 Oct 2007 13:35:23 -0600 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <6947627.1191869409300.JavaMail.?@fh1036.dia.cp.net> References: <6947627.1191869409300.JavaMail.?@fh1036.dia.cp.net> Message-ID: <49101967.20071008133523@noring.name> Josh wrote: > Ok, everyone, David has a point. A very GOOD point. > > Master formats are great and all, but the final output has to look > nice or no one will want to use it. > > Let's table this part of the discussion for a few days. I'll try to > fix up some examples of nice looking title pages and post them up. > Hopefully, we can get some consensus on whether TEI can do the job (and > some feedback on how hard it was/wasn't from me). Definitely cobble up some XHTML title pages! It is a useful exercise. But do keep in mind the alternative to simply not encode the title page as markup in the TEI "master" but move the information to the metadata section. Then when the XHTML derivative is created move the information from the metadata tags to the XHTML markup of the "title page". (Do note that if one builds XHTML markup for the title page, it should be optimized so it will look nice on all browsers for all platforms, including handhelds! It would not surprise me if some of the "facsimile" XHTML produced over at DP does not render well on smaller devices. It is also important that the title page is "readable" when all CSS styling is removed, so this requires the use of the h1-to-h6 tags, etc. This is also useful for accessibility.) Btw, this is what is being planned for BookX, and I've talked with a couple script experts (considering hiring them), and it is pretty trivial to do. In some cases such a transformation can be done with XSLT. This is in reply to David Starner who asked this be demonstrated, which is a reasonable request. But in this case I think the script people here will agree with me in that the markup for a standardized XHTML "title page" can be autogenerated from metadata information in TEI. There are times when demonstration is needed to prove something, and times when it is not. This is one time it is not needed -- it can be done, it's just a matter whether or not this is the best way to do it. Jon Noring From walter.van.holst at xs4all.nl Mon Oct 8 12:58:31 2007 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Mon, 08 Oct 2007 21:58:31 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <6d99d1fd0710081145y38e6fa9dq5024ea84f32b58f6@mail.gmail.com> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> <1977299086.20071006124554@noring.name> <4709E693.9040900@xs4all.nl> <4709FC84.1090709@perathoner.de> <6d99d1fd0710081059s6b296329n9a5cf7245cb6585b@mail.gmail.com> <294960705.20071008123035@noring.name> <6d99d1fd0710081145y38e6fa9dq5024ea84f32b58f6@mail.gmail.com> Message-ID: <470A8BE7.7090902@xs4all.nl> David Starner wrote: > Great. Then let's see it done. I'm tired of hearing about theory, Jon. > I wouldn't have brought it up if we were looking beautiful title > pages. But we've all heard huge promise made for things that never > worked out. I'm not going to work on TEI, nor will I encourage others > to, until it actually works right. I'd be willing to put effort in adding proper CSS formatting to TEI texts or even adding TEI mark-up if there was a (preferably web-based) proper tool for that. The current crop of XML editors I've tried so far were way too arcane for a casual user like me. Regards, Walter From joshua at hutchinson.net Mon Oct 8 13:05:04 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Mon, 8 Oct 2007 20:05:04 +0000 (UTC) Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? Message-ID: <2835006.1191873904270.JavaMail.?@fh1036.dia.cp.net> >----Original Message---- >From: jon at noring.name > >But do keep in mind the alternative to simply not encode the title page >as markup in the TEI "master" but move the information to the metadata >section. Then when the XHTML derivative is created move the information >from the metadata tags to the XHTML markup of the "title page". > No, that is the result that is causing the complaint (the title page macro just populates from the meta data). Granted I might be able to fix some of the complaints directly in the meta data (ie, Change "Edition 1" to something like "First PG Edition" or some such) but some of it needs to be handled at the level of manually controlling the title page. Josh From jon at noring.name Mon Oct 8 13:19:38 2007 From: jon at noring.name (Jon Noring) Date: Mon, 8 Oct 2007 14:19:38 -0600 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <470A8BE7.7090902@xs4all.nl> References: <32420778.1191687067584.JavaMail.?@fh1035.dia.cp.net> <1977299086.20071006124554@noring.name> <4709E693.9040900@xs4all.nl> <4709FC84.1090709@perathoner.de> <6d99d1fd0710081059s6b296329n9a5cf7245cb6585b@mail.gmail.com> <294960705.20071008123035@noring.name> <6d99d1fd0710081145y38e6fa9dq5024ea84f32b58f6@mail.gmail.com> <470A8BE7.7090902@xs4all.nl> Message-ID: <597714017.20071008141938@noring.name> Walter wrote: > David Starner wrote: >> Great. Then let's see it done. I'm tired of hearing about theory, Jon. >> I wouldn't have brought it up if we were looking beautiful title >> pages. But we've all heard huge promise made for things that never >> worked out. I'm not going to work on TEI, nor will I encourage others >> to, until it actually works right. > I'd be willing to put effort in adding proper CSS formatting to TEI > texts or even adding TEI mark-up if there was a (preferably web-based) > proper tool for that. The current crop of XML editors I've tried so far > were way too arcane for a casual user like me. I've always been intrigued with using CSS to directly view TEI documents in browsers, if for nothing else as a means of markup visualization for editing purposes. Of course, the problem with TEI is that it is not wholly compatible with CSS. For example, TEI may include an inline <note> element which itself can contain a whole document. In the absence of CSS, browsers will simply leave it inline and the main flow of text becomes jumbled (the web paradigm never included an inline <note> tag intended to be yanked out of the flow and presented elsewhere.) CSS can float such notes to the side, but that CSS only works in Opera and Firefox, not IE -- and it is "messy". Then there are differences between the TEI table model and the HTML-CSS table model. What is intriguing, though, and appears workable, is to NOT include markup in the body for a title page, but to use the metadata section at the front of TEI documents, combined with CSS, to make a "title page". I did this very thing for the BookX vocabulary. Here's a BookX example using silly (and not very pretty) CSS strictly for document visualization purposes -- unfortunately it only really works for Opera and Firefox -- IE barfs on the CSS (it might be possible to tweak the CSS to make it work in IE, but not sure): http://www.openreader.org/myantonia/BookX/myantonia-bookx.xml Look at the source of this XML document -- the metadata used for generating the "title page" is located in the <bookinfo> section. Note that the "title page" is created from the "metadata" using CSS. It might be possible to do likewise from the TEI metadata provided it is properly ordered (a restriction). With a script approach to generate XHTML from TEI, the metadata order is not important. Jon Noring From jon at noring.name Mon Oct 8 13:24:19 2007 From: jon at noring.name (Jon Noring) Date: Mon, 8 Oct 2007 14:24:19 -0600 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <2835006.1191873904270.JavaMail.?@fh1036.dia.cp.net> References: <2835006.1191873904270.JavaMail.?@fh1036.dia.cp.net> Message-ID: <589228901.20071008142419@noring.name> Josh wrote: > No, that is the result that is causing the complaint (the title page > macro just populates from the meta data). Granted I might be able to > fix some of the complaints directly in the meta data (ie, Change > "Edition 1" to something like "First PG Edition" or some such) but some > of it needs to be handled at the level of manually controlling the > title page. What kind of complaints? By taking the metadata, one should be able to do just about anything with it, including building standardized title pages. Can you give us one or two examples of TEI metadata that leads to title pages leading to complaints? Thanks. Jon Noring From marcello at perathoner.de Mon Oct 8 14:34:28 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 08 Oct 2007 23:34:28 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <20071008093652.GA24464@ark.in-berlin.de> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> <6d99d1fd0710051746r5cf61006v9e83091543fcb830@mail.gmail.com> <4707A109.5070000@perathoner.de> <20071008070835.GA27881@ark.in-berlin.de> <4709F708.20604@perathoner.de> <20071008093652.GA24464@ark.in-berlin.de> Message-ID: <470AA264.2070908@perathoner.de> Ralf Stephan wrote: > Marcello: >>> That example isn't representative because 'Chapitre' and 'Candide' >>> which are the only words in the running header don't have accented >>> characters. >> So what? "Accented characters" are no different from other ones. > > You don't even read those bug reports? A shame. Plan B it will be. > > To repeat: Latin-1 characters in the <title>, even coded as HTML entities > like ä, garble PDF output in the running header. We were talking about title pages, which do "accented characters" quite well. Candide is an example of a formatted title page. You are talking about a bug you reported in the page headers in the PDF output. I'm trying to fix it, but I have to work around some LaTeX internals. (TeX is not unicode compatible. Version 0.5 will probably not use TeX any more if they don't implement full unicode support by then.) I can put in a quick work around so the book title doesn't automatically get assigned to the PDF left page headers. But that will change existing PDFs. -- Marcello Perathoner webmaster at gutenberg.org From marcello at perathoner.de Mon Oct 8 14:44:07 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 08 Oct 2007 23:44:07 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <589228901.20071008142419@noring.name> References: <2835006.1191873904270.JavaMail.?@fh1036.dia.cp.net> <589228901.20071008142419@noring.name> Message-ID: <470AA4A7.4010908@perathoner.de> Jon Noring wrote: > By taking the metadata, one should be able to do just about anything > with it, including building standardized title pages. Bonjour. That's what we *are* doing. > Can you give us one or two examples of TEI metadata that leads to > title pages leading to complaints? The complaint so far is that the title page doesn't use the font sizes and text alignment the plaintiff likes best. Also there is confusion between the ebook title page, which contains PG metadata and PG edition and publication date, and the title page of the paper book. IMO they are different entites. You can provide one or both of them, but at the discretion of the PPer. -- Marcello Perathoner webmaster at gutenberg.org From prosfilaes at gmail.com Mon Oct 8 15:23:27 2007 From: prosfilaes at gmail.com (David Starner) Date: Mon, 8 Oct 2007 18:23:27 -0400 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <470AA4A7.4010908@perathoner.de> References: <2835006.1191873904270.JavaMail.?@fh1036.dia.cp.net> <589228901.20071008142419@noring.name> <470AA4A7.4010908@perathoner.de> Message-ID: <6d99d1fd0710081523l5776b6dcvb4080599b1cac94e@mail.gmail.com> On 10/8/07, Marcello Perathoner <marcello at perathoner.de> wrote: > The complaint so far is that the title page doesn't use the font sizes > and text alignment the plaintiff likes best. The complaint is that the title page doesn't look anything like any title page I have ever seen, be it paper book, web page, or hand-created ebook. I can pull books off my shelves, books printed in Japanese, in Russian, in Esperanto, in English, in Romanian, books printed in Japan, Brazil, England, the Soviet Union, the US, books printed in 2007, books printed in 1831, mass-market paperbacks, expensive math books. The title pages look pretty similar, and nothing like the one your macro generates. From lee at novomail.net Mon Oct 8 16:44:39 2007 From: lee at novomail.net (Lee Passey) Date: Mon, 08 Oct 2007 17:44:39 -0600 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> Message-ID: <470AC0E7.1010208@novomail.net> joshua at hutchinson.net wrote: > Is one camp right and the other wrong? Sure ;-). The problem is deciding which is which. You see, the whole meaning of "right" and "wrong" requires a standard. In the case of morality, right and wrong are defined in relationship to the word of God (however you conceive of Her). In the case of legality, right and wrong are defined in relationship to the law of the land (English common law is an interesting hybrid that sort of straddles legality and morality). In the case of democracy, right and wrong are defined in relationship to the opinion of the majority. In the case of logical positivism right and wrong are defined in relationship to whatever provides the greatest good for the greatest number. In the case of textual markup, you can't decide which camp is "right" and which is "wrong" until you have determined what the standard is against which these competing philosophies are to be judged. > Is it necessary to have one camp or the other "win"? When you ask if one camp must "win" you imply that the other must "lose." As you pointed out TEI can serve both camps at least as well as any other solution, with the possible exception of page images for the WYSIWYG camp. (You suggest that camp 2 can accommodate camp 1, but that "it's just much more work." I disagree. It's only slightly more work.) Thus, if the "content" camp wins, the WYSIWYG camp wins too; there is no loser. On the other hand, purely presentational markup is patently unsuited for virtually any purpose other than use by a fully-functional human (the visually impaired, even the myopic, need not apply). Thus, if the WYSIWYG camp wins, the "content" camp loses. I don't think it's necessary to have one camp "win." Rather I think it's important that neither camp lose. > Can both be adequately served? Only by the adoption of structural/semantic markup. > Is it worth the effort to TRY to serve both camps? It depends on how much of an altruist you are. In psychology there is the concept of "projection." In the classic formulation, projection is a defense mechanism whereby one "projects" one's own undesirable thoughts, motivations, desires, and feelings onto someone else (http://en.wikipedia.org/wiki/Psychological_projection). I believe that projection extends beyond /undesirable/ thoughts and motivations, and usually includes /all/ thoughts, motivations, etc. This more inclusive formulation of projection can still be considered a defense mechanism due to the common, pervasive desire of humans to be considered part of the norm. If I believe something as innocuous as "blue is a cool color" (in the temperature sense, not in the social acceptability sense) I will tend to believe that everyone else thinks blue is a cool color as well. The more controversial the belief, and the more threatening its denial is to our fundamental belief system, the stronger our hold on projection will be (thus the Freudian formulation of /undesirable/ thoughts. If you express the opinion that blue is a hot color (again in the temperature sense, not in the social acceptability sense) I would tend to accept that you hold a differing opinion, as your opinion does not challenge my fundamental beliefs. On the other hand, if you were to express the opinion that most people believe that war is a good thing (in either the moral or the legal sense) I would probably reject that opinion, as I am strongly opposed to war and therefore believe that most people would share my believe (despite all the evidence to the contrary). Likewise, those people who believe that posting unencrypted, unprotected e-books to the internet will cause those books to be widely pirated and shared, probably believe this because in similar circumstances it is what they, themselves, would do. Thus, Mr. Hart, who believes that degraded ASCII texts is sufficient for his purposes, projects that belief onto others, and believes that it is sufficient for /everyone's/ purposes. He has expressed his opinion so frequently, and in so many forums that it has been deeply incorporated into his fundamental belief system. When presented with contrary opinions, even when such opinions seem to be in the majority, he will continue to believe that those people expressing the contrary opinions are the aberrations, not the norm; otherwise, his own belief system would be the aberration, and this is too great a challenge to his fundamental beliefs. Likewise, Bowerbird has invested a great deal of emotional capital in the creation and promotion of his proprietary Zen Markup Language. He has a strong, internalized belief that it is capable of completely representing everything required for the electronic version of a book. Thus, not only are those constructs it is not capable of representing by definition unimportant, the vast majority of people when exposed to ZML will recognize its superiority. The critics of ZML are the aberration, not Bowerbird. To put it in simple syllogistic terms: 1. All right-minded people will recognize that ZML is the best markup language possible (psychological projection). 2. Lee Passey does not recognize that ZML is a good markup language. 3. QED, Lee Passey is not in his right mind. A completely logical conclusion, but completely out of touch with reality. Those people who seem to believe that a certain presentation is "ugly" also seem to me to be those people who are most likely to project this belief on to others, as though questions of aesthetics could have any kind of universal standard, and are least likely to accept any presentation which is not the same as what they would do in similar circumstances. They are also less likely to see the value in someone else's preferences or needs (which are, obviously, aberrations). Thus, Mr. Starner needn't explain why the PG TEI-to-HTML XSL script generates a title page that "looks like crap," nor need he explain what needs to be done to fix it; virtually everyone in the world shares his artistic sensibilities, so all you should have to do is look at it to understand. Personally, I have spoken with many people (mostly programmers) about marking up text semantically or structurally instead of presentationally. My experience is that a certain percentage of them (probably less than half) "get it" almost immediately. No arguments or justifications are needed; you just say, "if we mark up the text according to the document's semantic, we can produce any presentation, and consume it like a database as well", and they reply "well, of course, that's the way it should be done." Those people who don't "get it" almost immediately rarely, if ever, "get it." They can't seem to imagine that anyone could use the e-book in any way other than how /they/ want to use it, or to see it in any other way than the way they want to see it. The reaction has its source in sub-conscious emotions and motivations, and no amount of logic can alter an emotional response. So, is it worth the effort to try to serve both camps? Is it even /possible/ to serve both camps? Because semantic markup is much more powerful than presentational markup, and because those who favor semantic markup tend to be more far-sighted and less ego-centric than those caught up in presentation, I believe it is possible, with little additional effort, for those in camp #2 to produce a product that will satisfy those in camp #1. But I do not believe that there is any way to create a work flow that would allow the adherents of camp #1 to produce a product that would be acceptable to those in camp #2; at least nothing that provides any significant advantage to just going to the scan set and starting over. So yes, I think it is worth the effort for those in camp #2 to try and satisfy the needs of camp #1, so long as they understand that there will be no reciprocity, and the "good" will have to co-exist in the same database with the "bad" and the "ugly." At the same time, I don't think there is any process possible that would permit those who are presentationally oriented to create something that the "content" camp will find useful, and I don't think it is worth the effort for those in camp #2 to try and persuade those in camp #1 that semantic markup is a better way. They just can't see it. As the saying goes, "Never try to teach a pig to sing -- it wastes your time and annoys the pig." From jon at noring.name Mon Oct 8 16:56:57 2007 From: jon at noring.name (Jon Noring) Date: Mon, 8 Oct 2007 17:56:57 -0600 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <470AA4A7.4010908@perathoner.de> References: <2835006.1191873904270.JavaMail.?@fh1036.dia.cp.net> <589228901.20071008142419@noring.name> <470AA4A7.4010908@perathoner.de> Message-ID: <1414923511.20071008175657@noring.name> Marcello wrote: > Jon Noring wrote: >> By taking the metadata, one should be able to do just about anything >> with it, including building standardized title pages. > Bonjour. That's what we *are* doing. >> Can you give us one or two examples of TEI metadata that leads to >> title pages leading to complaints? > The complaint so far is that the title page doesn't use the font sizes > and text alignment the plaintiff likes best. Hmmm, that doesn't seem to be related to the issue of drawing the title page information from the metadata. > Also there is confusion between the ebook title page, which contains PG > metadata and PG edition and publication date, and the title page of the > paper book. IMO they are different entites. You can provide one or both > of them, but at the discretion of the PPer. Hmmm, isn't there a way to provide both in the same TEI document? Maybe one way is to insert the PG-related metadata as some Dublin Core markup, and/or in some RDF, and embed that in a CDATA section. That info will thus be available to scripts, and will pass XML validation to the TEI DTD. Jon Noring From lee at novomail.net Mon Oct 8 18:23:39 2007 From: lee at novomail.net (Lee Passey) Date: Mon, 08 Oct 2007 19:23:39 -0600 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <1414923511.20071008175657@noring.name> References: <2835006.1191873904270.JavaMail.?@fh1036.dia.cp.net> <589228901.20071008142419@noring.name> <470AA4A7.4010908@perathoner.de> <1414923511.20071008175657@noring.name> Message-ID: <470AD81B.6060301@novomail.net> Jon Noring wrote: > Marcello wrote: > > > Jon Noring wrote: >> > >> By taking the metadata, one should be able to do just about > >> anything with it, including building standardized title pages. > > > Bonjour. That's what we *are* doing. > > >> Can you give us one or two examples of TEI metadata that leads to > >> title pages leading to complaints? > > > The complaint so far is that the title page doesn't use the font > > sizes and text alignment the plaintiff likes best. > > Hmmm, that doesn't seem to be related to the issue of drawing the > title page information from the metadata. But it is. The <teiHeader> element is used for all the metadata about a book. You should never see /any/ of the data from the <teiHeader> when displaying the book, which is why in my CSS for TEI I have marked the <teiHeader> as "display:none". If you want a displayable title page in your e-book (particularly if you want to control its presentation), one way is to create a <titlePage> element in the <front> and construct your title page there. This is not the common way it is done in PGTEI, however. In TEI there is an element called <divGen>. In essence, it is a function call instruction; it says, "at this point in the text, generate this 'type' of a <div> element." According to the draft P5 specification, "This element is intended primarily for use in document production or manipulation, rather than in the transcription of pre-existing materials; it makes it easier to specify the location of indices, tables of contents, etc., to be generated by text preparation or word processing software." Most PGTEI texts do not contain a transcribed <titlePage>, instead they contain a <divGen type="titlepage"> which the PG XSL script interprets as an instruction to generate a standard title page from pieces of the <teiHeader> data. If you haven't realized it yet, PGTEI is inextricably linked to the PG XSL conversion scripts. If you're not happy with the output of those scripts, there is no purpose in using the PGTEI extensions to TEI. Thus, what Mr. Starner is complaining about is /not/ TEI, or the way TEI has been used to encode a particular e-book. He's complaining about the way Mr. Perathoner's script generates the title page from the given metadata. > > Also there is confusion between the ebook title page, which > > contains PG metadata and PG edition and publication date, and the > > title page of the paper book. IMO they are different entites. You > > can provide one or both of them, but at the discretion of the PPer. > > > > Hmmm, isn't there a way to provide both in the same TEI document? Sure. Add complete metadata to the <teiHeader> and create a title page using <titlePage>. If you're displaying the document using CSS, nothing in the <teiHeader> will be displayed (presumably, if you're using a good style sheet), nor will the <divGen> element (although it's contents, if any, will be - although the only contents allowed in a <divGen> is a <head>). Then, when you write a script to do some other transformation (which is about the only way you can get anything useful out of a <divGen> element), you add a conditional to suppress the <titlePage> element if a <divGen type="titlepage"> is present, or alternatively to only generate a title page if a <titlePage> is /not/ present. You know, in XML order of child elements is (typically) not important. Just to make things a little more forgiving for those who /don't/ have a good CSS style sheet for TEI, I think the standard ought to be to put the <teiHeader> /after/ the <text> element instead of before. From prosfilaes at gmail.com Mon Oct 8 20:10:07 2007 From: prosfilaes at gmail.com (David Starner) Date: Mon, 8 Oct 2007 23:10:07 -0400 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <470AC0E7.1010208@novomail.net> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> <470AC0E7.1010208@novomail.net> Message-ID: <6d99d1fd0710082010o31dd45e6xf48306ba89e4a9ea@mail.gmail.com> On 10/8/07, Lee Passey <lee at novomail.net> wrote: > Thus, > Mr. Starner needn't explain why the PG TEI-to-HTML XSL script generates > a title page that "looks like crap," nor need he explain what needs to > be done to fix it; virtually everyone in the world shares his artistic > sensibilities, so all you should have to do is look at it to understand. Or, rather, Lee Passey doesn't need to read Mr. Starner's messages, the one's where he points out that the format of title pages is virtually unanimous across time and space, and where Mr. Starner complains that titles are usually larger than anything else in the book, that the author is correspondingly large, and that the whole bulk of material is generally centered. > those who favor semantic markup tend to be more > far-sighted and less ego-centric than those caught up in presentation, I.E. "people who agree with me tend to be morally superior to those who don't." Besides the fact that that is the most questionable type of conclusion to be drawn, given that it's completely self-serving and blinding to any positive aspects of the other side, most people feel it's a bit of a personal attack to be called short-sighted and ego-centric and don't really want to work with you when you toss that type of phrasing around. From jon at noring.name Mon Oct 8 21:56:55 2007 From: jon at noring.name (Jon Noring) Date: Mon, 8 Oct 2007 22:56:55 -0600 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <470AD81B.6060301@novomail.net> References: <2835006.1191873904270.JavaMail.?@fh1036.dia.cp.net> <589228901.20071008142419@noring.name> <470AA4A7.4010908@perathoner.de> <1414923511.20071008175657@noring.name> <470AD81B.6060301@novomail.net> Message-ID: <1649568367.20071008225655@noring.name> Lee Passey wrote: > Jon Noring wrote: >> Marcello wrote: >>> Jon Noring wrote: >>>> Can you give us one or two examples of TEI metadata that leads to >>>> title pages leading to complaints? >>> The complaint so far is that the title page doesn't use the font >>> sizes and text alignment the plaintiff likes best. >> Hmmm, that doesn't seem to be related to the issue of drawing the >> title page information from the metadata. > But it is. Er, ok. But before commenting further, I just had to keep all the prior comments in this thread to get five levels of comments. <lol/> > The <teiHeader> element is used for all the metadata about a book. You > should never see /any/ of the data from the <teiHeader> when displaying > the book, which is why in my CSS for TEI I have marked the <teiHeader> > as "display:none". If you want a displayable title page in your e-book > (particularly if you want to control its presentation), one way is to > create a <titlePage> element in the <front> and construct your title > page there. This is not the common way it is done in PGTEI, however. O.k. If the purpose of the TEI document is to actually be read by end-users in a web browser with no "plugin", then I agree with Lee that the title page must be built into the document in the <titlePage>. (Of course, if the TEI document is to be used in this end-user manner, then other restrictions probably have to also be established, such as no inline <note>, limitations on the TEI table elements used, and possibly a couple others to make the TEI as XHTML/CSS-compatible as possible. And note the problems with embedding images and enabling hypertext links, too.) But if the purpose of the TEI document is solely as a "master" for script conversion to readable formats, then my current thinking is that <titlePage> is redundant, and derivative formats would use the data in the <teiHeader> to build optimized title pages for the target platform, as Marcello says he does. And certainly for CSS "visualization" of the TEI master during the document authoring process, the metadata in <teiHeader> may certainly be displayed -- unless God Almighty herself disallows it, it is sort of arbitrary when it comes to visualization during the document editing process. > In TEI there is an element called <divGen>. In essence, it is a function > call instruction; it says, "at this point in the text, generate this > 'type' of a <div> element." According to the draft P5 specification, > "This element is intended primarily for use in document production or > manipulation, rather than in the transcription of pre-existing > materials; it makes it easier to specify the location of indices, tables > of contents, etc., to be generated by text preparation or word > processing software." Most PGTEI texts do not contain a transcribed > <titlePage>, instead they contain a <divGen type="titlepage"> which the > PG XSL script interprets as an instruction to generate a standard title > page from pieces of the <teiHeader> data. It still seems to me that for using TEI solely as a master, <divGen> is not needed for generating title pages and navigational lists, since the best "locations" for thesea are heavily platform/format dependent. I'm still intrigued in using Digital Talking Book's NCX for describing the navigational lists of each work, but I can see the alternative where the necessary nav-list information is encoded at each target point location, and optimized nav-lists built for each target platform using that information. (For OPS Publications, an NCX would thus be built -- note that specifying hierarchical level is important in NCX Table of Contents, the reasons of which I won't go into here.) > Thus, what Mr. Starner is complaining about is /not/ TEI, or the way TEI > has been used to encode a particular e-book. He's complaining about the > way Mr. Perathoner's script generates the title page from the given > metadata. Hmmm, are you saying the complaints are because: 1) the title page is generated from the metadata. Period. Or 2) the title page markup generated by Marcello's script is somehow not right? If the first, then I'd like to know why. If the second, what is the fix to Marcello's script to make "better" title pages? >>> Also there is confusion between the ebook title page, which >>> contains PG metadata and PG edition and publication date, and the >>> title page of the paper book. IMO they are different entites. You >>> can provide one or both of them, but at the discretion of the PPer. >> Hmmm, isn't there a way to provide both in the same TEI document? > Sure. Add complete metadata to the <teiHeader> and create a title page > using <titlePage>. If you're displaying the document using CSS, nothing > in the <teiHeader> will be displayed (presumably, if you're using a good > style sheet), nor will the <divGen> element (although it's contents, if > any, will be - although the only contents allowed in a <divGen> is a > <head>). Then, when you write a script to do some other transformation > (which is about the only way you can get anything useful out of a > <divGen> element), you add a conditional to suppress the <titlePage> > element if a <divGen type="titlepage"> is present, or alternatively to > only generate a title page if a <titlePage> is /not/ present. O.k. I had assumed from the prior comment that the metadata in the <teiHeader> cannot easily simultaneously contain: 1) Work/Expression metadata, 2) Source book (Manifestation) metadata, and 3) PG-related metadata. But I assume by your comment, Lee, that indeed all three types of metadata can coexist in <teiHeader> and be unambiguously identified for what they are by scripts. I've not yet closely studied the metadata facility in TEI. Btw, I *love* the OEBPS/OPS method for identifying the role played by both creators and contributors of a given work. This provides very useful information when generating title pages from the metadata since we now know what role each creator placed in producing the book (e.g., be able to differentiate between author, illustrator, translator, etc.) Can the OEBPS/OPS "role" be implemented in TEI? > You know, in XML order of child elements is (typically) not important. It depends upon the content model in the DTD. For example, if wanted, one could build a TEI-subset DTD where order is important of sibling elements. There are times when flexibility of order are important, but there are other times when fixing order is important. It is a case-by-case sort of thing so as to meet specific requirements. > Just to make things a little more forgiving for those who /don't/ have a > good CSS style sheet for TEI, I think the standard ought to be to put > the <teiHeader> /after/ the <text> element instead of before. I assume in the full blow TEI DTD that <teiHeader> may appear after <text>? Thanks, Lee, for clarifying several things. Jon From ralf at ark.in-berlin.de Tue Oct 9 01:09:50 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Tue, 9 Oct 2007 10:09:50 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <470AA264.2070908@perathoner.de> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> <6d99d1fd0710051746r5cf61006v9e83091543fcb830@mail.gmail.com> <4707A109.5070000@perathoner.de> <20071008070835.GA27881@ark.in-berlin.de> <4709F708.20604@perathoner.de> <20071008093652.GA24464@ark.in-berlin.de> <470AA264.2070908@perathoner.de> Message-ID: <20071009080950.GC27456@ark.in-berlin.de> You wrote > internals. (TeX is not unicode compatible. Version 0.5 will probably not > use TeX any more if they don't implement full unicode support by then.) pango/cairo has a Unicode PS backend which should be mature enough. ralf From ralf at ark.in-berlin.de Tue Oct 9 00:55:31 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Tue, 9 Oct 2007 09:55:31 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <470AA264.2070908@perathoner.de> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> <6d99d1fd0710051746r5cf61006v9e83091543fcb830@mail.gmail.com> <4707A109.5070000@perathoner.de> <20071008070835.GA27881@ark.in-berlin.de> <4709F708.20604@perathoner.de> <20071008093652.GA24464@ark.in-berlin.de> <470AA264.2070908@perathoner.de> Message-ID: <20071009075531.GA27456@ark.in-berlin.de> > You are talking about a bug you reported in the page headers in the PDF > output. I'm trying to fix it, but I have to work around some LaTeX > internals. (TeX is not unicode compatible. Version 0.5 will probably not > use TeX any more if they don't implement full unicode support by then.) > > I can put in a quick work around so the book title doesn't automatically > get assigned to the PDF left page headers. But that will change existing > PDFs. Your decision. It would remove garbled running headers like in http://www.gutenberg.org/files/19239/19239-pdf.zip ralf From marcello at perathoner.de Tue Oct 9 03:11:05 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 09 Oct 2007 12:11:05 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <470AC0E7.1010208@novomail.net> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> <470AC0E7.1010208@novomail.net> Message-ID: <470B53B9.9050804@perathoner.de> Lee Passey wrote: > Because semantic markup is much more powerful than presentational > markup, and because those who favor semantic markup tend to be more > far-sighted and less ego-centric than those caught up in presentation, I > believe it is possible, with little additional effort, for those in camp > #2 to produce a product that will satisfy those in camp #1. If the "semantic camp" has done a book in TEI, the "presentational camp" can easily augment the markup to make it "look" much like the original paper copy. In TEI all presentational markup is confined to the "rend" attribute, which can be attached to any element without changing the semantic structure of the text. I suggest introducing "semantic" and "presentational" rounds at DP. They are nearly orthogonal and it should be very seldom necessary to change the tag structure to accomodate presentational-oriented refinements. -- Marcello Perathoner webmaster at gutenberg.org From marcello at perathoner.de Tue Oct 9 03:20:16 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 09 Oct 2007 12:20:16 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <470AD81B.6060301@novomail.net> References: <2835006.1191873904270.JavaMail.?@fh1036.dia.cp.net> <589228901.20071008142419@noring.name> <470AA4A7.4010908@perathoner.de> <1414923511.20071008175657@noring.name> <470AD81B.6060301@novomail.net> Message-ID: <470B55E0.3010409@perathoner.de> Lee Passey wrote: > You know, in XML order of child elements is (typically) not important. > Just to make things a little more forgiving for those who /don't/ have a > good CSS style sheet for TEI, I think the standard ought to be to put > the <teiHeader> /after/ the <text> element instead of before. The TEI DTD prescribes that the <teiHeader> must come before the <text>. There's nothing we can do about that. -- Marcello Perathoner webmaster at gutenberg.org From jon at noring.name Tue Oct 9 07:08:19 2007 From: jon at noring.name (Jon Noring) Date: Tue, 9 Oct 2007 08:08:19 -0600 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <470B53B9.9050804@perathoner.de> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> <470AC0E7.1010208@novomail.net> <470B53B9.9050804@perathoner.de> Message-ID: <727840511.20071009080819@noring.name> Marcello wrote: > Lee Passey wrote: >> Because semantic markup is much more powerful than presentational >> markup, and because those who favor semantic markup tend to be more >> far-sighted and less ego-centric than those caught up in presentation, I >> believe it is possible, with little additional effort, for those in camp >> #2 to produce a product that will satisfy those in camp #1. > I suggest introducing "semantic" and "presentational" rounds at DP. They > are nearly orthogonal and it should be very seldom necessary to change > the tag structure to accomodate presentational-oriented refinements. This is similar to my suggestion that DP "officially" separate when the time is right into the "mastering" and the "facsimile" groups. Jon Noring From lee at novomail.net Tue Oct 9 11:40:22 2007 From: lee at novomail.net (Lee Passey) Date: Tue, 09 Oct 2007 12:40:22 -0600 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <1649568367.20071008225655@noring.name> References: <2835006.1191873904270.JavaMail.?@fh1036.dia.cp.net> <589228901.20071008142419@noring.name> <470AA4A7.4010908@perathoner.de> <1414923511.20071008175657@noring.name> <470AD81B.6060301@novomail.net> <1649568367.20071008225655@noring.name> Message-ID: <470BCB16.3010306@novomail.net> Jon Noring wrote: [snip] > It still seems to me that for using TEI solely as a master, <divGen> > is not needed for generating title pages and navigational lists, since > the best "locations" for these are heavily platform/format dependent. I tend to agree. If one is generating a title page (or table of contents, or list of whatever) via a script, the script is probably in the best position to decide where it should go. Having a command to "generate a title page, and put it here" embedded in the file is probably unnecessary. I suspect it's mostly a result of the script author encountering the <divGen> element and saying, "this is cool, let's see if we can make it work." The <divGen> element is, however, interesting evidence of TEI's somewhat schizophrenic nature. This schizophrenia is a result of the fact that TEI can be used to transcribe existing works, as well as to encode new, never before published works. As mentioned earlier, the P5 draft specification states that "[the divGen] element is intended primarily for use in document production or manipulation, rather than in the transcription of pre-existing materials; it makes it easier to specify the location of indices, tables of contents, etc., to be generated by text preparation or word processing software." It appears to me that use of the <divGen> element in PG texts is probably inappropriate; if I'm transcribing an existing work I would probably want to include a <titlePage> element in the <front> section, placed where the title page occurred in the existing work, and containing all the data included on that existing title page, and nothing more. After all, it's a transcription. On the other hand, if I'm writing "TEI for Dummies," it might make sense to use the <divGen> (only inside a <front> element) to say, "generate a title page for me here following corporate standards." In this case, it's part of a document production work flow, not a transcription. This same schizophrenia frequently manifests itself in the use of the "rend" attribute. In the case of my "For Dummies" book, if I use the element <hi rend="italic"> I mean "I don't know why, but when this phrase gets rendered, it should be rendered in an italic font." On the other hand, if I use the same element in a transcription of an existing book I mean "I can't figure out why this phrase was rendered in an italic font, but it was." Hopefully you can see the distinction; in one case it indicates how it was done in the past, and in the other it indicates how it should be done in the future. The two cases are not necessarily equivalent. Personally, for PG's purposes I would think the focus should be on transcription (how it was) and not document production (how it should be). [snip] > Hmmm, are you saying the complaints are because: > > 1) the title page is generated from the metadata. Period. Or > > 2) the title page markup generated by Marcello's script is somehow not > right? > > If the first, then I'd like to know why. If the second, what is the > fix to Marcello's script to make "better" title pages? The second. Although I wouldn't say that the complaint is that the title page markup is not right, but rather that it produces a result which is aesthetically unpleasing to the vast majority of consumers (i.e. me). XSL is a fairly complex and esoteric scripting language. I've glanced at Mr. Perathoner's conversion script, but nowhere near closely enough to know where the code is that generates the title page. But if one knows XSL programming, one could certainly take the script and modify it to generate the kind of title page one prefers. If someone wanted Mr. Perathoner to do the work for him, he would probably have to generate sample title pages in HTML and PDF, present them to Mr. Perathoner, and then convince him that the samples are superior to the current output and that he should modify the scripts to generate output similar to the sample markup. [snip] > O.k. I had assumed from the prior comment that the metadata in the > <teiHeader> cannot easily simultaneously contain: > > 1) Work/Expression metadata, > > 2) Source book (Manifestation) metadata, and > > 3) PG-related metadata. > > But I assume by your comment, Lee, that indeed all three types of > metadata can coexist in <teiHeader> and be unambiguously identified > for what they are by scripts. I've not yet closely studied the > metadata facility in TEI. Yep. In fact, I think that TEI support is probably much superior to Dublin Core. A <teiHeader> can contain a <fileDesc> which "provides a title and statements of responsibility together with details of the publication or distribution of the file, of any series to which it belongs, and detailed bibliographic notes for matters not addressed elsewhere in the header. It also contains a full bibliographic description for the source or sources from which the electronic text was derived." It also can contain a <encodingDesc> element which "documents the relationship between an electronic text and the source or sources from which it was derived." All in all, the <teiHeader> information is very powerful. On the other hand, IIRC the <divGen> process gets its information from certain fields in the <teiHeader> with no fallback to other fields if the preferred fields are absent. Thus, you may cram all sorts of good data into the <teiHeader> and still end up with a sparsely generated title page. It would probably be a good thing if we had some documentation that says, in essence, "the PG XSL script generates a title page from information found in these elements; if you want to have a complete title page you must include data in those elements." [snip] > Can the OEBPS/OPS "role" be implemented in TEI? Yes. See <respStmt> (http://www.tei-c.org/release/doc/tei-p5-doc/html/CO.html#COBICOR). >> You know, in XML order of child elements is (typically) not important. > > It depends upon the content model in the DTD. For example, if wanted, > one could build a TEI-subset DTD where order is important of sibling > elements. > > There are times when flexibility of order are important, but there are > other times when fixing order is important. It is a case-by-case sort > of thing so as to meet specific requirements. True, which is why I added the qualifier "typically." I have rarely encountered a DTD where element order /is/ important, but they do exist, and, in fact, the TEI DTD is one of them (the notion that you can't have a <div> after a <p> still boggles my mind). >> Just to make things a little more forgiving for those who /don't/ have a >> good CSS style sheet for TEI, I think the standard ought to be to put >> the <teiHeader> /after/ the <text> element instead of before. > > I assume in the full blow TEI DTD that <teiHeader> may appear after > <text>? Relying on Mr. Perathoner's message, apparently not, although I cannot imagine any reason why it shouldn't. It's probably an oversight in the creation of the TEI DTD, and someone ought to suggest to the TEI-C that the restriction ought to be removed. -- Nothing of significance below this line. From Bowerbird at aol.com Tue Oct 9 14:49:58 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 9 Oct 2007 17:49:58 EDT Subject: [gutvol-d] speaking of title-pages Message-ID: <d6a.e9c34e9.343d5186@aol.com> speaking of title-pages... i'm finding that editing the title-pages is one of the more interesting aspects of the conversion from pg-ascii to z.m.l. format... it's fun to make 'em look nice. anyway, i've produced an offline application to grab the first chunk of text from an e-text so that i can do the edits and previews offline, and then the app sends it back up to the site... if anyone wants to try it out to see if they would also enjoy doing this -- only if you will enjoy it, i'm not looking for "volunteers" to do "work" -- backchannel me and i will send you the program and some general instructions how to do the edits. -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071009/4f34b1fd/attachment.htm From jon at noring.name Tue Oct 9 17:06:27 2007 From: jon at noring.name (Jon Noring) Date: Tue, 9 Oct 2007 18:06:27 -0600 Subject: [gutvol-d] speaking of title-pages In-Reply-To: <d6a.e9c34e9.343d5186@aol.com> References: <d6a.e9c34e9.343d5186@aol.com> Message-ID: <926033296.20071009180627@noring.name> Bowerbird wrote: > i'm finding that editing the title-pages is > one of the more interesting aspects of the > conversion from pg-ascii to z.m.l. format... > > it's fun to make 'em look nice. Yes, I agree. The title page is something where the publisher can do some very creative things for both branding and improving the quality of the reading experience. One suggestion to PG and DP is to consider coming up with a "standard" title page template, which can be quite elaborate and/or ornate. All that's needed is to fill the fields with the title page information. This would be used for XHTML versions of the texts, and other higher typographic resolution formats like PDF. Even SVG and raster image versions could be built. The "master" for the title page could be SVG, and that's of real interest since a lot can be done with that (more than XHTML+CSS), such as incorporation into PDF, conversion to raster graphics, and direct rendering on SVG-aware browsers. The folk at DP have looked at enough elaborate and ornate title pages that someone there with an artistic flair could come up with a pretty good title page template based on either one particular book, or a "composite" from a number of title pages. Jon Noring From marcello at perathoner.de Wed Oct 10 03:53:21 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 10 Oct 2007 12:53:21 +0200 Subject: [gutvol-d] speaking of title-pages In-Reply-To: <926033296.20071009180627@noring.name> References: <d6a.e9c34e9.343d5186@aol.com> <926033296.20071009180627@noring.name> Message-ID: <470CAF21.5090709@perathoner.de> Jon Noring wrote: > Yes, I agree. The title page is something where the publisher can do > some very creative things for both branding and improving the quality > of the reading experience. Usually the "creative" branding is done on the book cover and the actual title page is quite dull. The title page, being somewhere in no man's land between four-color front cover and where the jumble of words actually starts, is probably the most never-looked-at page in the whole book (bar the empty ones). Contents, yes. People do sometimes look at the contents page. Index also. People sometimes use the index. But title page? What title page? I bet 9 people out of 10, if you give them a book, can't even tell you where the title page *is*. Until recently the book cover was not provided by the book publisher at all, but by the book binder, and was selected to match the cover of all the other books in your library. Also we must here consider that cover art may have different copyright status than book contents, especially in live + x countries. -- Marcello Perathoner webmaster at gutenberg.org From jon at noring.name Wed Oct 10 07:29:26 2007 From: jon at noring.name (Jon Noring) Date: Wed, 10 Oct 2007 08:29:26 -0600 Subject: [gutvol-d] speaking of title-pages In-Reply-To: <470CAF21.5090709@perathoner.de> References: <d6a.e9c34e9.343d5186@aol.com> <926033296.20071009180627@noring.name> <470CAF21.5090709@perathoner.de> Message-ID: <832724833.20071010082926@noring.name> Marcello wrote: > Jon Noring wrote: >> Yes, I agree. The title page is something where the publisher can do >> some very creative things for both branding and improving the quality >> of the reading experience. > Usually the "creative" branding is done on the book cover and the actual > title page is quite dull. For paper books, yes, most title pages are rather dull. > The title page, being somewhere in no man's land between four-color > front cover and where the jumble of words actually starts, is probably > the most never-looked-at page in the whole book (bar the empty ones). For digital versions of public domain books, we no longer have the "luxury" of having a binding, front cover, etc. So all that's really left in the lead-up to the actual textual content is the title page. (Some public domain books did have something like a front cover graphic, but most did not. I somehow doubt PG will create new front covers for every etext that didn't have one in the original paper edition.) Since Greg Newby has stated many times that he considers PG a publisher with its own brand name ("Project Gutenberg"), then the PG and DP folk might consider coming up with a flashy title page design that may be used for those digital rendition formats that don't already include their own built-in title page mechanism. SVG is an intriguing "mastering" format for such title pages. And certainly the title page may use color -- RGB digital ink costs the same as digital black ink. <lol/> Jon Noring From ralf at ark.in-berlin.de Wed Oct 10 02:45:16 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Wed, 10 Oct 2007 11:45:16 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <20071009075531.GA27456@ark.in-berlin.de> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> <6d99d1fd0710051746r5cf61006v9e83091543fcb830@mail.gmail.com> <4707A109.5070000@perathoner.de> <20071008070835.GA27881@ark.in-berlin.de> <4709F708.20604@perathoner.de> <20071008093652.GA24464@ark.in-berlin.de> <470AA264.2070908@perathoner.de> <20071009075531.GA27456@ark.in-berlin.de> Message-ID: <20071010094516.GA29264@ark.in-berlin.de> > > I can put in a quick work around so the book title doesn't automatically > > get assigned to the PDF left page headers. But that will change existing > > PDFs. > > Your decision. It would remove garbled running headers like in > http://www.gutenberg.org/files/19239/19239-pdf.zip Also, it would work around the problem of overly long titles garbling running headers as for example in http://www.gutenberg.org/files/22492/22492-pdf.zip So I'm all for it, at the moment. ralf From Bowerbird at aol.com Wed Oct 10 11:07:17 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 10 Oct 2007 14:07:17 EDT Subject: [gutvol-d] speaking of title-pages Message-ID: <bc8.1b1aecbe.343e6ed5@aol.com> people want to know what i mean about the title-pages... here's a very good example, with a full explanation below it: > http://snowy.arsc.alaska.edu/bowerbird/misc/screen1.html (wow, that work was from june of 2006! seems like yesterday.) the idea is to edit your title-page -text so the zml-viewer knows how to copy-fit it to the screen in the same manner that a relatively decent typographer would set it for a page. (the main difference being that the zml-viewer must do this for a _wide_ variety of screen-sizes, which is why i described the task as "copy-fitting", a term that -- though it fits well -- is probably rarely applied to the setting of a print title-page.) here's another one: > http://www.z-m-l.com/go/rieger/oya-cover.html the type-size on the title of that one seems big to me, but... and here's some more: > http://www.z-m-l.com/go/mabie/mabiec001.html > http://www.z-m-l.com/go/mabie/mabiec002.html those last two are reworked versions of the page-scans -- > http://snowy.arsc.alaska.edu/bowerbird/bachwm/bachwmp001.png > http://snowy.arsc.alaska.edu/bowerbird/bachwm/bachwmp004.png > http://snowy.arsc.alaska.edu/bowerbird/bachwm/bachwmp005.png -- but they're a relatively good example of what i aim for. the psychedelia makes it fairly clear this is a _new_ page... the "title-page" in a z.m.l. file is a cross between a "cover" in the traditional sense and a typical p-book "title-page"... it is the first thing that people see, and -- as it says in the "zandbox" manual -- the first text on it must be the title: >?? http://z-m-l.com/go/zandbox_manual.zml it's a reaction against the p.g. header, which -- to my mind -- fails its main mission, i.e., to inform readers what this file _is_. because i feel free to discard the publisher information that is typically found at the bottom of the title-page, i can often use the cover as the "official" page-scan for my title-page: > http://z-m-l.com/go/myant/myantc001.html > http://snowy.arsc.alaska.edu/bowerbird/betsy/betsy001.jpg > http://snowy.arsc.alaska.edu/bowerbird/sgfhb/sgfhbc001.jpg but i'll include relevant text from the title-page (e.g., on illustrators): > http://www.z-m-l.com/go/myant/myantf003.png > http://snowy.arsc.alaska.edu/bowerbird/betsy/betsy003.png more than just title-pages, the task is on all frontmatter pages. section 2 in a .zml file is reserved for the hotlinked table of contents: > http://www.z-m-l.com/go/mabie/mabiec002.html for an example of how i edit the table of contents page, compare this text as it appeared originally in the p.g. e-text: > http://snowy.arsc.alaska.edu/bowerbird/misc/screen2before.jpg with the edited version that would appear in the .zml file: > http://snowy.arsc.alaska.edu/bowerbird/misc/screen2after.jpg which will then be displayed in the zml-viewer like this: > http://snowy.arsc.alaska.edu/bowerbird/misc/screen2.jpg where, of course, the entries will be hotlinked to their chapters. (the look of this page compares well with the original, and also corresponds to the form it takes in the .html versions from d.p.) as this last series demonstrates, i feel very little compulsion to "match the original". people can look at the page-scan for that. what i'm doing is creating an "in-house style" such that _all_ of the books in my library have a consistent look-and-feel to them. if the original book had a frontispiece, that would go in section 3. > http://snowy.arsc.alaska.edu/bowerbird/betsy/betsy002.jpg plus, to preserve the flavor of the p-book's original facing-spread if a book had a frontispiece, i'll include the title-page as section 4. (sections 3 and 4 might "be" pages 1 and 2, i.e., p001 and p002, in the case that sections 1 and 2 were named as c001 and c002. the renaming/renumbering of frontmatter pages is case-by-case, depending on frontmatter idiosyncrasies of the original p-book.) it goes on for other frontmatter pages. a dedication page has its text centered and displayed in the upper third of the screen. at one time, i even did the programming to "properly" format the typical p-book verso with the library of congress cataloging info. (if memory serves correctly, i did that for lawrence lessig's book.) anyway... to boil it all down again, it's just a matter of proper copy-fitting... -bowerbird p.s. it's kind of cute that, in the recent pg-ascii files out of d.p., all of the title-page information is actually "centered" in the text. that is, they've used leading spaces to create a "centered" look. (of course, when you have a 23-inch cinema-screen like i do, and the window is expanded to the size of the screen, the text isn't really "centered" at all. but that's kind of beside the point.) i was wondering why i was getting funny results, when i had my viewer center the text, and discovered it was the leading spaces. no problem, they're easy enough to have the ap get rid of 'em... ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071010/f9086d36/attachment.htm From jon at noring.name Wed Oct 10 12:23:34 2007 From: jon at noring.name (Jon Noring) Date: Wed, 10 Oct 2007 13:23:34 -0600 Subject: [gutvol-d] speaking of title-pages In-Reply-To: <bc8.1b1aecbe.343e6ed5@aol.com> References: <bc8.1b1aecbe.343e6ed5@aol.com> Message-ID: <477696541.20071010132334@noring.name> Bowerbird wrote: > as this last series demonstrates, i feel very little compulsion to > "match the original".? people can look at the page-scan for that. > what i'm doing is creating an "in-house style" such that _all_ of > the books in my library have a consistent look-and-feel to them. This is a good way to put it. So long as PG/DP considers the work product they produce to be "brandable" (e.g., PG considers itself a publisher), then it makes sense that a consistent and brandable title page be generated for most, if not all, digital renditions which are put online. But then, maybe this consistency is something PG does now want since then it becomes a sort of "requirement" which stifles individuality. But at least some in DP, when they produce an XHTML rendition, might consider doing this... Jon Noring From piggy at netronome.com Thu Oct 11 10:02:24 2007 From: piggy at netronome.com (La Monte Henry Piggy Yarroll) Date: Thu, 11 Oct 2007 13:02:24 -0400 Subject: [gutvol-d] gnutenberg-press maintenance offer (was Re: Proposal to add OpenDocument as an additional In-Reply-To: <200710051727.42671.rolsch@verizon.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> Message-ID: <470E5720.6030207@netronome.com> Roland Schlenker wrote: > On Thursday 04 October 2007 6:20 pm, Lee Passey wrote: > >> Jeroen Hellingman (Mailing List Account) wrote: >> >>> I think the biggest barrier here is the steep learning curve of TEI (20% >>> of the tags cover 80% of the things you encounter, but every other book >>> you will need something from those remaining 80%, and, oh gosh, which >>> tag can I use then) .... >>> >> I am intrigued by this comment (and not only because it mirrors my own >> experience). So by way of information gathering among those who use TEI >> on a regular basis, I would you to tell me, perhaps simply as an ordered >> list, what TEI tags you believe are most used and most valuable (not >> necessarily the same thing). In other words, what are the 20% of the >> tags that cover 80% of the need, and from the remaining 80% what seems >> to come up the most often? >> >> I'm thinking of writing a little script that will try to automate the >> collection of usage data from current Gutenberg TEI texts. >> > > >From my lastest project, Marcia Schuyler, by Grace Livingston Hill Lutz: > > <p> - 1687 > <q> - 934 > <anchor> - 434 > <pb> - 358 > <hi> - 204 > Thanks for the lists! I'm particularly interested in the typical range of values used for the rend attribute on <hi>. I've added the (I think) fictional rend="gesperrt" to the book I'm working on. At some point, I'll have to figure out the right way to do that. The biggest problem I've had getting started with TEI is the staggering plethora of documentation. There doesn't seem to be a strong consensus on best documentation or best tools. I didn't even notice "The Guide to PGTEI" until I'd been playing with TEI for a couple weeks. A good solution for me would be to have a few of the TEI experts start hanging out on #pgdp. If there's a better place to find live humans to ask questions of, I'd love to hear about it. This list appears to be the best place I've found so far and I'm a little uncomfortable with asking questions that I OUGHT to be able to extract from that great pile of documentation. Well, back to the firehose for another sip... > <item> - 118 > <ref> - 76 > <lb/> - 71 > <index> - 63 > <div> - 41 > <list> - 39 > <corr> - 38 > <head> - 37 > <milestone> - 27 > <l> - 25 > <quote> - 8 > <lg> - 5 > <divGen> - 5 > <figure> - 4 > <figDesc> - 4 > <title> - 3 > <name> - 3 > <date> - 3 > <publisher> - 2 > <idno> - 2 > <classCode> - 2 > <bibl> - 2 > <author> - 2 > <titleStmt> - 1 > <textClass> - 1 > <text> - 1 > <teiHeader> - 1 > <taxonomy> - 1 > <sourceDesc> - 1 > <revisionDesc> - 1 > <respStmt> - 1 > <publicationStmt> - 1 > <pubPlace> - 1 > <projectDesc> - 1 > <profileDesc> - 1 > <language> - 1 > <langUsage> - 1 > <keywords> - 1 > <imprint> - 1 > <front> - 1 > <fileDesc> - 1 > <encodingDesc> - 1 > <editorialDecl> - 1 > <editionStmt> - 1 > <edition> - 1 > <classDecl> - 1 > <change> - 1 > <body> - 1 > <back> - 1 > <availability> - 1 > <TEI.2> - 1 > From jon at noring.name Thu Oct 11 11:23:21 2007 From: jon at noring.name (Jon Noring) Date: Thu, 11 Oct 2007 12:23:21 -0600 Subject: [gutvol-d] gnutenberg-press maintenance offer (was Re: Proposal to add OpenDocument as an additional In-Reply-To: <470E5720.6030207@netronome.com> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> Message-ID: <1354905005.20071011122321@noring.name> La Monte Henry Piggy Yarroll wrote: > The biggest problem I've had getting started with TEI is the staggering > plethora of documentation. There doesn't seem to be a strong consensus > on best documentation or best tools. I didn't even notice "The Guide to > PGTEI" until I'd been playing with TEI for a couple weeks. > > A good solution for me would be to have a few of the TEI experts start > hanging out on #pgdp. If there's a better place to find live humans to > ask questions of, I'd love to hear about it. This list appears to be the > best place I've found so far and I'm a little uncomfortable with asking > questions that I OUGHT to be able to extract from that great pile of > documentation. An approach which I previously discussed is based on the recognition that maybe 80% (a guesstimate for now) of all the books are quite simple in overall structure, and the TEI subset to properly add document structure and inline text semantics for these books is pretty small and manageable by most anyone with familiarity in marking up documents. This brings up the possibility of PG/DP coming up with such a basic TEI subset (and associated DTD and usage ruleset) for marking up books, along with a "usage manual". Those 20% (or so) of books which are more complex would then be turned over to those familiar with the full TEI vocabulary. I previously gave a few other requirements I think we should have in using TEI this way, but won't repeat them in this message. Jon Noring From Bowerbird at aol.com Thu Oct 11 11:39:05 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 11 Oct 2007 14:39:05 EDT Subject: [gutvol-d] here's another one Message-ID: <c4b.1f2b7fef.343fc7c9@aol.com> here's another form of light-markup, asciidoc: > http://www.methods.co.nz/asciidoc/ from the home-page: > AsciiDoc is a text document format for writing short documents, > articles, books and UNIX man pages. > ... > The asciidoc(1) command translates AsciiDoc files to HTML, XHTML > and DocBook markups. DocBook can be post-processed to > presentation formats such as HTML, PDF, roff, and Postscript > using readily available Open Source tools. asciidoc seems to be very well developed, compared to many light-markup systems. asciidoc does music (lilypond and abc) and math (asciimathml and latexmathml). among an interesting mix of projects using asciidoc is the linux kernel git source code management system... *** here's the asciidoc user-guide web-page: > http://www.methods.co.nz/asciidoc/userguide.html the inventor of asciidoc was using docbook before, but said: > But DocBook is a complex language, the marked up text is > difficult to read and even more difficult to write directly -- > I found I was spending more time typing markup tags, > consulting reference manuals and fixing syntax errors, > than I was writing... i'd say that about sums it up... -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071011/9448394e/attachment.htm From jon at noring.name Thu Oct 11 13:15:08 2007 From: jon at noring.name (Jon Noring) Date: Thu, 11 Oct 2007 14:15:08 -0600 Subject: [gutvol-d] here's another one In-Reply-To: <c4b.1f2b7fef.343fc7c9@aol.com> References: <c4b.1f2b7fef.343fc7c9@aol.com> Message-ID: <1899484476.20071011141508@noring.name> Bowerbird wrote: > here's another form of light-markup, asciidoc: > > [snip] > > i'd say that about sums it up... I do believe that ZML needs to be given a chance to show its stuff. So again I ask the PG/DP folk to select a good representative cross- section of texts, something like 10 or so, from simple to complex, and ask Bowerbird to put them into ZML. Then we can analyze them and see if the "ZML markup" is sufficient to represent these books for PG purposes. If ZML shows itself sufficient for all of them, then great, we can discuss where to go from there. If there are some areas where ZML is deemed deficient, then we can see if ZML can be tweaked while staying within what Bowerbird deems "ZMLness" (only he knows what that is.) Then we go from there. So, PG/DPers. Can you suggest, by number, some PG etexts that Bowerbird should consider converting to ZML? Jon Noring From joshua at hutchinson.net Thu Oct 11 14:17:28 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Thu, 11 Oct 2007 21:17:28 +0000 (UTC) Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? Message-ID: <5235198.1192137448655.JavaMail.?@fh1037.dia.cp.net> Ok, I have an issue of the magazine Punch available for comment. I *had* hoped to have a few more, but time is ever a fleeting thing. :) http://pglaf.org/~joshua/punch/ There is a version created with the built-in macro for the title page and a version with a manually created title page that is properly centered. The txt versions are also there. The remainder of each version is the same. Only the titlepages have been changed. To keep the discussion on topic, I ask that you refrain from comments on the markup used (yes, I know I went lazy and marked up the italics with <hi> instead of <emph>, etc). I want to know what people think of the built-in macro verses the manually created title page and what could be improved on each. Thanks, Josh >----Original Message---- >From: joshua at hutchinson.net >Date: Oct 8, 2007 14:50 >To: <gutvol-d at lists.pglaf.org> >Subj: Re: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? > >Ok, everyone, David has a point. A very GOOD point. > >Master formats are great and all, but the final output has to look >nice or no one will want to use it. > >Let's table this part of the discussion for a few days. I'll try to >fix up some examples of nice looking title pages and post them up. >Hopefully, we can get some consensus on whether TEI can do the job (and >some feedback on how hard it was/wasn't from me). > >Josh > >>----Original Message---- >>From: prosfilaes at gmail.com >> >>Great. Then let's see it done. I'm tired of hearing about theory, >Jon. >>I wouldn't have brought it up if we were looking beautiful title >>pages. But we've all heard huge promise made for things that never >>worked out. I'm not going to work on TEI, nor will I encourage others >>to, until it actually works right. > >_______________________________________________ >gutvol-d mailing list >gutvol-d at lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d > From lee at novomail.net Thu Oct 11 16:41:48 2007 From: lee at novomail.net (Lee Passey) Date: Thu, 11 Oct 2007 17:41:48 -0600 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <470E5720.6030207@netronome.com> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> Message-ID: <470EB4BC.9090201@novomail.net> La Monte Henry Piggy Yarroll wrote: > Roland Schlenker wrote: > >> On Thursday 04 October 2007 6:20 pm, Lee Passey wrote: [snip] >>> I'm thinking of writing a little script that will try to automate the >>> collection of usage data from current Gutenberg TEI texts. [snip] Yet more grist for the mill: I wrote a program that downloaded each of the alleged 112 TEI files stored at Project Gutenberg. Of these, on three occasions the PG server responded with a 404 or 406, leaving a total of 109 files for analysis. I loaded each file into a DOM, and then counted all of the elements used in the <text> element. As a result of this strategy 1. what is identified are only those elements used to transcribe the document, not those used to record metadata, and 2. certain elements may be under-counted if they are used both in the <teiHeader> element and the <text> element. (Analysis of <teiHeader> elements is down the road). The data has been presented as a table of comma-separated values; it should be possible to copy this table and save it as a .csv file which can then be opened in any reasonably capable spreadsheet program. The first line is a header line. The first column counts the total number of documents in which the named element appears; the second column is how that relates as a percentage of the total number of documents scanned. The third column records the total number of uses of the named element, and the fourth column is the average number of uses for those documents in which the element is used (not the total number of documents scanned). The last column is the name of the element. Total Docs,Percentage, Total Use,Avg per Doc, Element Name 109, 100.00, 9928, 91.08, head 109, 100.00, 9654, 88.57, div 109, 100.00, 623, 5.72, divGen 109, 100.00, 109, 1.00, front 109, 100.00, 109, 1.00, body 109, 100.00, 109, 1.00, back 108, 99.08, 87341, 808.71, p 106, 97.25, 18293, 172.58, index 95, 87.16, 28950, 304.74, hi 69, 63.30, 5170, 74.93, lb 66, 60.55, 8796, 133.27, note 63, 57.80, 175, 2.78, then 63, 57.80, 175, 2.78, pgIf 63, 57.80, 174, 2.76, else 52, 47.71, 9189, 176.71, pb 49, 44.95, 8731, 178.18, anchor 34, 31.19, 640, 18.82, figure 34, 31.19, 640, 18.82, figDesc 32, 29.36, 11891, 371.59, l 32, 29.36, 899, 28.09, lg 30, 27.52, 3898, 129.93, milestone 22, 20.18, 33, 1.50, titlePart 22, 20.18, 23, 1.05, docImprint 22, 20.18, 22, 1.00, docTitle 22, 20.18, 22, 1.00, titlePage 21, 19.27, 22, 1.05, docAuthor 20, 18.35, 165, 8.25, quote 20, 18.35, 21, 1.05, byline 16, 14.68, 9818, 613.63, cell 16, 14.68, 4167, 260.44, row 16, 14.68, 313, 19.56, table 14, 12.84, 5392, 385.14, ref 14, 12.84, 14, 1.00, docDate 13, 11.93, 1038, 79.85, item 13, 11.93, 177, 13.62, list 11, 10.09, 134, 12.18, corr 8, 7.34, 7161, 895.13, q 7, 6.42, 7, 1.00, docEdition 5, 4.59, 559, 111.80, emph 4, 3.67, 152, 38.00, title 3, 2.75, 494, 164.67, abbr 3, 2.75, 212, 70.67, foreign 3, 2.75, 81, 27.00, reg 3, 2.75, 8, 2.67, name 2, 1.83, 671, 335.50, formula 2, 1.83, 28, 14.00, sic 2, 1.83, 27, 13.50, bibl 2, 1.83, 26, 13.00, author 2, 1.83, 2, 1.00, epigraph 2, 1.83, 2, 1.00, date 2, 1.83, 2, 1.00, add 2, 1.83, 2, 1.00, trailer 1, 0.92, 13, 13.00, label 1, 0.92, 8, 8.00, argument 1, 0.92, 2, 2.00, del 1, 0.92, 1, 1.00, eg 1, 0.92, 1, 1.00, cit I'm sure this data will reveal some oddities and insights, more than just the ones that jump out at me at first blush. Looking at paragraphs, I'm sure this number is over-inflated because there is so much paragraph abuse apparent in PGTEI texts (pretty much every block of text is labeled a paragraph, even those which are obviously not). I'm sure a little more discretion in the use of the <p> element would result in an increase is those elements which are part of the <titlePage>. Regarding paragraphs, one of the oddities is that there is apparently one document that doesn't contain a single paragraph! I thought perhaps it was the TEI version of the American Declaration of Independence, but that proved not to be the case. I note that 100% of the files contained a <divGen> element (usually used to create a table of contents). I also note that 22 of the files have a <titlePage> element as well, which means that in some cases there may be both a generated title page and a hand-crafted title page as well. My perusal of PGTEI files indicates that the <index> element is used almost exclusive to support the PGTEI XSL scripts which generate title pages, tables of contents, and lists of illustration. My guess is that the <index> count is also over-inflated if you were to disregard the effects of the PGTEI conversion scripts. Yet another oddity is that the <quote> element is used in 20 documents while the <q> element is used only in 8. This seemed odd to me at first, as the <quote> element is only used for those quotations which are attributed by the author to some agency external to the text. And in TEI-Lite, <quote> has been deprecated in favor of <q>, which is kind of a "catch-all" element in TEI. As I think about it, however, the TEI specification does suggest that it is acceptable to not use the <q> element at all, instead retaining the original quotation marks. Maybe the disparity between <q> and <quote> is really not odd after all. > Thanks for the lists! > > I'm particularly interested in the typical range of values used for the > rend attribute on <hi>. I've added the (I think) fictional > rend="gesperrt" to the book I'm working on. At some point, I'll have to > figure out the right way to do that. One of the data points I found most interesting is that while the <hi> tag (typically used to record italicization when the reason is not discernible) is used in 95% of all the documents, the <emph> element (indicating emphasized text) which is probably the most common use of italicization, was used in only 5 of the texts, and the <foreign> tag, indicating a word foreign to the language of the text, was used in only 3 of the texts. The preponderance of the <hi> element (which is almost purely presentational) together with the amount of paragraph abuse leads me to conclude that even those people who are using TEI because it is semantic, not presentational, are still marking up text using a presentational philosophy. I would bet that when I do an analysis of the "rend" attribute on the <hi> element we will find that when it exists it will almost always be 'italic'. Thus, as we work towards a tutorial on the use of TEI one of the major focuses will need to be how to avoid the tendency to use presentational markup. > The biggest problem I've had getting started with TEI is the staggering > plethora of documentation. There doesn't seem to be a strong consensus > on best documentation or best tools. I didn't even notice "The Guide to > PGTEI" until I'd been playing with TEI for a couple weeks. And if you look at "The Guide to PGTEI" I think you will find that the only thing it does is document the PG extensions to TEI; it does nothing to actually help novices get started with TEI itself, nor does it expose any Known Best Practices. That is what I am trying to develop in this thread. The questions I want resolved are 1. What are the core TEI elements which everyone /must/ understand before attempting to encode /any/ work (examples include <text>, <front>, <body>, and <back>)? 2. What are the most commonly encountered elements that I should understand to correctly encode 80% of the books in the database? 3. What are the less common but still useful TEI elements that I will need to understand occasionally? 4. What are the common mistakes made in TEI encoding books, and how can and should one avoid them? 5. What are the TEI elements that can safely be ignored (unless you're creating a dictionary, I know a how tassle of them). Thus, while I agree with Mr. Noring's suggestion of creating a "usage manual" for the most common and useful TEI elements, I don't agree with the suggestion of creating a DTD for any "approved" subset of TEI. You see, DTDs are only useful in detecting deviations from a standard. They are no good whatsoever in helping people decide which parts of the standard are appropriate in any given circumstance. I think any usage that satisfies the full TEI DTD ought to be acceptable (and I wouldn't be opposed to the development of a "liberalized" TEI DTD that removes some of the restrictions of the current TEI DTD). But restricting documents to a certain subset of elements does nothing to promote an understanding of those elements. We need a document which does not tell us which elements and usages are acceptable, we need a document which tells us which elements and usages are important and appropriate in given scenarios. Looking at the numbers above, there appear to be about 40 documents that were created using only the base TEI structural elements with <hi> and <p>. These documents are no doubt valid (i.e. they satisfy the TEI DTD) but they can't possibly all be good. The right document should help us create documents which are valid /and/ good. I'd be very interested in hearing what other observations and insights these data provoke in other people. [remainder snipped] -- Nothing of significance below this line. From marcello at perathoner.de Thu Oct 11 16:49:58 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Fri, 12 Oct 2007 01:49:58 +0200 Subject: [gutvol-d] gnutenberg-press maintenance offer (was Re: Proposal to add OpenDocument as an additional In-Reply-To: <470E5720.6030207@netronome.com> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> Message-ID: <470EB6A6.3030308@perathoner.de> La Monte Henry Piggy Yarroll wrote: > I'm particularly interested in the typical range of values used for the > rend attribute on <hi>. I've added the (I think) fictional > rend="gesperrt" to the book I'm working on. At some point, I'll have to > figure out the right way to do that. rend="letter-spacing: 0.15em" PGTEI aims to support all attributes of CSS 2.1 / 3 at some point in the future. -- Marcello Perathoner webmaster at gutenberg.org From Bowerbird at aol.com Thu Oct 11 17:03:16 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 11 Oct 2007 20:03:16 EDT Subject: [gutvol-d] if you wanteded to convince people that t.e.i. is easy Message-ID: <c56.1fa41180.344013c4@aol.com> if you wanted to convince people that t.e.i. is easy, these humongous threads are not the way to do it. my finger is tired from deleting so many messages from my spam folder... ;+) -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071011/fa96561e/attachment.htm From Bowerbird at aol.com Thu Oct 11 17:07:16 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 11 Oct 2007 20:07:16 EDT Subject: [gutvol-d] if you wanteded to convince people that t.e.i. is easy Message-ID: <cba.1c94b728.344014b4@aol.com> i said: > my finger is tired from deleting so many messages > from my spam folder...???????????????? ;+) on the other hand -- which would be my left one -- the fingers are still quite spry and ready for action, which is why the middle finger pressed "ed" twice, thus explaining the "wanteded" typo in the subject. -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071011/facec2a2/attachment.htm From marcello at perathoner.de Thu Oct 11 17:10:09 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Fri, 12 Oct 2007 02:10:09 +0200 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <5235198.1192137448655.JavaMail.?@fh1037.dia.cp.net> References: <5235198.1192137448655.JavaMail.?@fh1037.dia.cp.net> Message-ID: <470EBB61.4080407@perathoner.de> joshua at hutchinson.net wrote: > I want to know what people think of > the built-in macro verses the manually created title page and what > could be improved on each. I would use rend="margin-top: 4em" instead of <lb/><lb/>. You were using <lb/> for purely presentational purposes (to create a vertical gap). There ain't no such thing as 2 adjacent linebreaks anyway. Maybe <docImprint> instead of <docEdition> ? <titlePage rend="page-break-before: right; text-align: center"> <docTitle> <titlePart type="main" rend="font-size: xx-large">Punch</titlePart><lb /> <titlePart type="sub" rend="font-size: x-large">or the London Charivari</titlePart> </docTitle> <docImprint rend="margin-top: 4em"> Volume 98<lb /> <docDate value="1890-01-11">11th January 1890</docDate> </docImprint> </titlePage> -- Marcello Perathoner webmaster at gutenberg.org From prosfilaes at gmail.com Fri Oct 12 07:30:11 2007 From: prosfilaes at gmail.com (David Starner) Date: Fri, 12 Oct 2007 10:30:11 -0400 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <5235198.1192137448655.JavaMail.?@fh1037.dia.cp.net> References: <5235198.1192137448655.JavaMail.?@fh1037.dia.cp.net> Message-ID: <6d99d1fd0710120730n75c56f6cy62839239e727fb3a@mail.gmail.com> On 10/11/07, joshua at hutchinson.net <joshua at hutchinson.net> wrote: > To keep the discussion on topic, I ask that you refrain from comments > on the markup used (yes, I know I went lazy and marked up the italics > with <hi> instead of <emph>, etc). I want to know what people think of > the built-in macro verses the manually created title page and what > could be improved on each. The manual page looks fine, and I assume it compares well to the original. My biggest gripe on the built-in page is the comma between "First Project Gutenberg Edition" and "(October 2007)"; parenthetical clauses are never separated off with a comma. I don't know how flexible the macro page is, but given that the macro is the one-size-fits-all title page, I'd like to see more, including information about the original publisher, publication place and date. From jeroen.mailinglist at bohol.ph Fri Oct 12 11:55:39 2007 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Fri, 12 Oct 2007 20:55:39 +0200 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <470EB4BC.9090201@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> Message-ID: <470FC32B.5060409@bohol.ph> Thanks Lee for the analysis. I grabbed the over 300 .tei files on my disk (master of both the posted HTML files I've posted to PG, and those in progress), and am currently converting them to XML (I use SGML in pure ASCII, combined with various transcription methods that most tools can't handle). If interested, I will create an archive of these, so you can repeat the analysis. Is there a place where I can drop this (rather large) archive? Looking at your observation. <p> is probably the most common occurance. The only documents I can imagine with very few <p> tags are those exclusively dealing with poetry or plays, as these will use the <l> tag. TEI often requires <p> in places you won't expect them, such as inside an argument. I hardly use <index>, which I believe would be the best way to deal with pre-existing indexes, but would be extremely labour intensive (unless somebody takes the time to build a tool for it. Ideally, you would resolve every term appearing in the index, and replace it with an <index> tag on the exact place intended by the indexer. This could be partly automated, but good indexes are often smart, that is, they would refer to a person mentioned on a page using a normalized name, which would break automated index-resolving tool. Some of my texts use <ref> to the extreme, as I let every entry in the index point back to the <pb> before the page they point to as a poor-mans alternative. I never used PGTEI, for the following reasons: * Use of SGML. I use the SGML version of TEI, which is slightly easier for human editors than its XML reincarnation. Since I employ an automated conversion from SGML to XML, this is no problem. The automatic conversion is performed with J. Clark's SX tool, available at www.jclark.com <http://www.jclark.com/sp/>. After this I run the tei2tei.xsl <http://www.tei-c.org/Activities/MI/Tools/tei2tei.xsl> stylesheet. * Use of ASCII only. Since my SGML work predates Unicode, I don't use Unicode and stick to ASCII only. All characters outside ASCII are encoded with entities. When including sections in non-Latin script, such as Greek, I use ad-hoc transcription schemes. Since I have tools to convert these to Unicode, this is no problem. * Use of extensions. I try to avoid extensions to TEI, and stick exclusively to TEILite, or borrow elements from the full-blown TEI on a case-by-case basis when required. * Use of the |rend| attribute. We both use the |rend| attribute to provide hints on rendering elements. I use the concepts of rendition ladders, whereas PGTEI uses (since version 0.4) slightly modified CSS. Since this is mainly a syntactic distinction, I may migrate to CSS in future. (Which means I'll have to write a conversion tool for this purpose.) * Use of |<divGen>| for tables of contents. Since we are digitizing pre-existing texts, I avoid the use of the |<divGen>| attribute for tables of contents and similar sections in favor of encoding these as they appear in the source. The only exception is where the source has no table of contents. Note that titles in original tables of contents often differ considerably from the actual headings used. Sometimes this is (apparently) intentional, sometimes a mistake, which I will then correct. * Use of |<divGen>| for footnotes. I automatically generate footnote sections at the end of the chapter they appear in. This requires some tweaks with quoted sections, etc., but these can be handled by software easily. * Use of |<q>| elements. I try to follow the principle that TEI texts should be the characters in the source plus tagging, and thus encode all quotation marks with the proper characters or character entities, as they appear in the source. I only use the |<q>| element when required to add attributes to a run-in quotation, but do not object to tagging it, but insist on keeping the quotation marks. Jeroen. Lee Passey wrote: > > [snip] > > Yet more grist for the mill: > > I wrote a program that downloaded each of the alleged 112 TEI files > stored at Project Gutenberg. From lee at novomail.net Fri Oct 12 12:59:09 2007 From: lee at novomail.net (Lee Passey) Date: Fri, 12 Oct 2007 13:59:09 -0600 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <470FC32B.5060409@bohol.ph> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> Message-ID: <470FD20D.1020000@novomail.net> Jeroen Hellingman (Mailing List Account) wrote: > Thanks Lee for the analysis. > > I grabbed the over 300 .tei files on my disk (master of both the posted > HTML files I've posted to PG, and those in progress), and am currently > converting them to XML (I use SGML in pure ASCII, combined with various > transcription methods that most tools can't handle). > > If interested, I will create an archive of these, so you can repeat the > analysis. Is there a place where I can drop this (rather large) archive? I would love to analyze your corpus. The way my program works (it is a 'C' program, not a script) is it looks in an XHTML table for a <td> element containing an "href" attribute. If the value of that attribute starts with "file://" it looks in the local file system for the TEI file to parse. If the value of the attribute starts with "http://" it parses the URL, opens a socket to the remote machine and transmits an http GET command for the file. It then parses the file from the incoming socket stream, and the file is never stored on the local file system. Thus, if your files are available on an HTTP (web) server, all I need is a list of the URLs. > Looking at your observation. > > <p> is probably the most common occurance. The only documents I can > imagine with very few <p> tags are those exclusively dealing with poetry > or plays, as these will use the <l> tag. TEI often requires <p> in > places you won't expect them, such as inside an argument. The predominance of <p> is not surprising; when we write we almost always write in paragraphs. A problem, however, is that people who are steeped in the word-processing paradigm tend to mark every block of text that looks like a paragraph with the <p> tag, even when it's not a paragraph. I wish there were a good automated way to detect these paragraph abuses, but so far I haven't figured one out. > I hardly use <index>, which I believe would be the best way to deal with > pre-existing indexes, but would be extremely labour intensive (unless > somebody takes the time to build a tool for it. Ideally, you would > resolve every term appearing in the index, and replace it with an > <index> tag on the exact place intended by the indexer. This could be > partly automated, but good indexes are often smart, that is, they would > refer to a person mentioned on a page using a normalized name, which > would break automated index-resolving tool. Some of my texts use <ref> > to the extreme, as I let every entry in the index point back to the <pb> > before the page they point to as a poor-mans alternative. I think you've hit the proper use of the index element right on the head. As it turns out, the PGTEI files use the element differently, primarily as targets for the <divGen> function. When I wrote my tei2html program I used a somewhat different approach to the automated creation of tables of contents and illustrations. For the table of contents I scanned for the <head>ers of <div>s and used them to create a hierarchical table with the same hierarchy that the <div>isions had; for the list of illustrations I just looked for every <figure> tag. I don't like the way PG uses the <index> element, which is why I think its not one of the elements that deserves inclusion in the group of "crucial" elements. > I never used PGTEI, for the following reasons: [snip] In my view, the PGTEI constraints are focused very strongly on supporting the XSLT transformations built into the PG web site. Because I find the output from those scripts unacceptable, the scripts, and therefore the markup supporting those scripts, and of very little importance to me. They certainly have little, if any, general applicability. The use of the "rend" attribute throughout is very problematic, because it has two, somewhat different, meanings. It can mean "in the original text this region had this presentation," or it can mean "when you convert this document to a presentation document, force it to have this presentation." Because I strongly support the proposition that the end user should have simple and direct control over the presentation of a document, I fully support the first use of the rend attribute, but give only qualified support to the second. Finding all the "rend" attributes, and listing the array of values and what elements they are used on will be one of my next projects. -- Nothing of significance below this line. From lee at novomail.net Fri Oct 12 13:59:12 2007 From: lee at novomail.net (Lee Passey) Date: Fri, 12 Oct 2007 14:59:12 -0600 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <470EB4BC.9090201@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> Message-ID: <470FE020.3080608@novomail.net> Lee Passey wrote: > Yet more grist for the mill: [snip] > Total Docs, Percentage, Total Use, Avg per Doc, Element Name [snip] > 69, 63.30, 5170, 74.93, lb [snip] > I'm sure this data will reveal some oddities and insights, more than > just the ones that jump out at me at first blush. A couple of other observations that have recently occurred to me: If you remove the items in the list which are obviously included only to support the PG XSLT scripts (divGen, index, pgIf, then, else) over 50% of the TEI documents in the PG database are created using only 9 tags in the body text (i.e. not including metadata). While I suspect that this is a result of a certain amount of /over/-simplification, it is evidence that TEI is not nearly as complex as its detractors would have us believe. 63% of the PG texts included the <lb> element averaging 75 uses per document. There are some individuals who believe that line endings in the original source document should be preserved in the TEI transcription, and this is obviously one use for the <lb> element. Others obviously view the <lb> element as being analogous to the HTML <br> element which instructs the user agent to begin a new line when the document is being presented. These two uses are to some extent contradictory. If all the PG documents which use the <lb> element used it to record line breaks in the text, I can't believe that there would average only 75 instances per text. On the other hand, it seems to me that if all the uses were to indicate a desired presentation, there wouldn't average 75 in a document. I suspect that most of the documents scanned use the <lb> element to force a line break on presentation, not to memorialize the format of the original document, but there are probably a few documents in the mix which /do/ use it for its intended purpose, which skews the numbers (some way of identifying outliers would be useful). Hopefully, all of the documents contain some indication of how the <lb> element is used in that particular document. In any event, the <lb> element needs to join the <hi> element in the list of elements prone to abuse that need to be distinguished. -- Nothing of significance below this line. From jon at noring.name Fri Oct 12 14:03:03 2007 From: jon at noring.name (Jon Noring) Date: Fri, 12 Oct 2007 15:03:03 -0600 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <470FD20D.1020000@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> <470FD20D.1020000@novomail.net> Message-ID: <1603744685.20071012150303@noring.name> Lee Passey wrote: > Finding all the "rend" attributes, and listing the array of values and > what elements they are used on will be one of my next projects. It will be interesting to see how the "rend" attribute has been used. I do like the idea that PG/DP needs to standardize on something when using the "rend" attribute (and, yes, I agree with Lee that the value in "rend" should simply describe how the element was rendered in the paper source.) I like Marcello's approach of using CSS 2.1/3.0 as the attribute value. In cases where CSS cannot be used (can't think of any, but no doubt it will occur for real odd stuff), then the PG/DP folk need to standardize on a set of values. Jon Noring From klofstrom at gmail.com Fri Oct 12 14:15:08 2007 From: klofstrom at gmail.com (Karen Lofstrom) Date: Fri, 12 Oct 2007 11:15:08 -1000 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <1603744685.20071012150303@noring.name> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> <470FD20D.1020000@novomail.net> <1603744685.20071012150303@noring.name> Message-ID: <1e8e65080710121415icf7cba3vf89bf01c896ea46e@mail.gmail.com> A question, from someone who hasn't grappled with TEI yet: Perhaps it would be possible to do the TEI in two stages? One, a plain vanilla TEI. Academic quality. Two, this TEI marked up into PGTEI (a markup as automated as possible), a specialized format designed for easy on-the-fly generation of ebooks in various formats. This would make the plain vanilla TEI into the archival master, the PGTEI into a completely behind the scenes format that could be changed as formats and ebook readers change. Just a thought. May or may not work. -- Karen Lofstrom From lee at novomail.net Fri Oct 12 14:29:30 2007 From: lee at novomail.net (Lee Passey) Date: Fri, 12 Oct 2007 15:29:30 -0600 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <470EB4BC.9090201@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> Message-ID: <470FE73A.6060302@novomail.net> Forwarded without further comment. Metadata elements (<teiHeader> contents) from the PG TEI corpus: Total Docs, Percent , Total Use, Avg / Doc, Element Name 109, 100.00, 637, 5.84, p 109, 100.00, 276, 2.53, date 109, 100.00, 275, 2.52, name 109, 100.00, 244, 2.24, title 109, 100.00, 150, 1.38, item 109, 100.00, 143, 1.31, publisher 109, 100.00, 138, 1.27, respStmt 109, 100.00, 135, 1.24, change 109, 100.00, 126, 1.16, idno 109, 100.00, 124, 1.14, language 109, 100.00, 109, 1.00, langUsage 109, 100.00, 109, 1.00,publicationStmt 109, 100.00, 109, 1.00, availability 109, 100.00, 109, 1.00, titleStmt 109, 100.00, 109, 1.00, sourceDesc 109, 100.00, 109, 1.00, profileDesc 109, 100.00, 109, 1.00, fileDesc 109, 100.00, 109, 1.00, encodingDesc 109, 100.00, 109, 1.00, revisionDesc 108, 99.08, 151, 1.40, author 104, 95.41, 104, 1.00, edition 104, 95.41, 104, 1.00, editionStmt 101, 92.66, 200, 1.98, bibl 100, 91.74, 100, 1.00, taxonomy 100, 91.74, 100, 1.00, classDecl 99, 90.83, 99, 1.00, textClass 40, 36.70, 83, 2.08, lb 34, 31.19, 38, 1.12, classCode 33, 30.28, 44, 1.33, pubPlace 33, 30.28, 33, 1.00, list 33, 30.28, 33, 1.00, keywords 31, 28.44, 31, 1.00, imprint 10, 9.17, 10, 1.00, projectDesc 7, 6.42, 8, 1.14, editor 4, 3.67, 4, 1.00,editorialDecl 3, 2.75, 4, 1.33, xref 3, 2.75, 3, 1.00, resp -- Nothing of significance below this line. From jon at noring.name Fri Oct 12 14:52:07 2007 From: jon at noring.name (Jon Noring) Date: Fri, 12 Oct 2007 15:52:07 -0600 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <470FC32B.5060409@bohol.ph> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> Message-ID: <1408403531.20071012155207@noring.name> Jeroen wrote: [A quite informative message in response to Lee's] > I hardly use <index>, which I believe would be the best way to deal with > pre-existing indexes, but would be extremely labour intensive (unless > somebody takes the time to build a tool for it. Ideally, you would > resolve every term appearing in the index, and replace it with an > <index> tag on the exact place intended by the indexer. This could be > partly automated, but good indexes are often smart, that is, they would > refer to a person mentioned on a page using a normalized name, which > would break automated index-resolving tool. Some of my texts use <ref> > to the extreme, as I let every entry in the index point back to the <pb> > before the page they point to as a poor-mans alternative. Back-of-book indexes are difficult to deal with for mastering purposes, and something we almost need to put together a separate working group to hammer out. I'm fortunate to have talked with one of the world's top experts at indexing, and marking up for it. It is complicated. Since we are dealing with existing books with existing indexes (e.g., they will not be expanded or added to in the master), we may determine that using <index> is the best way. (Unlike Tables of Contents and title pages, I view original back-of-book indexes as being *closer* to content, but not exactly there -- a sort of Twilight Zone sort of thing -- oooh, I see Rod Serling walk out now!) To be discussed at the appropriate time. > I never used PGTEI, for the following reasons: > > * Use of ASCII only. Since my SGML work predates Unicode, I don't > use Unicode and stick to ASCII only. All characters outside ASCII > are encoded with entities. When including sections in non-Latin > script, such as Greek, I use ad-hoc transcription schemes. Since I > have tools to convert these to Unicode, this is no problem. There's something to be said with this approach, especially for primarily English texts where the use of non-ASCII characters is pretty constrained. Using a mnemonic character entities set, such as that used in TEI, makes sense. (I don't remember: is the HTML mnemonic character entities a subset of that used in TEI?) As time goes on, our text editors will become more and more Unicode conformant to the point where we may never use mnemonic character entities in them at any stage of authorship. Now this is not to say I'd disallow UTF-8 master documents which encode characters beyond ASCII. We'd allow them. There now exist cool freeware tools to convert between encodings and any character entities in them, Such as BabelPad for Windows (highly recommended! No doubt Mac has some similar freeware text encoding conversion tools.) (In my case, I love my vi editor, and since it is not Unicode conformant, I simply use either mnemonic or numerical entities to represent characters beyond ASCII -- it's easy for me to type, for example, — and ö. Later, when I convert the docs to UTF-8 with all character entities converted to encoded characters, I use BabelPad -- and if need be I can use BabelPad to go the other direction.) > * Use of extensions. I try to avoid extensions to TEI, and stick > exclusively to TEILite, or borrow elements from the full-blown TEI > on a case-by-case basis when required. Another good piece of advice. For primarily prose works, the TEI-Lite probably has enough to do the job. And if not, then pull in individual elements as needed. > * Use of |<divGen>| for tables of contents. Since we are digitizing > pre-existing texts, I avoid the use of the |<divGen>| attribute > for tables of contents and similar sections in favor of encoding > these as they appear in the source. The only exception is where > the source has no table of contents. Note that titles in original > tables of contents often differ considerably from the actual > headings used. Sometimes this is (apparently) intentional, > sometimes a mistake, which I will then correct. Hmmm, I think we should seriously consider using Digital Talking Book's NCX to "format" the tables of contents and other navigation lists (like "List of Illustrations"), with the targets being ID's placed on the appropriate elements in the content. NCX is quite powerful at structuring such nav-lists, including ways to designate the table of contents item description which, as someone else mentioned, oftentimes is described differently in the original table of contents than the associated header title. It is also hierarchical in structure, and, finally, meets legal requirements for educational use of the books. Oh, and the NCX is now ready for use in EPub. The NCX may be embedded within the "master" TEI, probably using a CDATA section so as to not create problems with DTD validation (one could consider using namespaces, but namespacing, especially with regards to validation, is still a royal mess.) > * Use of |<divGen>| for footnotes. I automatically generate footnote > sections at the end of the chapter they appear in. This requires > some tweaks with quoted sections, etc., but these can be handled > by software easily. Hmmm, I believe the best system is to simply place the annotation (whether it is a footnote, endnote, sidebar, etc.) at the point in the main flow of the text where it naturally fits using the <note> tag. If need be, add the appropriate attribute/value to describe where it was placed originally. The problem is that different digital renditions from the master will each have its own best place and method to present the annotations. We must not "force" placement in a particular place or manner. Let the conversion tools take care of placement. ***** Just my thoughts, but I may have overlooked some issues we need to discuss, or there may be even better ways to do these things even given my present understanding. Jon Noring From piggy at netronome.com Fri Oct 12 18:53:52 2007 From: piggy at netronome.com (La Monte Henry Piggy Yarroll) Date: Fri, 12 Oct 2007 21:53:52 -0400 Subject: [gutvol-d] Very strict subset of TEI P5 for most PG/DP books? In-Reply-To: <1941308865.20071005114302@noring.name> References: <6688876.1191591951762.JavaMail.?@fh1064.dia.cp.net> <724651335.20071005100515@noring.name> <15cfa2a50710051017i4b4476acv19a8aa63037c3089@mail.gmail.com> <1941308865.20071005114302@noring.name> Message-ID: <47102530.5000607@netronome.com> Jon Noring wrote: > Robert > >> Jon Noring wrote: >> >>> (Aside, I've always believed that if we are to scan the public >>> domain books, we should do so at sufficient scan quality so the >>> scans will be useful for ludic reading, and not just as feed for >>> OCR, and thus have always advocated higher quality master scans >>> than has been done. This zeal for speed has troubled me, especially >>> in that the bottleneck at DP is not scans, but proofing -- I would >>> hope that DP will begin to encourage book scanners to focus on >>> archival quality -- even presentation quality -- rather than the >>> current "just scan 'em good enough for OCR." Make the scans > ... > So long as DP does not make any effort to encourage those who scan > books to do so at archival or even presentational quality, most won't. > But if it is encouraged, I think most will take the time to do it. If > volunteers are given reasons why to do something to a certain higher > level of quality, most will gladly do so. I reject the notion that > *asking* them to take the effort to produce archival quality will > turn them away. The end result is that a lot of high quality scan sets > will result and be made available to the world. > > Good enough for DP should NOT be considered good enough. > Here here! But I don't think this is an issue for DP. We need a CPer site similar to DP suitable for distributing the work associated with producing digital facsimiles. I have about 150 books in various states of preparedness to contribute. If you build it, I will come. :-) From marcello at perathoner.de Sat Oct 13 07:54:10 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat, 13 Oct 2007 16:54:10 +0200 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <1e8e65080710121415icf7cba3vf89bf01c896ea46e@mail.gmail.com> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> <470FD20D.1020000@novomail.net> <1603744685.20071012150303@noring.name> <1e8e65080710121415icf7cba3vf89bf01c896ea46e@mail.gmail.com> Message-ID: <4710DC12.7090402@perathoner.de> Karen Lofstrom wrote: > Perhaps it would be possible to do the TEI in two stages? One, a > plain vanilla TEI. Academic quality. Two, this TEI marked up into > PGTEI (a markup as automated as possible), a specialized format > designed for easy on-the-fly generation of ebooks in various formats. People think PGTEI is some sort of degraded, impure, tainted TEI because it adds some tags and defines the rend attribute (which TEI leaves intentionally undefined). Well, it is not. PGTEI is a TEI application (extends TEI). PGTEI is in no way different than TEI-Lite, which is another TEI application. TEI was expressly designed with extensibility in mind and ways of building TEI applications have been built into the TEI DTD: > In brief, the TEI Guidelines define a general-purpose encoding scheme > which makes it possible to encode different views of text, possibly > intended for different applications, serving the majority of > scholarly purposes of text studies in the humanities. However, no > predefined encoding scheme can serve all research purposes. > Therefore, the TEI also provides means of modifying and extending the > encoding scheme defined by the Guidelines (see chapter 29 Modifying > and Customizing the TEI DTD). ---- http://www.tei-c.org/P4X/AB.html#ABDPIU -- Marcello Perathoner webmaster at gutenberg.org From lee at novomail.net Sat Oct 13 11:16:13 2007 From: lee at novomail.net (Lee Passey) Date: Sat, 13 Oct 2007 12:16:13 -0600 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <1e8e65080710121415icf7cba3vf89bf01c896ea46e@mail.gmail.com> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> <470FD20D.1020000@novomail.net> <1603744685.20071012150303@noring.name> <1e8e65080710121415icf7cba3vf89bf01c896ea46e@mail.gmail.com> Message-ID: <47110B6D.50806@novomail.net> Karen Lofstrom wrote: > A question, from someone who hasn't grappled with TEI yet: > > Perhaps it would be possible to do the TEI in two stages? One, a > plain vanilla TEI. Academic quality. Two, this TEI marked up into > PGTEI (a markup as automated as possible), a specialized format > designed for easy on-the-fly generation of ebooks in various formats. > > > This would make the plain vanilla TEI into the archival master, the > PGTEI into a completely behind the scenes format that could be > changed as formats and ebook readers change. > > Just a thought. May or may not work. Mr. Perathoner's response was a bit testy and defensive, so you may not have recognized that he answered your question. Yes, what you are proposing could work; but in fact, step 2 is probably unnecessary because a good TEI file should contain all the information needed to generate the various formats without any additional human intervention. As I understand it, PGTEI does extend TEI, in an approved way, but I also believe that those extensions are for the most part inconsequential. More importantly, as Mr. Perathoner points out, PGTEI /refines/ TEI to make it more usable. Consider the "rend" attribute. In TEI it is used to indicate how a particular phrase was presented in the work being transcribed (for original works the rend attribute is not nearly as important). In most printed works, emphasized phrases are presented as italic text. In early PG editions, emphasized phrases are frequently presented as uppercase text. This distinction can be preserved in TEI by using the "rend" attribute: e.g. <emph rend="italic"> vs. <emph rend="uppercase">. In both cases you have noted that the text is emphasized, not merely rendered differently (potentially important in the case of text-to-speech), you permit the end user to decide how s/he prefers emphasized text rendered (ignoring the "rend" attribute) but you have preserved the distinction between the two. The problem with the "rend" attribute is that TEI has provided no controlled vocabulary for the values associated with "rend" attributes. The P4 guidelines /do/ provide the <rendition> attribute which is designed to map "rend" values to some formal language but the element description does not explain how that is to be done (the draft P5 guidelines have remedied that ambiguity). PGTEI refines TEI by stating that only CSS rules may be used as values for the "rend" attribute. This simple statement formalizes TEI in such a way that it can now be used by automated processes designed to transform TEI into a presentation format. Small constraints as this, which cannot be expressed in a DTD, make a huge impact on the usability of a file, but they have no deleterious effect on its "academic quality." Other refinements to TEI, such as the use of the <index> element combined with the <divGen> element to create lists, or the use of the <divGen> element to create a title page when the necessary information has been stored in the <teiHeader> element, are of marginal utility, but they do not make the file any less pure. As I see it, the biggest problem with these constructs is that they encourage people to omit transcribing title pages and tables of contents, believing that they will be generated later on the fly. Limited support for the <divGen> element, combined with the uncommon way it is used by the PG XSLT script, makes files which rely on <divGen> as an alternative to transcription less useful. Nevertheless, this problem is a side effect of the PGTEI extensions, not a result. If PPers make the effort to transcribe these valuable parts of a work, the presence or absence of the <divGen> is irrelevant. As I understand it, there are some PGTEI extensions that may be required if you want to use the PG XSLT scripts to generate PDF files. Because PDF is a pre-print format, not an e-book format, I have paid no attention to these extensions. I do not know how the PG XSLT PDF conversion scripts might respond to a TEI file without those extensions, so someone else will have to fill us in on that score. The bottom line is that a PGTEI conformant file can work as the master as well as a "plain vanilla" TEI file, and vice versa. If there is an automated way to add PGTEI extensions to a TEI file, then the later conversion processes can be modified to incorporate those methods rendering the addition of PGTEI extensions moot. At this point, while the two-stage model you propose is definitely feasible, I don't see any way it would be useful. From ralf at ark.in-berlin.de Sat Oct 13 11:28:08 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Sat, 13 Oct 2007 20:28:08 +0200 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <470FC32B.5060409@bohol.ph> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> Message-ID: <20071013182808.GB5263@ark.in-berlin.de> > * Use of |<divGen>| for footnotes. I automatically generate footnote > sections at the end of the chapter they appear in. This requires > some tweaks with quoted sections, etc., but these can be handled > by software easily. That's possible in PGTEI 0.4 too. Use <note place="end"> and a divGen "footnote" at the end iof each chapter. However, numbers are no reset, so you'll get marks with hundreds in some books. ralf From ralf at ark.in-berlin.de Sat Oct 13 11:23:04 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Sat, 13 Oct 2007 20:23:04 +0200 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <470EB4BC.9090201@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> Message-ID: <20071013182304.GA5263@ark.in-berlin.de> > The preponderance of the <hi> element (which is almost purely > presentational) together with the amount of paragraph abuse leads me to > conclude that even those people who are using TEI because it is > semantic, not presentational, are still marking up text using a > presentational philosophy. I don't thin so. I'm using <hi> as a first translation of <i> and <b>, with the plan to change the <hi>s later to the special mark up. I'm sure I've read somewhere in the docs an advice to that effect. > I would bet that when I do an analysis of the > "rend" attribute on the <hi> element we will find that when it exists it > will almost always be 'italic'. You will see bold and, in my german language text, gesperrt/antiqua, which is then further defined in the style sheet. ralf From klofstrom at gmail.com Sat Oct 13 12:49:34 2007 From: klofstrom at gmail.com (Karen Lofstrom) Date: Sat, 13 Oct 2007 09:49:34 -1000 Subject: [gutvol-d] Kindness to TEI novices Message-ID: <1e8e65080710131249g6930021x648e18c64054b632@mail.gmail.com> Thanks, Lee, for being so kind as to explain further. I've got a charming but tiny, simple book to put through DP. I've sworn to myself that I'm going to shepherd it through all the stages of preparation. When I post-process it, I'm going to prepare a PGTEI version, my first. Expect further stupid questions :) -- Karen Lofstrom aka Zora From marcello at perathoner.de Sat Oct 13 14:10:08 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat, 13 Oct 2007 23:10:08 +0200 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <470FC32B.5060409@bohol.ph> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> Message-ID: <47113430.7050608@perathoner.de> Jeroen Hellingman (Mailing List Account) wrote: > I never used PGTEI, for the following reasons: > * Use of ASCII only. Since my SGML work predates Unicode, I don't > use Unicode and stick to ASCII only. All characters outside ASCII > are encoded with entities. When including sections in non-Latin > script, such as Greek, I use ad-hoc transcription schemes. Since I > have tools to convert these to Unicode, this is no problem. You can do that in English. With other European languages it starts to look bad and becomes completely impossible with cyrillic and asian scripts. PGTEI has to accomodate all languages and therefore has full support for unicode. Of course, it inputs many lesser encodings too, so if your editor just won't do unicode, you still can use those other encodings + entites. > * Use of extensions. I try to avoid extensions to TEI, and stick > exclusively to TEILite, or borrow elements from the full-blown TEI > on a case-by-case basis when required. Funny. TEI-Lite is an extension of TEI. > * Use of |<divGen>| for tables of contents. Since we are digitizing > pre-existing texts, I avoid the use of the |<divGen>| attribute > for tables of contents and similar sections in favor of encoding > these as they appear in the source. The only exception is where > the source has no table of contents. Note that titles in original > tables of contents often differ considerably from the actual > headings used. Sometimes this is (apparently) intentional, > sometimes a mistake, which I will then correct. Just the same as for title pages, use of <divGen> for building tables of contents is optional. You can build your table of contents by hand. Use <ref> and <anchor> instead of <index> and <divGen>. It will just take you longer. > * Use of |<divGen>| for footnotes. I automatically generate footnote > sections at the end of the chapter they appear in. This requires > some tweaks with quoted sections, etc., but these can be handled > by software easily. Use of <divGen> makes PGTEI more flexible because you can collect footnotes at the end of the chapter OR at the end of the book, whichever you like best. (More exactly: you can collect them *any* place a <divGen> is legal markup.) > * Use of |<q>| elements. I try to follow the principle that TEI > texts should be the characters in the source plus tagging, and > thus encode all quotation marks with the proper characters or > character entities, as they appear in the source. I only use the > |<q>| element when required to add attributes to a run-in > quotation, but do not object to tagging it, but insist on keeping > the quotation marks. Use the embedded TEI stylesheet to set rend="pre: none; post: none" to <q> and <quote>. PGTEI will then no longer insert any quotation marks of its own. -- Marcello Perathoner webmaster at gutenberg.org From Bowerbird at aol.com Sat Oct 13 16:32:19 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 13 Oct 2007 19:32:19 EDT Subject: [gutvol-d] tim over at librarything.com Message-ID: <c27.22dc4d12.3442af83@aol.com> tim over at librarything.com is doing some interesting stuff, providing a new feature called "common knowledge", which is a "fielded wiki" where people can supply semi-structured data about books and authors, such as "important places in a book", "characters in a book", "author's residences", and so on... you can read more about it here: > http://www.librarything.com/blog/2007/10/common-knowledge-social-cataloging.php > http://www.librarything.com/blog/2007/10/common-knowledge-explodes.php -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071013/e5617b68/attachment.htm From joshua at hutchinson.net Sun Oct 14 06:16:57 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Sun, 14 Oct 2007 13:16:57 +0000 (UTC) Subject: [gutvol-d] The TEI 80/20 rule - empirical data Message-ID: <4994264.1192367817750.JavaMail.?@fh1064.dia.cp.net> Plus, the way TEI works ... if someone were to take our "PGTEI" marked document and run it through their system that doesn't support things like <divGen> and our rend structure ... it'll just ignore them. You wouldn't get an automated Table of Contents, but the rest of it would come across. The layout might look different, since it would be ignoring the rend attributes, but the content would still be there. Josh >----Original Message---- >From: marcello at perathoner.de >Date: Oct 13, 2007 10:54 >To: "Project Gutenberg Volunteer Discussion"<gutvol-d at lists.pglaf. org> >Subj: Re: [gutvol-d] The TEI 80/20 rule - empirical data > >Karen Lofstrom wrote: > >> Perhaps it would be possible to do the TEI in two stages? One, a >> plain vanilla TEI. Academic quality. Two, this TEI marked up into >> PGTEI (a markup as automated as possible), a specialized format >> designed for easy on-the-fly generation of ebooks in various formats. > >People think PGTEI is some sort of degraded, impure, tainted TEI because >it adds some tags and defines the rend attribute (which TEI leaves >intentionally undefined). > >Well, it is not. PGTEI is a TEI application (extends TEI). PGTEI is in >no way different than TEI-Lite, which is another TEI application. > >TEI was expressly designed with extensibility in mind and ways of >building TEI applications have been built into the TEI DTD: > >> In brief, the TEI Guidelines define a general-purpose encoding scheme >> which makes it possible to encode different views of text, possibly >> intended for different applications, serving the majority of >> scholarly purposes of text studies in the humanities. However, no >> predefined encoding scheme can serve all research purposes. >> Therefore, the TEI also provides means of modifying and extending the >> encoding scheme defined by the Guidelines (see chapter 29 Modifying >> and Customizing the TEI DTD). > > ---- http://www.tei-c.org/P4X/AB.html#ABDPIU > > > >-- >Marcello Perathoner >webmaster at gutenberg.org > >_______________________________________________ >gutvol-d mailing list >gutvol-d at lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d > From jeroen.mailinglist at bohol.ph Sun Oct 14 12:24:32 2007 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Sun, 14 Oct 2007 21:24:32 +0200 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <47113430.7050608@perathoner.de> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> <47113430.7050608@perathoner.de> Message-ID: <47126CF0.7080508@bohol.ph> Marcello Perathoner wrote: > Jeroen Hellingman (Mailing List Account) wrote: > > > PGTEI has to accomodate all languages and therefore has full support for > unicode. Of course, it inputs many lesser encodings too, so if your > editor just won't do unicode, you still can use those other encodings + > entites. > > Ten years ago, support for Unicode was non-existing, and using any character set but ASCII was a nightmare when working with more than one system. (I had to work on both Macs, PCs in DOS and PCs in Windows at that time. Now that we have Unicode, it has become much easier. >> * Use of extensions. I try to avoid extensions to TEI, and stick >> exclusively to TEILite, or borrow elements from the full-blown TEI >> on a case-by-case basis when required. >> > > Funny. TEI-Lite is an extension of TEI. > > If you belong to the school that considers a subset an extension, I can agree. I will not use extensions where perfectly valid TEI constructs exists, or where I think the purpose of the tagging lies outside the scope of semantic tagging (such as conditional switches in the tagged text, not in the rendering code). If I need some really odd, one-of-a-kind construct, I can always include an illustration. However, if I have a need for an extension, I will certainly invent it. > You can build your table of contents by hand. Use <ref> and <anchor> > instead of <index> and <divGen>. It will just take you longer. > > When I build my table of contents by hand, based on the book at hand, I will more accurately capture their contents. If I regenerate from the available heads, they are often quite different. > Use of <divGen> makes PGTEI more flexible because you can collect > footnotes at the end of the chapter OR at the end of the book, whichever > you like best. (More exactly: you can collect them *any* place a > <divGen> is legal markup.) > By tweaking my XSLT, I can put them anywhere I like, and leave that choice to the person rendering, not the person encoding the text. I have XSLT scripts that produce tweaked HTML that goes through Prince, and the footnotes end up as footnotes in a PDF. Jeroen. From jon at noring.name Sun Oct 14 13:10:26 2007 From: jon at noring.name (Jon Noring) Date: Sun, 14 Oct 2007 14:10:26 -0600 Subject: [gutvol-d] "California court tilts towards mandating web accessibility" -- ebook connection? Message-ID: <1421257391.20071014141026@noring.name> [I have already posted the following to The eBook Community, and am reposting it here for discussion specific to PG and DP. I won't go into detail in this preamble what I see are the connections, but briefly it revolves around the use of the PG corpus in public schools, and my suggestion that for TEI mastering, NCX be used for the navigational lists, including the Table of Contents.] Everyone, Large and small publishers of digital text content, such as ebooks, need to be aware that legal requirements (such as the Americans with Disabilities Act, ADA) may eventually force them to adopt accessible- friendly formats, and to implement them in an accessible manner. (Btw, this is something I've predicted would happen the last 10 years on The eBook Community, and we are now seeing the opening salvos.) A recent ruling regarding Target and the accessibility of its web site is a sort of presage of what may come to the digital publishing world: http://www.theregister.co.uk/2007/10/14/california_target_web_accessibility/ At first glance, this court case appears limited to publicly-accessible web sites, but it is clear that any textual-content which is digitally readable, whether online, or remotely, may be subject to disability laws in the future, especially if such content is used in the public sector, such as for education, the government, public libraries, etc. (The likely scenario is that publishers have to provide at least an accessible version, but because of production costs, this will likely lead to simply using accessible formats for all publications as will be mentioned later.) Now a lot of digital text formats are accessible IF IMPLEMENTED PROPERLY, but the issue is that many publishers today do not implement them properly, as the Target case illustrates. (It's well-known that a lot of web sites are wholly inaccessible because their markup focuses on presentational markup rather than focusing on document structure and important inline text semantics, and using CSS for most visual styling -- refer to CSS Zen Garden, http://www.csszengarden.com/ , for a demo on how web markup *should* be done. The W3C Web Content Accessibility Guidelines, WCAG, should be religously followed as they apply: http://www.w3.org/WAI/intro/wcag20 .) Now, many ebook formats actually use HTML/XHTML in some fashion, such as the new IDPF "EPub" format. So it is important that publishers use only clean XHTML which minimizes presentational elements and attributes -- and NEVER NEVER NEVER use tables for layout purposes -- and religiously follow WCAG guidelines as they apply. (I've run a web site for several years now which uses tables for layout, something I'm very much NOT proud of, and plan to upgrade it very soon.) Unfortunately some tools which generate XHTML from word processing formats produce garbage XHTML, and I mean garbage which is completely inaccessible. Interestingly, the more "web accessible" the markup is, the easier it is to author and edit and maintain, and it is much more repurposeable, so with web accessibility comes greater repurposeability, simpler documents, and ultimately lower cost in the publishing workflow -- and without sacrificing presentation quality. (So publishers can have their cake and eat it, too.) Another aspect of the accessibility of digital texts is navigation, and one that's less understood. It is little known that in the U.S. all textbooks used in the K-12 sector must also be provided in a form that follows DAISY's Digital Talking Book format, a sort of supercharged XHTML (with some TEI influenced markup) that EPub now supports. As part of this requirement, all DTB (and EPub) *must* include what's called NCX, an XML document that contains each publication's "navigational lists", including a required Table of Contents. Publishers need to begin to understand NCX (at least the EPub subset of NCX which is REQUIRED for all EPubs) and implement it in their work flow. Jon Noring From jon at noring.name Sun Oct 14 13:15:19 2007 From: jon at noring.name (Jon Noring) Date: Sun, 14 Oct 2007 14:15:19 -0600 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <47126CF0.7080508@bohol.ph> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> <47113430.7050608@perathoner.de> <47126CF0.7080508@bohol.ph> Message-ID: <1249433446.20071014141519@noring.name> Jeroen wrote: > By tweaking my XSLT, I can put them anywhere I like, and leave that > choice to the person rendering, not the person encoding the text. I have > XSLT scripts that produce tweaked HTML that goes through Prince, and the > footnotes end up as footnotes in a PDF. Cool! I've been following the development of the Prince application for a few years now, and have gotten to know quite well Michael Day who developed Prince. It shows the power of XML+CSS to produce fixed paged output. Jon Noring From jeroen.mailinglist at bohol.ph Sun Oct 14 13:56:56 2007 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Sun, 14 Oct 2007 22:56:56 +0200 Subject: [gutvol-d] "California court tilts towards mandating web accessibility" -- ebook connection? In-Reply-To: <1421257391.20071014141026@noring.name> References: <1421257391.20071014141026@noring.name> Message-ID: <47128298.50100@bohol.ph> Jon Noring wrote: > Large and small publishers of digital text content, such as ebooks, > need to be aware that legal requirements (such as the Americans with > Disabilities Act, ADA) may eventually force them to adopt accessible- > friendly formats, and to implement them in an accessible manner. > I think (and hope), in the US, this type of thinking will eventually be scrapped by the workings of the first amendment, as such requirements add considerable cost to developing websites, and are thus an impediment to free speech (These things are not just about using sane techniques and semantic mark-up). It should be up to the speaker to decide in which language to speak, and, if such language cannot be understood by a group of people, that should be the speaker's choice. Otherwise, requirements to make websites in basic English only will soon pop-up, and we will only be able to serve the lowest common denominator of the idiots that can somehow operate a computer. In other countries, I also hope common sense will prevail. Leaving principle aside, I strongly believe it is in your own interest, if you want to be heard, to make sites as accessible as possible. I fully believe that government sites, and others paid for by public money should be legally prescribed to be accessible, but I reject the notion that you, as a private individual or company, need to adjust your speech to be heard by everybody, if you don't want that. Any other direction would be a kind of censorship. (Just as those idiotic one sentence French clauses in contracts that state that parties have agreed to use English for the remainder). As you may be aware, there has been a lot of talking about the latest round of accessibility guidelines by various groups, and I believe, with other critics (http://www.theregister.co.uk/2007/10/10/web_accessibility_critic/) that many of them will be ignored, and for good reasons as well (too vague to be of practical applicability, too limiting on artistic expressions, etc.) I've been working hard to make ebooks accessible, by providing HTML versions with reading hints and navigation aids, and will add more of such features in future, but I will never go as far as to rephrase them in simple language (as some of the accessibility guidelines suggest). Jeroen Hellingman From marcello at perathoner.de Sun Oct 14 16:18:59 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 15 Oct 2007 01:18:59 +0200 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <47126CF0.7080508@bohol.ph> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> <47113430.7050608@perathoner.de> <47126CF0.7080508@bohol.ph> Message-ID: <4712A3E3.5030204@perathoner.de> Jeroen Hellingman (Mailing List Account) wrote: >> Funny. TEI-Lite is an extension of TEI. >> > If you belong to the school that considers a subset an extension, I can > agree. TEI-Lite extends TEI in that it adds some tags that are not in TEI. > When I build my table of contents by hand, based on the book at hand, I > will more accurately capture their contents. If I regenerate from the > available heads, they are often quite different. PGTEI allows to build a TOC with entries differing from the chapter heads, like this: <div> <index level1="Chapter 2"> <head>Chapter the Second</head> The TOC will say: Chapter 2. -- Marcello Perathoner webmaster at gutenberg.org From jon at noring.name Sun Oct 14 22:31:41 2007 From: jon at noring.name (Jon Noring) Date: Sun, 14 Oct 2007 23:31:41 -0600 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <4712A3E3.5030204@perathoner.de> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> <47113430.7050608@perathoner.de> <47126CF0.7080508@bohol.ph> <4712A3E3.5030204@perathoner.de> Message-ID: <102352901.20071014233141@noring.name> Marcello wrote: > TEI-Lite extends TEI in that it adds some tags that are not in TEI. Wow, I did not know this! I was under the impression that TEI-Lite was a pure subset of TEI, meaning that any conceivable XML document valid to TEI-Lite will also validate to TEI. So what are the "additions"? Is that briefly documented somewhere? Such "additions" can be elements, attributes, attribute values (for those attributes having a set of possible values), and element content model differences. Jon Noring From ralf at ark.in-berlin.de Mon Oct 15 01:21:11 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Mon, 15 Oct 2007 10:21:11 +0200 Subject: [gutvol-d] Kindness to TEI novices In-Reply-To: <1e8e65080710131249g6930021x648e18c64054b632@mail.gmail.com> References: <1e8e65080710131249g6930021x648e18c64054b632@mail.gmail.com> Message-ID: <20071015082111.GB6969@ark.in-berlin.de> > I've got a charming but tiny, simple book to put through DP. I've > sworn to myself that I'm going to shepherd it through all the stages > of preparation. When I post-process it, I'm going to prepare a PGTEI > version, my first. Expect further stupid questions :) Please see our DP thread http://www.pgdp.net/phpBB2/viewtopic.php?t=16031 You might want to have a look also at the (in construction) http://www.pgdp.net/wiki/Post-Processing_With_PGTEI_0.4 ralf From ralf at ark.in-berlin.de Sun Oct 14 07:40:27 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Sun, 14 Oct 2007 16:40:27 +0200 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <20071013182808.GB5263@ark.in-berlin.de> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> <20071013182808.GB5263@ark.in-berlin.de> Message-ID: <20071014144027.GA6093@ark.in-berlin.de> (replying to myself) > That's possible in PGTEI 0.4 too. Use <note place="end"> and > a divGen "footnote" at the end iof each chapter. However, numbers > are no reset, so you'll get marks with hundreds in some books. Also, what's not possible is multiple marks for one footnote, and a mark inside <sp> or <speaker> for marking the speaker name. Both of which I would need for my current project. ralf From marcello at perathoner.de Mon Oct 15 04:41:46 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 15 Oct 2007 13:41:46 +0200 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <102352901.20071014233141@noring.name> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> <47113430.7050608@perathoner.de> <47126CF0.7080508@bohol.ph> <4712A3E3.5030204@perathoner.de> <102352901.20071014233141@noring.name> Message-ID: <471351FA.2060100@perathoner.de> Jon Noring wrote: > So what are the "additions"? Is that briefly documented somewhere? <!-- TEILiteX.dtd: TEI.extensions.dtd file for TEI Lite --> <!-- Define some additions for the phrase level tags --> <!-- Revisions: --> <!-- 2002-01-21 : LB add type attribute for consistency --> <!-- 2001-12-07 : LB : parameterize for P4 --> <!-- 1995-02-17 : CMSMcQ : make file after agreements w/LB --> <!ENTITY % gi 'INCLUDE' > <![ %gi; [ <!ELEMENT %n.gi; %om.RO; (#PCDATA) > <!ATTLIST %n.gi; %a.global; TEI (yes | no) "yes" TEIform CDATA 'gi' > ]]> <!ENTITY % eg 'INCLUDE' > <![ %eg; [ <!ELEMENT %n.eg; %om.RR; (#PCDATA) > <!ATTLIST %n.eg; %a.global; TEIform CDATA 'eg' > ]]> <!ENTITY % code 'INCLUDE' > <![ %code; [ <!ELEMENT code %om.RO; (#PCDATA) > <!ATTLIST code %a.global; > ]]> <!ENTITY % ident 'INCLUDE' > <![ %ident; [ <!ELEMENT ident %om.RO; (#PCDATA) > <!ATTLIST ident %a.global; type CDATA #IMPLIED > ]]> <!ENTITY % kw 'INCLUDE' > <![ %kw; [ <!ELEMENT kw %om.RO; (#PCDATA) > <!ATTLIST kw %a.global; type CDATA #IMPLIED > ]]> -- Marcello Perathoner webmaster at gutenberg.org From jon at noring.name Mon Oct 15 07:23:13 2007 From: jon at noring.name (Jon Noring) Date: Mon, 15 Oct 2007 08:23:13 -0600 Subject: [gutvol-d] What about P5? (was additions to TEI-Lite) In-Reply-To: <471351FA.2060100@perathoner.de> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> <47113430.7050608@perathoner.de> <47126CF0.7080508@bohol.ph> <4712A3E3.5030204@perathoner.de> <102352901.20071014233141@noring.name> <471351FA.2060100@perathoner.de> Message-ID: <1811467071.20071015082313@noring.name> [TEI P5 questions asked at end] Marcello wrote: > Jon Noring wrote: >> So what are the "additions"? Is that briefly documented somewhere? > <!-- TEILiteX.dtd: TEI.extensions.dtd file for TEI Lite --> > <!-- Define some additions for the phrase level tags --> > <!-- Revisions: --> > <!-- 2002-01-21 : LB add type attribute for consistency --> > <!-- 2001-12-07 : LB : parameterize for P4 --> > <!-- 1995-02-17 : CMSMcQ : make file after agreements w/LB --> The element additions in TEI-Lite appear to be the elements <gi>, <eg>, <code>, <ident> and <kw>. The attribute addition appears to be to allow the 'type' attribute on <lb/>. Not sure what CMSMcQ is, or if it is even relevant to PG/DP usage. Checking the new P5, it looks like all the elements listed above have been added (although not sure on <kw> -- there is a new <keywords> which appears to do the same thing.) It does not appear that the 'type' attribute has been added to <lb/>. That's only a quick analysis. I'm sure others here can provide more complete analysis. ***** O.k., about the new TEI P5 that seems poised to be issued as version 1.0 -- the usual questions: How do the TEI experts here view P5? Is it an improvement? Should all PG/DP documents be upgraded to conform to P5 (if not already)? Etc. Jon Noring From lee at novomail.net Mon Oct 15 08:33:26 2007 From: lee at novomail.net (Lee Passey) Date: Mon, 15 Oct 2007 09:33:26 -0600 Subject: [gutvol-d] What about P5? (was additions to TEI-Lite) In-Reply-To: <1811467071.20071015082313@noring.name> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> <47113430.7050608@perathoner.de> <47126CF0.7080508@bohol.ph> <4712A3E3.5030204@perathoner.de> <102352901.20071014233141@noring.name> <471351FA.2060100@perathoner.de> <1811467071.20071015082313@noring.name> Message-ID: <47138846.5030705@novomail.net> Jon Noring wrote: > [TEI P5 questions asked at end] > > > Marcello wrote: >> Jon Noring wrote: > >>> So what are the "additions"? Is that briefly documented somewhere? > >> <!-- TEILiteX.dtd: TEI.extensions.dtd file for TEI Lite --> >> <!-- Define some additions for the phrase level tags --> >> <!-- Revisions: --> >> <!-- 2002-01-21 : LB add type attribute for consistency --> >> <!-- 2001-12-07 : LB : parameterize for P4 --> >> <!-- 1995-02-17 : CMSMcQ : make file after agreements w/LB --> > > The element additions in TEI-Lite appear to be the elements <gi>, > <eg>, <code>, <ident> and <kw>. The attribute addition appears to be > to allow the 'type' attribute on <lb/>. Not sure what CMSMcQ is, or if > it is even relevant to PG/DP usage. LB => Lou Burnard CMSMcQ => C. M. Sperberg-McQueen As near as I can tell, (I speak only pidgin DTD) there is nothing in here about the <lb> element. Instead, the DTD indicates that TEI was extended by the addition of 5 new elements, which presumably are documented elsewhere. -- Nothing of significance below this line. From hart at pglaf.org Mon Oct 15 08:34:22 2007 From: hart at pglaf.org (Michael Hart) Date: Mon, 15 Oct 2007 08:34:22 -0700 (PDT) Subject: [gutvol-d] Research on Listservers Message-ID: <Pine.LNX.4.64.0710150834060.16118@pglaf.org> Research on Listservers It has now been nearly 20 years since I started my first listserver, and I have been wondering if anyone else may have noticed any yearly trends they could report on. Even if you have just a small suspicion that listservers act slightly differently at some times of the year, just let me know what you think, and perhaps we may spot some kind of patterns that may help understand listservers in future of the Internet. Please email me at: hart at pglaf.org I will reply to all such emails, so if you do not get an answer in a few days, please email me again. Please feel free to forward this to other listservers. Thanks!!! Michael S. Hart Founder Project Gutenberg From marcello at perathoner.de Mon Oct 15 08:40:14 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 15 Oct 2007 17:40:14 +0200 Subject: [gutvol-d] What about P5? (was additions to TEI-Lite) In-Reply-To: <1811467071.20071015082313@noring.name> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> <47113430.7050608@perathoner.de> <47126CF0.7080508@bohol.ph> <4712A3E3.5030204@perathoner.de> <102352901.20071014233141@noring.name> <471351FA.2060100@perathoner.de> <1811467071.20071015082313@noring.name> Message-ID: <471389DE.6040801@perathoner.de> Jon Noring wrote: >> <!-- 2002-01-21 : LB add type attribute for consistency --> > It does not appear that the > 'type' attribute has been added to <lb/>. LB == Lou Burnard -- Marcello Perathoner webmaster at gutenberg.org From jon at noring.name Mon Oct 15 08:48:36 2007 From: jon at noring.name (Jon Noring) Date: Mon, 15 Oct 2007 09:48:36 -0600 Subject: [gutvol-d] What about P5? (was additions to TEI-Lite) In-Reply-To: <471389DE.6040801@perathoner.de> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> <47113430.7050608@perathoner.de> <47126CF0.7080508@bohol.ph> <4712A3E3.5030204@perathoner.de> <102352901.20071014233141@noring.name> <471351FA.2060100@perathoner.de> <1811467071.20071015082313@noring.name> <471389DE.6040801@perathoner.de> Message-ID: <1247294775.20071015094836@noring.name> Marcello wrote: > Jon Noring wrote: >>> <!-- 2002-01-21 : LB add type attribute for consistency --> >> It does not appear that the >> 'type' attribute has been added to <lb/>. > LB == Lou Burnard <laugh type="egg on my face"/> I should have noticed since LB was capitalized! Jon From jon at noring.name Mon Oct 15 08:55:03 2007 From: jon at noring.name (Jon Noring) Date: Mon, 15 Oct 2007 09:55:03 -0600 Subject: [gutvol-d] What about P5? (was additions to TEI-Lite) In-Reply-To: <47138846.5030705@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> <47113430.7050608@perathoner.de> <47126CF0.7080508@bohol.ph> <4712A3E3.5030204@perathoner.de> <102352901.20071014233141@noring.name> <471351FA.2060100@perathoner.de> <1811467071.20071015082313@noring.name> <47138846.5030705@novomail.net> Message-ID: <1936014050.20071015095503@noring.name> Lee Passey wrote: > Jon Noring wrote: >> From TEI-Lite DTD >>> <!-- TEILiteX.dtd: TEI.extensions.dtd file for TEI Lite --> >>> <!-- Define some additions for the phrase level tags --> >>> <!-- Revisions: --> >>> <!-- 2002-01-21 : LB add type attribute for consistency --> >>> <!-- 2001-12-07 : LB : parameterize for P4 --> >>> <!-- 1995-02-17 : CMSMcQ : make file after agreements w/LB --> >> The element additions in TEI-Lite appear to be the elements <gi>, >> <eg>, <code>, <ident> and <kw>. The attribute addition appears to be >> to allow the 'type' attribute on <lb/>. Not sure what CMSMcQ is, or if >> it is even relevant to PG/DP usage. > LB =>> Lou Burnard > CMSMcQ =>> C. M. Sperberg-McQueen As noted in my other reply, I certainly misread that DTD comment big time. The capitalization should have been a clue that they weren't element names. I was fortunate to have met C. Michael Sperberg-McQueen (brother of Roger Sperberg) at an XML Conference in Philadelphia in late 1999. We chatted for a short while. Brilliant person. > As near as I can tell, (I speak only pidgin DTD) there is nothing in > here about the <lb> element. Instead, the DTD indicates that TEI was > extended by the addition of 5 new elements, which presumably are > documented elsewhere. It does seem like four of these elements have been added to P5, with the fifth, TEI-Lite <kw>, being represented by the P5 <keywords> (not sure on this, though.) Jon Noring From lee at novomail.net Mon Oct 15 09:18:07 2007 From: lee at novomail.net (Lee Passey) Date: Mon, 15 Oct 2007 10:18:07 -0600 Subject: [gutvol-d] What about P5? (was additions to TEI-Lite) In-Reply-To: <1936014050.20071015095503@noring.name> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> <47113430.7050608@perathoner.de> <47126CF0.7080508@bohol.ph> <4712A3E3.5030204@perathoner.de> <102352901.20071014233141@noring.name> <471351FA.2060100@perathoner.de> <1811467071.20071015082313@noring.name> <47138846.5030705@novomail.net> <1936014050.20071015095503@noring.name> Message-ID: <471392BF.10302@novomail.net> Jon Noring wrote: > It does seem like four of these elements have been added to P5, with the > fifth, TEI-Lite <kw>, being represented by the P5 <keywords> (not sure > on this, though.) The <keywords> tag already exists in P5, P4 and Lite. It is designed to hold a list of keywords, perhaps from some controlled vocabulary. In TEI-Lite, the individual keywords can be indicated using the <kw> tag. In P5, the individual keywords are indicated either by using the <term> tag, or by using a <list> element which is composed of <item>s. -- Nothing of significance below this line. From piggy at netronome.com Mon Oct 15 10:26:24 2007 From: piggy at netronome.com (La Monte Henry Piggy Yarroll) Date: Mon, 15 Oct 2007 13:26:24 -0400 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <470EB4BC.9090201@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> Message-ID: <4713A2C0.1090004@netronome.com> Lee Passey wrote: > [snip] > > Yet more grist for the mill: > > ... > Total Docs,Percentage, Total Use,Avg per Doc, Element Name > 109, 100.00, 9928, 91.08, head > 109, 100.00, 9654, 88.57, div > 109, 100.00, 623, 5.72, divGen > 109, 100.00, 109, 1.00, front > 109, 100.00, 109, 1.00, body > 109, 100.00, 109, 1.00, back > 108, 99.08, 87341, 808.71, p > 106, 97.25, 18293, 172.58, index > 95, 87.16, 28950, 304.74, hi > 69, 63.30, 5170, 74.93, lb > 66, 60.55, 8796, 133.27, note > 63, 57.80, 175, 2.78, then > 63, 57.80, 175, 2.78, pgIf > 63, 57.80, 174, 2.76, else > 52, 47.71, 9189, 176.71, pb > 49, 44.95, 8731, 178.18, anchor > 34, 31.19, 640, 18.82, figure > 34, 31.19, 640, 18.82, figDesc > 32, 29.36, 11891, 371.59, l > 32, 29.36, 899, 28.09, lg > 30, 27.52, 3898, 129.93, milestone > 22, 20.18, 33, 1.50, titlePart > 22, 20.18, 23, 1.05, docImprint > 22, 20.18, 22, 1.00, docTitle > 22, 20.18, 22, 1.00, titlePage > 21, 19.27, 22, 1.05, docAuthor > 20, 18.35, 165, 8.25, quote > 20, 18.35, 21, 1.05, byline > 16, 14.68, 9818, 613.63, cell > 16, 14.68, 4167, 260.44, row > 16, 14.68, 313, 19.56, table > 14, 12.84, 5392, 385.14, ref > 14, 12.84, 14, 1.00, docDate > 13, 11.93, 1038, 79.85, item > 13, 11.93, 177, 13.62, list > 11, 10.09, 134, 12.18, corr > 8, 7.34, 7161, 895.13, q > 7, 6.42, 7, 1.00, docEdition > 5, 4.59, 559, 111.80, emph > 4, 3.67, 152, 38.00, title > 3, 2.75, 494, 164.67, abbr > 3, 2.75, 212, 70.67, foreign > 3, 2.75, 81, 27.00, reg > 3, 2.75, 8, 2.67, name > 2, 1.83, 671, 335.50, formula > 2, 1.83, 28, 14.00, sic > 2, 1.83, 27, 13.50, bibl > 2, 1.83, 26, 13.00, author > 2, 1.83, 2, 1.00, epigraph > 2, 1.83, 2, 1.00, date > 2, 1.83, 2, 1.00, add > 2, 1.83, 2, 1.00, trailer > 1, 0.92, 13, 13.00, label > 1, 0.92, 8, 8.00, argument > 1, 0.92, 2, 2.00, del > 1, 0.92, 1, 1.00, eg > 1, 0.92, 1, 1.00, cit > Fascinating. With all three datasets I see the same rough shape. Simply plot the frequency of each element in rank order. It looks to me like two separate statistical processes operating on the same data. If you look at the first four, p, hi, index and l, you have the start of something like an exponential distribution. From rank 5 through rank 15, we have something that looks more like a Gaussian hump. This encompasses head, cell, div, pb, note, anchor, q, ref, lb, row and milestone. From rank 16 onward, item, lg, formula, figure, etc... we have a long tail which might be the sum of a Gaussian tail and an exponential tail. What are the two selection processes? I am going to speculate that presentational markup is the exponential process and structural markup is the Gaussian process. I would further speculate that we could define a "flatness" metric which would be higher for documents using proportionately more structural markup than presentational markup. From lee at novomail.net Mon Oct 15 10:53:52 2007 From: lee at novomail.net (Lee Passey) Date: Mon, 15 Oct 2007 11:53:52 -0600 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <20071013182304.GA5263@ark.in-berlin.de> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> Message-ID: <4713A930.7060103@novomail.net> Ralf Stephan wrote: > > The preponderance of the <hi> element (which is almost purely > > presentational) together with the amount of paragraph abuse leads > > me to conclude that even those people who are using TEI because it > > is semantic, not presentational, are still marking up text using a > > presentational philosophy. > > I don't think so. I'm using <hi> as a first translation of <i> and > <b>, with the plan to change the <hi>s later to the special mark up. > I'm sure I've read somewhere in the docs an advice to that effect. I think Josh Hutchinson is the biggest proponent of TEI levels, level 1 being bare bones and level 3 (or higher) being only for uber-geeks. Mr. Perathoner alludes to it his "Guide to PGTEI" (http://pgtei.pglaf.org/marcello/0.3/doc/20000-h.html) when he states: <cite> You can and should mark up a text incrementally. That is: make more than one pass over the whole text and in each pass mark up a subset of elements. You may start marking only the most prominent text features like chapters and paragraphs. Later you make a second pass marking all italicized text. If you still want to do more, make another pass replacing all quotation marks with the <q> element. TODO: a PG working group needs to codify different ?levels? of PGTEI markup. </cite> However, just because you are following this guidance, doesn't mean that the markup isn't presentational; indeed it makes it more likely that it /is/ presentational, and not semantic. (Semantic: "of, pertaining to, or arising from the different meanings of words or other symbols.") When you mark a passage with <hi rend="italic"> (or some variation thereof) what you are saying is, "I cannot or will not state with confidence why this passage was italicized, but it was." When you mark a passage with <emph rend="italic"> what you are saying is "this passage is emphasized; in the original text it was italicized, but you may render it in whatever way you render emphasized text." Saying that something was presented in a particular way is presentational markup, saying /why/ it was presented in a particular way is semantic markup. Generally, I support the notion of levels of markup. However, it can lead to some unfortunate consequences. In the first place it can lead to the loss of data. While the <hi> element can contain an attribute indicating how the highlighting was rendered in a particular addition, the "type" attribute is not allowed. Therefore, there is no good way to indicate "this <hi> differs from that <hi>, and the second needs to be revisited to determine its semantic meaning." Neither is there any good way to record any hints about why this particular rendering may have been used. If you view purely presentational markup as "level one," then I would suggest that HTML ought to be level one markup; it contains all the power of purely presentational TEI, and is more directly useful. The second, and perhaps more pernicious, problem with being satisfied with low levels of markup is that once such a text is placed in the PG database the chances of anyone coming along later to fix the markup become vanishingly small. Looking at my data I see that while 87% of the texts use the <hi> (presentational) element, only 4.6% use the <emph> (semantic) element. What are the chances that anyone is going to go through all those texts and convert all the presentational markup to semantic markup? And how much harder would it have been for the original poster to just use semantic markup in the preparation of the texts in the first place? I'm a firm believer in the old adage that it is easier to do things right than to do them over. My suggestion is that if you're using <hi> in the first pass, with the intent to convert them to semantic markup in a subsequent pass, you probably ought to keep them in your queue and not pass them on to PG until the upgrade has occurred. What I am trying to do is come up with an overview of different levels of markup complexity which still maintain the unique semantic markup which is the hallmark of TEI. As a more complete example, consider the <p> element. This element is supposed to be used for paragraphs, which are a group of one or more complete sentences which express a single thought or topic. Unfortunately, OCR programs are so far incapable of detecting complete sentences to say nothing of single thoughts or topics. So HTML output from an OCR program will mark /every/ block with the <p> element; they simply can't do any better than that. So let's say you've just completed an OCR of "Rip Foster Rides the Gray Planet." You might find a resultant passage such as: ... could hear, "I'll bust the bubble of any son of a space sausage who laughs!"</p> <p style="text-align:center">Chapter Two - Rake That Radiation!</p> <p>The deputy commander and the safety officer got untangled and hurried to ... It's pretty obvious that the second "paragraph" is not really a paragraph, it's the title of the next chapter. If you were to change the "style" keyword to "rend" you would have perfectly legal TEI markup -- but it would still be purely presentational. In fact, marking the chapter title as a <p> may be said to be anti-semantic; by doing so you are, in essence, saying, "make this block /look/ like a paragraph, but don't give it the /meaning/ of a paragraph." Now if you wanted to keep the presentational aspects of a paragraph without lying about the semantics, you could replace the <p> in the title with <ab>, which "contains any arbitrary component-level unit of text, ... analogous to, but without the semantic baggage of, a paragraph." And if you wanted to take advantage of the PGTEI XSLT transformation script to auto-generate a table of contents you could add an <index> element. Now you would have: ... could hear, "I'll bust the bubble of any son of a space sausage who laughs!"</p> <index index="toc" level1="Chapter Two - Rake That Radiation!" /> <ab style="text-align:center">Chapter Two - Rake That Radiation!</ab> <p>The deputy commander and the safety officer got untangled and hurried to ... Note: the attributes of the <index> element have changed in P5. If you look carefully, you will see that this markup is still mostly presentational. The only improvement is that now only real paragraphs are marked as <p>. Even the use of the <index> element is presentational; it says "build me a list referring to this point in the text, without any implication as to what the list may signify." What we really want to do is change the <ab> to <head>, because that's what the phrase is: it's the heading on a chapter. Unfortunately, we can't just change the elements, because TEI only allows <head> elements to appear at the beginning of a textual division, before any paragraphs. That's really not a big deal, because we want to identify each chapter as its own division of text anyway. So if we add some <div>s, and change the <ab> to head we might end up with: ... could hear, "I'll bust the bubble of any son of a space sausage who laughs!"</p> </div> <index index="toc" level1="Chapter Two - Rake That Radiation!" /> <div rend="page-break-before: always"> <head style="text-align:center">Chapter Two - Rake That Radiation!</head> <p>The deputy commander and the safety officer got untangled and hurried to ... Now we're starting to get some semantic markup in the file. All <p>s are paragraphs, chapter headings are marked as headings, and each chapter is marked as being a unified division of the text. Unfortunately, we still have some presentational cruft hanging around. Maybe we want to keep it around for historical purposes, but maybe we can remove it as being implied in semantic markup (or even make it explicit in a <rendition> element). Our most recent revision doesn't really specify what the nature of the divisions are, so let's start by making that explicit. And if all our chapters start on a new page, the "rend" attribute that tells us that that is how it was done is superfluous. So let's add a "type" attribute to our <div>s, and get rid of the "rend". We now have: ... could hear, "I'll bust the bubble of any son of a space sausage who laughs!"</p> </div> <index index="toc" level1="Chapter Two - Rake That Radiation!" /> <div type="chapter"> <head style="text-align:center">Chapter Two - Rake That Radiation!</head> <p>The deputy commander and the safety officer got untangled and hurried to ... Now we can see that because all chapter headings are centered, and because we now know that the <head> element belongs to a chapter, the "rend" attribute on the <head> element is unnecessary. And because we know where every chapter begins, and we know how to get the title of every chapter, we can build a table of contents if we wanted to without the need of an <index> element. The resulting markup is now: ... could hear, "I'll bust the bubble of any son of a space sausage who laughs!"</p> </div> <div type="chapter"> <head>Chapter Two - Rake That Radiation!</head> <p>The deputy commander and the safety officer got untangled and hurried to ... What we have done here is taken a purely presentational TEI markup and transformed it into a purely semantic TEI markup. And yet, the only tags we have used are <div>, <head> and <p>. This feels to me like it is still "level one;" we have only used basic tags and not very many of them at that. What /has/ changed is our mindset in marking up the text. From jeroen.mailinglist at bohol.ph Mon Oct 15 11:20:30 2007 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Mon, 15 Oct 2007 20:20:30 +0200 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <20071014144027.GA6093@ark.in-berlin.de> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <470FC32B.5060409@bohol.ph> <20071013182808.GB5263@ark.in-berlin.de> <20071014144027.GA6093@ark.in-berlin.de> Message-ID: <4713AF6E.9090606@bohol.ph> I typically resolve the case of multiple markers for a footnote by saying: blah blah<ref target=n123.2 type=noteref>2</ref> blah blah. My XSLT will pick up this reference (based on its type), and render it as a link to the footnote in question, similar to the footnote reference for the footnote, taking care to replace the number with whatever the number of the footnote in the result has become. The speaker marks would require some tweaking of the DTD (and processing scripts) Jeroen. Ralf Stephan wrote: > (replying to myself) > > >> That's possible in PGTEI 0.4 too. Use <note place="end"> and >> a divGen "footnote" at the end iof each chapter. However, numbers >> are no reset, so you'll get marks with hundreds in some books. >> > > Also, what's not possible is multiple marks for one footnote, and > a mark inside <sp> or <speaker> for marking the speaker name. > > Both of which I would need for my current project. > > > ralf > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > From marcello at perathoner.de Mon Oct 15 11:30:47 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 15 Oct 2007 20:30:47 +0200 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <4713A930.7060103@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> Message-ID: <4713B1D7.7020301@perathoner.de> Lee Passey wrote: > I think Josh Hutchinson is the biggest proponent of TEI levels, level 1 > being bare bones and level 3 (or higher) being only for uber-geeks. Mr. > Perathoner alludes to it his "Guide to PGTEI" > (http://pgtei.pglaf.org/marcello/0.3/doc/20000-h.html) when he states: > > <cite> > You can and should mark up a text incrementally. That is: make more than > one pass over the whole text and in each pass mark up a subset of elements. > > You may start marking only the most prominent text features like > chapters and paragraphs. Later you make a second pass marking all > italicized text. If you still want to do more, make another pass > replacing all quotation marks with the <q> element. > > TODO: a PG working group needs to codify different ?levels? of PGTEI markup. > </cite> > > However, just because you are following this guidance, doesn't mean that > the markup isn't presentational; indeed it makes it more likely that it > /is/ presentational, and not semantic. (Semantic: "of, pertaining to, or > arising from the different meanings of words or other symbols.") Doing a complete text in TEI in one big single pass can take many days. Doing multiple passes has several advantages: - Newbies can do the easy stuff - and experts the hard stuff - Different experts can work their fields of expertise: - teiHeader metadata expert - title page formatting expert - native speakers of LOTE - history expert - You get more consistent markup because you focus on one aspect only - You know what to expect from previous passes (use grep) - Markup levels can be assigned to DP rounds > While the <hi> element can contain an attribute > indicating how the highlighting was rendered in a particular addition, > the "type" attribute is not allowed. Therefore, there is no good way to > indicate "this <hi> differs from that <hi>, and the second needs to be > revisited to determine its semantic meaning." Neither is there any good > way to record any hints about why this particular rendering may have > been used. <hi rend="italic"><!-- fixme: emph or foreign? -->Par Dieu!</hi>. -- Marcello Perathoner webmaster at gutenberg.org From Bowerbird at aol.com Mon Oct 15 11:37:15 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 15 Oct 2007 14:37:15 EDT Subject: [gutvol-d] it's good to see the .tei people Message-ID: <be5.18c1e5c6.34450d5b@aol.com> it's good to see the .tei people wasting their time trying to figure it out. i was waiting for a long time for them to start doing that. i welcome it! but don't make the mistake of letting them waste _your_ time! :+) seriously, light markup will give the big benefits _without_ the big costs. for the record, those big benefits are (1) a simple transition from o.c.r. into a plain-text "master format", (2) easy maintenance of that "master", (3) button-click conversion to other formats by users themselves, and (4) new functionalities from developers due to a straightforward format. i mean, jump into the gobbledygook soup if you _like_ that sorta thing, but if it makes your skin crawl, understand that you do _not_ "need" it... -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071015/166407b3/attachment.htm From jon at noring.name Mon Oct 15 11:52:16 2007 From: jon at noring.name (Jon Noring) Date: Mon, 15 Oct 2007 12:52:16 -0600 Subject: [gutvol-d] it's good to see the .tei people In-Reply-To: <be5.18c1e5c6.34450d5b@aol.com> References: <be5.18c1e5c6.34450d5b@aol.com> Message-ID: <6410280365.20071015125216@noring.name> Bowerbird wrote: > i mean, jump into the gobbledygook soup if you _like_ that sorta thing, > but if it makes your skin crawl, understand that you do _not_ "need" it... On what objective basis do you make this claim ("you do not need TEI")? As I've asked recently, the proof is in the pudding (to use Bowerbird's own phrase), and the best proof to show that ZML is sufficient to properly structure PG texts for *mastering* purposes is to take a representative sample of PG texts (I said 10, but 50 or 100 would be better) and "mark them up" in ZML. Then post for commentary. It is important the sample include some tough stuff. So let the PG/DP crowd pick the list of representative texts to convert to ZML. Once we have a bunch of files, then that puts the burden on the PG and DPers to focus on them, and tell the group here what is missing in the ZML renderings for mastering purposes. This should lead to a discussion of what is important the master needs to do (which has not yet been put all together into a cogent story), and possibly point out where ZML could be tweaked to improve it. (Of course, some of us believe that ZML is not sufficient for mastering purposes, and of course we eventually have to get down to specifics, just as Bowerbird also needs to get down to specifics rather than making vague statements that ZML is "sufficient".) Nevertheless, as I've always said, so long as there is a need for the PG collection to include "plain text" versions of the books, ZML is a good candidate for that since it does normalize the texts. This is not the same as mastering. Jon Noring From lee at novomail.net Mon Oct 15 14:53:54 2007 From: lee at novomail.net (Lee Passey) Date: Mon, 15 Oct 2007 15:53:54 -0600 Subject: [gutvol-d] it's good to see the .tei people In-Reply-To: <6410280365.20071015125216@noring.name> References: <be5.18c1e5c6.34450d5b@aol.com> <6410280365.20071015125216@noring.name> Message-ID: <4713E172.3070106@novomail.net> Jon Noring wrote: > Nevertheless, as I've always said, so long as there is a need for the > PG collection to include "plain text" versions of the books, ZML is a > good candidate for that since it does normalize the texts. ZML does not have a mechanism to block indent text, which for me is a show-stopper. -- Nothing of significance below this line. From jon at noring.name Mon Oct 15 14:55:47 2007 From: jon at noring.name (Jon Noring) Date: Mon, 15 Oct 2007 15:55:47 -0600 Subject: [gutvol-d] it's good to see the .tei people In-Reply-To: <4713E172.3070106@novomail.net> References: <be5.18c1e5c6.34450d5b@aol.com> <6410280365.20071015125216@noring.name> <4713E172.3070106@novomail.net> Message-ID: <1877073571.20071015155547@noring.name> Lee wrote: > Jon Noring wrote: >> Nevertheless, as I've always said, so long as there is a need for the >> PG collection to include "plain text" versions of the books, ZML is a >> good candidate for that since it does normalize the texts. > ZML does not have a mechanism to block indent text, which for me is a > show-stopper. Can you explain that in a little more detail, maybe with an example? And is this for mastering purposes, or a plain text rendition (not for mastering)? Jon From lee at novomail.net Mon Oct 15 15:26:39 2007 From: lee at novomail.net (Lee Passey) Date: Mon, 15 Oct 2007 16:26:39 -0600 Subject: [gutvol-d] it's good to see the .tei people In-Reply-To: <1877073571.20071015155547@noring.name> References: <be5.18c1e5c6.34450d5b@aol.com> <6410280365.20071015125216@noring.name> <4713E172.3070106@novomail.net> <1877073571.20071015155547@noring.name> Message-ID: <4713E91F.1040707@novomail.net> Jon Noring wrote: > Lee wrote: >> Jon Noring wrote: >> >>> Nevertheless, as I've always said, so long as there is a need for the >>> PG collection to include "plain text" versions of the books, ZML is a >>> good candidate for that since it does normalize the texts. >> >> ZML does not have a mechanism to block indent text, which for me is a >> show-stopper. > > Can you explain that in a little more detail, maybe with an example? Sure. You're looking at it. Block indentation is when you take a "component-level unit of text" and indent the entire unit. Typically, this kind of text is a lengthy quotation from another source, but it certainly doesn't have to be. Me, quoting you, would be block indented in a message. You, quoting me, quoting you, should probably be nested indents. Me, quoting you, quoting me, quoting you would be yet another nested indent. My e-mail program (Thunderbird) indicates each one of these indent levels with an angle bracket in the first column. A good user agent will actually indent with spaces rather than a visual cue. In HTML, these block indents are marked with the <blockquote> element. So in HTML the above example would be coded as: Jon Noring wrote: <blockquote> Lee wrote: <blockquote> Jon Noring wrote: <blockquote> Nevertheless, as I've always said ... </blockquote> ZML does not have a ... </blockquote> Can you explain that in ... </blockquote> The important thing is that not only must the text be repeatedly indented, when appropriate, the interior text must also be word-wrapped. In ZML, any whitespace in the first column means that word-wrapping is turned off, so block indentation cannot be accomplished by hand. Some markup, such as the angle brackets used by Thunderbird or the <blockquote> tag in HTML has to be devised; so far it has not been. > And is this for mastering purposes, or a plain text rendition (not for > mastering)? Because the capability doesn't exist, it doesn't matter. It doesn't exist for /either/ purpose. -- Nothing of significance below this line. From jon at noring.name Mon Oct 15 18:29:37 2007 From: jon at noring.name (Jon Noring) Date: Mon, 15 Oct 2007 19:29:37 -0600 Subject: [gutvol-d] it's good to see the .tei people In-Reply-To: <4713E91F.1040707@novomail.net> References: <be5.18c1e5c6.34450d5b@aol.com> <6410280365.20071015125216@noring.name> <4713E172.3070106@novomail.net> <1877073571.20071015155547@noring.name> <4713E91F.1040707@novomail.net> Message-ID: <1272903299.20071015192937@noring.name> Lee Passey wrote: > Jon Noring wrote: >> Lee wrote: >>> ZML does not have a mechanism to block indent text, which for me is a >>> show-stopper. >> Can you explain that in a little more detail, maybe with an example? > In ZML, any whitespace in the first column means that word-wrapping is > turned off, so block indentation cannot be accomplished by hand. Some > markup, such as the angle brackets used by Thunderbird or the > <blockquote> tag in HTML has to be devised; so far it has not been. O.k.! I looked at Bowerbird's online "11 Rules of ZML", and if those are still the complete set of rules, then I see no means to identify block quotes which contain prose (like paragraphs), where the user agent is expected to auto-wrap the block quote content (to differentiate it from fixed line structures like verse lines in poetry.) If this is indeed the case (if Bowerbird has not come up with a clever way to identify block quotes in ZML, or I missed how to do this given the rules I've looked at), then ZML is truly insufficient as a mastering format and iffy as a derivative plain text format. Block quote support is a must for any mastering format. (Note that a block quote may contain a mix of prose, verse, and other things -- in essence a block quote can be a standalone ZML document in and of itself.) I've cc'd Bowerbird on this, but his supposed spam filter will probably sieve this one out. I guess a friend of his can forward this to him. <smile/> Jon Noring (p.s., if ZML were to differentiate the purpose between tabs and spaces, then I can see how this might be done. Use tabs for identifying block quotes, one tab for the first level, two tabs for the second level (which will be extremely rare), and so on. And save spaces for no-wrap line situations. Never use tabs except for this sole purpose of identifying a block quote and the level it is at.) From prosfilaes at gmail.com Mon Oct 15 18:43:20 2007 From: prosfilaes at gmail.com (David Starner) Date: Mon, 15 Oct 2007 21:43:20 -0400 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <4713A930.7060103@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> Message-ID: <6d99d1fd0710151843l3e2f1ba6wd9e317062f75c363@mail.gmail.com> On 10/15/07, Lee Passey <lee at novomail.net> wrote: > Generally, I support the notion of levels of markup. However, it can > lead to some unfortunate consequences. In the first place it can lead to > the loss of data. When transcribing an edition, using hi instead of emph isn't a loss of data; that data isn't in the text. It's stating clearly what we know. To convert that to emph may add data, but it also adds uncertainty and editorial opinion; the data it adds isn't pure and doesn't come from the text before us. From jon at noring.name Mon Oct 15 20:31:12 2007 From: jon at noring.name (Jon Noring) Date: Mon, 15 Oct 2007 21:31:12 -0600 Subject: [gutvol-d] The TEI 80/20 rule - empirical data In-Reply-To: <6d99d1fd0710151843l3e2f1ba6wd9e317062f75c363@mail.gmail.com> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <6d99d1fd0710151843l3e2f1ba6wd9e317062f75c363@mail.gmail.com> Message-ID: <8410580897.20071015213112@noring.name> David wrote: > Lee Passey wrote: >> Generally, I support the notion of levels of markup. However, it can >> lead to some unfortunate consequences. In the first place it can lead to >> the loss of data. > When transcribing an edition, using hi instead of emph isn't a loss of > data; that data isn't in the text. It's stating clearly what we know. > To convert that to emph may add data, but it also adds uncertainty and > editorial opinion; the data it adds isn't pure and doesn't come from > the text before us. Definitely when we try to describe the "why" something is emphasized in the original paper edition, we will certainly sometimes be wrong. Or, it is just plain difficult to know exactly for sure -- there are times when trying to fit it into our "standardized" list of elements and attribute values (which PG/DP should do) may be difficult or ambiguous. (In some cases two or more apply simultaneously.) Nevertheless, it is a good thing to do, and I believe accuracy can be quite high without much thought or effort. What I think Lee is really saying is that PG/DP should not consider releasing a TEI document to the public until each and every <hi> is converted to the PG/DP standardized semantic description (and PG/DP should standardize on something.) With Lee I agree. Certainly a 2-3 stage markup process may be used where the easy ones are first handled by those less experienced, leaving the few tough ones to the more seasoned veterans. In rare cases a decision may be need to be made by "committee", and in some cases the "standardized" list may need to be expanded or tweaked. I expect the need for committee-level treatment to be pretty rare, but enough that it needs to be planned for. Nevertheless, it is a *good* thing in the long run to remove all <hi> and the value of that will also be very instructive for the users of the TEI masters -- they will notice that and over time *understand*. Jon Noring From ralf at ark.in-berlin.de Tue Oct 16 01:05:28 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Tue, 16 Oct 2007 10:05:28 +0200 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <4713A930.7060103@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> Message-ID: <20071016080528.GA4609@ark.in-berlin.de> Lee wrote > ... Saying that > something was presented in a particular way is presentational markup, > saying /why/ it was presented in a particular way is semantic markup. Yes, and that's why a first pass using scripts, the suggested Level 1, always tends to produce presentational markup. Scripts are stupid, they cannot handle the 'why' question _if given presentational markup_. What it looks to me is that semantic markup can't be had cheaply, can it? Not even when a bit more intelligent than average humans are involved, as can be clearly supposed being part of DP. > Generally, I support the notion of levels of markup. However, it can > lead to some unfortunate consequences. In the first place it can lead to > the loss of data. Supposing the foofing process in DP didn't already lose it. > What are the chances that anyone is going to go > through all those texts and convert all the presentational markup to > semantic markup? Maybe you should include some (unknown future) AI with 'anyone'. > And how much harder would it have been for the original > poster to just use semantic markup in the preparation of the texts in > the first place? Not the poster, the foofer. > I'm a firm believer in the old adage that it is easier > to do things right than to do them over. My suggestion is that if you're > using <hi> in the first pass, with the intent to convert them to > semantic markup in a subsequent pass, you probably ought to keep them in > your queue and not pass them on to PG until the upgrade has occurred. Why you would declare the poster to be better suited for that task than the foofer, you didn't explain. And, of course it's easier to do it right from the start, but we don't have an AI just now, do we? So we need to build it up stepwise. > Unfortunately, OCR programs are so far incapable of detecting complete > sentences to say nothing of single thoughts or topics. I'm with you there. Let's assume that Google is in the best position to come up with some AI that can "do better". What it would need would be a corpus of semantically marked up texts (for a specific language). Is there any effort outside PG to come up with such a thing? Why would any sane person mark up some text semantically except for being a hopeless bibliophile or for M O N E Y ? Just send me your proposals in this respect per eMail, ralf From traverso at posso.dm.unipi.it Tue Oct 16 03:31:05 2007 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Tue, 16 Oct 2007 12:31:05 +0200 (CEST) Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <20071016080528.GA4609@ark.in-berlin.de> (message from Ralf Stephan on Tue, 16 Oct 2007 10:05:28 +0200) References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> Message-ID: <20071016103105.795EE10225@posso.dm.unipi.it> >>>>> "Ralf" == Ralf Stephan <ralf at ark.in-berlin.de> writes: Ralf> Why would any sane person mark up some text semantically Ralf> except for being a hopeless bibliophile or for Ralf> M O N E Y ? Some kind of semantic markup can make ebooks better accessible, (e.g. a foreign tag to drive prononciation of automatically reading), this might be a motivation for some volunteers (I know some projects in DP in which the PM asks to include <foreing> markup). (true, DPers might also qualify for "hopeless bibliophile", and some for "insane"). Carlo From marcello at perathoner.de Tue Oct 16 07:15:35 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 16 Oct 2007 16:15:35 +0200 Subject: [gutvol-d] it's good to see the .tei people In-Reply-To: <6410280365.20071015125216@noring.name> References: <be5.18c1e5c6.34450d5b@aol.com> <6410280365.20071015125216@noring.name> Message-ID: <4714C787.3060609@perathoner.de> Jon Noring wrote: > Nevertheless, as I've always said, so long as there is a need for the > PG collection to include "plain text" versions of the books, ZML is a > good candidate for that since it does normalize the texts. ZML is the worst candidate for that. The design of ZML is fundamentally flawed. To format a text in ZML you have to use combinations of characters that you cannot distinguish from each other on screen (space, tab and newline). Space and tab look very much the same. (After a word of 7 chars, a space and a tab look *exactly* the same.) You also need trailing tabs on a line, which are also invisible. You clearly see the problem when you read the zml documentation. It has to use the markup `~tab~' instead of the tab character to be at all readable. This should have tipped off BB, that an invisible character is a bad choice for a markup tag. Moreover, some editors chop off trailing whitespace. You can fuck up your ZML text simply by loading and saving. Moreover some editors substitute tabs and spaces without asking. All in all, using non-printing characters as markup tags must be the most bone-headed design decision ever. -- Marcello Perathoner webmaster at gutenberg.org From jon at noring.name Tue Oct 16 08:08:47 2007 From: jon at noring.name (Jon Noring) Date: Tue, 16 Oct 2007 09:08:47 -0600 Subject: [gutvol-d] Establish plain text normalization rules? (was "it's good to see the .tei people") In-Reply-To: <4714C787.3060609@perathoner.de> References: <be5.18c1e5c6.34450d5b@aol.com> <6410280365.20071015125216@noring.name> <4714C787.3060609@perathoner.de> Message-ID: <876403294.20071016090847@noring.name> Marcello wrote: > Jon Noring wrote: >> Nevertheless, as I've always said, so long as there is a need for the >> PG collection to include "plain text" versions of the books, ZML is a >> good candidate for that since it does normalize the texts. > All in all, using non-printing characters as markup tags must be the > most bone-headed design decision ever. I agree that using white space characters to communicate document structure for *machine-processing* (i.e., for mastering purposes) is a show stopper, for a few reasons. However, my comments had to do with creating plain text renditions whose sole purpose is for direct reading and not to be converted to something else. (The plain text is NOT for machine conversion.) It is clear that in plain text renditions white space *must be used* (and is used looking at all PG plain text renditions) to create document structure for *human-processing*. So plain text editions can't avoid using white spaces for communicating structure to human readers (who are intelligent enough to "figure it out.") Now, as I think about it though, even here, tabs should never be used because how text editors interpret tabs can vary plus the tabs muck up the usability of the text when some text is extracted for reuse. So long as the user *knows* that all the white space there is the ASCII space character plus the usual EOL stuff, then they will know how to process it. But when tabs are mixed in with spaces, that is not good -- it is downright annoying and depending upon the text editor used can lead to unpredictable results (e.g., in some situations, a tab character may visually pass for a single space character.) By and large, the tab character is evil and, in my opinion, should never be used in plain text renditions of books. It may be possible Bowerbird can create a tabless ZML, but now things get tricky since I think he will have to establish rules based on using a specific number of white space characters plus the use of some of the other ASCII characters (such as the ">" character which could be used for block quotes) to communicate structure. But using other ASCII characters in certain situations actually adds "content characters" to the content, and that is something that should be avoided. All in all, Bowerbird is caught between a rock and a hard place to come up with some plain text normalization rules useful for mastering that do not break some "thou shalt not do this" rule. To summarize the two I came up with in this message: 1) Thou shalt not use any white space character other than the ASCII space character and EOL characters, 2) Thou shalt not use any non-white space character in the Unicode set except when that character is actually used in the textual content of the work. ) ***** So, a question for PG/DP to maybe discuss. Is it important that PG even establish some sort "normalization rules" for the formatting of plain text renditions of books solely used for direct reading, or are we past that now and it doesn't matter any more? Jon Noring From lee at novomail.net Tue Oct 16 08:16:48 2007 From: lee at novomail.net (Lee Passey) Date: Tue, 16 Oct 2007 09:16:48 -0600 Subject: [gutvol-d] it's good to see the .tei people In-Reply-To: <4714C787.3060609@perathoner.de> References: <be5.18c1e5c6.34450d5b@aol.com> <6410280365.20071015125216@noring.name> <4714C787.3060609@perathoner.de> Message-ID: <4714D5E0.1050704@novomail.net> Marcello Perathoner wrote: > All in all, using non-printing characters as markup tags must be the > most bone-headed design decision ever. I don't know, I've seen some pretty bone-headed design decisions in my time. I will agree, however, that it's probably in the top ten. Somebody needs to pass this on to the Distributed Proofreaders so they can fix their proofing process in this regard as well. -- Nothing of significance below this line. From jon at noring.name Tue Oct 16 08:33:15 2007 From: jon at noring.name (Jon Noring) Date: Tue, 16 Oct 2007 09:33:15 -0600 Subject: [gutvol-d] it's good to see the .tei people In-Reply-To: <4714D5E0.1050704@novomail.net> References: <be5.18c1e5c6.34450d5b@aol.com> <6410280365.20071015125216@noring.name> <4714C787.3060609@perathoner.de> <4714D5E0.1050704@novomail.net> Message-ID: <26157054.20071016093315@noring.name> Lee wrote: > Marcello Perathoner wrote: >> All in all, using non-printing characters as markup tags must be the >> most bone-headed design decision ever. > I don't know, I've seen some pretty bone-headed design decisions in my > time. I will agree, however, that it's probably in the top ten. > > Somebody needs to pass this on to the Distributed Proofreaders so they > can fix their proofing process in this regard as well. Agreed. The biggest abuse I've seen is the use of the non-breaking space character (usually inserted in XML docs using the character entity " " or its numerical equivalent -- it may also be encoded at the bit level in UTF-* encoded texts.) It should *never* be used in either TEI or XHTML in the context PG/DP uses them (I do recognize in some situations it is a quick fix for web authoring, but PG/DP should not be doing "quick fixes"). There is *always* a structural or inline semantic reason why (notice the word "why"?), during visual presentation, one may want to see more space inserted between chunks of text -- in this case mark it up properly and use CSS to add the necessary space if really, really, really needed. Jon Noring From joshua at hutchinson.net Tue Oct 16 09:28:09 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Tue, 16 Oct 2007 16:28:09 +0000 (UTC) Subject: [gutvol-d] it's good to see the .tei people Message-ID: <3027016.1192552089449.JavaMail.?@fh1035.dia.cp.net> Ok, time to disagree.   is very useful if you have something that you don't want broken up in a word wrap. For instance, let's say I put my initials in here: J. H. A word wrapping program could wrap that to J. H. Not real nice looking. But, if you put in J. H. ... it'll never get split apart, which is where the vast majority of   get used in DP texts. Josh >----Original Message---- >From: jon at noring.name >Date: Oct 16, 2007 11:33 >To: "Project Gutenberg Volunteer Discussion"<gutvol-d at lists.pglaf. org> >Subj: Re: [gutvol-d] it's good to see the .tei people > >Lee wrote: >> Marcello Perathoner wrote: > >>> All in all, using non-printing characters as markup tags must be the >>> most bone-headed design decision ever. > >> I don't know, I've seen some pretty bone-headed design decisions in my >> time. I will agree, however, that it's probably in the top ten. >> >> Somebody needs to pass this on to the Distributed Proofreaders so they >> can fix their proofing process in this regard as well. > >Agreed. The biggest abuse I've seen is the use of the non-breaking >space character (usually inserted in XML docs using the character >entity " " or its numerical equivalent -- it may also be encoded >at the bit level in UTF-* encoded texts.) > >It should *never* be used in either TEI or XHTML in the context PG/DP >uses them (I do recognize in some situations it is a quick fix for web >authoring, but PG/DP should not be doing "quick fixes"). > >There is *always* a structural or inline semantic reason why (notice >the word "why"?), during visual presentation, one may want to see more >space inserted between chunks of text -- in this case mark it up properly >and use CSS to add the necessary space if really, really, really needed. > >Jon Noring > >_______________________________________________ >gutvol-d mailing list >gutvol-d at lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d > From jon at noring.name Tue Oct 16 09:52:53 2007 From: jon at noring.name (Jon Noring) Date: Tue, 16 Oct 2007 10:52:53 -0600 Subject: [gutvol-d] it's good to see the .tei people In-Reply-To: <3027016.1192552089449.JavaMail.?@fh1035.dia.cp.net> References: <3027016.1192552089449.JavaMail.?@fh1035.dia.cp.net> Message-ID: <447877789.20071016105253@noring.name> Joshua wrote: > Ok, time to disagree. Great! :^) >   is very useful if you have something that you don't want broken > up in a word wrap. > > For instance, let's say I put my initials in here: J. H. > > A word wrapping program could wrap that to J. > H. > > Not real nice looking. > > But, if you put in J. H. ... it'll never get split apart, which > is where the vast majority of   get used in DP texts. In XHTML (there's a TEI equivalent): <span class="keeptogether">J. H.</span> <!-- use whatever classname you want, example only --> In CSS, one may then, if they wish to: span.keeptogether {white-space: nowrap} (see: http://www.w3.org/TR/CSS21/text.html#propdef-white-space ) (The value of the above is that we now have a better semantic idea *why* we are doing something. Putting in   we have a lesser idea why, and in some cases, to someone reading the document, may become confused, or in some situations ambiguous. Also note that we are, using both techniques, adding presentationally-oriented markup. One can imagine just leaving it out entirely, and letting conversion systems hunt down those instances and treat them as desired.) There are times, such as for extremely limited displays or space, where forcing nowrapping creates a situation worse than allowing the J. and H. to be broken on separate lines. After all, if we begin to be worried about breaking the J. and H., then we have to be equally "anal" about things like orphans and widows -- now we move into the realm of typesetting engines and the like... Is this the role of the master format to be worried about? Now, granted, I had not thought of this situation, even though I am aware of it, since in *so many* PG (X)HTML texts I've looked at,   is rampantly being abused, such as for indentation of paragraphs and verse lines, etc. It's better to see   being used only for keeping words together rather than forcing spacing in visual presentation (since that is its purpose.) Yet,   is still something I am not fond of using in virtually any circumstance, especially in that in most instances there is a markup solution, as illustrated above. Jon Noring From lee at novomail.net Tue Oct 16 09:56:08 2007 From: lee at novomail.net (Lee Passey) Date: Tue, 16 Oct 2007 10:56:08 -0600 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <20071016080528.GA4609@ark.in-berlin.de> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> Message-ID: <4714ED28.6070008@novomail.net> Ralf Stephan wrote: > Lee wrote >> ... Saying that >> something was presented in a particular way is presentational markup, >> saying /why/ it was presented in a particular way is semantic markup. > > Yes, and that's why a first pass using scripts, the suggested Level 1, > always tends to produce presentational markup. Scripts are stupid, > they cannot handle the 'why' question _if given presentational markup_. I agree, a first pass /using scripts/ will always produce presentational markup. But where is it suggested that a first pass using scripts is sufficient to create a Level 1 document? And where is it suggested that Level 1 documents are sufficient to "check in" to PG? TEI is inherently a semantic/structural markup language. When you create a TEI file you are making an implicit promise that it contains at least a modicum of semantic markup. When you create a TEI file that is purely presentational you are breaking that promise. There are, of course, other markup languages that are much better than TEI for carrying presentational information, not the least of which is XHTML. It is at least as easy to write an XSLT script to convert XHTML to PDF or RTF as it is to write a script to convert TEI. Plus, XHTML is directly usable by most User Agent software without conversion. I would suggest that if one is going to create files that are purely presentational then XHTML is a better choice. When the time comes to add semantic markup XHTML can easily first be converted to TEI. > What it looks to me is that semantic markup can't be had cheaply, > can it? Not even when a bit more intelligent than average humans are > involved, as can be clearly supposed being part of DP. I disagree. I think that a significant amount of semantic markup /can/ be had cheaply, particularly when humans are involved as in DP. Consider, for example, the DP proofing rules regarding the beginning of chapters. The last time I looked, the DP rules were: <cit> Put 4 blank lines before the "CHAPTER XXX".... Then leave one blank line between each additional part of the chapter header, such as a chapter description, opening quote, etc., and finally leave two blank lines before the start of the text of the chapter. <xptr>http://www.pgdp.net/c/faq/document.php#chap_head</xptr> </cit> Overlooking the fact that it is a boneheaded design to use non-printing characters as markup tags, how is the existing rule any easier for people to use that a rule such as: <ab> Begin each chapter with '<div type="chapter">.' Chapter headers should begin with <head> and end with </head>, and should appear immediately after the "<div>" line, e.g.: <div type="chapter"> <head>CHAPTER XXX</head> </ab>? >> Generally, I support the notion of levels of markup. However, it can >> lead to some unfortunate consequences. In the first place it can lead to >> the loss of data. > > Supposing the foofing process in DP didn't already lose it. Well, that can be a problem. But I'm trying to establish some parameters for the use of TEI, regardless of whether or not DP is involved in the process. >> What are the chances that anyone is going to go >> through all those texts and convert all the presentational markup to >> semantic markup? > > Maybe you should include some (unknown future) AI with 'anyone'. Oh, I have. I know that computer scientists have been researching the problems of natural language processing for more than four decades now, and while great strides have been made, I still don't think that we will have in my lifetime, or in my children's lifetimes, an AI that can read a page and say "that phrase is italicized because it is emphasized." I think that true TEI texts are valuable today, and that means that some human intervention will be required. If we are going to wait for some hypothetical AI in the future, we would be better off spending our time preserving books as paper artifacts rather than trying to convert them to /any/ electronic format. >> And how much harder would it have been for the original >> poster to just use semantic markup in the preparation of the texts in >> the first place? > > Not the poster, the foofer. The person or persons who caused a TEI file to be prepared and added to the PG database. Use whatever term you like for him/her/them. >> I'm a firm believer in the old adage that it is easier >> to do things right than to do them over. My suggestion is that if you're >> using <hi> in the first pass, with the intent to convert them to >> semantic markup in a subsequent pass, you probably ought to keep them in >> your queue and not pass them on to PG until the upgrade has occurred. > > Why you would declare the poster to be better suited for that task than > the foofer, you didn't explain. > > And, of course it's easier to do it right from the start, but we don't > have an AI just now, do we? So we need to build it up stepwise. Yes. And the first step /requires/ human input, and those humans need to be taught to think in terms of semantics not presentation. That human input may be an individual who cares deeply about a single work and is capable of carrying the process from beginning to end (see http://shinparam.org/Sam/Projects/TEI-CSS/Bronte-Shirley-draft.xml). That human input may be a more or less formal organization where some individuals are responsible for scanning the books, others are responsible for proofreading the content, and others are responsible for assembling the completed work. I don't think that Level 1 texts need to be /complete/ semantic markup; but I do think they ought to be semantic markup. Mr. Perathoner has suggested a hack (in the most positive sense of the word) whereby XML comments could be included in a TEI file to explain why a purely presentational markup was used instead of the expected semantic markup. I think that even Level 1 texts should have these kinds of comments whenever any kind of non-semantic element (e.g. <ab>, <hi>, <seg>, etc.) is used, explaining why a semantic element could not be chosen. >> Unfortunately, OCR programs are so far incapable of detecting complete >> sentences to say nothing of single thoughts or topics. > > I'm with you there. > > Let's assume that Google is in the best position to come up with > some AI that can "do better". What it would need would be a corpus > of semantically marked up texts (for a specific language). > > Is there any effort outside PG to come up with such a thing? Yes, in Colleges and Universities around the world. Natural language processing is a hot topic, and there is almost always some research going on in the area (my brother-in-law did his master's thesis on the topic). > Why would any sane person mark up some text semantically except > for being a hopeless bibliophile or for > > M O N E Y ? Because some people are altruists and believe that what they are doing is for the good of mankind. I suspect that this was Michael Hart's original motivation and is the motivation of virtually everyone who participates in Distributed Proofreaders. And unlike Michael Hart, I believe that if those volunteers are given instructions and guidelines as to how to produce a better work product they would happily accept and adopt those guidelines. -- Nothing of significance below this line. From jon at noring.name Tue Oct 16 10:26:45 2007 From: jon at noring.name (Jon Noring) Date: Tue, 16 Oct 2007 11:26:45 -0600 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <4714ED28.6070008@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> Message-ID: <177426060.20071016112645@noring.name> Lee Passey wrote: > [snip of a lot of excellent insights] > > Oh, I have. I know that computer scientists have been researching the > problems of natural language processing for more than four decades now, > and while great strides have been made, I still don't think that we will > have in my lifetime, or in my children's lifetimes, an AI that can read > a page and say "that phrase is italicized because it is emphasized." I've always said that when we have AI at the level of a Commander Data in Star Trek, then we can turn over all our text digitization completely to machines, at least to do it properly and completely and perfectly, the way we know it needs to be done. Such AI has to essentially be "sentient-level", and has to learn language as a human and understand human nature and social systems as a human, and must especially understand the language and culture associated with a particular text being transcribed and structured. When will that happen? Nobody really knows, but to be a little bit off topic here, I think we are closer to this than many might think, but it will be based on understanding how the human brain really works and building machines to mimic that (I believe, but it is only a belief, that true intelligence is only possible as a result of quantum effects, so quantum computing now under development may be a component of this AI revolution... Back in the 90's I had some fascinating private talks with some of the quantum physicists at LLNL on this very topic. Of course, a few quantum physicists believe all solutions to all problems are based on applying quantum mechanics to them! A spooky lot these quantum physicists are. LOL) > I think that true TEI texts are valuable today, and that means that some > human intervention will be required. If we are going to wait for some > hypothetical AI in the future, we would be better off spending our time > preserving books as paper artifacts rather than trying to convert them > to /any/ electronic format. Well said! > Because some people are altruists and believe that what they are doing > is for the good of mankind. I suspect that this was Michael Hart's > original motivation and is the motivation of virtually everyone who > participates in Distributed Proofreaders. And unlike Michael Hart, I > believe that if those volunteers are given instructions and guidelines > as to how to produce a better work product they would happily accept and > adopt those guidelines. Yes, agreed on this, too. In many ways, PG's laxness grew as a result of Michael's personality which is very individualistic ("nobody tells me what to do"). And that's fine -- we need those people in the world. Personally, I think PG would have gotten *more* people involved had it had a little more structure and stricter guidelines from the start because most people who are altruistic are also those who need and gladly accept guidance, and see the value in doing things right: "give me a 'to-do' check list, and I'll follow it exactly." For example, I decided not get involved back in 1994 with PG at the text production level for the reason that PG was too lax in a number of important ways. It is unfortunate that it is the self-motivated, individualistic types who often get projects started, and the idea of providing a strict check-list of guidelines is something that is alien to them -- they assume everyone else is just like them. Jon Noring From vze3rknp at verizon.net Tue Oct 16 11:25:11 2007 From: vze3rknp at verizon.net (Juliet Sutherland) Date: Tue, 16 Oct 2007 14:25:11 -0400 Subject: [gutvol-d] it's good to see the .tei people In-Reply-To: <447877789.20071016105253@noring.name> References: <3027016.1192552089449.JavaMail.?@fh1035.dia.cp.net> <447877789.20071016105253@noring.name> Message-ID: <47150207.9010203@verizon.net> Jon Noring wrote: > Joshua wrote: > > >> Ok, time to disagree. >> > > Great! :^) > > >>   is very useful if you have something that you don't want broken >> up in a word wrap. >> >> For instance, let's say I put my initials in here: J. H. >> >> A word wrapping program could wrap that to J. >> H. >> >> Not real nice looking. >> >> But, if you put in J. H. ... it'll never get split apart, which >> is where the vast majority of   get used in DP texts. >> > > In XHTML (there's a TEI equivalent): > > <span class="keeptogether">J. H.</span> > > <!-- use whatever classname you want, example only --> > > In CSS, one may then, if they wish to: > > span.keeptogether {white-space: nowrap} > > (see: http://www.w3.org/TR/CSS21/text.html#propdef-white-space ) Other places that might well get non-breaking spaces are abbreviations such as i.e. or e.g. Here I've rendered them without a space, but they are clearly spaced in some books and we have ongoing debates about how to handle them. We first standardized on always closing up the space (so that they wouldn't rewrap with if line break fell between the letters) but that proved confusing. (What are initials? What aren't? Which abbreviation should be closed up? How do we know? etc, etc) We've now moved to transcribing as it is printed in the book. Leave a space if one appears, don't if one doesn't. Note that these rules also cover initials as in Josh's example. Also, please, please note that this is how the proofers and formatters transcribe things in the rounds. What the post-processor does is different. He/she can space the initials with no non-breaking space, add the non-breaking space, or choose not to space the initials at all. Our only requirement is that whatever method is chosen be used consistently throughout that book. Still another place where non-breaking spaces can come up is in simple, in-line equations. x + y = z for example. And one more significant use for the non-breaking space is in indenting poetry. We have agreed that indentation in poetry is part of the author's intent and that we need to preserve it. I gather that there aren't easy ways of doing this in xhmtl and thus that the non-breaking spaces are often used for that purpose. I'm not saying that the way our post-processors use non-breaking spaces is always correct. Just pointing out some more places where they arise. JulietS From vze3rknp at verizon.net Tue Oct 16 11:59:18 2007 From: vze3rknp at verizon.net (Juliet Sutherland) Date: Tue, 16 Oct 2007 14:59:18 -0400 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <4714ED28.6070008@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> Message-ID: <47150A06.1010207@verizon.net> Lee Passey wrote: > I disagree. I think that a significant amount of semantic markup /can/ > be had cheaply, particularly when humans are involved as in DP. > Consider, for example, the DP proofing rules regarding the beginning of > chapters. The last time I looked, the DP rules were: > > <cit> > Put 4 blank lines before the "CHAPTER XXX".... Then leave one blank line > between each additional part of the chapter header, such as a chapter > description, opening quote, etc., and finally leave two blank lines > before the start of the text of the chapter. > <xptr>http://www.pgdp.net/c/faq/document.php#chap_head</xptr> > </cit> > > Overlooking the fact that it is a boneheaded design to use non-printing > characters as markup tags, how is the existing rule any easier for > people to use that a rule such as: > > <ab> > Begin each chapter with '<div type="chapter">.' Chapter headers should > begin with <head> and end with </head>, and should appear immediately > after the "<div>" line, e.g.: > > <div type="chapter"> > <head>CHAPTER XXX</head> > </ab>? Please remember that there is a big difference between what we do in the formatting rounds and what gets produced by the post-processors. What we ask our formatters to do has to be quickly learned, easily remembered, and easily typed. By these criteria, the blank line rules work quite well. It would probably work as well to use some markup that says "chapter" with an opening and closing tag, but for historical reasons, we don't. Nonetheless, the post-processing software can find the 4-stuff-2 spacing and convert that automatically into whatever chapter heading markup is appropriate. Similarly with the 2 before, 1 after line spacing for sections of a chapter. The post-processor will have to look at each of these and determine that they really are chapters, sections, subsections, etc. but hopefully the worst of the grunt work will have been done. Just out of curiosity, how would you markup a two line chapter heading? Something like CHAPTER 1 Missy Goes to Space Would you just assume that both lines are part of the title, that that is all that matters semantically and ignore the fact that the whole thing was printed on two lines? That would seem to be a strictly semantic approach. But what happens if someone decides that rendering chapters with the chapter number first and then the title on another line would look better. I understand that this is covered by the transform that converts the semantic information into something presentational. But how can that transform offer the option of two line chapter headers if the information about where to break the line has been lost? Or, in a more complicated case, how would you handle the article header seen at http://www.pgdp.net/c/tools/project_manager/displayimage.php?project=projectID40faf2a2aaaff&imagefile=226.png <http://www.pgdp.net/c/tools/project_manager/displayimage.php?project=projectID40faf2a2aaaff&imagefile=226.png> I'm assuming that you'd use some combination of Chapter Title, Chapter Subtitle, Author, InfoReAuthor, and maybe something to indicate the opening quote as being different from most other quotes (or not). That makes perfect sense to me. What concerns me, however, is whether the currently existing PGTEI transforms that would render this would make something that looks decent and that makes sense of the information to the reader. Not that the exact typography be reproduced, but so that the reader can tell at a glance what's what and how the various lines of information relate to each other. Which brings me to my final point. I believe that much of the reluctance among the DP post-processors about using PGTEI is not because it is semantic markup, but because they don't trust the current system of rendering the semantic information to produce something that look acceptable. Jon, this is quite different from wanting to exactly reproduce the typography. And Marcello, it's all very well to say that anyone can write their own XSLT (or style sheets or whatever the thing is called that converts the TEI to html, plain text, etc) but the facts are that the DP volunteers don't have the skills to do it. I'd say that for many of the books we produce, it isn't terribly important exactly how chapter titles are rendered, for example, but it is important that the result look nice. I know that Josh said he would see about producing something that makes a better looking title page. That is the sort of change that will be necessary for DP to adopt TEI for at least the simple projects. JulietS From joshua at hutchinson.net Tue Oct 16 12:41:15 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Tue, 16 Oct 2007 19:41:15 +0000 (UTC) Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data Message-ID: <10536393.1192563675046.JavaMail.?@fh1064.dia.cp.net> >----Original Message---- >From: vze3rknp at verizon.net > >Just out of curiosity, how would you markup a two line chapter heading? >Something like > >CHAPTER 1 >Missy Goes to Space > Here is how I've handled it in the past (which may or may not be the "best" way): <div> <index index="toc" level1="Missy Goes to Space" /> <index index="pdf" level1="Missy Goes to Space" /> <head>CHAPTER 1 - Missy Goes to Space</head> <p>...</p> </div> If you wanted it on two lines, you could "force" a line break with a <lb /> after the Chapter 1, I suppose. > >Or, in a more complicated case, how would you handle the article header >seen at >http://www.pgdp.net/c/tools/project_manager/displayimage.php? project=projectID40faf2a2aaaff&imagefile=226.png ><http://www.pgdp.net/c/tools/project_manager/displayimage.php? project=projectID40faf2a2aaaff&imagefile=226.png> Here is a first pass attempt. My apologies in advance if I type something stupid: <div> <index index="toc" /> <index index="pdf" /> <head rend="text-align: center">VI - AMERICAN BUSINESS IN THE WAR</head> <head type="sub" rend="text-align: center">Voluntary Cooperation of Experts and Loyal Support of Labor Put Our Industries on a War Basis</head> <p rend="text-align: center">By Grosvenor B. Clarkson</p> <p rend="text-align: center">Director of the U.S. Council of National Defense and of Its Advisory Commission</p> <p>...</p> </div> You could add semantic tags such as marking Clarkson as a name, but that isn't necessary. > >Which brings me to my final point. I believe that much of the reluctance >among the DP post-processors about using PGTEI is not because it is >semantic markup, but because they don't trust the current system of >rendering the semantic information to produce something that look >acceptable. > I agree and add one more major stumbling block ... lack of tools/scripts. The HTML crowd has a lot of existing tools to help go from DP output to HTML final product. TEI does not (and I wish I had the time and ability to create such tools!). Josh From Bowerbird at aol.com Tue Oct 16 13:00:36 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 16 Oct 2007 16:00:36 EDT Subject: [gutvol-d] it's good to see the .tei people Message-ID: <ceb.1e171e0b.34467264@aol.com> wow, i see jon and lee have dragged people into another t.e.i. black-hole. oh well, i'm working on my .pdf converter right now, so i can't be bothered. *** juliet said: > What the post-processor does is different. > He/she can space the initials with no non-breaking space, > add the non-breaking space, or choose not to space the initials at all. > Our only requirement is that whatever method is chosen > be used consistently throughout that book. and that "requirement" means individual _books_ are "uniform", but that the _library_ is _inconsistent_. oh well, i'll deal with it... it would be nice, though, if other people thought of the library as a whole, as developers cannot build on an inconsistent base. -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071016/d9999386/attachment.htm From marcello at perathoner.de Tue Oct 16 14:29:52 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 16 Oct 2007 23:29:52 +0200 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <47150A06.1010207@verizon.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> Message-ID: <47152D50.3000205@perathoner.de> Juliet Sutherland wrote: > Just out of curiosity, how would you markup a two line chapter heading? > Something like > > CHAPTER 1 > Missy Goes to Space First do it sematically: <div> <index type="toc" level1="Chapter 1: Missy goes to space" /> <head>CHAPTER 1</head> <head type="sub">Missy Goes to Space</head> Then add presentational stuff: <div> <index type="toc" level1="Chapter 1: Missy goes to space" /> <head rend="font-size: 200%; text-align: center">CHAPTER 1</head> <head type="sub" rend="font-size: 80%; text-align: center">Missy Goes to Space</head> > Or, in a more complicated case, how would you handle the article header > seen at > http://www.pgdp.net/c/tools/project_manager/displayimage.php?project=projectID40faf2a2aaaff&imagefile=226.png First do it semantically: <group> <text> <front> <titlePage> <docTitle> <titlePart>VI - American Business in the War</titlePart> <titlePart>Voluntary Co?peration ... Basis</titlePart> </docTitle> <byline> By <docAuthor>Grosvenor B. Clarkson</docAuthor><lb/> Director of ... Commission </byline> <epigraph> <cit> <q>Modern wars are not won by ... force.</q> <bibl>—Woodrow Wilson.</bibl> </cit> </epigraph> </titlePage> </front> <body> <p>War today means ... Then add the presentational stuff: <index index="toc" level1="VI - American Business in the War" /> <docTitle> <titlePart rend="display: block; text-align: center; font-size: 200%; text-transform: uppercase" >VI - American Business in the War</titlePart> <titlePart rend="display: block; text-align: center; font-size: 150%" >Voluntary Co?peration ... Basis</titlePart> The rest is left as exercise to the reader. -- Marcello Perathoner webmaster at gutenberg.org From prosfilaes at gmail.com Tue Oct 16 15:43:05 2007 From: prosfilaes at gmail.com (David Starner) Date: Tue, 16 Oct 2007 18:43:05 -0400 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <47152D50.3000205@perathoner.de> References: <20071001081923.GA29575@ark.in-berlin.de> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> <47152D50.3000205@perathoner.de> Message-ID: <6d99d1fd0710161543n3fe18eb1wccadceb312a14b03@mail.gmail.com> On 10/16/07, Marcello Perathoner <marcello at perathoner.de> wrote: > Juliet Sutherland wrote: > > > Just out of curiosity, how would you markup a two line chapter heading? > > Something like > > > > CHAPTER 1 > > Missy Goes to Space > > First do it sematically: > > <div> > <index type="toc" level1="Chapter 1: Missy goes to space" /> > <head>CHAPTER 1</head> > <head type="sub">Missy Goes to Space</head> > > Then add presentational stuff: > > <div> > <index type="toc" level1="Chapter 1: Missy goes to space" /> > <head rend="font-size: 200%; text-align: center">CHAPTER 1</head> > <head type="sub" rend="font-size: 80%; text-align: center">Missy Goes > to Space</head> That's at least double the work of HTML. The whole promise of TEI is that we shouldn't have to add the presentational stuff to make it come out right. The TEI has to produce the equivalent of that presentational stuff in the HTML and PDF editions to make this whole thing worth our time. From lee at novomail.net Tue Oct 16 15:46:04 2007 From: lee at novomail.net (Lee Passey) Date: Tue, 16 Oct 2007 16:46:04 -0600 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <47150A06.1010207@verizon.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> Message-ID: <47153F2C.8060308@novomail.net> Juliet Sutherland wrote: [snip] > Please remember that there is a big difference between what we do in the > formatting rounds and what gets produced by the post-processors. What we > ask our formatters to do has to be quickly learned, easily remembered, > and easily typed. By these criteria, the blank line rules work quite > well. Not for me. The requirement that I put a cursor in a spot on the screen and then count keystrokes as I move it is quite annoying, and one of the things that keeps me from doing more work at DP. And it's really hard to just look at a gap on the screen and know whether it's 3 lines, or 4, or 5. If the markup is explicit (i.e., doesn't rely on non-printing characters) I can see in a glance if the markup is correct. > It would probably work as well to use some markup that says > "chapter" with an opening and closing tag, but for historical reasons, > we don't. Then perhaps it's time to re-examine the process, and change it if it makes sense. "Because that's the way we've always done it" is about the /worst/ reason I can imagine to justify anything. [snip] > Just out of curiosity, how would you markup a two line chapter heading? > Something like > > CHAPTER 1 > Missy Goes to Space <div type="chapter"> <head>CHAPTER 1</head> <head>Missy Goes to Space</head> My recollection is that TEI allows as many <head> elements as you want just so long as they all come before other elements. Or... <div type="chapter"> <head>CHAPTER 1 <lb/>Missy Goes to Space</head> This is saying, "there is a single header semantically, which was broken into two lines in the original." (According to the TEI specification, the <lb/> element should be inserted at the /beginning/ of the new line, not the end of the old one. I added a space so it would look acceptable if you were viewing the file natively in a software User Agent that didn't have good support for CSS). > Would you just assume that both lines are part of the title, that that > is all that matters semantically and ignore the fact that the whole > thing was printed on two lines? That would seem to be a strictly > semantic approach. But what happens if someone decides that rendering > chapters with the chapter number first and then the title on another > line would look better. I understand that this is covered by the > transform that converts the semantic information into something > presentational. But how can that transform offer the option of two line > chapter headers if the information about where to break the line has > been lost? > > Or, in a more complicated case, how would you handle the article header > seen at > http://www.pgdp.net/c/tools/project_manager/displayimage.php?project=projectID40faf2a2aaaff&imagefile=226.png Assuming that what you have posted is a chapter, off the top of my head (I may have violated some picky TEI DTD requirement): <div type="chapter"> <head>VI—AMERICAN BUSINESS IN THE WAR</head> <head type="sub">Voluntary Cooperation of Experts and Loyal Support of Labor Put Our Industries on a War Basis</head> <byline>By <docAuthor>Grosvenor B. Clarkson</docAuthor> <lb />Director of the U.S. Council of National Defense and of Its Advisory Commission</byline> <epigraph> <cit> <p>Modern wars are not won by mere numbers. They are not won by mere enthusiasm. They are not won by mere national spirit. They are won by the scientific conduct of war, the scientific application of irresistible force.</p> <bibl><author>Woodrow Wilson</author></bibl> </cit> </epigraph> <p>War today means that for every man on the fighting line... </div> > I'm assuming that you'd use some combination of Chapter Title, Chapter > Subtitle, Author, InfoReAuthor, and maybe something to indicate the > opening quote as being different from most other quotes (or not). That > makes perfect sense to me. What concerns me, however, is whether the > currently existing PGTEI transforms that would render this would make > something that looks decent and that makes sense of the information to > the reader. Not that the exact typography be reproduced, but so that the > reader can tell at a glance what's what and how the various lines of > information relate to each other. I don't know how the PGTEI XSL script would handle this. My own tei2html program rendered this fragment as: <DIV class="tei-div chapter"> <H3 class="tei-head">VI—AMERICAN BUSINESS IN THE WAR</H3> <H3 class="tei-head sub">Voluntary Cooperation of Experts and Loyal Support of Labor Put Our Industries on a War Basis</H3> <H1 class="tei-byline">By <SPAN class="tei-docAuthor">Grosvenor B. Clarkson</SPAN> <BR class="tei-lb" />Director of the U.S. Council of National Defense and of Its Advisory Commission</H1> <DIV class="tei-epigraph"> <blockquote class="tei-cit"> <P>Modern wars are not won by mere numbers. They are not won by mere enthusiasm. They are not won by mere national spirit. They are won by the scientific conduct of war, the scientific application of irresistible force.</P> <span class="tei-bibl"><SPAN class="tei-author">Woodrow Wilson</SPAN></span> </blockquote> </DIV> <P>War today means that for every man on the fighting line... </P> </DIV> You have to have a CSS which sets off the epigraph, and right aligns the "tei-bibl", but I think it looks pretty good (I'll e-mail you a screen shot if you'd like. > Which brings me to my final point. I believe that much of the reluctance > among the DP post-processors about using PGTEI is not because it is > semantic markup, but because they don't trust the current system of > rendering the semantic information to produce something that look > acceptable. I think by your wording you have hit the nail on the head: "they don't trust..." It really is a matter of trust. Because I understand how XML transformations can occur, and how CSS works, and even the XSL scripting language a little, I have complete confidence that semantic TEI markup is capable of preserving all the data necessary to re-render the text in a way that is aesthetically pleasing to me (I refuse to make any judgments as to whether or not it is aesthetically pleasing to anyone else). Even if I thought the HTML output from Mr. Perathoner's XSLT scripts sucks (which I do) it wouldn't make any difference to me because all the data is still there in the master file. Those volunteers who are contemplating using TEI markup will have to learn to trust that if they do so there will be tools, either existing or in the future, that will create output that doesn't suck (I believe it is possible to create a CSS file that will let people see good output from the TEI files directly in a web browser). I understand that this can be a hard thing to do, and for many people the only way they will develop this level of trust is to actually see it. But I would encourage everyone to take the leap of faith. [snip] -- Nothing of significance below this line. From prosfilaes at gmail.com Tue Oct 16 15:49:50 2007 From: prosfilaes at gmail.com (David Starner) Date: Tue, 16 Oct 2007 18:49:50 -0400 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <4714ED28.6070008@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <47013DAF.5090400@bohol.ph> <47056722.9000801@novomail.net> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> Message-ID: <6d99d1fd0710161549m9fd7ee7v887ba662dca0fe66@mail.gmail.com> On 10/16/07, Lee Passey <lee at novomail.net> wrote: > TEI is inherently a semantic/structural markup language. When you create > a TEI file you are making an implicit promise that it contains at least > a modicum of semantic markup. When you create a TEI file that is purely > presentational you are breaking that promise. The hi tag is in TEI, so by using it you aren't breaking any promises. > I would suggest that if one is going to create files that are purely > presentational then XHTML is a better choice. Whatever that means. XHTML is distinctly lacking in several regards for the type of documents I think we should be creating. There's lots of things TEI does better than XHTML, like sidenotes and footnotes and page numbers and chapter markings, things that require little to no editorial interpretation to figure out. > Overlooking the fact that it is a boneheaded design to use non-printing > characters as markup tags, how is the existing rule any easier for > people to use that a rule such as: We've found that the more complex the markup, the more likely it is to get mistyped. The chapter markup is probably on the way out some time, but never for raw TEI. > I think that true TEI texts are valuable today, If a lot of people found them valuable enough, they'd be common. As it is, plain text and HTML and PDF etexts are common; TEI etexts are rare. They're rare because it's hard to make them, and people don't find the additional markup worth it. You could make so many etexts in TEI and make your collection so important that people will choose to work in TEI to work with you; but if you decide to choose to get people working on an existing project to change, then you have to understand that they don't think it's valuable right now. > If we are going to wait for some > hypothetical AI in the future, we would be better off spending our time > preserving books as paper artifacts rather than trying to convert them > to /any/ electronic format. Plain text and HTML have enough value to do today; the users have shown that. That doesn't mean your changes have enough value to do today. > I think that even Level 1 texts should have these kinds of comments > whenever any kind of non-semantic element (e.g. <ab>, <hi>, <seg>, etc.) > is used, explaining why a semantic element could not be chosen. I think that's a waste of time. These elements are perfectly good TEI. From prosfilaes at gmail.com Tue Oct 16 16:11:02 2007 From: prosfilaes at gmail.com (David Starner) Date: Tue, 16 Oct 2007 19:11:02 -0400 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <47153F2C.8060308@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> <47153F2C.8060308@novomail.net> Message-ID: <6d99d1fd0710161611n5005bad3he9364a9b37b30e75@mail.gmail.com> On 10/16/07, Lee Passey <lee at novomail.net> wrote: > If the markup is explicit (i.e., doesn't rely on non-printing > characters) I can see in a glance if the markup is correct. Can you honestly see in a glance if <div type="chapter"> <head>VI&emdash;AMERICAN BUSINESS IN THE WAR</head> <head type="sub">Voluntary Cooperation of Experts and Loyal Support of Labor Put Our Industries on a War Basis</head> <byline>By <docAuthor>Grosvenor B. Clarkson</docAuthor> <lb/>Director of the U.S. Council of National Defense and of Its Advisory Commission</byline> <epigraph> <cit> <p>Modern wars are not won by mere numbers. They are not won by mere enthusiasm. They are not won by mere national spirit. They are won by the scientific conduct of war, the scientific application of irresistible force.</p> <bibl><author>Woodrow Wilson</author></bibl> </cit> </epigraph> <p>War today means that for every man on the fighting line... </div> is correct? Noisy markup is not easy to verify. > "Because that's the way we've always done it" is about the > /worst/ reason I can imagine to justify anything. Every time we change markup at DP, we get a long period where people get confused about what the right way to do things is. There's some definite frustration at DP every so often about how proofers and formatters are constantly having to learn new rules. Not changing things unless there's a definite large benefit is a good reason. If you're familiar with computers at all, there's a huge history of that. Why do AMD-64s boot up in 8-bit mode? Because every chip that has decided that "that's the way we've always done it" was a bad reason to do things and looked to compete with the Intel x86 line has failed. Why does UTF-8--an ASCII compatible encoding for Unicode--exist? It wasn't part of the original design of Unicode, which specified 16 (or 32) bit characters only; but Unix programmers weren't about to give up dealing with ASCII bytes. I'm sure Unicode could have fought that, but I don't think that would have been a good move for them. > I think by your wording you have hit the nail on the head: "they don't > trust..." It really is a matter of trust. There's an old Arabic saying: "Trust in Allah; but tie up your camel". I've seen obscure formats for text come and go; we want to see it working. > I have complete confidence that semantic TEI markup > is capable of preserving all the data necessary to re-render the text in > a way that is aesthetically pleasing to me The original scans preserve all this data. We need to see it in practice. > Those volunteers who are contemplating using TEI markup will have to > learn to trust that if they do so there will be tools, either existing > or in the future, that will create output that doesn't suck I don't see why they should, and I don't think they will. We want to know that our work is usable today, and we don't want to worry about such tools not appearing. > But I would encourage > everyone to take the leap of faith. And I would encourage everyone not to dedicate work into a text format until the tools do the things they want. From lee at novomail.net Tue Oct 16 21:24:40 2007 From: lee at novomail.net (Lee Passey) Date: Tue, 16 Oct 2007 22:24:40 -0600 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <6d99d1fd0710161611n5005bad3he9364a9b37b30e75@mail.gmail.com> References: <20071001081923.GA29575@ark.in-berlin.de> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> <47153F2C.8060308@novomail.net> <6d99d1fd0710161611n5005bad3he9364a9b37b30e75@mail.gmail.com> Message-ID: <47158E88.8080404@novomail.net> David Starner wrote: > On 10/16/07, Lee Passey <lee at novomail.net> wrote: > >> If the markup is explicit (i.e., doesn't rely on non-printing >> characters) I can see in a glance if the markup is correct. >> > > Can you honestly see in a glance if > [snip] > is correct? Noisy markup is not easy to verify. > Well, yes, I can see it easily in a glance. Although I will admit it would probably take your average DP volunteer two or three tries to rise to that level of proficiency. However, I believe you have mis-stated the proposition, which was that <div type="chapter"> <head>CHAPTER VII</head> <head>THE MERRY LITTLE BREEZES HELP LIGHTFOOT</head> is easier to validate at a glance than is: CHAPTER VII THE MERRY LITTLE BREEZES HELP LIGHTFOOT Could you have seen the hunter... Or how about: VI--AMERICAN BUSINESS IN THE WAR Voluntary Cooperation of Experts and Loyal Support of Labor Put Our Industries on a War Basis By Grosvenor B. Clarkson Director of the U.S. Council of National Defense and of Its Advisory Commission Modern wars are not won by mere numbers. They are not won by mere enthusiasm. They are not won by mere national spirit. They are won by the scientific conduct of war, the scientific application of irresistible force. Woodrow Wilson War today means that for every man on the fighting line... Did you see the errors in the foregoing? "Noisy" markup may be hard for humans to validate (although it is trivial for automation to validate), but ambiguous, invisible, and subtle markup is even harder. >> "Because that's the way we've always done it" is about the >> /worst/ reason I can imagine to justify anything. >> > > Every time we change markup at DP, we get a long period where people > get confused about what the right way to do things is. There's some > definite frustration at DP every so often about how proofers and > formatters are constantly having to learn new rules. Not changing > things unless there's a definite large benefit is a good reason. > Well, not changing things unless the benefit outweighs the cost is a good reason. But you're changing the argument from "Because that's the way we've always done it" to "the DP volunteers have such a hard time adapting to new processes that making any change at all is too disruptive to our work." That's definitely a valid argument, I just don't believe it. [snip] > Why does UTF-8--an ASCII compatible encoding for > Unicode--exist? It wasn't part of the original design of Unicode, > which specified 16 (or 32) bit characters only; but Unix programmers > weren't about to give up dealing with ASCII bytes. I'm sure Unicode > could have fought that, but I don't think that would have been a good > move for them. > The 'C' standard libraries contain a number of routines designed to manipulate strings, which were defined as a series of 7 bit characters terminated with the "null" (0) character. If programmers would have started using strings where each character was 16 bits (2 bytes, UCS-2 encoding) the standard 'C' libraries could not have been used, because the strings would have contained embedded zeros. Even now, many databases cannot store double-byte strings except as Binary Large OBjects (BLOBs). UTF-8 was developed because it enabled programmers to store Unicode characters in strings, without encountering embedded nulls, and while continuing to be able to use a large corpus of code written initially for 7 bit characters. UTF-8 exists not because unix programmers were unwilling to change, but because it enabled the continued use of a large body of existing code and programs. [snip] >> I have complete confidence that semantic TEI markup >> is capable of preserving all the data necessary to re-render the text in >> a way that is aesthetically pleasing to me >> > > The original scans preserve all this data. We need to see it in practice. > True, the original scans preserve all the data, if they are complete and of sufficient quality, but at a huge cost in usability. I simply cannot envision reading by looking at image scans on my 2.8 inch PDA screen. TEI encoding can preserve all the same data, but with an exponential increase in usability. >> Those volunteers who are contemplating using TEI markup will have to >> learn to trust that if they do so there will be tools, either existing >> or in the future, that will create output that doesn't suck >> > > I don't see why they should, and I don't think they will. We want to > know that our work is usable today, and we don't want to worry about > such tools not appearing. > TEI (or a similar markup) is the future. I just don't want to have to redo your work next year, because what you did today is inadequate. From marcello at perathoner.de Wed Oct 17 02:22:27 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 17 Oct 2007 11:22:27 +0200 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <6d99d1fd0710161543n3fe18eb1wccadceb312a14b03@mail.gmail.com> References: <20071001081923.GA29575@ark.in-berlin.de> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> <47152D50.3000205@perathoner.de> <6d99d1fd0710161543n3fe18eb1wccadceb312a14b03@mail.gmail.com> Message-ID: <4715D453.2020907@perathoner.de> David Starner wrote: >> <div> >> <index type="toc" level1="Chapter 1: Missy goes to space" /> >> <head rend="font-size: 200%; text-align: center">CHAPTER 1</head> >> <head type="sub" rend="font-size: 80%; text-align: center">Missy Goes >> to Space</head> > > That's at least double the work of HTML. No, that's not. And it gets you 3 user formats in one markup run. BTW, the code above was written to answer Juliets question. In a production environment you would use a PGTEI stylesheet, so you would only write those formatting rules down once for all heads and subheads. > The whole promise of TEI is > that we shouldn't have to add the presentational stuff to make it come > out right. Definitely! The TEI converter should send tiny experimental signals down the neural pathways to the visual and spacial cognition centers of the PPers brain to see what kind of visual formatting is most likely to please this one particular PPer. The next version of PGTEI will support the "Sirius Cybernetics USB 2.0 Synaptic Scanner". Until then, you can work around using stylesheets and the "rend" attribute, just like you are accustomed to do in HTML. -- Marcello Perathoner webmaster at gutenberg.org From joshua at hutchinson.net Wed Oct 17 05:50:29 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Wed, 17 Oct 2007 12:50:29 +0000 (UTC) Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data Message-ID: <3466526.1192625429600.JavaMail.?@fh1064.dia.cp.net> Ok, all sarcastic commentary aside ... the default render for <head> and <head type="sub"> gets the job done. ie. <head>CHAPTER 1</head> <head type="sub">Missy Goes to Space</head> Renders with a large "CHAPTER 1" and a slightly smaller "Missy Goes to Space". The only rend attribute necessary is a centering attribute (the same thing you'd need to add in an HTML document because <h1> isn't automatically centered). Now, just like HTML, you can center each <head> individually (<head rend="text-align: center">) or you can put a line in the stylesheet section at the beginning telling it to center every <head> element. Honestly, for 99% of the stuff we see, the TEI code is no more complex than the HTML code equivalent. It's just that it is DIFFERENT from the HTML code equivalent and therefore needs different tools/scripts if you want to automate any part of it. Marcello has said, repeatedly, he's not interested nor has the time to write such tools and scripts. I've admitted I don't have the ability. Lee sounds like he has the ability, but perhaps not the time or inclination. If someone was willing to step up and start creating a tool/regex scripts/perl scripts/whatever, I'd be happy to work with them and I'm positive so would the couple of other people active in trying to work with TEI (Ralf and others from DP). Right now, the arguments are going to be fairly useless and circular in nature simply because we don't have the tools to take the process to the next level. And that is the main reason I haven't been actively stumping for TEI in quite a while (just answering questions as I see them). Josh >----Original Message---- >From: marcello at perathoner.de >Date: Oct 17, 2007 5:22 >To: "Project Gutenberg Volunteer Discussion"<gutvol-d at lists.pglaf. org> >Subj: Re: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data > >David Starner wrote: > >>> <div> >>> <index type="toc" level1="Chapter 1: Missy goes to space" /> >>> <head rend="font-size: 200%; text-align: center">CHAPTER 1</head> >>> <head type="sub" rend="font-size: 80%; text-align: center">Missy Goes >>> to Space</head> >> >> That's at least double the work of HTML. > >No, that's not. And it gets you 3 user formats in one markup run. > >BTW, the code above was written to answer Juliets question. In a >production environment you would use a PGTEI stylesheet, so you would >only write those formatting rules down once for all heads and subheads. > > >> The whole promise of TEI is >> that we shouldn't have to add the presentational stuff to make it come >> out right. > >Definitely! The TEI converter should send tiny experimental signals down >the neural pathways to the visual and spacial cognition centers of the >PPers brain to see what kind of visual formatting is most likely to >please this one particular PPer. > >The next version of PGTEI will support the "Sirius Cybernetics USB 2.0 >Synaptic Scanner". Until then, you can work around using stylesheets and >the "rend" attribute, just like you are accustomed to do in HTML. > > > >-- >Marcello Perathoner >webmaster at gutenberg.org > >_______________________________________________ >gutvol-d mailing list >gutvol-d at lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d > From piggy at netronome.com Wed Oct 17 06:14:45 2007 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Wed, 17 Oct 2007 09:14:45 -0400 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <47158E88.8080404@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> <47153F2C.8060308@novomail.net> <6d99d1fd0710161611n5005bad3he9364a9b37b30e75@mail.gmail.com> <47158E88.8080404@novomail.net> Message-ID: <47160AC5.2060007@netronome.com> Lee Passey wrote: > David Starner wrote: > >> On 10/16/07, Lee Passey <lee at novomail.net> wrote: >> >> >> ... >>> "Because that's the way we've always done it" is about the >>> /worst/ reason I can imagine to justify anything. >>> >>> >> Every time we change markup at DP, we get a long period where people >> get confused about what the right way to do things is. There's some >> definite frustration at DP every so often about how proofers and >> formatters are constantly having to learn new rules. Not changing >> things unless there's a definite large benefit is a good reason. >> >> > > Well, not changing things unless the benefit outweighs the cost is a > good reason. But you're changing the argument from "Because that's the > way we've always done it" to "the DP volunteers have such a hard time > adapting to new processes that making any change at all is too > disruptive to our work." That's definitely a valid argument, I just > don't believe it. > Experiments at PGDP are pretty easy to conduct. Could we hear from someone who has run a book through PGDP asking the F* rounds to use PGTEI instead of normal PGDP markup? I have some easy novels similar to novels already in PG which I am willing to make available to someone willing to run such an experiment. Can we even get enough interest from formatters to complete one such book? >>> I have complete confidence that semantic TEI markup >>> is capable of preserving all the data necessary to re-render the text in >>> a way that is aesthetically pleasing to me >>> >>> >> The original scans preserve all this data. We need to see it in practice. >> >> > > True, the original scans preserve all the data, if they are complete and > of sufficient quality, but at a huge cost in usability. I simply cannot > envision reading by looking at image scans on my 2.8 inch PDA screen. > TEI encoding can preserve all the same data, but with an exponential > increase in usability. > I tend to agree that TEI is more easily manipulated than even very good page scans, but I take issue with the claim that it can preserve all the same data. I have a large collection of high-resolution color and grayscale scans of blank paper. I'm more than happy to provide one or two to anyone who would like to attempt TEI markup of the paper properties I'm interested in. TEI is great, but we need to preserve high-grade scans too. > >>> Those volunteers who are contemplating using TEI markup will have to >>> learn to trust that if they do so there will be tools, either existing >>> or in the future, that will create output that doesn't suck >>> >>> >> I don't see why they should, and I don't think they will. We want to >> know that our work is usable today, and we don't want to worry about >> such tools not appearing. >> >> > > TEI (or a similar markup) is the future. I just don't want to have to > redo your work next year, because what you did today is inadequate. > If I'm researching the metrical structure of Shakespearean sonnets, I might START with PG TEI, but the TEI markup I'm looking for is most likely missing or not very reliable. No TEI text will be sufficient for every potential user. I think we should be able to make a good case for TEI without making unnecessarily broad claims. Whatever we do today WILL be inadequate for some future user. Let's do what we can to give them a good starting point. From prosfilaes at gmail.com Wed Oct 17 07:04:17 2007 From: prosfilaes at gmail.com (David Starner) Date: Wed, 17 Oct 2007 10:04:17 -0400 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <47158E88.8080404@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> <47153F2C.8060308@novomail.net> <6d99d1fd0710161611n5005bad3he9364a9b37b30e75@mail.gmail.com> <47158E88.8080404@novomail.net> Message-ID: <6d99d1fd0710170704w3b2ee680k87e8e99da324d2a@mail.gmail.com> On 10/17/07, Lee Passey <lee at novomail.net> wrote: > TEI (or a similar markup) is the future. No one knows what's in the future. From my vantage point, I've seen a lot of movement towards HTML, with all those devices that supposedly need specialized handling getting more and more artful at dealing with the ever-present HTML. > I just don't want to have to > redo your work next year, because what you did today is inadequate. If you have the time to go around redoing inadequate work, you mind if I upload my scans of the Grammar of the Lau Language to you? Between the ASCIIifaction and the loss of graphics, it's been on my list for redoing for a long time. Thanks. From lee at novomail.net Wed Oct 17 08:53:59 2007 From: lee at novomail.net (Lee Passey) Date: Wed, 17 Oct 2007 09:53:59 -0600 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <6d99d1fd0710170704w3b2ee680k87e8e99da324d2a@mail.gmail.com> References: <20071001081923.GA29575@ark.in-berlin.de> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> <47153F2C.8060308@novomail.net> <6d99d1fd0710161611n5005bad3he9364a9b37b30e75@mail.gmail.com> <47158E88.8080404@novomail.net> <6d99d1fd0710170704w3b2ee680k87e8e99da324d2a@mail.gmail.com> Message-ID: <47163017.5090605@novomail.net> David Starner wrote: > > If you have the time to go around redoing inadequate work, you mind if > I upload my scans of the Grammar of the Lau Language to you? Between > the ASCIIifaction and the loss of graphics, it's been on my list for > redoing for a long time. Thanks. Ordinarily I'm open to these kind of suggestions, but there are a few extra considerations. Most importantly, I want to get the biggest bang for my buck. There are, no doubt, hundreds, if not thousands, of works in the early PG corpus that not only need to be redone, but which are quite popular. While <title>The Grammar of the Lau Language is no doubt of interest to you, it seems to be a rather esoteric work, devoid of interest to the public at large. Mr. Perathoner claims that the most popular work at PG is Jane Austen's Pride and Prejudice. He has offered no evidence to support this claim, but it doesn't seem unlikely to me. So after I complete the two conversions I have in my queue right now, and if I can't find a reputable TEI version of Pride and Prejudice that will probably be my next project. It would be really great if we could figure out a way to gather download statistics from PG over the past 4-5 years, so we could get a better handle on just what works /are/ of the greatest interest to the general public, and then focus our effort on re-doing those works. The Gutenberg web site only lists the most popular downloads in the past 30 days, but I note that the Internet Archive's Wayback machine has archived these pages since Sept. 2004, so I might be able to write a tool that can aggregate these pages. I thought I had read in the PG faq that PG is not really interested in archiving multiple editions of the same work. After all, Project Gutenberg is, in point of fact, an e-publisher that publishes its own editions. So I don't think that PG would be open to archiving different editions of the same work. I also think that Mr. Hutchinson is right when he says that while the submission of a degraded text version of any particular work is not a de jure it /is/ a de facto rule. I have absolutely no interest in creating degraded text versions of any of the works I transcribe, nor do I have any interest in encouraging other people to do so. So, in all likelihood, I will not submit to Project Gutenberg any versions of its most popular downloads that I have redone. These will be submitted to the Internet Archive instead. If anyone would like to join me in my efforts, I would be glad to have the help. -- Nothing of significance below this line. From jon at noring.name Wed Oct 17 09:13:31 2007 From: jon at noring.name (Jon Noring) Date: Wed, 17 Oct 2007 10:13:31 -0600 Subject: [gutvol-d] A proposed list of common understandings on the TEI mastering threads In-Reply-To: <47160AC5.2060007@netronome.com> References: <20071001081923.GA29575@ark.in-berlin.de> <200710051727.42671.rolsch@verizon.net> <470E5720.6030207@netronome.com> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> <47153F2C.8060308@novomail.net> <6d99d1fd0710161611n5005bad3he9364a9b37b30e75@mail.gmail.com> <47158E88.8080404@novomail.net> <47160AC5.2060007@netronome.com> Message-ID: <121591611.20071017101331@noring.name> La Monte H.P. Yarroll wrote: > I think we should be able to make a good case for TEI without making > unnecessarily broad claims. Whatever we do today WILL be inadequate > for some future user. Let's do what we can to give them a good > starting point. This is an excellent comment. The common understandings we have in this set of TEI-related threads are the following: 1) Each text project will use a known source book, and the final digitized text, in whatever form, will be "accurate" to that source book, and will include metadata referencing that source book. (Note that in this statement "accurate" remains undefined.) 2) Each text project will always make available the source book scanset in (at least) sufficient quality for OCR, human proofing, verifying text accuracy by end-users, and discerning the original typography. (I believe every scanset should be archival quality but this is an issue not germane to this particular discussion.) 3) Each text project will produce a "digital master" from which all user renditions, and other types of uses, will be derived. 4) The "digital master" will be an XML document marked up with some "flavor" of TEI. [Note: There may be a couple other common understandings that I've not included in the above list, and certainly mention them if you think of them. But I think this is a good starting point of where I believe most of us participating in these threads agree with. However, if we don't have super-majority agreement on the above four items, then the gap in views is wider than I suspected, and I doubt we can ever get to any agreement at all on the specifics of implementation if we can't even agree on the general principles. I'll assume in the comments below that we have collective majority agreement on the above general understandings.] What is obvious, though, is that these common understandings are not of sufficient completeness that the specifics of implementation become crystal clear -- they just don't fall out. So that's the reason for our discussions, to clarify each understanding and maybe also add to the list. And this is proving difficult because we tend to fall into different "camps" as Josh, and then Lee, so eloquently explained. Alright, now to provide maybe a little more on the above from my perspective... Obviously, a dream we all have is that the "master" will have all that is needed to allow push-button auto-conversion, using today's technology, for *all conceivable renditions and uses* we can ever imagine. But the reality is that this is unreasonable and probably impossible. I think we do agree on this. Thus, I see a "master" as a sort of intermediary which captures the most important information common to all conceivable uses, and maybe with some added support for some select uses. Thus the "master" becomes, as La Monte says, a good starting point. I believe, then, that we come to some agreement about what minimum information the master should capture, and what form the information will take in the master. This is where we disagree on the specifics. [As an aside, hopefully we can work towards some general set of requirements to aid in decision making -- this is what any competent engineering organization does in project development. But we haven't yet taken an objective requirements approach to this, and if we don't, then agreement will never be reached. Rather it will become a Darwinian race between various factions to see whose views prevail, and contrary to what others may otherwise think, oftentimes such races do not lead to the best long-term result for the common vision we all share. The best long-term result might happen, but then it is more likely not -- it depends upon the views of those pushing their particular solution.] Obviously, I think we can agree that if some information is pretty trivial (effort-wise) to add to the master during mastering which is useful to certain *recognized* end-uses, and such information does not inhibit other important uses, then it makes sense to add it to the master. We might, too, consider which end-uses are the most important and make sure we have enough information to allow for full auto-conversion for those uses, or get it very close to that level. So that may be a discussion thread: what are the most important user renditions/uses we need to support above all others? (The one clinker in this last consideration has less to do with typography and more to do with "accuracy" -- error correction. I believe the master must, for a couple reasons I've noted previously, faithfully preserve the original text in the source book, including author's, publisher's, and typesetter's errors. And, yes, sometimes decisions have to be made about what exactly to transcribe to be "accurate" to the original. But this does not preclude marking up, in the "master", corrections to such errors. Conversion systems for some end-use purpose can then decide whether to use the "original text warts and all", or use the corrections, or a particular set of corrections since we should, I believe, allow for different sets of corrections based on different perspectives or end-uses.) Anyway, I could go on, but I'll mercifully end this message. Thoughts? Additions to the general agreements list? Jon From traverso at posso.dm.unipi.it Wed Oct 17 09:13:52 2007 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Wed, 17 Oct 2007 18:13:52 +0200 (CEST) Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <47163017.5090605@novomail.net> (message from Lee Passey on Wed, 17 Oct 2007 09:53:59 -0600) References: <20071001081923.GA29575@ark.in-berlin.de> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> <47153F2C.8060308@novomail.net> <6d99d1fd0710161611n5005bad3he9364a9b37b30e75@mail.gmail.com> <47158E88.8080404@novomail.net> <6d99d1fd0710170704w3b2ee680k87e8e99da324d2a@mail.gmail.com> <47163017.5090605@novomail.net> Message-ID: <20071017161352.5A02410231@posso.dm.unipi.it> >>>>> "Lee" == Lee Passey writes: Lee> I thought I had read in the PG faq that PG is not really Lee> interested in archiving multiple editions of the same Lee> work. After all, Project Gutenberg is, in point of fact, an Lee> e-publisher that publishes its own editions. So I don't think Lee> that PG would be open to archiving different editions of the Lee> same work. I don't know if this is written in the FAQ, but practically it is absolutely false. PG has many examples of multiple editions, (transcriptions of different editions marginally different), and many are added regularly. It has also a few multiple transcriptions of the same edition, but this has stopped: new transcriptions are used to produce a merged new edition. A new transcription is never discarded. Carlo Traverso From jon at noring.name Wed Oct 17 09:25:23 2007 From: jon at noring.name (Jon Noring) Date: Wed, 17 Oct 2007 10:25:23 -0600 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <6d99d1fd0710170704w3b2ee680k87e8e99da324d2a@mail.gmail.com> References: <20071001081923.GA29575@ark.in-berlin.de> <470EB4BC.9090201@novomail.net> <20071013182304.GA5263@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> <47153F2C.8060308@novomail.net> <6d99d1fd0710161611n5005bad3he9364a9b37b30e75@mail.gmail.com> <47158E88.8080404@novomail.net> <6d99d1fd0710170704w3b2ee680k87e8e99da324d2a@mail.gmail.com> Message-ID: <1388279505.20071017102523@noring.name> David Starner wrote: > No one knows what's in the future. From my vantage point, I've seen a > lot of movement towards HTML, with all those devices that supposedly > need specialized handling getting more and more artful at dealing with > the ever-present HTML. Well, David's comment seems to make the assumption that the end-user version is the same as the "master" version. Lee's comments that led to this assume a "master" from which other renditions, like XHTML, are derived. Is the concept of an intermediary "master" from which all end-user renditions are derived something that we have not yet come to a collective agreement? Jon Noring From prosfilaes at gmail.com Wed Oct 17 09:35:55 2007 From: prosfilaes at gmail.com (David Starner) Date: Wed, 17 Oct 2007 12:35:55 -0400 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <1388279505.20071017102523@noring.name> References: <20071001081923.GA29575@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> <47153F2C.8060308@novomail.net> <6d99d1fd0710161611n5005bad3he9364a9b37b30e75@mail.gmail.com> <47158E88.8080404@novomail.net> <6d99d1fd0710170704w3b2ee680k87e8e99da324d2a@mail.gmail.com> <1388279505.20071017102523@noring.name> Message-ID: <6d99d1fd0710170935i76b58b9en3aa954c579785371@mail.gmail.com> On 10/17/07, Jon Noring wrote: > David Starner wrote: > > > No one knows what's in the future. From my vantage point, I've seen a > > lot of movement towards HTML, with all those devices that supposedly > > need specialized handling getting more and more artful at dealing with > > the ever-present HTML. > > Well, David's comment seems to make the assumption that the end-user > version is the same as the "master" version. It doesn't make that assumption. Today, HTML is the most common master format for ebooks, at least the non-commercial kind. And partially because of that, most ebook readers read HTML, either directly, or by converting it into a different end-user version. Frankly, I don't see HTML being unseated as the primary master ebook format by TEI; people are going to use HTML because it's easy to create, and everyone can read it directly. The question for me is whether TEI is going to be a major format for PG. From joshua at hutchinson.net Wed Oct 17 09:36:29 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Wed, 17 Oct 2007 16:36:29 +0000 (UTC) Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data Message-ID: <14178793.1192638989699.JavaMail.?@fh1064.dia.cp.net> The problem is that "markup nuts" (and I use the term affectionately), like you (Jon) and Lee, view a master document on its own, assuming that conversion to end user formats is possible and nothing to be worried about. Folks like David, judge a master document by the end-user format that results from it. So the HTML and PDF that outputs from a TEI master are the yardstick by which TEI is being measured. Both are a valid yardstick, but they make the arguments back and forth seem to have a conceptual gap between them. Josh >----Original Message---- >From: jon at noring.name >Date: Oct 17, 2007 12:25 >To: "Project Gutenberg Volunteer Discussion" >Subj: Re: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data > >David Starner wrote: > >> No one knows what's in the future. From my vantage point, I've seen a >> lot of movement towards HTML, with all those devices that supposedly >> need specialized handling getting more and more artful at dealing with >> the ever-present HTML. > >Well, David's comment seems to make the assumption that the end-user >version is the same as the "master" version. Lee's comments that led >to this assume a "master" from which other renditions, like XHTML, are >derived. > >Is the concept of an intermediary "master" from which all end-user >renditions are derived something that we have not yet come to a >collective agreement? > >Jon Noring > >_______________________________________________ >gutvol-d mailing list >gutvol-d at lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d > From prosfilaes at gmail.com Wed Oct 17 09:42:55 2007 From: prosfilaes at gmail.com (David Starner) Date: Wed, 17 Oct 2007 12:42:55 -0400 Subject: [gutvol-d] A proposed list of common understandings on the TEI mastering threads In-Reply-To: <121591611.20071017101331@noring.name> References: <20071001081923.GA29575@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> <47153F2C.8060308@novomail.net> <6d99d1fd0710161611n5005bad3he9364a9b37b30e75@mail.gmail.com> <47158E88.8080404@novomail.net> <47160AC5.2060007@netronome.com> <121591611.20071017101331@noring.name> Message-ID: <6d99d1fd0710170942k28c03fb1q6cb55f7936b92ca6@mail.gmail.com> On 10/17/07, Jon Noring wrote: > I > believe the master must, for a couple reasons I've noted previously, > faithfully preserve the original text in the source book, including > author's, publisher's, and typesetter's errors. Eh. There are some volumes for which this is important. But if you really want to go through French and Oriental Love in a Harem (http://www.gutenberg.org/etext/21868) and pick out all the times where the typesetter forgot which way u's go or ran out of b's and started using h's, go ahead. From prosfilaes at gmail.com Wed Oct 17 09:51:38 2007 From: prosfilaes at gmail.com (David Starner) Date: Wed, 17 Oct 2007 12:51:38 -0400 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <47163017.5090605@novomail.net> References: <20071001081923.GA29575@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> <47153F2C.8060308@novomail.net> <6d99d1fd0710161611n5005bad3he9364a9b37b30e75@mail.gmail.com> <47158E88.8080404@novomail.net> <6d99d1fd0710170704w3b2ee680k87e8e99da324d2a@mail.gmail.com> <47163017.5090605@novomail.net> Message-ID: <6d99d1fd0710170951r14a2a14r61db24125666032a@mail.gmail.com> On 10/17/07, Lee Passey wrote: > Mr. Perathoner claims that the most popular work at PG is Jane Austen's > Pride and Prejudice. He has offered no evidence to > support this claim, but it doesn't seem unlikely to me. So after I > complete the two conversions I have in my queue right now, and if I > can't find a reputable TEI version of Pride and Prejudice > that will probably be my next project. But is there any evidence that the PG edition of Pride and Prejudice is lacking anything? It seems very political to go after the most highly visible material, rather than working on stuff that most needs doing. > So, in all likelihood, I will not submit to Project Gutenberg any > versions of its most popular downloads that I have redone. These will be > submitted to the Internet Archive instead. Then why are you posting here? This is a list of PG volunteers to discuss PG. If you want to discuss ebooks unrelated to PG, the bookpeople list is quite open to that. > If anyone would like to join me in my efforts, I would be glad to have > the help. It's generally considered tacky to try and drag away volunteers from a competing project on their mailing lists. From piggy at netronome.com Wed Oct 17 10:08:51 2007 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Wed, 17 Oct 2007 13:08:51 -0400 Subject: [gutvol-d] A proposed list of common understandings on the TEI mastering threads In-Reply-To: <6d99d1fd0710170942k28c03fb1q6cb55f7936b92ca6@mail.gmail.com> References: <20071001081923.GA29575@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> <47153F2C.8060308@novomail.net> <6d99d1fd0710161611n5005bad3he9364a9b37b30e75@mail.gmail.com> <47158E88.8080404@novomail.net> <47160AC5.2060007@netronome.com> <121591611.20071017101331@noring.name> <6d99d1fd0710170942k28c03fb1q6cb55f7936b92ca6@mail.gmail.com> Message-ID: <471641A3.2010704@netronome.com> David Starner wrote: > On 10/17/07, Jon Noring wrote: > >> I >> believe the master must, for a couple reasons I've noted previously, >> faithfully preserve the original text in the source book, including >> author's, publisher's, and typesetter's errors. >> > > Eh. There are some volumes for which this is important. But if you > really want to go through French and Oriental Love in a Harem > (http://www.gutenberg.org/etext/21868) and pick out all the times > where the typesetter forgot which way u's go or ran out of b's and > started using h's, go ahead. > David brings out two very nice points: 1) Not every work deserves the same amount of effort. 2) In a volunteer project, each volunteer gets to decide what deserves their effort. I'm experimenting with TEI because I have been frustrated with getting both HTML and text editions that are consistent with each other and make me feel good about having produced them. From jon at noring.name Wed Oct 17 10:21:48 2007 From: jon at noring.name (Jon Noring) Date: Wed, 17 Oct 2007 11:21:48 -0600 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <6d99d1fd0710170951r14a2a14r61db24125666032a@mail.gmail.com> References: <20071001081923.GA29575@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> <47153F2C.8060308@novomail.net> <6d99d1fd0710161611n5005bad3he9364a9b37b30e75@mail.gmail.com> <47158E88.8080404@novomail.net> <6d99d1fd0710170704w3b2ee680k87e8e99da324d2a@mail.gmail.com> <47163017.5090605@novomail.net> <6d99d1fd0710170951r14a2a14r61db24125666032a@mail.gmail.com> Message-ID: <1839225867.20071017112148@noring.name> David Starner wrote: > Lee Passey wrote: >> Mr. Perathoner claims that the most popular work at PG is Jane Austen's >> Pride and Prejudice. He has offered no evidence to >> support this claim, but it doesn't seem unlikely to me. So after I >> complete the two conversions I have in my queue right now, and if I >> can't find a reputable TEI version of Pride and Prejudice >> that will probably be my next project. > But is there any evidence that the PG edition of Pride and Prejudice > is lacking anything? It seems very political to go after the most > highly visible material, rather than working on stuff that most needs > doing. Well, without rehashing the *many* reasons, the most popular works of the Public Domain in the PG corpus need to be redone from scratch. The PG versions, particularly the pre-DP ones, are wholly unsatisfactory for one or more reasons. And regarding "what most needs doing", DP is working at a fast clip, so I don't have any worries that Lee will somehow slow DP down in any way. He won't. And those who work outside of DP will continue to do what they will continue to do. >> So, in all likelihood, I will not submit to Project Gutenberg any >> versions of its most popular downloads that I have redone. These will be >> submitted to the Internet Archive instead. > Then why are you posting here? This is a list of PG volunteers to > discuss PG. If you want to discuss ebooks unrelated to PG, the > bookpeople list is quite open to that. Well, maybe Lee is finding who here might wish to join him. :^) After all, the goal of PG is to digitize the Public Domain books, and make them freely available to the world. I know Michael is *supportive* of other projects that do the same. He himself has said it. So Lee's comment is very much appropriate to gutvol-d. Now if Greg and Michael think differently than I wrote above, then I hope they chime in since they are the ones who actually administer this group. >> If anyone would like to join me in my efforts, I would be glad to have >> the help. > It's generally considered tacky to try and drag away volunteers from a > competing project on their mailing lists. See my comment above. What Lee proposes is not competitive -- it is in the spirit of PG as envisioned by Michael Hart. So, by definition, it cannot take away volunteers from PG. If anything, Lee's proposal will add volunteers to this goal. Btw, count me in as joining Lee in his effort. Anyone else here interested in helping Lee out, contact him directly -- or contact me if you can't get a hold of him. Jon Noring From jon at noring.name Wed Oct 17 10:38:39 2007 From: jon at noring.name (Jon Noring) Date: Wed, 17 Oct 2007 11:38:39 -0600 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <14178793.1192638989699.JavaMail.?@fh1064.dia.cp.net> References: <14178793.1192638989699.JavaMail.?@fh1064.dia.cp.net> Message-ID: <751740422.20071017113839@noring.name> Josh wrote: > The problem is that "markup nuts" (and I use the term affectionately), > like you (Jon) and Lee, view a master document on its own, assuming > that conversion to end user formats is possible and nothing to be > worried about. LOL. To the contrary, we consider the conversion to end-user renditions to be critical, but we are looking at the even bigger picture of ebook formats, various platforms from 2" screens and larger, etc. (I think many of the "beautiful" XHTML editions being produced at DP will not do well on such platforms.) And also the needs of the accessibility community. Nevertheless, I don't think the gap between the two camps is as large as some may think it to be since the arguments have more to do with details rather than general philosophy. I also suspect for 80% to 90% of the books, there are not that many issues that lead to wild disagreements regarding markup. As another example, Juliet mentioned about some poetry where the verse lines are indented various lengths in the original, and that is important information that needs to be preserved since the author had some reason for variable indentation. So I see value for some renditions needing to reproduce that, and the "rend" attribute seems sufficient to communicate that information. But note that for some end-user renditions, such as for 2" screens, you *don't* want to force much if any indentation of the verse because the verse will become unreadable. And what about text-to-speech for the blind? I don't think Lee and I are saying we will dump certain typographic information, but that it must be preserved the right way, and used where appropriate, and NOT used where appropriate. We must not FORCE all end-user renditions to present the texts a certain way. And I see some of the DP work product to get perilously close to this border. That is, DP may be releasing end-user renditions as "masters" but which do not make good "masters" and also do not make good renditions for certain platforms. > Folks like David, judge a master document by the end-user format that > results from it. So the HTML and PDF that outputs from a TEI master > are the yardstick by which TEI is being measured. > > Both are a valid yardstick, but they make the arguments back and forth > seem to have a conceptual gap between them. Yes, agreed. In my other message where I propose the understandings we do seem to share, I make note that the "master" may certainly include information of value for the more recognized end-use renditions. Refer to that for more details. And of course my example mentioned above. Jon From jon at noring.name Wed Oct 17 10:53:51 2007 From: jon at noring.name (Jon Noring) Date: Wed, 17 Oct 2007 11:53:51 -0600 Subject: [gutvol-d] A proposed list of common understandings on the TEI mastering threads In-Reply-To: <471641A3.2010704@netronome.com> References: <20071001081923.GA29575@ark.in-berlin.de> <4713A930.7060103@novomail.net> <20071016080528.GA4609@ark.in-berlin.de> <4714ED28.6070008@novomail.net> <47150A06.1010207@verizon.net> <47153F2C.8060308@novomail.net> <6d99d1fd0710161611n5005bad3he9364a9b37b30e75@mail.gmail.com> <47158E88.8080404@novomail.net> <47160AC5.2060007@netronome.com> <121591611.20071017101331@noring.name> <6d99d1fd0710170942k28c03fb1q6cb55f7936b92ca6@mail.gmail.com> <471641A3.2010704@netronome.com> Message-ID: <371969742.20071017115351@noring.name> La Monte H.P. Yarroll wrote: > David Starner wrote: >> Eh. There are some volumes for which this is important. But if you >> really want to go through French and Oriental Love in a Harem >> (http://www.gutenberg.org/etext/21868) and pick out all the times >> where the typesetter forgot which way u's go or ran out of b's and >> started using h's, go ahead. To answer David, one reason is for purposes of aligning the master with the digital scans for future proofing, OCRing, etc. Plus, there are scholars who may be interested in this. Certainly for these kinds of texts which PG/DP is now doing, they will correct it *anyway*, so we can certainly provide the original *and* the marked up corrections. That is, DP gets the original information, and then throws it away with the corrections. Who says the original information needs to be thrown away? (As an aside, I do believe the "master" must preserve the original line breaks, even "in-word", since this is important for alignment plus the event someone didn't correctly join an EOL broken word. In addition, it will be useful for those who may wish to produce an exact facsimile reproduction. After all, it is information that the OCR or key entry gives us, so stripping it out *removes* information we already have which may prove useful to both the project and some users of the "master".) > David brings out two very nice points: > > 1) Not every work deserves the same amount of effort. Exactly! > 2) In a volunteer project, each volunteer gets to decide what deserves > their effort. > > I'm experimenting with TEI because I have been frustrated with getting > both HTML and text editions that are consistent with each other and make > me feel good about having produced them. This again is the idea of a "master", to capture the critical information in the most processible manner so others who wish to produce specific renditions meeting specific requirements have a reasonable starting point to build upon. Certainly we should strive for auto-conversion for the most important renditions which are most widely used, but other more specialized renditions may require some human effort, and whether anyone will do the effort depends upon whether the work is worth it to them. Apparently David Starner feels that "French and Oriental Love in a Harem" is important enough for him to produce a very fancy rendition of it, while to others that book will not be considered important enough and the original page scans and the "raw accurate text" sufficient for their needs... Jon Noring From grythumn at gmail.com Wed Oct 17 11:29:18 2007 From: grythumn at gmail.com (Robert Cicconetti) Date: Wed, 17 Oct 2007 14:29:18 -0400 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <751740422.20071017113839@noring.name> References: <14178793.1192638989699.JavaMail.?@fh1064.dia.cp.net> <751740422.20071017113839@noring.name> Message-ID: <15cfa2a50710171129u26d95c6cldca8cd6e406d74a2@mail.gmail.com> On 10/17/07, Jon Noring wrote: > As another example, Juliet mentioned about some poetry where the verse > lines are indented various lengths in the original, and that is > important information that needs to be preserved since the author had > some reason for variable indentation. So I see value for some renditions > needing to reproduce that, and the "rend" attribute seems sufficient to > communicate that information. But note that for some end-user > renditions, such as for 2" screens, you *don't* want to force much if > any indentation of the verse because the verse will become unreadable. Just to have an example handy, Christmas and its associations has a poem formatted in the shape of a stocking. I didn't PP this one, I just scanned it. http://www.gutenberg.org/files/22042/22042-h/22042-h.htm http://www.gutenberg.org/files/22042/22042-h/images/stocking.jpg R C From Bowerbird at aol.com Wed Oct 17 11:39:22 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 17 Oct 2007 14:39:22 EDT Subject: [gutvol-d] pushing the merry-go-round Message-ID: my spam folder is overflowing! it looks like lee is pushing the noring merry-go-round, and taking the whole listserve for a ride... :+) i'm glad i got off that thing, or i would be _very_ dizzy. meanwhile, my .pdf converter is starting to look good! -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071017/f5369898/attachment.htm From jon at noring.name Wed Oct 17 11:41:55 2007 From: jon at noring.name (Jon Noring) Date: Wed, 17 Oct 2007 12:41:55 -0600 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <15cfa2a50710171129u26d95c6cldca8cd6e406d74a2@mail.gmail.com> References: <14178793.1192638989699.JavaMail.?@fh1064.dia.cp.net> <751740422.20071017113839@noring.name> <15cfa2a50710171129u26d95c6cldca8cd6e406d74a2@mail.gmail.com> Message-ID: <769730035.20071017124155@noring.name> Robert wrote: > Jon Noring wrote: >> As another example, Juliet mentioned about some poetry where the verse >> lines are indented various lengths in the original, and that is >> important information that needs to be preserved since the author had >> some reason for variable indentation. So I see value for some renditions >> needing to reproduce that, and the "rend" attribute seems sufficient to >> communicate that information. But note that for some end-user >> renditions, such as for 2" screens, you *don't* want to force much if >> any indentation of the verse because the verse will become unreadable. > Just to have an example handy, Christmas and its associations has a > poem formatted in the shape of a stocking. I didn't PP this one, I > just scanned it. > > http://www.gutenberg.org/files/22042/22042-h/22042-h.htm > http://www.gutenberg.org/files/22042/22042-h/images/stocking.jpg LOL, cool. :^) In many prior messages I do note that when the typography itself becomes content, that one may consider SVG. This way the content is still accessible, and has fallback to the core text, but it will also reproduce the layout. Fortunately things like this are not that common -- they occur, but only in a small percentage. This is one of those things that really challenges cross-platform presentation. Yet we must not throw up our hands in despair and say it can't be done, but simply compromise as to what will be presented on some platforms. Trying to *force* the stocking typography on 2" cellphone screens may end up making the content inside the stocking, which is the core content, very difficult if not impossible to read. Is that what we want to do? Thus, our solutions have to take into account the wide cross-platform visual presentation support we'd like to give. And of course, for the stocking, how does one render that in text-to-speech? (Ideally, a human would say, "and the following text is shaped on the paper in the form of a stocking...".) Jon Noring From ralf at ark.in-berlin.de Wed Oct 17 03:32:30 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Wed, 17 Oct 2007 12:32:30 +0200 Subject: [gutvol-d] it's good to see the .tei people In-Reply-To: <47150207.9010203@verizon.net> References: <3027016.1192552089449.JavaMail.?@fh1035.dia.cp.net> <447877789.20071016105253@noring.name> <47150207.9010203@verizon.net> Message-ID: <20071017103230.GA17928@ark.in-berlin.de> Wouldn't it be better, before discussing   to look what current browsers already do automatically when rendering text? I have the impression that Firefoxc already handles some of the cases but can't put the finger on it. ralf From jon at noring.name Wed Oct 17 12:09:03 2007 From: jon at noring.name (Jon Noring) Date: Wed, 17 Oct 2007 13:09:03 -0600 Subject: [gutvol-d] pushing the merry-go-round In-Reply-To: References: Message-ID: <6710107542.20071017130903@noring.name> Bowerbird wrote: > my spam folder is overflowing! I bet. LOL. > it looks like lee is pushing the noring merry-go-round, > and taking the whole listserve for a ride...????????? :+) Bringing in names is irrelevant in this discussion. Calling it a "merry-go-round" is fine, but adding a name is a form of disparagement that creates a hostile discussion environment. Some may even classify what you are saying as a form of hate speech by your focusing on the individual rather than on the thoughts and ideas being discussed. > i'm glad i got off that thing, or i would be _very_ dizzy. No comment, but I am so tempted... > meanwhile, my .pdf converter is starting to look good! Great! Btw, did you not realize that there does seem consensus that your ZML is insufficient for the purpose you are promoting it for? And reasons have been given? (I have a couple other reasons but have not taken the time to elaborate on them.) You can demonstrate all the conversion "toolz" you want, but the core input itself is insufficient for PG/DP purposes of a "master" format, therefore whatever you demonstrate in output will be wasted effort. You probably disagree, and the best way to prove us wrong is to get a list of 10 to 100 representative PG texts that the PG/DP experts suggest, format them in ZML, then see what you can do with them. (Of course, a couple of the books have to include block quotes, and maybe notes inside of notes.) Of course, I've repeatedly called for the PG/DP folk to recommend a list (and this is another plea), but no one has. Either no one is reading my messages (I'm getting a lot of replies on my messages so it can't be that), or they simply don't want to provide *you* with such a list. If they don't want to, I wonder why? Jon Noring From jon at noring.name Wed Oct 17 12:11:02 2007 From: jon at noring.name (Jon Noring) Date: Wed, 17 Oct 2007 13:11:02 -0600 Subject: [gutvol-d] it's good to see the .tei people In-Reply-To: <20071017103230.GA17928@ark.in-berlin.de> References: <3027016.1192552089449.JavaMail.?@fh1035.dia.cp.net> <447877789.20071016105253@noring.name> <47150207.9010203@verizon.net> <20071017103230.GA17928@ark.in-berlin.de> Message-ID: <1819197792.20071017131102@noring.name> Ralf wrote: > Wouldn't it be better, before discussing   to look what > current browsers already do automatically when rendering text? > > I have the impression that Firefoxc already handles some of the > cases but can't put the finger on it. Yes, this is interesting. In many respects it is the quality and sophistication of the typographic rendering engine used. Jon Noring From Bowerbird at aol.com Wed Oct 17 12:47:49 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 17 Oct 2007 15:47:49 EDT Subject: [gutvol-d] A proposed list of common understandings on the TEI mastering threads Message-ID: piggy said: > I have some easy novels similar to novels already in PG which I am > willing to make available to someone willing to run such an experiment. why work on simple books? you need to complete a _test-suite_ which will give you the confidence that you need to proceed fully. > Can we even get enough interest from formatters to complete one such book? wouldn't it be nice if some of the .tei advocates took on this task? really, if they won't even do this little bit of work, why should you? the advice is exactly the same as what i gave to that gentleman who recently proposed the e-texts be repurposed in the opendoc format: mount a mirror of the library where the e-texts are in such a format and -- if the format is really useful people -- your mirror will prevail. convert the e-texts that are released daily, then work on the backlog. the .tei people want _someone_else_ to do the work. (and i don't blame 'em, because .tei is a lot of work...) *** piggy said: > I'm experimenting with TEI because I have been frustrated with > getting both HTML and text editions that are consistent with each other > and make me feel good about having produced them. when you tire of experimenting with .tei, take a look at z.m.l. since the .html is auto-generated from the text version, they will stay consistent. and you will feel good about the quality. -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071017/4d06cbf6/attachment.htm From Bowerbird at aol.com Wed Oct 17 12:49:38 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 17 Oct 2007 15:49:38 EDT Subject: [gutvol-d] taking my own advice Message-ID: so, i'm now capable of easily churning out the conversion of a couple of hundred pg-ascii e-texts into z.m.l. at a stretch... so i expect to be releasing 1,000 at a time in the near future, in the building of my z.m.l. mirror of the p.g. library. game on. if anyone has any preferences as to _where_ i should start in the p.g. library, please say so, either here or backchannel (e.g., start at 12000, or start at 16,500, or so on)... -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071017/f7a0af5a/attachment.htm From rolsch at verizon.net Wed Oct 17 13:18:47 2007 From: rolsch at verizon.net (Roland Schlenker) Date: Wed, 17 Oct 2007 16:18:47 -0400 Subject: [gutvol-d] A proposed list of common understandings on the TEI mastering threads In-Reply-To: <121591611.20071017101331@noring.name> References: <20071001081923.GA29575@ark.in-berlin.de> <47160AC5.2060007@netronome.com> <121591611.20071017101331@noring.name> Message-ID: <200710171618.47926.rolsch@verizon.net> On Wednesday 17 October 2007 12:13:31 pm Jon Noring wrote: > La Monte H.P. Yarroll wrote: > > I think we should be able to make a good case for TEI without making > > unnecessarily broad claims. Whatever we do today WILL be inadequate > > for some future user. Let's do what we can to give them a good > > starting point. > > This is an excellent comment. > > The common understandings we have in this set of TEI-related threads > are the following: > > 1) Each text project will use a known source book, and the final > digitized text, in whatever form, will be "accurate" to that source > book, and will include metadata referencing that source book. > (Note that in this statement "accurate" remains undefined.) > > 2) Each text project will always make available the source book > scanset in (at least) sufficient quality for OCR, human proofing, > verifying text accuracy by end-users, and discerning the original > typography. (I believe every scanset should be archival quality > but this is an issue not germane to this particular discussion.) IMO, this is very important. It will always allow someone to refer back the the original source material. It also will allow someone to further refine the TEI file or even to produce a total different file using a total different markup method. > > 3) Each text project will produce a "digital master" from which all > user renditions, and other types of uses, will be derived. > > 4) The "digital master" will be an XML document marked up with some > "flavor" of TEI. > > [Note: There may be a couple other common understandings that I've not > included in the above list, and certainly mention them if you think of > them. But I think this is a good starting point of where I believe > most of us participating in these threads agree with. However, if we > don't have super-majority agreement on the above four items, then the > gap in views is wider than I suspected, and I doubt we can ever get to > any agreement at all on the specifics of implementation if we can't > even agree on the general principles. I'll assume in the comments > below that we have collective majority agreement on the above general > understandings.] > > > What is obvious, though, is that these common understandings are not > of sufficient completeness that the specifics of implementation become > crystal clear -- they just don't fall out. So that's the reason for > our discussions, to clarify each understanding and maybe also add to > the list. And this is proving difficult because we tend to fall into > different "camps" as Josh, and then Lee, so eloquently explained. > > Alright, now to provide maybe a little more on the above from my > perspective... > > Obviously, a dream we all have is that the "master" will have all > that is needed to allow push-button auto-conversion, using today's > technology, for *all conceivable renditions and uses* we can ever > imagine. But the reality is that this is unreasonable and probably > impossible. I think we do agree on this. > > Thus, I see a "master" as a sort of intermediary which captures the > most important information common to all conceivable uses, and maybe > with some added support for some select uses. Thus the "master" > becomes, as La Monte says, a good starting point. > > I believe, then, that we come to some agreement about what minimum > information the master should capture, and what form the information > will take in the master. This is where we disagree on the specifics. > > [As an aside, hopefully we can work towards some general set of > requirements to aid in decision making -- this is what any competent > engineering organization does in project development. But we haven't > yet taken an objective requirements approach to this, and if we don't, > then agreement will never be reached. Rather it will become a > Darwinian race between various factions to see whose views prevail, > and contrary to what others may otherwise think, oftentimes such races > do not lead to the best long-term result for the common vision we all > share. The best long-term result might happen, but then it is more > likely not -- it depends upon the views of those pushing their > particular solution.] > > Obviously, I think we can agree that if some information is pretty > trivial (effort-wise) to add to the master during mastering which is > useful to certain *recognized* end-uses, and such information does > not inhibit other important uses, then it makes sense to add it to > the master. We might, too, consider which end-uses are the most > important and make sure we have enough information to allow for full > auto-conversion for those uses, or get it very close to that level. > > So that may be a discussion thread: what are the most important user > renditions/uses we need to support above all others? > > (The one clinker in this last consideration has less to do with > typography and more to do with "accuracy" -- error correction. I > believe the master must, for a couple reasons I've noted previously, > faithfully preserve the original text in the source book, including > author's, publisher's, and typesetter's errors. And, yes, sometimes > decisions have to be made about what exactly to transcribe to be > "accurate" to the original. But this does not preclude marking up, in > the "master", corrections to such errors. Conversion systems for > some end-use purpose can then decide whether to use the "original text > warts and all", or use the corrections, or a particular set of > corrections since we should, I believe, allow for different sets of > corrections based on different perspectives or end-uses.) From the P5 specs: If the encoder elects both to record the original source text and to provide a correction for the sake of word-search and other programs, both sic and corr are used, wrapped in a choice: ? marginal comments which indicate that the dates date's mentioned in the main body of the text are incorrect. Question: DP proofing guidelines current call for contractions to be closed up. Are we to mark this up as wouldn't would n't > > Anyway, I could go on, but I'll mercifully end this message. > > Thoughts? Additions to the general agreements list? > > Jon > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From rolsch at verizon.net Wed Oct 17 13:32:34 2007 From: rolsch at verizon.net (Roland Schlenker) Date: Wed, 17 Oct 2007 16:32:34 -0400 Subject: [gutvol-d] A proposed list of common understandings on the TEI mastering threads In-Reply-To: <371969742.20071017115351@noring.name> References: <20071001081923.GA29575@ark.in-berlin.de> <471641A3.2010704@netronome.com> <371969742.20071017115351@noring.name> Message-ID: <200710171632.34321.rolsch@verizon.net> On Wednesday 17 October 2007 1:53:51 pm Jon Noring wrote: > La Monte H.P. Yarroll wrote: > > David Starner wrote: > >> Eh. There are some volumes for which this is important. But if you > >> really want to go through French and Oriental Love in a Harem > >> (http://www.gutenberg.org/etext/21868) and pick out all the times > >> where the typesetter forgot which way u's go or ran out of b's and > >> started using h's, go ahead. > > To answer David, one reason is for purposes of aligning the master > with the digital scans for future proofing, OCRing, etc. Plus, there > are scholars who may be interested in this. Certainly for these kinds > of texts which PG/DP is now doing, they will correct it *anyway*, so > we can certainly provide the original *and* the marked up corrections. > > That is, DP gets the original information, and then throws it away > with the corrections. Who says the original information needs to be > thrown away? > > (As an aside, I do believe the "master" must preserve the original > line breaks, even "in-word", since this is important for alignment > plus the event someone didn't correctly join an EOL broken word. In > addition, it will be useful for those who may wish to produce an exact > facsimile reproduction. After all, it is information that the OCR > or key entry gives us, so stripping it out *removes* information we > already have which may prove useful to both the project and some users > of the "master".) If original breaks are to be retained, then EOL hyphenations are going to be retained. DP proofing guidelines current call for EOL hyphenations to be closed up to the line above. IMO, I do not see that the DP proofing guidelines are going to be changed for this reason. > > > David brings out two very nice points: > > > > 1) Not every work deserves the same amount of effort. > > Exactly! > > > 2) In a volunteer project, each volunteer gets to decide what deserves > > their effort. > > > > I'm experimenting with TEI because I have been frustrated with getting > > both HTML and text editions that are consistent with each other and make > > me feel good about having produced them. > > This again is the idea of a "master", to capture the critical > information in the most processible manner so others who wish to > produce specific renditions meeting specific requirements have a > reasonable starting point to build upon. Certainly we should strive > for auto-conversion for the most important renditions which are most > widely used, but other more specialized renditions may require some > human effort, and whether anyone will do the effort depends upon > whether the work is worth it to them. Apparently David Starner feels > that "French and Oriental Love in a Harem" is important enough for him > to produce a very fancy rendition of it, while to others that book > will not be considered important enough and the original page scans > and the "raw accurate text" sufficient for their needs... > > Jon Noring Roland Schlenker From joshua at hutchinson.net Wed Oct 17 14:11:14 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Wed, 17 Oct 2007 21:11:14 +0000 (UTC) Subject: [gutvol-d] pushing the merry-go-round Message-ID: <28703431.1192655474902.JavaMail.?@fh1064.dia.cp.net> >----Original Message---- >From: jon at noring.name > > ** on why no one has provided bowerbird with any list of texts to work from ** > >... they simply don't want to provide *you* with such a >list. If they don't want to, I wonder why? > Ding! Ding! Ding! We have a winner! Josh From marcello at perathoner.de Thu Oct 18 06:11:49 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Thu, 18 Oct 2007 15:11:49 +0200 Subject: [gutvol-d] [Fwd: Wiki2Tei converter 1.0!] Message-ID: <47175B95.8090905@perathoner.de> Wow! Cool! This sounds great! Rave on! -------- Original Message -------- Subject: Wiki2Tei converter 1.0! Date: Wed, 10 Oct 2007 20:33:30 +0200 From: Sylvain Loiseau Reply-To: Sylvain Loiseau To: TEI-L at LISTSERV.BROWN.EDU We are pleased to announce the first release of the Wiki2Tei software. Wiki2Tei is a converter from the mediawiki format to XML (TEI vocabulary). The mediawiki format is used by wikimedia fundation wikis (Wikipedia, Wikibooks, Wikisource), and many other wikis using the mediawiki software. Large amounts of free hight-quality structured texts are available in this format. These texts are used more and more often in NLP (natural language processing) projects. However, the mediawiki parser is oriented towards rendition and the mediawiki syntax is complex and hard to parse. The Wiki2Tei converter makes available the information contained in wiki syntax (structuration, highlighting, etc.), and allows to properly retrieve the plain text. This conversion is intended to preserve all the properties of the original text. Wiki2Tei is closely coupled with the mediawiki software, allowing to convert all the features of the mediawiki syntax. The Wiki2Tei converter provides a rich set of tools for converting mediawiki text from several sources (file, mediawiki database) and managing collections of files to be converted. The TEI vocabulary used is documented, according to the TEI Guidelines, in an ODD document. The code is open source and may be downloaded from the SourceForge download area: http://sourceforge.net/projects/wiki2tei/ http://sourceforge.net/project/showfiles.php?group_id=198407 The web site contains full documentation and a "demo": http://wiki2tei.sourceforge.net/ http://wiki2tei.sourceforge.net/demo/ A mailing list is open: https://lists.sourceforge.net/lists/listinfo/wiki2tei-users Best, Bernard Desgraupes, Sylvain Loiseau ---------------------------------------------------------------- Ce message a ete envoye par IMP, grace a l'Universite Paris 10 Nanterre -- Marcello Perathoner webmaster at gutenberg.org From hart at pglaf.org Thu Oct 18 10:13:31 2007 From: hart at pglaf.org (Michael Hart) Date: Thu, 18 Oct 2007 10:13:31 -0700 (PDT) Subject: [gutvol-d] taking my own advice In-Reply-To: References: Message-ID: Follow the instructions in "Alice". . . . On Wed, 17 Oct 2007, Bowerbird at aol.com wrote: > so, i'm now capable of easily churning out the conversion of > a couple of hundred pg-ascii e-texts into z.m.l. at a stretch... > > so i expect to be releasing 1,000 at a time in the near future, > in the building of my z.m.l. mirror of the p.g. library. game on. > > if anyone has any preferences as to _where_ i should start > in the p.g. library, please say so, either here or backchannel > (e.g., start at 12000, or start at 16,500, or so on)... > > -bowerbird > > > > ************************************** > See what's new at http://www.aol.com > From piggy at netronome.com Thu Oct 18 10:27:36 2007 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Thu, 18 Oct 2007 13:27:36 -0400 Subject: [gutvol-d] it's good to see the .tei people In-Reply-To: <4714C787.3060609@perathoner.de> References: <6410280365.20071015125216@noring.name> <4714C787.3060609@perathoner.de> Message-ID: <47179788.6080801@netronome.com> Marcello Perathoner wrote: > All in all, using non-printing characters as markup tags must be the > most bone-headed design decision ever. > But I LIKE python! :-) From jon at noring.name Thu Oct 18 10:33:34 2007 From: jon at noring.name (Jon Noring) Date: Thu, 18 Oct 2007 11:33:34 -0600 Subject: [gutvol-d] How are block quotes handled in ZML? In-Reply-To: References: Message-ID: <684145262.20071018113334@noring.name> [Josh and others in PG/DP, do you know of specific PG texts which contain block quotes? And I'd love to see some where the block quote itself contains multiple paragraphs and maybe mixed with other structures like verse. Jon] Bowerbird wrote: > so, i'm now capable of easily churning out the conversion of > a couple of hundred pg-ascii e-texts into z.m.l. at a stretch... > > so i expect to be releasing 1,000 at a time in the near future, > in the building of my z.m.l. mirror of the p.g. library. game on. > > if anyone has any preferences as to _where_ i should start > in the p.g. library, please say so, either here or backchannel > (e.g., start at 12000, or start at 16,500, or so on)... Cool! As asked before, Bowerbird, how does ZML handle block quotes? Many books have block quotes, which themselves usually contain one or more *ordinary paragraphs* and may also contain pretty much anything else (i.e., a block quote may be a "mini document" all to itself, such as paragraphs, verse, and even other block quotes.) According to Lee, ZML does not have a special rule to recognize a block quote so it can be properly presented, which is that a paragraph in a block quote needs to be reflowed upon presentation like any paragraph would. (In fact, to be recognized as a paragraph and not lines of verse.) But I'd rather hear it from the ZML expert: how does ZML specifically identify block quotes? If it can't, this is one show stopper for the current ZML spec being a universal mastering format. If it can, then we'd like to know, and of course maybe mention how in your "11 Rulez." Jon Noring From lee at novomail.net Thu Oct 18 10:56:07 2007 From: lee at novomail.net (Lee Passey) Date: Thu, 18 Oct 2007 11:56:07 -0600 Subject: [gutvol-d] How are block quotes handled in ZML? In-Reply-To: <684145262.20071018113334@noring.name> References: <684145262.20071018113334@noring.name> Message-ID: <47179E37.9040304@novomail.net> Jon Noring wrote: > [Josh and others in PG/DP, do you know of specific PG texts which > contain block quotes? And I'd love to see some where the block quote > itself contains multiple paragraphs and maybe mixed with other > structures like verse. Jon] How about: http://www.gutenberg.org/files/16494/16494-h.zip? [snip] > According to Lee, ZML does not have a special rule to recognize a block > quote so it can be properly presented, which is that a paragraph in a > block quote needs to be reflowed upon presentation like any paragraph > would. (In fact, to be recognized as a paragraph and not lines of > verse.) Well, I'm not sure I ever said that ZML didn't have a rule to recognize a block quote (and if I did, I apologize). I only said if one exists that I couldn't find any documentation of it. The rule may exist in a local file, or in BB's brain, or somewhere else I'm not aware of; I just can't find it. -- Nothing of significance below this line. From jon at noring.name Thu Oct 18 11:10:40 2007 From: jon at noring.name (Jon Noring) Date: Thu, 18 Oct 2007 12:10:40 -0600 Subject: [gutvol-d] How are block quotes handled in ZML? In-Reply-To: <47179E37.9040304@novomail.net> References: <684145262.20071018113334@noring.name> <47179E37.9040304@novomail.net> Message-ID: <784139522.20071018121040@noring.name> Lee wrote: > Jon Noring wrote: >> [Josh and others in PG/DP, do you know of specific PG texts which >> contain block quotes? And I'd love to see some where the block quote >> itself contains multiple paragraphs and maybe mixed with other >> structures like verse. Jon] > How about: http://www.gutenberg.org/files/16494/16494-h.zip? LOL, yes, that's a good one. There you go, Bowerbird... >> According to Lee, ZML does not have a special rule to recognize a block >> quote so it can be properly presented, which is that a paragraph in a >> block quote needs to be reflowed upon presentation like any paragraph >> would. (In fact, to be recognized as a paragraph and not lines of >> verse.) > Well, I'm not sure I ever said that ZML didn't have a rule to recognize > a block quote (and if I did, I apologize). I only said if one exists > that I couldn't find any documentation of it. The rule may exist in a > local file, or in BB's brain, or somewhere else I'm not aware of; I just > can't find it. Lee, I apologize for extrapolating what you said. Indeed you were careful to say that if there is a ZML rule for "marking up" block quotes, it is not yet documented. To clarify again to Bowerbird. A "block quote" may itself be a whole standalone document (which may contain paragraphs, other block quotes, verse, letters, etc., etc.) which is quoted inside another document. So ZML, for mastering purposes, must be able to handle this construct as just described (essentially a "nesting" of documents.) Of course, it is possible to come up with a rule using an unusual combination of white space (tabs and spaces) to accomplish this, but now we are really getting into other difficulties as Lee previously mentioned. Differentiating tabs and spaces in many text editors is oftentimes difficult, plus other problems in requiring counting spaces and tabs. Jon Noring From Bowerbird at aol.com Thu Oct 18 11:34:31 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 18 Oct 2007 14:34:31 EDT Subject: [gutvol-d] taking my own advice Message-ID: michael said: > Follow the instructions in "Alice". . . . i asked michael backchannel what he meant. the reply: > "Start at the beginning" yeah, i should've mentioned that. because the filenaming structures are in flux with pre-10000 e-texts, i am not dealing with those at first... i'm not sure why y'all don't make the library consistent in its filenames, but until you do, i can't be bothered with it. once i've got all the 10000+ files in z.m.l., i'll do the rest. but probably not until then. -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071018/dae3d51b/attachment.htm From lee at novomail.net Thu Oct 18 12:30:36 2007 From: lee at novomail.net (Lee Passey) Date: Thu, 18 Oct 2007 13:30:36 -0600 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <3466526.1192625429600.JavaMail.?@fh1064.dia.cp.net> References: <3466526.1192625429600.JavaMail.?@fh1064.dia.cp.net> Message-ID: <4717B45C.5050603@novomail.net> You raise several interesting issues here, some of which I would like to explore in more depth. joshua at hutchinson.net wrote: > Ok, all sarcastic commentary aside ... the default render for > and gets the job done. > > ie. > > CHAPTER 1 > Missy Goes to Space > > renders with a large "CHAPTER 1" and a slightly smaller "Missy Goes to > Space". Actually, there really /isn't/ a default rendering for , just as there is no default rendering for /most/ XML vocabularies. I think what you are saying is that Mr. Perathoner's XSL script will automatically add these rendering styles to it's output, if that's the conversion method you choose. I have been generally unsuccessful in discovering documentation as to just what styles will be automatically added during an XSLT transformation, unless you count the old unix programmer adage that "the code is the documentation." Some simple documentation about just what an end user should expect from the transformation would probably be helpful. On the other hand, it would probably be possible to /create/ a default rendering for PGTEI file using a CSS file. In this case you would probably create a couple of rules such as: div[type~="chapter"] head { font-size:150% } div[type~="chapter"] head[type~="sub"] { font-size:120% } > The only rend attribute necessary is a centering attribute > (the same thing you'd need to add in an HTML document because

> isn't automatically centered). Now, just like HTML, you can center > each individually () or you can > put a line in the stylesheet section at the beginning telling it to > center every element. Or add head { text-align:center } to your own personal CSS file and just add to the beginning of every TEI file you download (you would probably want to do the same thing for the default CSS file as well). > Honestly, for 99% of the stuff we see, the TEI code is no more complex > than the HTML code equivalent. This is /so/ true. The notion that TEI is difficult to use is a myth promulgated by those people who are scared by the shear size of the TEI specification (and its reliance on DTDs, which are a truly foreign language) or by people who have a vested interest in maintaining the status quo or are promoting an alternative. The biggest stumbling block to the adoption of TEI is the perception that it is hard. But it's all a perception problem, because TEI is really no harder to use than any other markup language, and a good deal simpler and more straight-forward than ZML. > It's just that it is DIFFERENT from the > HTML code equivalent and therefore needs different tools/scripts if you > want to automate any part of it. Perhaps more importantly, it requires a different mind set. What You See is not only /not/ What You Get, What You See Is Mostly Irrelevant. On the other hand, when people rely on their brains instead of their eyes things seem to just fall into place. > Marcello has said, repeatedly, he's not interested nor has the time to > write such tools and scripts. I've admitted I don't have the ability. > Lee sounds like he has the ability, but perhaps not the time or > inclination. If someone was willing to step up and start creating a > tool/regex scripts/perl scripts/whatever, I'd be happy to work with > them and I'm positive so would the couple of other people active in > trying to work with TEI (Ralf and others from DP). Check out http://www.passkeysoft.com/~lee/te12html. You could also check out http://www.passkeysoft.com/~lee/antonia.xml to see what TEI + CSS looks like (use Firefox or Opera, Microsoft is still learning how to do XML right). > Right now, the arguments are going to be fairly useless and circular > in nature simply because we don't have the tools to take the process to > the next level. On the other hand, it's difficult to develop tools because there are so few complex TEI texts available for testing, and so little feedback as to just what kind of tools might be desired (I'm fairly certain there are commercial WYSIWYG XML editors available, but you would have to develop a default CSS file to make them work -- and TEI is not really a WYSIWYG markup anyway). One of the first things I learned in law school was that legal technicalities are not nearly so important as the layman thinks. Yesterday a colleague suggested that because his credit card says "not valid unless signed," if a merchant accepted the unsigned card as payment then he might be able to avoid paying the charge. A court would blow by that argument so fast it would make your head spin. The fact is, what matters is what a thing /is/, not what it is called. Most of what is available from PG as "TEI", isn't. It's HTML with a different set of tags. It's like the people who slap
 tags around 
PG degraded text and call the result HTML. It's a legal technicality, 
that just doesn't fly in the real world. These files may validate 
against the TEI DTDs, but they don't contain the Tao of TEI.

We definitely have a chicken and egg problem here, but because TEI can 
be created fairly easily in a simple text editor, I'm inclined to favor 
the creation of TEI documents as the first step, as opposed to the 
creation of TEI manipulation tools. (That and the fact that converting 
presentationally-oriented markup to TEI is virtually impossible without 
a very powerful AI engine). It would be very useful to have some 
examples of real TEI to work with, and not just the pseudo-TEI which is 
so pervasive.

-- 
Nothing of significance below this line.



From jon at noring.name  Thu Oct 18 12:38:37 2007
From: jon at noring.name (Jon Noring)
Date: Thu, 18 Oct 2007 13:38:37 -0600
Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical
	data
In-Reply-To: <4717B45C.5050603@novomail.net>
References: <3466526.1192625429600.JavaMail.?@fh1064.dia.cp.net>
	<4717B45C.5050603@novomail.net>
Message-ID: <34367422.20071018133837@noring.name>

Lee wrote:

> Most of what is available from PG as "TEI", isn't. It's HTML with a 
> different set of tags. It's like the people who slap 
 tags around
> PG degraded text and call the result HTML. It's a legal technicality, 
> that just doesn't fly in the real world. These files may validate 
> against the TEI DTDs, but they don't contain the Tao of TEI.

"Tao of TEI"

I love this! I checked Google to see if it appears anywhere else and
it doesn't. It is a Lee Passey original and, again, I love it!

Jon



From joshua at hutchinson.net  Thu Oct 18 14:08:04 2007
From: joshua at hutchinson.net (joshua at hutchinson.net)
Date: Thu, 18 Oct 2007 21:08:04 +0000 (UTC)
Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical
 data
Message-ID: <18948644.1192741684598.JavaMail.?@fh1036.dia.cp.net>

>----Original Message----
>From: lee at novomail.net
>
>On the other hand, it's difficult to develop tools because there are 
so 
>few complex TEI texts available for testing, and so little feedback 
as 
>to just what kind of tools might be desired (I'm fairly certain 
there 
>are commercial WYSIWYG XML editors available, but you would have to 
>develop a default CSS file to make them work -- and TEI is not really 
a 
>WYSIWYG markup anyway).
>

Here is exactly what is needed to start things moving at DP (which is 
were critical mass *has* to occur):

1 - A script/utility/plug-in/whatever that takes DP formatted input 
and spits out a generic TEI encoded document.  

For instance

- paragraphs need to be enclosed in 

. - Poetry (/* */ enclosed in DP) needs converted to ... type markup including the indention information (DP uses 2 spaces per indent level). - /# #/ gets converted to a blockquote markup. - Chapter headings need to be used as
dividers and enclosed with markup properly. - The DP page divider markup needs to be converted into markup so that the original page break informaion is retained. - [Illustration: Caption] markup needs to be converted to a generic
markup with Caption information. - Convert the [1] type footnote markup (out of line) into inline TEI markup - Things that *are* presentational need to be handled: = italics = small caps. = bold, etc etc. Just because semantic is important ... doesn't mean you can ignore presentational. 2 - A utility/program/web form that allows easy entry of book meta data and then spits out a happy little teiHeader. That thing is a bloody nightmare to encode by hand. *** We get that much going and DP will start using it. Keep it all open source and it's much easier to extend it as needed. And it not only doesn't have to be a WYSIWYG editor ... it doesn't have to be an editor at all. Just a widget that plugs in DP text in one end and spits out TEI on the other. The first utility doesn't have to be perfect. Just good enough so that a manual second pass can catch the "weird" stuff. I also don't think we should worry about getting all the heavy lifting done up front. Get something covers the "easy" fiction type books first and then move on from there. People will throw easy stuff at it at first to get comfortable ... then start adding more and more difficult stuff. I don't expect some to throw a scholarly critique of Middle English poetry on a first go. But the Campfire Girl Take Another Unsupervised Trip Where They Shouldn't Be ... heck, yeah. Let's mark it up! Josh From traverso at posso.dm.unipi.it Fri Oct 19 02:30:53 2007 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Fri, 19 Oct 2007 11:30:53 +0200 (CEST) Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <18948644.1192741684598.JavaMail.?@fh1036.dia.cp.net> (joshua@hutchinson.net) References: <18948644.1192741684598.JavaMail.?@fh1036.dia.cp.net> Message-ID: <20071019093053.5C406101F0@posso.dm.unipi.it> >>>>> "josh" == joshua at hutchinson net writes: josh> Here is exactly what is needed to start things moving at DP josh> (which is were critical mass *has* to occur): josh> 1 - A script/utility/plug-in/whatever that takes DP josh> formatted input and spits out a generic TEI encoded josh> document. josh> For instance ..... I warmly agree. I also think that we might SLIGHTLY modify DP markup to allow better working of such a tool. This should NOT include using anything else than blank line for paragraphs (no

...

in the formatting stages) but might e.g. require some light markup for chapters and sections, speakers and stage notes in drama, standardized markup for corrections, etc.; of course, this has to be supported by the DP proofing interface. BTW, DP code already prepares a skeletal TEI from the txt files, extremely limited and out of sync with the current guidelines. But can be used as a template for a better version. Because of the need to mix DP coding and TEI coding this can only be made by a DP team. If Josh is willing to help, I can give my contribution for a prototype, including modifications to the DP proofing interface to test in the DP test site. A problem with this approach is the handling of markup errors; a markup validation step should be included in the DP page saving code. And an utility to convert DP-marked txt after the first phases of post-processing is needed too. Interacting with guiguts might hence be necessary, complicating the issues. josh> 2 - A utility/program/web form that allows easy entry of book meta josh> data and then spits out a happy little teiHeader. That thing is a josh> bloody nightmare to encode by hand. Ditto, this can be auto-generated from the project data, (and is present in an insufficient form in the skeletal DP TEI), but the project data form and project database have to be modified to keep all the information needed. I see this as less urgent and more difficult to provide at DP, but can be easily made with a simple tool if it is well specified. Carlo From ralf at ark.in-berlin.de Thu Oct 18 09:24:29 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Thu, 18 Oct 2007 18:24:29 +0200 Subject: [gutvol-d] taking my own advice In-Reply-To: References: Message-ID: <20071018162429.GA20062@ark.in-berlin.de> You wrote > if anyone has any preferences as to _where_ i should start > in the p.g. library, please say so, either here or backchannel > (e.g., start at 12000, or start at 16,500, or so on)... In any case, include 22001 and 13006, please. From ralf at ark.in-berlin.de Fri Oct 19 03:41:36 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Fri, 19 Oct 2007 12:41:36 +0200 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <20071019093053.5C406101F0@posso.dm.unipi.it> References: <18948644.1192741684598.JavaMail.?@fh1036.dia.cp.net> <20071019093053.5C406101F0@posso.dm.unipi.it> Message-ID: <20071019104135.GA26101@ark.in-berlin.de> Carlo wrote > BTW, DP code already prepares a skeletal TEI from the txt files, > extremely limited and out of sync with the current guidelines. But can > be used as a template for a better version. But very awkwardly, as one can see from my tutorial http://www.pgdp.net/wiki/Post-Processing_With_PGTEI_0.4 and making the TEI template PGTEI 0.4 conform would greatly simplify this document! So count me in with the work. Where is that script that produces that 0.3 output to be found, anyway? > A problem with this approach is the handling of markup errors; a OTOH, it could make an automated verification of the foofing stage, make a button for the foofers like the WordCheck one. ralf From joshua at hutchinson.net Fri Oct 19 06:18:32 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Fri, 19 Oct 2007 13:18:32 +0000 (UTC) Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data Message-ID: <11915392.1192799912118.JavaMail.?@fh1039.dia.cp.net> Count me in, Carlos. Tell me what you need and I'll put together as fast as I can. Tell me what needs testing and feedback and I'll get on it. I am at your disposal, sir. Josh >----Original Message---- >From: traverso at posso.dm.unipi.it >Date: Oct 19, 2007 5:30 >To: , >Cc: >Subj: Re: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data > >>>>>> "josh" == joshua at hutchinson net writes: > > > josh> Here is exactly what is needed to start things moving at DP > josh> (which is were critical mass *has* to occur): > > josh> 1 - A script/utility/plug-in/whatever that takes DP > josh> formatted input and spits out a generic TEI encoded > josh> document. > > josh> For instance ..... > >I warmly agree. I also think that we might SLIGHTLY modify DP markup >to allow better working of such a tool. This should NOT include using >anything else than blank line for paragraphs (no

...

in the >formatting stages) but might e.g. require some light markup for >chapters and sections, speakers and stage notes in drama, standardized >markup for corrections, etc.; of course, this has to be supported by >the DP proofing interface. > >BTW, DP code already prepares a skeletal TEI from the txt files, >extremely limited and out of sync with the current guidelines. But can >be used as a template for a better version. > >Because of the need to mix DP coding and TEI coding this can only be >made by a DP team. If Josh is willing to help, I can give my >contribution for a prototype, including modifications to the DP proofing >interface to test in the DP test site. > >A problem with this approach is the handling of markup errors; a >markup validation step should be included in the DP page saving >code. And an utility to convert DP-marked txt after the first phases >of post-processing is needed too. Interacting with guiguts might hence >be necessary, complicating the issues. > > > josh> 2 - A utility/program/web form that allows easy entry of book meta > josh> data and then spits out a happy little teiHeader. That thing is a > josh> bloody nightmare to encode by hand. > >Ditto, this can be auto-generated from the project data, (and is >present in an insufficient form in the skeletal DP TEI), but the >project data form and project database have to be modified to keep all >the information needed. > >I see this as less urgent and more difficult to provide at DP, but can >be easily made with a simple tool if it is well specified. > > >Carlo From traverso at posso.dm.unipi.it Fri Oct 19 07:12:11 2007 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Fri, 19 Oct 2007 16:12:11 +0200 (CEST) Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <20071019104135.GA26101@ark.in-berlin.de> (message from Ralf Stephan on Fri, 19 Oct 2007 12:41:36 +0200) References: <18948644.1192741684598.JavaMail.?@fh1036.dia.cp.net> <20071019093053.5C406101F0@posso.dm.unipi.it> <20071019104135.GA26101@ark.in-berlin.de> Message-ID: <20071019141211.3B19E101E6@posso.dm.unipi.it> >>>>> "Ralf" == Ralf Stephan writes: Ralf> Carlo wrote >> BTW, DP code already prepares a skeletal TEI from the txt >> files, extremely limited and out of sync with the current >> guidelines. But can be used as a template for a better version. Ralf> But very awkwardly, as one can see from my tutorial Ralf> http://www.pgdp.net/wiki/Post-Processing_With_PGTEI_0.4 Ralf> and making the TEI template PGTEI 0.4 conform would greatly Ralf> simplify this document! So count me in with the work. Ralf> Where is that script that produces that 0.3 output to be Ralf> found, anyway? the DP code is in http://dproofreaders.sourceforge.net , the file is tools/project_manager/post_files.inc , the function join_proofed_text_tei Carlo From traverso at posso.dm.unipi.it Fri Oct 19 07:14:43 2007 From: traverso at posso.dm.unipi.it (Carlo Traverso) Date: Fri, 19 Oct 2007 16:14:43 +0200 (CEST) Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <11915392.1192799912118.JavaMail.?@fh1039.dia.cp.net> (joshua@hutchinson.net) References: <11915392.1192799912118.JavaMail.?@fh1039.dia.cp.net> Message-ID: <20071019141443.339F9101E8@posso.dm.unipi.it> >>>>> "josh" == joshua at hutchinson net writes: josh> Count me in, Carlos. Tell me what you need and I'll put josh> together as fast as I can. Tell me what needs testing and josh> feedback and I'll get on it. josh> I am at your disposal, sir. josh> Josh I start a thread in the DP forum. Carlo From lee at novomail.net Fri Oct 19 09:38:04 2007 From: lee at novomail.net (Lee Passey) Date: Fri, 19 Oct 2007 10:38:04 -0600 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <18948644.1192741684598.JavaMail.?@fh1036.dia.cp.net> References: <18948644.1192741684598.JavaMail.?@fh1036.dia.cp.net> Message-ID: <4718DD6C.6090207@novomail.net> joshua at hutchinson.net wrote: > Here is exactly what is needed to start things moving at DP (which > is where critical mass *has* to occur): I don't think so. Given the institutional inertia that DP has so far exhibited, I don't think DP is any longer the proper venue for innovation. I would suggest a smaller, nimbler group who can develop a new proofing processes without having to cater to outdated conventions. > 1 - A script/utility/plug-in/whatever that takes DP formatted input > and spits out a generic TEI encoded document. The problem here is that by the time you have DP formatted input you're already reduced to purely presentational markup. Converting this to TEI will result in purely presentational TEI, which, while valid, doesn't buy you anything. If you're going to be satisfied with purely presentational markup a tool to input DP markup and convert to HTML would be even better; or we could easily convert from DPML to ZML, because we /know/ that the process of converting ZML to HTML is already perfected. ;-) I would suggest that the best approach would be to do away with DPML altogether. Instead, we/I could write a script/program that would take the HTML output from an OCR engine and convert it to (admittedly purely presentational) TEI. For any markup which is questionable (such as ) the output should be intentionally made invalid, perhaps by the addition of the attribute "class='invalid'" (there is no "class" attribute anywhere in TEI). This oversimplified TEI would then go on to the proofing rounds. Now, the DP proofing guidelines would be re-written to replace DPML with TEI. As Mr. Hutchinson has pointed out, TEI really is quite easy to use. There's no reason a proofer couldn't use instead of /# #/; I, myself, would find it easier because the TEI markup virtually describes itself, where as /# is completely arbitrary, and easy to confuse mentally with /* (which we all know is how you mark up comments which are not part of the text). Mr. Hutchinson's list, below, shows just how easy it would be to re-write the DPML markup rules. > For instance > > - paragraphs need to be enclosed in

. And everything that is /not/ a paragraph needs to be enclosed in something else. > - Poetry (/* */ enclosed in DP) needs converted to ... > type markup including the indention information (DP uses 2 > spaces per indent level). > - /# #/ gets converted to a blockquote markup. > - Chapter headings need to be used as
dividers and enclosed with > markup properly. > - The DP page divider markup needs to be converted into markup so > that the original page break informaion is retained. > - [Illustration: Caption] markup needs to be converted to a generic >
markup with Caption information. > - Convert the [1] type footnote markup (out of line) into inline TEI > markup > - Things that *are* presentational need to be handled: = italics > = small caps. = bold, etc etc. I wasn't aware that DP used "etc etc." as markup. What do you want it converted to? > Just because semantic is > important ... doesn't mean you can ignore presentational. > And just because presentation is important ... doesn't mean you can ignore semantics, particularly when the target markup language is inherently semantic in nature. But this is exactly what you would get if you don't engage a human brain in the process. BTW, if I wanted to get a hold of a DPML file to experiment with, how would I get one/several? > 2 - A utility/program/web form that allows easy entry of book meta > data and then spits out a happy little teiHeader. That thing is a > bloody nightmare to encode by hand. > OK. Do you want it web based, or a stand-alone application? Please provide a mockup of what you want it to look like. > *** > > We get that much going and DP will start using it. Well, get that much going, AND make it work end to end, AND produce a significant number of "real" TEI texts, then DP MAY adopt it, perhaps in parallel with its current work flow. It has been said that "big ships do not turn in tight circles," and I think this is particularly true of big ships with powerful engines and tiny rudders. Personally, I'd be happy to get a process going that works end to end, and let DP worry about whether it can be integrated into its own work flow. > Keep it all open > source and it's much easier to extend it as needed. And it not only > doesn't have to be a WYSIWYG editor ... it doesn't have to be an editor > at all. Just a widget that plugs in DP text in one end and spits out > TEI on the other. > > The first utility doesn't have to be perfect. Just good enough so > that a manual second pass can catch the "weird" stuff. > In my mind, the first pass will be /all/ weird stuff, because you're moving from a purely presentational paradigm to what is at least a semantic/presentational hybrid. Perhaps the most useful tool would be an application which would scan a file for everything that is questionable (and I would rank every instance of

and in that category) and ask for human confirmation that the markup is, in fact, correct, and offer reasonable alternatives if it is not. > I also don't think we should worry about getting all the heavy lifting > done up front. Get something covers the "easy" fiction type books > first and then move on from there. People will throw easy stuff at it > at first to get comfortable ... then start adding more and more > difficult stuff. I don't expect some to throw a scholarly critique of > Middle English poetry on a first go. But the Campfire Girl Take > Another Unsupervised Trip Where They Shouldn't Be ... heck, yeah. > Let's mark it up! > I agree. How do you eat an elephant? One bite at a time. This idea is what prompted my original post in this thread, "The TEI 80/20 rule." What is the 20% of the TEI markup vocabulary that can be used to cover 80% of the e-books we are trying to create? Let's promote that. My suspicion is that once someone has becomes comfortable with the 20% cream, s/he will be able to easily dip into the the 80% when it is needed. On the other hand, (as Tevya said, there's always an other hand) we need to remember the admonition of Albert Einstein, to "make everything as simple as possible, but not simpler." Things can always be made simpler, but at a certain point the simplification starts to make it impossible to meet the goal you are trying to achieve. I think that allowing purely presentational TEI to be distributed outside of the production chain is making things simpler than is possible. We must be careful to not allow our quest for simplicity to render the process as a whole futile. From grythumn at gmail.com Fri Oct 19 09:47:50 2007 From: grythumn at gmail.com (Robert Cicconetti) Date: Fri, 19 Oct 2007 12:47:50 -0400 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <4718DD6C.6090207@novomail.net> References: <18948644.1192741684598.JavaMail.?@fh1036.dia.cp.net> <4718DD6C.6090207@novomail.net> Message-ID: <15cfa2a50710190947n4fe02d27x233996bca243453a@mail.gmail.com> On 10/19/07, Lee Passey wrote: > BTW, if I wanted to get a hold of a DPML file to experiment with, how > would I get one/several? Pick any project in PP, click the download zipped text link, or for that matter, the download zipped TEI link. You can grab stuff in earlier states by using the download concatenated text button. http://www.pgdp.net/c/tools/project_manager/projectmgr.php?show=search&state%5B%5D=proj_post_first_available&state%5B%5D=proj_post_first_checked_out&state%5B%5D=proj_post_second_available&state%5B%5D=proj_post_second_checked_out&n_results_per_page=100 You may need a DP account to access it. R C From joshua at hutchinson.net Fri Oct 19 10:51:23 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Fri, 19 Oct 2007 17:51:23 +0000 (UTC) Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data Message-ID: <14789848.1192816283649.JavaMail.?@fh1039.dia.cp.net> >----Original Message---- >From: lee at novomail.net > >joshua at hutchinson.net wrote: > >> Here is exactly what is needed to start things moving at DP (which >> is where critical mass *has* to occur): > >I don't think so. Given the institutional inertia that DP has so far >exhibited, I don't think DP is any longer the proper venue for >innovation. I would suggest a smaller, nimbler group who can develop a >new proofing processes without having to cater to outdated conventions. > You've just pretty much guaranteed failure. If you don't want to work within the DP framework, that's up to you. But our discussion is pretty much over at this point. Josh PS DP markup is NOT purely presentational. In fact, it is semantic in many ways, but that is an argument for another time. From donovan at abs.net Fri Oct 19 10:51:56 2007 From: donovan at abs.net (D Garcia) Date: Fri, 19 Oct 2007 13:51:56 -0400 Subject: [gutvol-d] =?iso-8859-1?q?Gimme_that_AI_now_Re=3A_The_TEI_80/20_r?= =?iso-8859-1?q?ule_-=09empirical=09data?= In-Reply-To: <20071019141211.3B19E101E6@posso.dm.unipi.it> References: <18948644.1192741684598.JavaMail.?@fh1036.dia.cp.net> <20071019104135.GA26101@ark.in-berlin.de> <20071019141211.3B19E101E6@posso.dm.unipi.it> Message-ID: <200710191351.57120.donovan@abs.net> On Friday 19 October 2007 10:12, Carlo Traverso wrote: > >>>>> "Ralf" == Ralf Stephan writes: > > Ralf> Carlo wrote > > >> BTW, DP code already prepares a skeletal TEI from the txt > >> files, extremely limited and out of sync with the current > >> guidelines. But can be used as a template for a better version. > > Ralf> But very awkwardly, as one can see from my tutorial > Ralf> http://www.pgdp.net/wiki/Post-Processing_With_PGTEI_0.4 > > Ralf> and making the TEI template PGTEI 0.4 conform would greatly > Ralf> simplify this document! So count me in with the work. > > Ralf> Where is that script that produces that 0.3 output to be > Ralf> found, anyway? > > the DP code is in http://dproofreaders.sourceforge.net , the file is > tools/project_manager/post_files.inc , the function join_proofed_text_tei > There is an open task at DP to correct these issues, but it has languished for lack of knowledgeable resources as well as being low-priority. Reference http://www.pgdp.net/c/tasks.php?f=detail&tid=426 for details on the known issues with that code. From Bowerbird at aol.com Fri Oct 19 11:54:31 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 19 Oct 2007 14:54:31 EDT Subject: [gutvol-d] taking my own advice Message-ID: ralf said: > In any case, include 22001 and 13006, please. sure thing ralf. it'll be good for me to do a play as an example... if you can make the scan-set for 13006 available -- the image-files seemed to be missing on o.l.s. -- i'd be happy to do both those after this weekend... as for 22001 -- a straightforward book of poetry -- the one obvious shortcoming -- vis a vis z.m.l. -- is that every line of poetry doesn't have a leading space, to indicate that each block of text is _not_ a paragraph -- and thus should not have its first line indented -- _and_ that all those lines should not be unwrapped... and this isn't just a "problem" when it comes to z.m.l. _every_ conversion program that rewraps the lines will have the identical problem with a file like this... so this is one of the worst problems in the library... it also appears, from a quick glance at the text-file without reference to the scans, that the _headings_ (in this case, poem titles) do not have the standard 4-blank-lines-before-and-2-blank-lines-after... -bowerbird > http://www.pgdp.org/ols/tools/display.php?nextpage=001.png& lastpage=120.png&numpages=120&book=4026e97f4b0fc ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071019/b3f26f84/attachment.htm From prosfilaes at gmail.com Fri Oct 19 14:16:44 2007 From: prosfilaes at gmail.com (David Starner) Date: Fri, 19 Oct 2007 17:16:44 -0400 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <4718DD6C.6090207@novomail.net> References: <18948644.1192741684598.JavaMail.?@fh1036.dia.cp.net> <4718DD6C.6090207@novomail.net> Message-ID: <6d99d1fd0710191416g3aa7d342if9ff3eeeb2cb6860@mail.gmail.com> On 10/19/07, Lee Passey wrote: > Converting this to TEI > will result in purely presentational TEI, which, while valid, doesn't > buy you anything. Repeating the claim doesn't make it true. Being able to specify page numbers, sidenotes, and footnotes as such is a huge advantage over HTML, where you can, with enough work, define page numbers and sidenotes in an opaque way, but never footnotes. > There's no reason a proofer couldn't use rend="display:block"> instead of /# #/; This shows a lack of experience with the system. We've seen footnote spelled a dozen different ways; the last thing we want to do is replace two characters with 20, especially when the 20 characters are just as confusing to someone who hasn't touched TEI-Lite, especially as most of our users haven't written XML or HTML. > I, myself, would find it > easier because the TEI markup virtually describes itself, where as /# is > completely arbitrary, The choice of " over ' or ? is completely arbitrary (well, ? isn't completely arbitrary, but you can't expect most of our users to know that.) Why rend="display:block" instead of rend=display:block or display=block or even just block, again from the point of view of someone to whom XML means nothing? > But this is exactly what you would get if > you don't engage a human brain in the process. What, insulting of people you want to convince? From lee at novomail.net Mon Oct 22 13:06:30 2007 From: lee at novomail.net (Lee Passey) Date: Mon, 22 Oct 2007 14:06:30 -0600 Subject: [gutvol-d] placement of the note in the TEI of My Antonia In-Reply-To: <6310459816.20071020133137@noring.name> References: <6310459816.20071020133137@noring.name> Message-ID: <471D02C6.9010201@novomail.net> Jon Noring wrote: > I plan to use the following for this: > > > ... > ... > > > (and of course various attributes will be applied...) I think that is the correct way, and, as I understand it, the only acceptable way under P5. > Anyway, one thing that interests me are your reasons for placing the > sole note in My Antonia where you did in the text, at the end of the > chapter it is referenced. > > I've been going through the pros and cons of three placements: > > 1) Inline at the point of occurrence, > > 2) At the end of the division they are referenced in (which you did) > > 3) At the end of the document in a stand-alone "note dump" section > (which for books with many notes would collect all the notes in one > place.) > > So what are your thoughts? I think the question implies much broader issues than what you see at face value, specifically in regard to two dichotomies that exist in TEI: those of past presentation versus future presentation, and implicit markup versus explicit markup. When I placed the note at the end of the chapter, I did it primarily to make the file display nicely when using CSS. I'm sure you understand CSS much better than I do, but I couldn't figure out any way, using CSS, to move a note from an inline presentation. The one thing we /don't/ want is for the note to be displayed at the same place it is referenced. Your questions prompted further reflection on my part, however. I asked myself, what is the purpose of footnotes/endnotes, and how are they traditionally presented. It seems to me that generally there are two kinds of footnotes, explanatory footnotes and bibliographic footnotes. Bibliographic footnotes are used to contain a reference to the source of a quotation or viewpoint, or to further useful information pertaining to the subject in the main text. Explanatory footnotes are used for additional information or explanatory notes that might be too digressive for the main text, as an alternative to a parenthetical comment. Footnotes are notes of text placed at the bottom of a page in a book or document. Endnotes are similar to footnotes, but differ in that rather than appearing at the foot of the particular page, they are collected together at the end of the chapter or at the end of the work. Rarely do you see notes actually embedded in the text; when you do they are called parenthetical expressions and are delimited from the surrounding text by parentheses. In print documents endnotes are considered more inconvenient than footnotes because of the need to move back and forth between the main text and the endnote section. I think it is for this reason that explanatory notes are typically presented as footnotes, so you can quickly glance at the note while still in the main text, and endnotes are typically bibliographic notes. In electronic documents, the practical distinction between footnotes and endnotes becomes less important. Assuming you have implemented the note with reciprocal pointers, it doesn't matter where in the file the note appears, except that you /don't/ want the note to appear in the main text where it will disrupt the flow of the text. In HTML there are a couple of ways to achieve this. You could collect all the notes at the end of the main file, or in a separate notes file. Each reference in the main text would point to a note using "href='#xxx'", and each note would point back to the reference in the main text. Alternatively, each reference in the main text could contain a "title" attribute which most browsers display when the mouse cursor "flies over" the reference. Or, the two options can be combined. Given the ease of navigating between the main text and the notes in electronic documents, there doesn't seem to be much need to try to force the note text onto the same screen "page" as the referring text, unless the electronic format is simply a precursor to a printed document, as with PDF. So why would you ever place the note inline at the place of the reference? Well, I think this brings us to the past/future dichotomy I mentioned earlier. Most commonly TEI is probably used as a transcription markup; that is, TEI is used to transcribe an existing printed work into an electronic format. But it is also possible to use TEI as a text mastering format. How certain TEI elements should be interpreted depends on which of these two uses you have chosen. For example, consider the element which, according to the P5 guidelines, "marks the start of a new (typographic) line in some edition or version of a text." You have indicated that in your "faithful" edition of My ?ntonia you want to maintain a record of the original line breaks. For this you would use the element. But when presenting your edition you certainly don't want the User Agent to display these line breaks, unless the end user has explicitly declared that this is what s/he wants. On the other hand, in most PGTEI editions of other works the element is used to indicate where line breaks should occur when presenting the current editions. In other words, in your document is an indication of where line breaks appeared /in the past/ whereas in PGTEI editions is an indication of where line breaks should appear /in the future/. This same analysis applies to the presentation of a Title Page, which is one of the bones of contention on the Gutenberg discussion list. The element is used to transcribe how a title page appeared /in a past edition/, whereas the element is used in PGTEI to create a new, standardized title page /in a future edition/. We both know that Project Gutenberg is its own electronic publisher, and is not really concerned about archiving or preserving past editions of any particular work. Thus, the use of as a forced line break, and the use of to create a new title page, are completely appropriate uses of the markup. The implied "ed" attribute on the element is, essentially, "that edition which will be created when the PGTEI transformation script is run." Returning to the problem of notes, if your intended use is to master a future edition you may want to embed the note at the point in the main text where the reference occurs. When you're actually composing the text, it's at that point during composition that the explanation or reference is close at hand. When you're editing or maintaining the text having the actual note embedded in the main text will help to be sure that edits to the text will not invalidate the reference or require alteration of the explanation. When the file is transformed into a presentation format the note can be moved to wherever is most appropriate for that particular format. During the transformation a new intra-document reference will have to be created because the relationship between an in-line note and its context is only implicit. On the other hand, if you are transcribing a work from an existing edition, and alteration of the text is not foreseen, I don't see how the justifications for creating in-line notes apply. One of the downsides to using embedded notes is that if you try to view the document using an appropriate Cascading Style Sheet the note text will remain visible in the middle of the noted text in the displayed document; yet moving text which is too digressive is exactly why an author or publisher used a footnote in the first place. We want our notes to be stored somewhere where they can be easily accessed, but only when we choose to do so, and displayed in a manner which will not disrupt the flow of the main text. Another of the problems of the using in-line notes, at least as you have used them (from my perspective) is that they imply a reference without explicitly creating one. This may just be my irrational bias, but I get really nervous with implied content; everything that can be made explicit, should be. For example, even if you were to leave the note in-line, you should probably include a reference at that point, making the linkage between the main text and the note explicit, e.g.:

I first heard of ?ntonia?ntonia is strongly accented on the first syllable, like the English name Anthony, ... (As an alternative, you may want to omit a note marker and use the noted text itself, e.g.:

I first heard of ?ntonia is strongly accented on the first syllable, like the English name Anthony, ...) Likewise, I feel that anytime there is a risk of confusion between "in the past" uses of the document and "in the future" uses, something needs to be added to the markup to make the use explicit. For example, every element that is intended to force a line break in all future presentations should have some indication that it is more that the simple description of the presentation of the source text that the guidelines envisioned; perhaps something like or . Given the purposes and use of footnotes, I think I have concluded that for TEI encoding of notes in transcribed texts I would follow these guidelines: 1. Never encode notes inline. 2. Encode explanatory notes immediately after the paragraph in which they are referenced. 3. Encode bibliographic notes in blocks, at the end of the document in the element if they are not extensive, or at the end of each chapter if they are. 4. If bibliographic notes are placed at the end of the text in the source material, place all notes in the element. 5. If the notes are a combination of explanatory notes and bibliographic notes, choose one method that best reflects their use in the text and use it exclusively. Mimicking the presentation in the source text is probably the best option. 6. Notes should include erences back to the point in the text that referenced them. If you want to play with a document that has extensive footnotes, both explanatory and bibliographic, see http://www.passkeysoft.com/~lee/911-Commission-Report.zip. From lee at novomail.net Mon Oct 22 14:14:22 2007 From: lee at novomail.net (Lee Passey) Date: Mon, 22 Oct 2007 15:14:22 -0600 Subject: [gutvol-d] Gimme that AI now Re: The TEI 80/20 rule - empirical data In-Reply-To: <6d99d1fd0710191416g3aa7d342if9ff3eeeb2cb6860@mail.gmail.com> References: <18948644.1192741684598.JavaMail.?@fh1036.dia.cp.net> <4718DD6C.6090207@novomail.net> <6d99d1fd0710191416g3aa7d342if9ff3eeeb2cb6860@mail.gmail.com> Message-ID: <471D12AE.5000801@novomail.net> David Starner wrote: > On 10/19/07, Lee Passey wrote: >> Converting this to TEI >> will result in purely presentational TEI, which, while valid, doesn't >> buy you anything. > > Repeating the claim doesn't make it true. Being able to specify page > numbers, sidenotes, and footnotes as such is a huge advantage over > HTML, where you can, with enough work, define page numbers and > sidenotes in an opaque way, but never footnotes. Rats! I wish you would have told me this 5 years ago. All this time I've been doing it without knowing it was impossible. Just think how much time I could have saved. >> There's no reason a proofer couldn't use > rend="display:block"> instead of /# #/; > > This shows a lack of experience with the system. We've seen footnote > spelled a dozen different ways; the last thing we want to do is > replace two characters with 20, especially when the 20 characters are > just as confusing to someone who hasn't touched TEI-Lite, especially > as most of our users haven't written XML or HTML. Well, you obviously have a lower opinion of the average DPer than I do. I think your average volunteer wouldn't have any trouble at all with a new, and slightly more verbose, markup. And with DTD validation you should be able to get feedback on mistakes much quicker. >> I, myself, would find it >> easier because the TEI markup virtually describes itself, where as /# is >> completely arbitrary, > > The choice of " over ' or ? is completely arbitrary (well, ? isn't > completely arbitrary, but you can't expect most of our users to know > that.) Why rend="display:block" instead of rend=display:block or > display=block or even just block, again from the point of view of > someone to whom XML means nothing? /Everything/ is arbitrary from the point of view of someone who doesn't understand the context in which it is meaningful. The most recent version of PGTEI calls for the values of "rend" attributes to be valid CSS rules. Thus, "rend='display:block'" has meaning to everyone who understands CSS, or someone inquisitive enough to question what the underlying meaning might be, yet it is no more arbitrary than "/# #/" to someone who does not understand CSS. It may not have meaning to all people in all contexts, but it does have meaning to some people in some contexts, which is better than having no meaning at all to anyone. >> But this is exactly what you would get if >> you don't engage a human brain in the process. > > What, insulting of people you want to convince? It is regrettable that you have chosen to be offended by my comments. Perhaps I was not clear enough in expressing my opinion that it is impossible to add meaning (semantics) to a file in an automated way. It /requires/ a human in the process. I'm not suggesting that humans are involved, but not engaging their brains, I'm suggesting that any process that excludes human involvement (i.e. automated scripts) cannot succeed in this particular task (although I'm open to being convinced of the contrary; in fact, I would /love/ to be convinced of the contrary). In any event, I'm not trying to convince anyone of anything. I'm just gathering information, and trying to share what I have gathered and concluded with others. I have no illusions about my ability to affect the institutional inertia of DP, or Michael Hart's commitment to anarchy. If someone finds what I am saying interesting, or wants to explore the ideas further, that's great. If not, I'm not troubled; I have no expectation that what I say will be wildly popular. -- Nothing of significance below this line. From piggy at netronome.com Mon Oct 22 17:16:10 2007 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Mon, 22 Oct 2007 20:16:10 -0400 Subject: [gutvol-d] placement of the note in the TEI of My Antonia In-Reply-To: <471D02C6.9010201@novomail.net> References: <6310459816.20071020133137@noring.name> <471D02C6.9010201@novomail.net> Message-ID: <471D3D4A.2080909@netronome.com> Lee Passey wrote: > Jon Noring wrote: > > >> I plan to use the following for this: >> >> >> ... >> ... >> >> >> (and of course various attributes will be applied...) >> > > I think that is the correct way, and, as I understand it, the only > acceptable way under P5. > That's a shame as I really like the asymmetry in correction recommended in P4. The P5 solution is so painfully verbose. From jon at noring.name Mon Oct 22 17:35:07 2007 From: jon at noring.name (Jon Noring) Date: Mon, 22 Oct 2007 18:35:07 -0600 Subject: [gutvol-d] placement of the note in the TEI of My Antonia In-Reply-To: <471D3D4A.2080909@netronome.com> References: <6310459816.20071020133137@noring.name> <471D02C6.9010201@novomail.net> <471D3D4A.2080909@netronome.com> Message-ID: <1795766454.20071022183507@noring.name> La Monte H.P. Yarroll wrote: > Lee Passey wrote: >> Jon Noring wrote: >>> I plan to use the following for this: >>> >>> >>> ... >>> ... >>> >>> >>> (and of course various attributes will be applied...) >> I think that is the correct way, and, as I understand it, the only >> acceptable way under P5. > That's a shame as I really like the asymmetry in > >correction recommended in P4. The P5 > solution is so painfully verbose. Hmmm, in actually marking up the error corrections to "My Antonia", I ran into the issue of placing "corrected" text into attribute values, which leads to some problems under certain circumstances which I won't get into here. It is much better if the corrected text is also PCDATA in its own markup (and if needed may include TEI tags which is one of the problems with using attribute values.) The P5 solution is better, imho, and very clean to understand and follow. In addition, with the P5 system, we can do the following: ... ... ... ... Thus if we have some differences of opinion as to what a correction should be, we can handle those multiple suggestions using the P5 method. As I look more at the element, the more I like it for mastering purposes. Of course, just imho. Jon Noring From Bowerbird at aol.com Mon Oct 22 19:44:20 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 22 Oct 2007 22:44:20 EDT Subject: [gutvol-d] nice weekend Message-ID: what a nice weekend, blessedly free of t.e.i. posts in my spam folder... i suppose this means the .tei guys have gotten their stuff figured out, which is good, because maybe they'll make pudding instead of spam. :+) meanwhile, i've been working on my .zml-to-.pdf conversion. yeah, .pdf still stinks, for the most part, but if i'm gonna do it, i want to do it right. you might recall that .zml has been able to create a nice-looking .pdf for some time now. indeed, if you go over and look at this page here: > http://snowy.arsc.alaska.edu/bowerbird/alice01/alice01/ you'll see that i showed you versions of a .pdf of "alice in wonderland" way back in september of 2005. two years ago! seems like yesterday! (ok, maybe not yesterday. but it's hard to believe it was 2 years ago; i can still remember how marcello squealed like a stuck pig about it. then again, it was 3 years ago boston was down 3-0 to the yankees.) anyway, at that time, i had not yet installed the nifty navigational links that i consider to be crucial in an electronic-book. i've now done that. so now the .pdf doesn't just look good, it's also high-powered as well. you can now download a demo .pdf, created using my p.g. test-suite: > http://z-m-l.com/go/suite/test-suite-demo.pdf the z.m.l. file -- from which this .pdf was auto-generated -- is here: > http://z-m-l.com/go/suite/test-suite-demo.zml you'll find, on page 2, a table of contents that is thoroughly hotlinked. moreover, every chapter-heading links _back_ to this table of contents, which means it's very easy for a reader to get an overview of the e-book. in addition, every chapter-heading has links in the upper-corners that will conveniently transport you to the previous/next chapter-heading... i found this capability _so_ useful that i put these links on _every_ page. when i read, i often look ahead -- to see where the next chapter starts -- so i know whether to finish up the current chapter, or stop where i am. and the "jump to the next chapter" button comes in very handy for that... the links in the bottom-corners transport you to the previous/next page. so you can quickly and easily navigate a whole document with the mouse. in addition, the "internal" links are installed as well. so -- for instance -- when there is a back-reference in chapter 10 back to chapter 2, it is a hotlink so you can just click on the back-reference to jump to chapter 2. any reference to a chapter-heading automatically becomes a link in .zml. (if you have a 2-part chapter-heading, _either_ part will satisfy this rule, so you can have a reference to "chapter 2" or "the sections of the book") i've also installed links for the footnotes, so you can jump back and forth between the footnote anchors in the text and their footnotes at the end... in addition, i've also used the .pdf "note" capacity to "pop up" the footnote right on the page where it occurs, when you put your cursor over the note. eventually i will offer users the ability to have the footnote actually printed on the page itself, for when they want to print out hard-copy, but for now, i think the double-functionality of links and pop-ups should be sufficient. i also install links for each u.r.l. that's mentioned. the adobe reader will do this automatically, but only for an end-user who activates that option, and some might have it turned off, so i have my converter do it as well... i also install links to the u.r.l. for each picture, if that was given in the z.m.l. (i have a few such pictures in the test-suite, so you can see how they work.) which brings up the shortcomings of this demo. the pictures are not yet being positioned in an optimal way, so you will have to forgive that glitch... also, i didn't take the time to figure out how to do text styling in this demo, so the italics are being rendered as bold, and the spacing is kind of weird... also, all my links have visible rectangles drawn around them, which is putrid. (it should be an option that the end-user can turn 'em off or on, as desired.) finally, i used helvetica for this demo. some people _love_ helvetica, but for the life of me, i don't know why. i like sans serif, but helvetica is ugly, so i apologize for offending your sensibilities if you can't stand helvetica. *** for a long time now, the price for anyone wanting to buy the rights to z.m.l. has been "six figures". and i think it was a couple years back that i raised it to $200,000 minimum. now, with fairly solid conversions to .html and .pdf, an offline-standalone authoring tool, and a to-be-announced-quite-soon web-based authoring tool, plus viewer-apps, i'll be raising the price again... as of november 1, 2007, the price for full rights to z.m.l. will be $350,000. since this is just 10% of what amazon paid for mobipocket, it's a fair price... preference will be given to buyers who will make the package open-source, and such buyers can negotiate for a substantial discount, maybe up to 50%... of course, you know, you could just figure it all out for yourself. it's simple... -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071022/3c2f6457/attachment-0001.htm From lee at novomail.net Mon Oct 22 20:49:18 2007 From: lee at novomail.net (Lee Passey) Date: Mon, 22 Oct 2007 21:49:18 -0600 Subject: [gutvol-d] placement of the note in the TEI of My Antonia In-Reply-To: <471D3D4A.2080909@netronome.com> References: <6310459816.20071020133137@noring.name> <471D02C6.9010201@novomail.net> <471D3D4A.2080909@netronome.com> Message-ID: <471D6F3E.30004@novomail.net> La Monte H.P. Yarroll wrote: > Lee Passey wrote: >> Jon Noring wrote: >> >>> I plan to use the following for this: >>> >>> >>> ... >>> ... >>> >>> >>> (and of course various attributes will be applied...) >>> >> I think that is the correct way, and, as I understand it, the only >> acceptable way under P5. > > That's a shame as I really like the asymmetry in sic="error">correction recommended in P4. The P5 solution is so > painfully verbose. From a purely aesthetic point of view, I tend to agree. But as some people on this list may recall, I believe that it is not only possible, but desirable, to create TEI files which can render natively in CSS-aware browsers, such as Opera and Firefox. I haven't figured out how (if it's even possible) to get an attribute value to be displayed as part of the text using CSS, but with the new P5 structure it should be possible to do: sic { display:none } to see only the corrected text, or corr { display:none } to see only the uncorrected text. What's really kind of fun is that you could get output like A thoroughly modem [sic] Millie from A thoroughly modemmodern Millie by using this: corr { display:none } sic:after { content: " [sic]" } so all in all I think the change is positive, even if not quite as elegant as the P4 version. From jon at noring.name Mon Oct 22 23:28:14 2007 From: jon at noring.name (Jon Noring) Date: Tue, 23 Oct 2007 00:28:14 -0600 Subject: [gutvol-d] placement of the note in the TEI of My Antonia In-Reply-To: <471D6F3E.30004@novomail.net> References: <6310459816.20071020133137@noring.name> <471D02C6.9010201@novomail.net> <471D3D4A.2080909@netronome.com> <471D6F3E.30004@novomail.net> Message-ID: <05211275.20071023002814@noring.name> Lee wrote: > I haven't figured out how (if it's even possible) to get an attribute > value to be displayed as part of the text using CSS. This, I believe, is not possible with CSS. It is usually (but not always) a bad idea to place content into an attribute value. And, if I am interpreting the XML spec correctly (see Sec. 3.1), for XML well-formedness we cannot have start or end tags in an attribute value, either direct or indirect: "Well-formedness constraint: No < in Attribute Values "The replacement text of any entity referred to directly or indirectly in an attribute value MUST NOT contain a <." (That means, I believe, one may not use "<", "<", "&lt;", etc., in attribute values. Nor the numeric character entity equivalent to "<". No "<" in any form. Period. Thus, there is no way by XML well-formedness rules to include start/end tags in attribute values.) The technique is actually very elegant, since it allows a lot more flexiblity and power, such as choosing between multiple corrections, including certain markup in the corrected text, and using CSS for desired visualization. Jon Noring From marcello at perathoner.de Tue Oct 23 01:50:31 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 23 Oct 2007 10:50:31 +0200 Subject: [gutvol-d] placement of the note in the TEI of My Antonia In-Reply-To: <471D6F3E.30004@novomail.net> References: <6310459816.20071020133137@noring.name> <471D02C6.9010201@novomail.net> <471D3D4A.2080909@netronome.com> <471D6F3E.30004@novomail.net> Message-ID: <471DB5D7.7000304@perathoner.de> Lee Passey wrote: > I haven't figured out how (if it's even possible) to get an attribute > value to be displayed as part of the text using CSS corr:after { content: " " attr(sic); text-decoration: line-through; color: red; } The hard part is to do this using M$ browsers ... -- Marcello Perathoner webmaster at gutenberg.org From piggy at netronome.com Tue Oct 23 04:51:08 2007 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Tue, 23 Oct 2007 07:51:08 -0400 Subject: [gutvol-d] placement of the note in the TEI of My Antonia In-Reply-To: <471D6F3E.30004@novomail.net> References: <6310459816.20071020133137@noring.name> <471D02C6.9010201@novomail.net> <471D3D4A.2080909@netronome.com> <471D6F3E.30004@novomail.net> Message-ID: <471DE02C.2040905@netronome.com> Lee Passey wrote: > La Monte H.P. Yarroll wrote: > > >> Lee Passey wrote: >> >>> Jon Noring wrote: >>> >>> >>>> I plan to use the following for this: >>>> >>>> >>>> ... >>>> ... >>>> >>>> >>>> (and of course various attributes will be applied...) >>>> >>>> >>> I think that is the correct way, and, as I understand it, the only >>> acceptable way under P5. >>> >> That's a shame as I really like the asymmetry in > sic="error">correction recommended in P4. The P5 solution is so >> painfully verbose. >> > > From a purely aesthetic point of view, I tend to agree. But as some > people on this list may recall, I believe that it is not only possible, > but desirable, to create TEI files which can render natively in > CSS-aware browsers, such as Opera and Firefox. > > I haven't figured out how (if it's even possible) to get an attribute > value to be displayed as part of the text using CSS, but with the new P5 > structure it should be possible to do: > > sic { display:none } to see only the corrected text, or > corr { display:none } to see only the uncorrected text. > > What's really kind of fun is that you could get output like > > A thoroughly modem [sic] Millie > > from > > A thoroughly modemmodern Millie > > by using this: > > corr { display:none } > sic:after { content: " [sic]" } > > so all in all I think the change is positive, even if not quite as > elegant as the P4 version. > OK, I'm convinced. Anybody care to create CSS which will show the in a as mouse-over on the ? From jon at noring.name Tue Oct 23 08:03:26 2007 From: jon at noring.name (Jon Noring) Date: Tue, 23 Oct 2007 09:03:26 -0600 Subject: [gutvol-d] placement of the note in the TEI of My Antonia In-Reply-To: <471DB5D7.7000304@perathoner.de> References: <6310459816.20071020133137@noring.name> <471D02C6.9010201@novomail.net> <471D3D4A.2080909@netronome.com> <471D6F3E.30004@novomail.net> <471DB5D7.7000304@perathoner.de> Message-ID: <159627520.20071023090326@noring.name> Marcello wrote: > Lee Passey wrote: >> I haven't figured out how (if it's even possible) to get an attribute >> value to be displayed as part of the text using CSS > corr:after { content: " " attr(sic); > text-decoration: line-through; color: red; } Wow, thanks! My prior comment on this was wrong. I should have checked the CSS spec before saying it could not be done. I do know there are some things regarding value transfer between XML and CSS that cannot be done. > The hard part is to do this using M$ browsers ... I tested the above, and indeed IE does not recognize this CSS. However, the CSS works like a charm in Firefox and Opera. If the idea is to use CSS for visualizing TEI documents during the authoring process, where we don't care about end-user presentation, and don't care that we can't enable image embedding and hypertext links, then the fact that we can't use IE does not really matter. Jon Noring From lee at novomail.net Tue Oct 23 08:36:21 2007 From: lee at novomail.net (Lee Passey) Date: Tue, 23 Oct 2007 09:36:21 -0600 Subject: [gutvol-d] placement of the note in the TEI of My Antonia In-Reply-To: <471DB5D7.7000304@perathoner.de> References: <6310459816.20071020133137@noring.name> <471D02C6.9010201@novomail.net> <471D3D4A.2080909@netronome.com> <471D6F3E.30004@novomail.net> <471DB5D7.7000304@perathoner.de> Message-ID: <471E14F5.5060409@novomail.net> Marcello Perathoner wrote: > Lee Passey wrote: > >> I haven't figured out how (if it's even possible) to get an attribute >> value to be displayed as part of the text using CSS > > corr:after { content: " " attr(sic); > text-decoration: line-through; color: red; } Thank you. > The hard part is to do this using M$ browsers ... Well, I think we all know what the solution to /this/ problem is ... :-) -- Nothing of significance below this line. From lee at novomail.net Tue Oct 23 10:18:50 2007 From: lee at novomail.net (Lee Passey) Date: Tue, 23 Oct 2007 11:18:50 -0600 Subject: [gutvol-d] it's good to see the .tei people In-Reply-To: <447877789.20071016105253@noring.name> References: <3027016.1192552089449.JavaMail.?@fh1035.dia.cp.net> <447877789.20071016105253@noring.name> Message-ID: <471E2CFA.5060708@novomail.net> Jon Noring wrote: > Now, granted, I had not thought of this situation, even though I am > aware of it, since in *so many* PG (X)HTML texts I've looked at, >   is rampantly being abused, such as for indentation of > paragraphs and verse lines, etc. It's better to see   being used > only for keeping words together rather than forcing spacing in visual > presentation (since that is its purpose.) Yet,   is still > something I am not fond of using in virtually any circumstance, > especially in that in most instances there is a markup solution, as > illustrated above. Playing with My ?ntonia, I notice you have left the original typography in place with regards to contractions, e.g. "do n't". I find that frequently this will cause the "do" to end one line, and "n't" to start the following line. Don't you think this is an appropriate use for   (e.g. "do n't")? After all, the nbsp stands for Non-Breaking SPace, which seems to be exactly what it is being used for here. -- Nothing of significance below this line. From ralf at ark.in-berlin.de Tue Oct 23 09:50:48 2007 From: ralf at ark.in-berlin.de (Ralf Stephan) Date: Tue, 23 Oct 2007 18:50:48 +0200 Subject: [gutvol-d] placement of the note in the TEI of My Antonia In-Reply-To: <471DB5D7.7000304@perathoner.de> References: <6310459816.20071020133137@noring.name> <471D02C6.9010201@novomail.net> <471D3D4A.2080909@netronome.com> <471D6F3E.30004@novomail.net> <471DB5D7.7000304@perathoner.de> Message-ID: <20071023165048.GA25910@ark.in-berlin.de> Marcello Perathoner wrote > Lee Passey wrote: > > I haven't figured out how (if it's even possible) to get an attribute > > value to be displayed as part of the text using CSS > > corr:after { content: " " attr(sic); > text-decoration: line-through; color: red; } Would that be acceptable behaviour for a TEI file for PG? I'm just in the process of providing a default stylesheet with the TEI file skeleton that is downloadable on DP project pages, and I'm asking myself if this should go in. ralf From jon at noring.name Tue Oct 23 11:03:29 2007 From: jon at noring.name (Jon Noring) Date: Tue, 23 Oct 2007 12:03:29 -0600 Subject: [gutvol-d] placement of the note in the TEI of My Antonia In-Reply-To: <471D02C6.9010201@novomail.net> References: <6310459816.20071020133137@noring.name> <471D02C6.9010201@novomail.net> Message-ID: <1963269351.20071023120329@noring.name> Lee wrote: > Jon Noring wrote: >> Anyway, one thing that interests me are your reasons for placing the >> sole note in My Antonia where you did in the text, at the end of the >> chapter it is referenced. >> >> I've been going through the pros and cons of three placements: >> >> 1) Inline at the point of occurrence, >> >> 2) At the end of the division they are referenced in (which you did) >> >> 3) At the end of the document in a stand-alone "note dump" section >> (which for books with many notes would collect all the notes in one >> place.) > When I placed the note at the end of the chapter, I did it primarily > to make the file display nicely when using CSS. I'm sure you > understand CSS much better than I do, but I couldn't figure out any > way, using CSS, to move a note from an inline presentation. The one > thing we /don't/ want is for the note to be displayed at the same > place it is referenced. I've experimented using CSS in yanking and moving inline text to elsewhere in the document (such as to the left or right of the containing paragraph using the "float" property) during browser presentation, and it's not really worked to my satisfaction. (See CSS 2.1, Section 9: Visual formatting model, for the grab-bag of CSS properties that may be used: http://www.w3.org/TR/CSS21/visuren.html ) One can, of course, present the inline note as its own "block" dividing the main flow of the text (and differentiate it from the main, use "display: block;"). One can even handle multiple levels of notes (e.g., a note to a note.) Lee, I sent you a demo of doing this. It works quite nicely in Opera and Firefox, but IE wants to generate a new paragraph after the inline note is presented in its own block. Obviously for end-user presentation this is not a good way to present inline annotations, but for visualization to aid in the document markup process, it works quite well. To me, the only reason to apply CSS to the TEI master is for visualization during document markup and not for end-user purposes since we cannot embed images (other than forcing it by custom CSS which is sort of silly to do, in my opinion) and more importantly for hypertext links for tables of contents and referenced annotations (with Firefox we can enable hypertext links using XLink, but then that's not "vanilla" TEI markup -- we'd have to introduce xlink: namespace stuff into our TEI master documents -- not sure we want to go that direction. And we'd be restricted to Firefox browsers at the moment: Opera nor IE support any XLink -- Firefox support is only for hypertext links and not embedded images.) > Your questions prompted further reflection on my part, however. I > asked myself, what is the purpose of footnotes/endnotes, and how are > they traditionally presented. And this is a good thing to understand. :^) > It seems to me that generally there are two kinds of footnotes, > explanatory footnotes and bibliographic footnotes. Bibliographic > footnotes are used to contain a reference to the source of a quotation > or viewpoint, or to further useful information pertaining to the subject > in the main text. Explanatory footnotes are used for additional > information or explanatory notes that might be too digressive for the > main text, as an alternative to a parenthetical comment. Yes, it does seem like *referenced* annotations come in two basic flavors as Lee noted. An annotation may also mix the two. (I use the word "annotation" loosely here since it may include only a bibliographic reference -- maybe "amplificatory" is a better word?) I'm not sure we need to differentiate between the two in mastering, however, unless we plan, upon conversion to an end-user format, to separate the two types from each other so they are presented to the end-user in "different areas". But I'll try not to become religious on this issue until I hear more pros and cons. Hopefully DP folk will let us know of some books which use funky/ mixed/multiple ways to handle referenced annotations. It's the "outliers" that may help in decision-making when we have multiple options. > Footnotes are notes of text placed at the bottom of a page in a book > or document. Endnotes are similar to footnotes, but differ in that > rather than appearing at the foot of the particular page, they are > collected together at the end of the chapter or at the end of the > work. Rarely do you see notes actually embedded in the text; when > you do they are called parenthetical expressions and are delimited > from the surrounding text by parentheses. Summarizing what Lee said, we have essentially three general places where referenced annotations may occur in the original paper source: 1) Inline at the exact point or range of reference. 2) On the page the annotation is first referenced but not at the point of reference (an annotation of course can be referenced more than once.) Two general places we find them: a) Footnotes area. b) Sidebar area (right and/or left) 3) Gathered at the end of some document hierarchical level in which the references occur. Could be the same division, or as high as the top level of the book (the whole book -- here called endnotes.) Of course, some books may use any two or all three of these general locations to place referenced annotations. > In print documents endnotes are considered more inconvenient than > footnotes because of the need to move back and forth between the > main text and the endnote section. I think it is for this reason > that explanatory notes are typically presented as footnotes, so you > can quickly glance at the note while still in the main text, and > endnotes are typically bibliographic notes. > > In electronic documents, the practical distinction between footnotes > and endnotes becomes less important. Assuming you have implemented > the note with reciprocal pointers, it doesn't matter where in the > file the note appears, except that you /don't/ want the note to > appear in the main text where it will disrupt the flow of the text. This is an important point. For mastering purposes, one can certainly describe where an annotation was placed (for those wishing to create a facsimile reproduction), but the placement is almost always itself arbitrary and is not part of the content itself. It is usually a decision of the publisher/typesetter as to where to place annotations based on readability/usability for the *particular paper artifact*, the "Manifestation" of the "Expression", being produced (see below for a better explanation and reference to Manifestation and Expression.) > In HTML there are a couple of ways to achieve this. You could > collect all the notes at the end of the main file, or in a separate > notes file. Each reference in the main text would point to a note > using "href='#xxx'", and each note would > point back to the > reference in the main text. Usually an annotation is referenced only once from the main text, but we do find that annotations may be referenced multiple times in a book. In this case, when I've implemented "reference back" in XHTML, I've always pointed back to the first reference occurence. > Alternatively, each reference in the main text could contain a > "title" attribute which most browsers display when the mouse cursor > "flies over" the reference. Or, the two options can be combined. Lots of things can be done in the end-user renditions, but the question is what makes sense for the TEI master. Of course, making it easier to produce the most important end-use renditions may factor into what is done for the master (when we have multiple options for what we can do in the master, we then invoke various requirements to decide which among the various options to implement.) > Given the ease of navigating between the main text and the notes in > electronic documents, there doesn't seem to be much need to try to > force the note text onto the same screen "page" as the referring > text, unless the electronic format is simply a precursor to a > printed document, as with PDF. Agreed! The placement of referenced annotations is almost always (if not "always always") totally arbitrary in the original source book. For mastering we can record placement for those who want that information, and let those who produce end-user renditions to decide how to handle them (and whether or not to use the original placement information.) In essence, referenced annotations we find in old books is a primitive form of hypertext linking. > So why would you ever place the note inline at the place of the > reference? Well, I think this brings us to the past/future > dichotomy I mentioned earlier. Since Lee and I seem to agree that the placement of referenced annotations in the TEI master document is pretty much arbitrary, where we place them in the master then depends upon other requirements. > Most commonly TEI is probably used as a transcription markup; that > is, TEI is used to transcribe an existing printed work into an > electronic format. But it is also possible to use TEI as a text > mastering format. How certain TEI elements should be interpreted > depends on which of these two uses you have chosen. Agreed. My philosophical view of where to focus is based upon the Work- Expression-Manifestation-Item FRBR group 1 entities: http://en.wikipedia.org/wiki/FRBR (I use the acronym WEMI as a sort of memory aid. It rhymes with "hemi". ) I believe for maximum repurposeability and usability, the TEI digital master must capture the Expression, not the Manifestation. Once we free ourselves from the (almost always) arbitrary layout of a particular physical book (the Manifestation), that makes it clearer what is and is not important to capture in the TEI Master, and how to do so. I think this is where many of us have differences of opinion, since several here are interested for the digital master to capture the Manifestation of the source book. From what Lee has said, I think we agree the focus should be on the Expression for maximum usability of the digitized texts. Now, I am NOT hostile to those who wish to produce facsimile reproductions (at any level of exactness), but that must not drive the markup of the Master. So long as we preserve original page scans, those who wish to take the TEI Master and produce some sort of facsimile reproduction are certainly encouraged to do so! I view the facsimile reproduction as not a "master", but an "end-user" rendition. > For example, consider the element which, according to the P5 > guidelines, "marks the start of a new (typographic) line in some > edition or version of a text." At this time I plan to use in test mastering as a mark for a typographic "end of line" in the original, with no statement or inference as to whether there's any meaning to the break -- that's for other markup to flesh out. So this is a deviation from the P5 meaning of . Maybe I should use instead with a custom value if I want to flag typographic EOL ("tEOL"?) > You have indicated that in your "faithful" edition of My ?ntonia you > want to maintain a record of the original line breaks. For this you > would use the element. I preserved the exact spot (including within hyphenation) for typographic EOL. This is done for four primary reasons: 1) For purposes of future OCR and proofing work on the texts -- we'd like to know the exact spot of each EOL. 2) References found in the "old" literature which may refer not only to a page but to a line number on the page. 3) To aid those who wish to produce an *exact* facsimile reproduction. If we did not have this information, they would have to reinsert these EOL markers, which would be a *lot* of work. Since we will have this info from the OCR and/or key entry stages, why throw it away? 4) To record hyphenated EOL and the decision made on whether the resulting re-joined word includes a hard hyphen or not. (This is one of the few "editorial" decisions that needs to be made in producing the master -- to infer the original text -- and we could get some of these wrong the first time, so being able to know where these EOL hyphenations occur is important for future fixes of the master.) > But when presenting your edition you certainly don't want the User > Agent to display these line breaks, unless the end user has > explicitly declared that this is what s/he wants. On the other hand, > in most PGTEI editions of other works the element is used to > indicate where line breaks should occur when presenting the current > editions. In other words, in your document is an indication of > where line breaks appeared /in the past/ whereas in PGTEI editions > is an indication of where line breaks should appear /in the > future/. Yes! Good observation. My purpose for preserving source EOL points is for the previously mentioned four reasons -- only one of which is for rendition use -- and only provided because it is easy for us to do (since we will preserve it anyway for the other cited reasons.) For repurposeability, for lines which were broken in the source work for reasons *other* than simple typography, it is better to use the appropriate structural/semantic markup to describe why we may want, in most end-user renditions, to break the line at such a point. There is a reason "why", and we should mark it up. In many DP produced texts, they simply force a line break on presentation without saying why, and this is troubling since such line breaks will oftentimes make the presentation needlessly look awful on some platforms. Forcing line breaks in presentation is something that should not be taken for granted. > This same analysis applies to the presentation of a Title Page, > which is one of the bones of contention on the Gutenberg discussion > list. The element is used to transcribe how a title page > appeared /in a past edition/, whereas the element is used > in PGTEI to create a new, standardized title page /in a future > edition/. Yes, I view the Title Page in nearly all books to be part of the Manifestation, but not the Expression. The Title Page includes important metadata which must be preserved, but in a way which is machine processible as metadata. Personally, in TEI mastering, I would not even include a title page nor even a , since we will preserve the original page scan anyway. Rather, for markup visualization purposes only (not end-user use), I'd simply use CSS to present the metadata as a sort of "title page" information. It's when the TEI master is converted to end-user formats that a title page will be created (as necessary), and each target format has its own requirements which we cannot a priori predict. This is why I am not enamored with the extra effort it takes to produce some sort of "title page" in the TEI Master -- it's sort of useless imho. (I really don't like using at all in TEI mastering. Maybe Marcello and Lee can provide some very useful reasons that it may be used for some purposes.) > Returning to the problem of notes, if your intended use is to master > a future edition you may want to embed the note at the point in the > main text where the reference occurs. When you're actually composing > the text, it's at that point during composition that the explanation > or reference is close at hand. When you're editing or maintaining > the text having the actual note embedded in the main text will help > to be sure that edits to the text will not invalidate the reference > or require alteration of the explanation. When the file is > transformed into a presentation format the note can be moved to > wherever is most appropriate for that particular format. During the > transformation a new intra-document reference will have to be > created because the relationship between an in-line note and its > context is only implicit. These are good points, and in my opinion I am still intrigued with embedding all referenced annotations at the point in the text they are referenced. But I do see Lee's point that we must add xml:id to each note so when the master is repurposed, there will exist a standardized ID for each note for both intra- and inter-publication linking purposes. Furthermore, we should add markup to declare the point or range in the main text which references the annotation, and such markup would specify the equivalent of IDREF (this would also be used when we have one annotation which is referenced in multiple spots in the text.) Don't know what the appropriate TEI markup would be for this. (upon rereading this, Lee mentioned the element, more below.) > On the other hand, if you are transcribing a work from an existing > edition, and alteration of the text is not foreseen, I don't see > how the justifications for creating in-line notes apply. One of the > downsides to using embedded notes is that if you try to view the > document using an appropriate Cascading Style Sheet the note text > will remain visible in the middle of the noted text in the displayed > document; yet moving text which is too digressive is exactly why an > author or publisher used a footnote in the first place. We want our > notes to be stored somewhere where they can be easily accessed, but > only when we choose to do so, and displayed in a manner which will > not disrupt the flow of the main text. My next favorite location in the TEI Master (not end-user renditions necessarily) to place all the book's annotations is to collect them all at the end of the book in a special "end notes" section. We now have them all in one place, and from my work with a few books that contain hundreds of annotations, collecting them in one place has great benefit to the authoring process. > Another of the problems of the using in-line notes, at least as you > have used them (from my perspective) is that they imply a reference > without explicitly creating one. This may just be my irrational > bias, but I get really nervous with implied content; everything that > can be made explicit, should be. I agree. See my comment earlier on the need to add referencing markup. > For example, even if you were to leave the note in-line, you should > probably include a reference at that point, making the linkage > between the main text and the note explicit, e.g. [I redid this > markup a little to make it easier to see]: **********************************************************************

I first heard of ?ntonia >?ntonia is strongly accented on the first syllable, like the English name Anthony, ...

blah, blah, blah...

********************************************************************** > (As an alternative, you may want to omit a note marker and use the noted > text itself, e.g.: **********************************************************************

I first heard of >?ntonia is strongly accented on the first syllable, like the English name Anthony, ...

blah, blah, blah...

********************************************************************** I *like* the second above since I believe, for TEI Mastering, the kinds of note markers used in the source book (with a few rare exceptions) is NOT important at the Expression level. We should try to record what the original marker was, but try avoid putting it into content itself since it is NOT content of the Expression. Thus, we should let the conversion system add the appropriate referencing markers, numbers or letters, as needed. > Likewise, I feel that anytime there is a risk of confusion between > "in the past" uses of the document and "in the future" uses, > something needs to be added to the markup to make the use explicit. > For example, every element that is intended to force a line > break in all future presentations should have some indication that > it is more that the simple description of the presentation of the > source text that the guidelines envisioned; perhaps something like > or . Well, as noted above, we should do our absolute best to markup 'why' a line break is to occur there in end-user renditions. I personally believe we can cover 99.9% this way (mostly for verse.) If we still gotta have a forced line break somewhere and can't mark it up semantically, we could consider using the element with a "standardized" value that is ours and unambiguous as to meaning. > Given the purposes and use of footnotes, I think I have concluded > that for TEI encoding of notes in transcribed texts I would follow > these guidelines: Hmmm, so far I am not convinced in separating bibliographic types of annotations from exposition types. Given that I remain unconvinced, and I could be persuaded differently, then here's the order of where I'd place the referenced annotations in the TEI master: 1) At the point of reference. 2) In a special end-notes section which collects them all in one place. 3) At the end of the division where the annotations are *first* referenced (note that an annotation may be referenced multiple times.) #1 and #2 are pretty close in my mind, and with #2 we can get quite powerful CSS visualization, plus we now have them all collected so conversion is made easier for end-user formats where having all the annotations collected makes sense (e.g., Microsoft LIT -- which I do in my commercial version of Burton's "Kama Sutra".) And I do agree with Lee that no matter what, each annotation must carry an "xml:id" and that we include a at the point or range where the annotation is referenced from, and includes an IDREF. In addition, I would NOT place at the point of reference the original notemarker, or any other notemarker, in the *content* of the TEI master, but do believe we should record the original note marker within an attribute value in , however that may be done (not in the note.) Jon Noring From jon at noring.name Tue Oct 23 11:17:21 2007 From: jon at noring.name (Jon Noring) Date: Tue, 23 Oct 2007 12:17:21 -0600 Subject: [gutvol-d] it's good to see the .tei people In-Reply-To: <471E2CFA.5060708@novomail.net> References: <3027016.1192552089449.JavaMail.?@fh1035.dia.cp.net> <447877789.20071016105253@noring.name> <471E2CFA.5060708@novomail.net> Message-ID: <1461317943.20071023121721@noring.name> > Jon Noring wrote: >> Now, granted, I had not thought of this situation, even though I am >> aware of it, since in *so many* PG (X)HTML texts I've looked at, >>   is rampantly being abused, such as for indentation of >> paragraphs and verse lines, etc. It's better to see   being used >> only for keeping words together rather than forcing spacing in visual >> presentation (since that is its purpose.) Yet,   is still >> something I am not fond of using in virtually any circumstance, >> especially in that in most instances there is a markup solution, as >> illustrated above. > Playing with My ?ntonia, I notice you have left the original typography > in place with regards to contractions, e.g. "do n't". I find that > frequently this will cause the "do" to end one line, and "n't" to start > the following line. Don't you think this is an appropriate use for >   (e.g. "do n't")? After all, the nbsp stands for Non-Breaking > SPace, which seems to be exactly what it is being used for here. Hmmm, yes, this might be a place to use  . I'll see if, in the original source text, an EOL never occurs in the use of these contractions. Anyway, using   for purposes other than "requesting" the user agent not put a linebreak at the point, should not be allowed in the TEI Master. Jon Noring From jon at noring.name Tue Oct 23 11:41:09 2007 From: jon at noring.name (Jon Noring) Date: Tue, 23 Oct 2007 12:41:09 -0600 Subject: [gutvol-d] it's good to see the .tei people In-Reply-To: <1461317943.20071023121721@noring.name> References: <3027016.1192552089449.JavaMail.?@fh1035.dia.cp.net> <447877789.20071016105253@noring.name> <471E2CFA.5060708@novomail.net> <1461317943.20071023121721@noring.name> Message-ID: <19210344480.20071023124109@noring.name> Lee wrote: > Playing with My ?ntonia, I notice you have left the original typography > in place with regards to contractions, e.g. "do n't". I find that > frequently this will cause the "do" to end one line, and "n't" to start > the following line. Don't you think this is an appropriate use for >   (e.g. "do n't")? After all, the nbsp stands for Non-Breaking > SPace, which seems to be exactly what it is being used for here. As a followup to my prior reply, some here may be interested in the summary regarding the use of white space, the Unicode space characters, line breaking and the use of the soft hyphen in XML documents -- a sort of grab-bag of interrelated topics: http://www.openreader.org/spec/bnd10.html#sec3.3.7 Especially note a couple references that add insights into these topics: 1. Unicode in XML and other Markup Languages: http://www.w3.org/TR/unicode-xml/ 2. Section 6.2 of the Unicode 4.0 Standard (PDF): http://www.unicode.org/versions/Unicode4.0.0/ch06.pdf 3. And the easy-to-understand (NOT) Unicode Standard Annex #14 -- Line Breaking Properties: http://www.unicode.org/unicode/reports/tr14/ Jon From marcello at perathoner.de Tue Oct 23 11:56:19 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 23 Oct 2007 20:56:19 +0200 Subject: [gutvol-d] placement of the note in the TEI of My Antonia In-Reply-To: <20071023165048.GA25910@ark.in-berlin.de> References: <6310459816.20071020133137@noring.name> <471D02C6.9010201@novomail.net> <471D3D4A.2080909@netronome.com> <471D6F3E.30004@novomail.net> <471DB5D7.7000304@perathoner.de> <20071023165048.GA25910@ark.in-berlin.de> Message-ID: <471E43D3.0@perathoner.de> Ralf Stephan wrote: > Marcello Perathoner wrote >> Lee Passey wrote: >>> I haven't figured out how (if it's even possible) to get an attribute >>> value to be displayed as part of the text using CSS >> corr:after { content: " " attr(sic); >> text-decoration: line-through; color: red; } > > Would that be acceptable behaviour for a TEI file for PG? As already said, no version of IE supports these standard CSS 2 declarations: you won't see anything on IE. -- Marcello Perathoner webmaster at gutenberg.org From Bowerbird at aol.com Tue Oct 23 12:13:16 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 23 Oct 2007 15:13:16 EDT Subject: [gutvol-d] early results of radiohead's experiment Message-ID: perhaps you've been wondering about radiohead's experiment? > http://mashable.com/2007/10/19/radiohead-album-sales/ > > Radiohead, which offered its latest album as free downloads last week, > has seen 1.2 million downloads of ?In Rainbows.? With no label, > no promotions, and direct access to fans, Radiohead gave up its music > for free and asked for donations, whatever fans deemed reasonable, > in return. What the band got was an average of $8 per album sold, > bringing estimates of profit to about $10 million. Not too shabby > for one week. The number of albums sold in the past week > exceeded the launch week sales of its three previous albums combined. even if i'm not sure i believe they got "an average of $8" over _all_ of those 1.2 million downloads, an average of even half that much would be great, when you consider distribution costs were minimal, and their cut is 100%... especially when you factor in the benefits they derive from the bonding of their fans because of this measure of generosity, a long-term benefit. ironically, the fans got something over and above a digital download too, the warm feeling that they have supported a band they love, in a way that making a buy of a recording-company product probably never matched... this is truly the model of the future. and no, when it's an everyday thing, it won't have the big impact that it had here -- as a novelty occurrence -- but the development of this methodology as "the normal course of music" will indeed exert a tremendous influence on 21st-century human relations. the virtually-zero cost of reproducing digital goods will allow a generosity of spirit to emerge that would have been impossible to engineer in the age of physical goods, with their comparatively expensive nature of reproduction. which is not to say that physical goods will disappear. indeed, radiohead is likely to make a good deal of money from sales of the "hard-copy" versions. but that $80 package will rightly be seen as a "souvenir" of the _free_ music... -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071023/2d1a0e3e/attachment.htm From jon at noring.name Tue Oct 23 12:14:13 2007 From: jon at noring.name (Jon Noring) Date: Tue, 23 Oct 2007 13:14:13 -0600 Subject: [gutvol-d] placement of the note in the TEI of My Antonia In-Reply-To: <471E43D3.0@perathoner.de> References: <6310459816.20071020133137@noring.name> <471D02C6.9010201@novomail.net> <471D3D4A.2080909@netronome.com> <471D6F3E.30004@novomail.net> <471DB5D7.7000304@perathoner.de> <20071023165048.GA25910@ark.in-berlin.de> <471E43D3.0@perathoner.de> Message-ID: <1919890506.20071023131413@noring.name> Marcello wrote: > corr:after { content: " " attr(sic); > text-decoration: line-through; color: red; } > > As already said, no version of IE supports these standard CSS 2 > declarations: you won't see anything on IE. If the purpose of using this CSS is for visualization of a TEI master for authoring and maintenance purposes, then we don't care about IE support so long as one browser supports these properties. It turns out we have both Opera and Firefox that work with this CSS. Good enough for me. LOL. Jon Noring From Bowerbird at aol.com Tue Oct 23 14:21:08 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 23 Oct 2007 17:21:08 EDT Subject: [gutvol-d] nice weekend Message-ID: here are a few follow-up thoughts on .zml generation of .pdf files... .pdf gives z.m.l. entry into the world of handheld reader-machines, including the sony reader, the iliad, and -- of course -- the iphone. of course, these machines also support .html, but there are people who are more comfortable with the paged presentation of the .pdf. and, realistically, since these machines have non-resizable screens, the frozen-page nature of .pdf is _not_ a deficit; it's even a benefit. however, my present web approaches work _dandy_ on my iphone, after a few u.i. tweaks (e.g., enlarging the buttons for my fat fingers). it's even workable to read a _scan-set_, as i show with "my antonia": > http://www.z-m-l.com/go/iphone/myantf007.html to go forward to the next scan, click on the right 3/4ths of the page; and to go back one, click the left quarter. ignore the small buttons... (thank you to jon noring for his good job of scanning and cropping! each page of the book displays clearly and readable on the iphone.) i'm likely to make a native app for the iphone when the developer kit comes out next february, since i'm an apple/iphone person, but still, it's nice to know i don't _have_ to, that -- already -- "it just works"... -bowerbird p.s. you could also download the zipped scan-set of "my antonia" -- http://z-m-l.com/go/myant/myant.zip -- and load 'em to your ipod, then use the nifty flicking motion to thumb through all of the images. ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071023/382ffa87/attachment.htm From sly at victoria.tc.ca Tue Oct 23 18:47:29 2007 From: sly at victoria.tc.ca (Andrew Sly) Date: Tue, 23 Oct 2007 18:47:29 -0700 (PDT) Subject: [gutvol-d] Canadian Public Domain music scores site taken town Message-ID: For those interested in Copyright, cease and desist orders, etc. Geist, Knopf and Universal Edition http://www.p2pnet.net/story/13749 Universal Edition AG Forces Public Domain Website Offline http://www.slyck.com/story1603_Universal_Edition_AG_Forces_Public_Domain_Website_Offline Andrew From Bowerbird at aol.com Tue Oct 23 19:01:28 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 23 Oct 2007 22:01:28 EDT Subject: [gutvol-d] Canadian Public Domain music scores site taken town Message-ID: michael hart has already come to the rescue... -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071023/c7b3d5ca/attachment.htm From jon at noring.name Tue Oct 23 22:44:04 2007 From: jon at noring.name (Jon Noring) Date: Tue, 23 Oct 2007 23:44:04 -0600 Subject: [gutvol-d] Comment on My Antonia scan set In-Reply-To: References: Message-ID: <1493909058.20071023234404@noring.name> Bowerbird wrote: > (thank you to jon noring for his good job of scanning and cropping! > each page of the book displays clearly and readable on the iphone.) Thank you for the kind words! Those who have followed the "Distributed Scanners" YahooGroup, which was (and still is) intended to discuss the feasibility of starting a distributed effort at scanning books, know that I tend to fall on the side of rigor when scanning books. If one gets only one chance at scanning a book, best to err on the side of overkill rather than rushing the job through... (see: http://groups.yahoo.com/group/distscan/ ) My scan of an original 1st edition of "My Antonia" was done at 600 dpi and 24-bit color depth (essentially full color). The book was carefully disassembled and the edges cut so each page could be truly flat on the flatbed scanner glass. I was pretty rigorous at keeping the glass clean. Each scanned image was saved in lossless png (typical compression is about 50% compared to uncompressed bitmap.) After scanning was done, I then used Paint Shop Pro (DO NOT USE ANY IMAGE PROCESSING SOFTWARE USED IN OCR PACKAGES) to deskew each page scan using a *true* image rotation tool (won't explain why this is important, but trust me, not all deskewing algorithms use true image rotation.) I then cropped each image to be the same x-y pixel size (2210x3716) and cropped so as to align the text on all the pages similarly as best I could. The result was again saved in 600 dpi, 24-bit lossless PNG. For an example (page 311) of a scanned page image which has been processed as just mentioned, see: http://www.openreader.org/myantonia/orig-pagescans/311.png (warning, 14 meg file!) I learned a lot from scanning a few books, and in the future (when I get a new flatbed scanner) will rescan "My Antonia" following this partial list of practices (I've kept the disassembled book saved in a ziploc bag to keep it from collecting any more dust than it has): 1) All pages with print will be scanned at 600 dpi, 24-bit color. Those with graphics/images, and the title page, will be done at 1200 dpi, 24-bit color. 2) I will scan a color calibration chart at both the beginning and the end of the job. These images will become part of the book scan set. 3) Each paper page will be thoroughly cleaned to remove as much dust as I can. How I'm not sure, so ideas here are welcome (I'd love a machine where one simply inserts the page and it comes out the other end thoroughly cleaned of all dust.) Of course, the glass will also be religiously kept clean. 4) Each scanned image will be saved in lossless format (likely PNG since it is an open standard.) 5) This time I will save the raw page scans before they are deskewed and cropped. The raw set will be the "raw masters", and the deskewed and cropped set the "normalized masters". Of course, deskewing will be done using a true rotation algorithm in a professional-grade image processing application. 6) Proper filenaming will be done (won't go into specifics here). 7) All scan sets will be burned to multiple DVD and distributed to various archives, including the Internet Archive. Other methods of storage will also be experimented with. Redundancy and distribution of the raw master is important for long-term preservation. There's a few other "best practices" I will also do in the scanning, but the above are the major ones. Now, someone will say why the fuss? A variety of reasons: 1) For the great Works in the Public Domain, we *should* produce archival-quality scan sets of the "canonical" printings of those Works. The 1st Edition of "My Antonia" which I possess falls into this special category. Of course, I believe all books should be scanned with the same rigor, but realistically this is unlikely to happen unless I can find an army of people who believe as I do in "doing it ueber-right." (Anyone here willing to do this for the "canonical" printings of the great Works in the Public Domain, let me know.) 2) 600 dpi creates scans which are quite highly readable "as is" and for image processing has a lot more information to get good results when downsized and processed in various ways. 3) 24-bit provides a lot of useful information. Not only does it preserve the natural look of the page, but it provides 3 channels (RGB) that we can play with for image processing *and* OCR. I'm intrigued of running OCR on each channel and comparing. In addition, for things like rotation and resampling, this higher color depth will give more accurate results. 4) For each color channel (which effectively is 256 gray scale), we can also convert each page to 2 color using a variety of thresholds, OCR each one, and do comparison of the results. Starting with high quality, high resolution images should give us better results. Anyway, thanks again to Bowerbird's kind words on the My Antonia scan set. It looks good on the iPhone because I started with high-quality, high-resolution, high-color depth masters, and was careful in image normalization. Jon Noring From walter.van.holst at xs4all.nl Wed Oct 24 09:18:50 2007 From: walter.van.holst at xs4all.nl (Walter van Holst) Date: Wed, 24 Oct 2007 18:18:50 +0200 Subject: [gutvol-d] [Fwd: Hands on with Google's OCRopus open-source scanning software] Message-ID: <471F706A.3040704@xs4all.nl> Ars Technica has reviewed Google's OCRopus software: From hart at pglaf.org Wed Oct 24 12:58:30 2007 From: hart at pglaf.org (Michael Hart) Date: Wed, 24 Oct 2007 12:58:30 -0700 (PDT) Subject: [gutvol-d] Canadian Public Domain music scores site taken town In-Reply-To: References: Message-ID: Well, it wasn't just me directly, it was PG. On Tue, 23 Oct 2007, Bowerbird at aol.com wrote: > > michael hart has already come to the rescue... > > -bowerbird > > > > ************************************** > See what's new at http://www.aol.com > From gbnewby at pglaf.org Wed Oct 24 17:28:08 2007 From: gbnewby at pglaf.org (Greg Newby) Date: Wed, 24 Oct 2007 17:28:08 -0700 Subject: [gutvol-d] Canadian Public Domain music scores site taken town In-Reply-To: References: Message-ID: <20071025002808.GA27052@mail.pglaf.org> On Wed, Oct 24, 2007 at 12:58:30PM -0700, Michael Hart wrote: > > Well, it wasn't just me directly, it was PG. As covered on slashdot, based on a note to the BookPeople list: http://yro.slashdot.org/article.pl?sid=07/10/24/0325256 We have not heard back from the IMSLP fellow in the past day or so, but I do think he'll accept our offer. -- Greg > On Tue, 23 Oct 2007, Bowerbird at aol.com wrote: > > > > > michael hart has already come to the rescue... > > > > -bowerbird > > > > > > > > ************************************** > > See what's new at http://www.aol.com > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From julio.reis at tintazul.com.pt Thu Oct 25 04:08:46 2007 From: julio.reis at tintazul.com.pt (=?ISO-8859-1?Q?J=FAlio_Reis?=) Date: Thu, 25 Oct 2007 12:08:46 +0100 Subject: [gutvol-d] Copyright on international treaties In-Reply-To: References: Message-ID: <4720793E.8020907@tintazul.com.pt> A quick (?) general copyright clearance question - are international treaties covered by copyright or other restrictions in the USA, which would prevent PG from freely distributing them? We already have a disclaimer which explicitly says people use the texts at their own risk, so we're not offering legal help of any kind, only informing and distributing legal texts, so we're clear there. Some such treaties are useful to have around; potentially all? I am thinking right now of the Act of Paris of the Berne Convention (1971), which Portugal signed in 1978. Administrative texts are not subject to copyright in Portugal, so I could find a reliable official source and produce a clear text version of the Portuguese translation for upload to PG; perhaps HTML too. FYI - PG Europe carries the Human Rights Declaration. The database is down again, so you can't really read it, but... Oh, and if it's the case that Gutenberg (USA) can carry international treaties, then I'd like to know if the Human Rights Declaration can be offered from gutenberg.org also, and what I should do, if anything. J?lio a.k.a. Tintazul. From jeroen.mailinglist at bohol.ph Thu Oct 25 14:04:38 2007 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Thu, 25 Oct 2007 23:04:38 +0200 Subject: [gutvol-d] Copyright on international treaties In-Reply-To: <4720793E.8020907@tintazul.com.pt> References: <4720793E.8020907@tintazul.com.pt> Message-ID: <472104E6.2060808@bohol.ph> Facts cannot be copyrighted. Since any text that has force of law cannot be paraphrased in any way without loosing its legal standing, laws and treaties with force of law, at least in the US fall under the fact-expression merger doctrine, and thus loose copyright restrictions, certainly when use in a context of law and legal discussion. In other words, laws are uncopyrightable facts. I didn't find this rule in PG books, though, but there is enough jurisprudence available. Note that this is US only. Some countries have even more insane copyright laws. Jeroen Hellingman J?lio Reis wrote: > A quick (?) general copyright clearance question - are international > treaties covered by copyright or other restrictions in the USA, which > would prevent PG from freely distributing them? > > We already have a disclaimer which explicitly says people use the texts > at their own risk, so we're not offering legal help of any kind, only > informing and distributing legal texts, so we're clear there. > > Some such treaties are useful to have around; potentially all? I am > thinking right now of the Act of Paris of the Berne Convention (1971), > which Portugal signed in 1978. Administrative texts are not subject to > copyright in Portugal, so I could find a reliable official source and > produce a clear text version of the Portuguese translation for upload to > PG; perhaps HTML too. > > FYI - PG Europe carries the Human Rights Declaration. The database is > down again, so you can't really read it, but... Oh, and if it's the case > that Gutenberg (USA) can carry international treaties, then I'd like to > know if the Human Rights Declaration can be offered from gutenberg.org > also, and what I should do, if anything. > > J?lio a.k.a. Tintazul. > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > From Bowerbird at aol.com Thu Oct 25 15:35:59 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 25 Oct 2007 18:35:59 EDT Subject: [gutvol-d] by by Message-ID: > http://www.gutenberg.org/etext/23181 > Thomas Jefferson Brown by By James Oliver Curwood -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071025/bdff5157/attachment.htm From hart at pglaf.org Thu Oct 25 17:39:13 2007 From: hart at pglaf.org (Michael Hart) Date: Thu, 25 Oct 2007 17:39:13 -0700 (PDT) Subject: [gutvol-d] Copyright on international treaties In-Reply-To: <472104E6.2060808@bohol.ph> References: <4720793E.8020907@tintazul.com.pt> <472104E6.2060808@bohol.ph> Message-ID: I think you will find that laws are copyrighted in England. mh On Thu, 25 Oct 2007, Jeroen Hellingman (Mailing List Account) wrote: > > Facts cannot be copyrighted. Since any text that has force of law cannot > be paraphrased in any way without loosing its legal standing, laws and > treaties with force of law, at least in the US fall under the > fact-expression merger doctrine, and thus loose copyright restrictions, > certainly when use in a context of law and legal discussion. In other > words, laws are uncopyrightable facts. > > I didn't find this rule in PG books, though, but there is enough > jurisprudence available. Note that this is US only. Some countries have > even more insane copyright laws. > > Jeroen Hellingman > > J?lio Reis wrote: >> A quick (?) general copyright clearance question - are international >> treaties covered by copyright or other restrictions in the USA, which >> would prevent PG from freely distributing them? >> >> We already have a disclaimer which explicitly says people use the texts >> at their own risk, so we're not offering legal help of any kind, only >> informing and distributing legal texts, so we're clear there. >> >> Some such treaties are useful to have around; potentially all? I am >> thinking right now of the Act of Paris of the Berne Convention (1971), >> which Portugal signed in 1978. Administrative texts are not subject to >> copyright in Portugal, so I could find a reliable official source and >> produce a clear text version of the Portuguese translation for upload to >> PG; perhaps HTML too. >> >> FYI - PG Europe carries the Human Rights Declaration. The database is >> down again, so you can't really read it, but... Oh, and if it's the case >> that Gutenberg (USA) can carry international treaties, then I'd like to >> know if the Human Rights Declaration can be offered from gutenberg.org >> also, and what I should do, if anything. >> >> J?lio a.k.a. Tintazul. >> _______________________________________________ >> gutvol-d mailing list >> gutvol-d at lists.pglaf.org >> http://lists.pglaf.org/listinfo.cgi/gutvol-d >> >> > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From gbnewby at pglaf.org Thu Oct 25 19:11:30 2007 From: gbnewby at pglaf.org (Greg Newby) Date: Thu, 25 Oct 2007 19:11:30 -0700 Subject: [gutvol-d] Copyright on international treaties In-Reply-To: <4720793E.8020907@tintazul.com.pt> References: <4720793E.8020907@tintazul.com.pt> Message-ID: <20071026021130.GH18258@mail.pglaf.org> On Thu, Oct 25, 2007 at 12:08:46PM +0100, J?lio Reis wrote: > A quick (?) general copyright clearance question - are international > treaties covered by copyright or other restrictions in the USA, which > would prevent PG from freely distributing them? Dear J?lio: For the most part, these will be eligible for Project Gutenberg. The key will be to find a source where the US Government publishes the treaties (such as in the Federal Register), to confirm that no copyright was claimed. Then, we should be able to clear under our Rule 8. I can think of several variations that might make it more difficult, but I can correspond individually (or just submit at copy.pglaf.org with details). -- Greg > We already have a disclaimer which explicitly says people use the texts > at their own risk, so we're not offering legal help of any kind, only > informing and distributing legal texts, so we're clear there. > > Some such treaties are useful to have around; potentially all? I am > thinking right now of the Act of Paris of the Berne Convention (1971), > which Portugal signed in 1978. Administrative texts are not subject to > copyright in Portugal, so I could find a reliable official source and > produce a clear text version of the Portuguese translation for upload to > PG; perhaps HTML too. > > FYI - PG Europe carries the Human Rights Declaration. The database is > down again, so you can't really read it, but... Oh, and if it's the case > that Gutenberg (USA) can carry international treaties, then I'd like to > know if the Human Rights Declaration can be offered from gutenberg.org > also, and what I should do, if anything. > > J?lio a.k.a. Tintazul. > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From wvholst at xs4all.nl Fri Oct 26 02:40:01 2007 From: wvholst at xs4all.nl (Walter H. van Holst) Date: Fri, 26 Oct 2007 11:40:01 +0200 (CEST) Subject: [gutvol-d] Copyright on international treaties Message-ID: <14569.80.127.124.230.1193391601.squirrel@webmail.xs4all.nl> > > > I think you will find that laws are copyrighted in England. Article 2.4 of the Berne Convention reads as follows: "It shall be a matter for legislation in the countries of the Union to determine the protection to be granted to official texts of a legislative, administrative and legal nature, and to official translations of such texts." Several signatories, including The Netherlands, have made explicit allowances in their copyright laws for legislation and judicial verdicts to be exempted from copyright. The UK is indeed an exception with an especially convoluted set of Crown copyright, Parliamentary copyright and Copyright in Acts and Measures. Regards, Walter From Bowerbird at aol.com Fri Oct 26 10:33:34 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 26 Oct 2007 13:33:34 EDT Subject: [gutvol-d] web-based wysiwyg word-processors Message-ID: the web-based wysiwyg word-processors just keep getting better. this one -- based in flash -- was even recently bought by adobe: > http://www.buzzword.com the heavy-markup people will soon be faced with a tough choice: they can align themselves with the light-markup revolution instead, or face a world where structure is abandoned entirely and we have documents formatted in a totally undisciplined wysiwyg manner... it might help to refresh their memory about the way this battle played out when it was fought on the desktop the last 20 years. -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071026/92b4d5c5/attachment.htm From Bowerbird at aol.com Fri Oct 26 13:01:11 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 26 Oct 2007 16:01:11 EDT Subject: [gutvol-d] z.m.l. auto-conversions to .html and .pdf Message-ID: ok, i've got z.m.l. auto-converting pretty well to both .html and .pdf now. so if anyone wants to send me some z.m.l. files to convert for them, i will. this will let me see where i've got problems with my converter-routines, or in my formatting rules, or just where people seem "confused" on the rules (which -- by my philosophy -- means that the rules need to be changed)... and once i've been reasonably convinced that you can create correct z.m.l., i'll give you a u.r.l. where you can just do the .html conversion by yourself... -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071026/107e0209/attachment.htm From Bowerbird at aol.com Fri Oct 26 16:46:45 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Fri, 26 Oct 2007 19:46:45 EDT Subject: [gutvol-d] funny and too funny Message-ID: funny: is it christmas? > http://www.isitchristmas.com/ too funny: the site has an r.s.s. feed: > http://www.isitchristmas.com/rss.xml -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071026/059d9b54/attachment.htm From piggy at netronome.com Fri Oct 26 20:19:02 2007 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Fri, 26 Oct 2007 23:19:02 -0400 Subject: [gutvol-d] Comment on My Antonia scan set In-Reply-To: <1493909058.20071023234404@noring.name> References: <1493909058.20071023234404@noring.name> Message-ID: <4722AE26.5000804@netronome.com> Jon Noring wrote: > ... > I learned a lot from scanning a few books, and in the future (when I > get a new flatbed scanner) will rescan "My Antonia" following this > partial list of practices (I've kept the disassembled book saved in a > ziploc bag to keep it from collecting any more dust than it has): > > 1) All pages with print will be scanned at 600 dpi, 24-bit color. > Those with graphics/images, and the title page, will be done at > 1200 dpi, 24-bit color. > As much as I enjoy the texture of nice paper, 24-bit color seems excessive for text. I find for most books that 8-bit gray actually produces a more visually pleasing result at 1/3rd the storage. For most text I find 200 dpi more than sufficient, but I do agree that certain well-printed works benefit from 600 dpi. My general rule for deciding if I've used a high enough resolution to scan artwork is to ask if I have captured any artifacts which the artist did not intend. If I see bubbles in the ink, evidence of aberrations in the pen, or even small defects in printing registration, then I have scanned at high enough resolution. I generally agree with your choice of 1200 dpi X 24 bit for illos, but there are some caveats. My scanner will not stream at 1200 dpi on a full-width page. It stops every few inches. The places it stops show up as half-pixel deviations across the full page. Sometimes these artifacts are sufficiently annoying that I am happier with a 600 dpi scan which WILL stream. I have run across a handful of very high-quality etchings for which I have been pleased to have 2400 dpi available. Unfortunately, the pause artifacts are even more frequent, but I don't have a good way around that. > 2) I will scan a color calibration chart at both the beginning and the > end of the job. These images will become part of the book scan set. > I have been very pleased with my target from http://www.targets.coloraid.de/. I only use the color target with books where the color is critical such as illustrated natural history books or text books on color. Other than those, the calculated color scanning profile for the scanner is quite sufficient. In the course of scanning a single book, your scanner will not shift enough to make a measurable difference between the first and last page to justify using the calibration chart twice. One scan every few weeks is more than sufficient. Once per book is even overkill, but it does simplify record keeping. > 3) Each paper page will be thoroughly cleaned to remove as much dust > as I can. How I'm not sure, so ideas here are welcome (I'd love a > machine where one simply inserts the page and it comes out the > other end thoroughly cleaned of all dust.) Of course, the glass > will also be religiously kept clean. > I have had success with canned air. With some books though, the paper is actively turning into dust and it quickly becomes an issue of seriously diminishing returns. I have also used scotch tape to pick up particles. The scotch tape is more of a touch-up tool than a full cleaning. For the glass I use Windex wipes followed by a dry lintless cloth. Depending on the book I clean anywhere from once per page to once every 20 or even 30 pages. I learned the hard way to use the canned air on the glass first. Occasionally you come across a particle which will scratch glass. > 4) Each scanned image will be saved in lossless format (likely PNG > since it is an open standard.) > I use png's for almost everything. > 5) This time I will save the raw page scans before they are deskewed > and cropped. The raw set will be the "raw masters", and the > deskewed and cropped set the "normalized masters". Of course, > deskewing will be done using a true rotation algorithm in a > professional-grade image processing application. > Yes, keeping raw originals is VERY useful. I have them for all of my books. The deskew capability in leptonica is very good and uses the fastest-known algorithm. The patent expired a couple years ago. By your reference to "true rotation" I gather that you object to rotation by successive shears? I have been unable to differentiate sheared pages from so-called "true rotations" by visual inspection--even very close visual inspection. After 50 books, the amount of time you spend waiting for page rotations really adds up. Archive your raw scans and future generations have the opportunity to redo your rotations if they aren't happy with your work. > 6) Proper filenaming will be done (won't go into specifics here). > Filenames which match page numbers are very handy, but I've lately found that a hand-built HTML TOC is even more useful for about the same amount of effort. Combine that with a good page-turning interface and exactly correct page number filenames are not all that critical. > 7) All scan sets will be burned to multiple DVD and distributed to > various archives, including the Internet Archive. Other methods of > storage will also be experimented with. Redundancy and distribution > of the raw master is important for long-term preservation. > I highly recommend a simple page-turning interface on top of your raw pages. The program "curator" produces a simple HTML hierarchy which increases the utility of a CD or DVD of page images quite a bit with very little extra overhead. > There's a few other "best practices" I will also do in the scanning, > but the above are the major ones. > > > Now, someone will say why the fuss? > > A variety of reasons: > > 1) For the great Works in the Public Domain, we *should* produce > archival-quality scan sets of the "canonical" printings of those > Works. The 1st Edition of "My Antonia" which I possess falls into > this special category. Of course, I believe all books should be > scanned with the same rigor, but realistically this is unlikely to > happen unless I can find an army of people who believe as I do in > "doing it ueber-right." (Anyone here willing to do this for the > "canonical" printings of the great Works in the Public Domain, let > me know.) > I feel that we have until 2019 to catch up. Only certain works are worth archival-grade preservation. For the bulk of human authorship, simple preservation of the content is sufficient. I have even come to feel that there are works which are *gasp* not worth preserving. > 2) 600 dpi creates scans which are quite highly readable "as is" and > for image processing has a lot more information to get good results > when downsized and processed in various ways. > I find 200 dpi very readable for most works. Interestingly, OCR does work better on larger images, but scaling 200 dpi scans to 600 dpi (with geometric interpolation) actually works just as well as scanning at 600 dpi to start with. > 3) 24-bit provides a lot of useful information. Not only does it > preserve the natural look of the page, but it provides 3 channels > (RGB) that we can play with for image processing *and* OCR. I'm > intrigued of running OCR on each channel and comparing. In > addition, for things like rotation and resampling, this higher > color depth will give more accurate results. > Unless the book is foxed or badly oxidized, I've found no value to scanning text in color. I HAVE tried the multiple channel OCR trick and the differences are insignificant. The exception is books with lots of foxing. Generally the green channel gives the best results, but occasionally the blue channel is better. > 4) For each color channel (which effectively is 256 gray scale), we > can also convert each page to 2 color using a variety of thresholds, > OCR each one, and do comparison of the results. Starting with high > quality, high resolution images should give us better results. > > I haven't tried this. I have observed that most text scanning yields nearly gray results--i.e. the color channels tend to have very close numeric values. Heavily oxidized paper skews the results, but even then we're talking about a few percent difference among channels. Variations across a page (especially if we have gutter noise) could be an order of magnitude larger than the inter-channel differences. Now, a good and fast localized thresholding algorithm would be very handy. > > Anyway, thanks again to Bowerbird's kind words on the My Antonia scan > set. It looks good on the iPhone because I started with high-quality, > high-resolution, high-color depth masters, and was careful in image > normalization. > > Jon Noring > I would add my kudos. From piggy at netronome.com Fri Oct 26 20:30:05 2007 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Fri, 26 Oct 2007 23:30:05 -0400 Subject: [gutvol-d] Copyright on international treaties In-Reply-To: <472104E6.2060808@bohol.ph> References: <4720793E.8020907@tintazul.com.pt> <472104E6.2060808@bohol.ph> Message-ID: <4722B0BD.7090802@netronome.com> Jeroen Hellingman (Mailing List Account) wrote: > Facts cannot be copyrighted. Since any text that has force of law cannot > be paraphrased in any way without loosing its legal standing, laws and > treaties with force of law, at least in the US fall under the > fact-expression merger doctrine, and thus loose copyright restrictions, > certainly when use in a context of law and legal discussion. In other > words, laws are uncopyrightable facts. > > The case law in the US is not so clear: http://www.g4tv.com/techtvvault/features/32238/Who_Owns_the_Law.html > I didn't find this rule in PG books, though, but there is enough > jurisprudence available. Note that this is US only. Some countries have > even more insane copyright laws. > > Jeroen Hellingman > From dbalexander2 at comcast.net Fri Oct 26 20:59:31 2007 From: dbalexander2 at comcast.net (David Alexander) Date: Fri, 26 Oct 2007 22:59:31 -0500 Subject: [gutvol-d] Copyright on international treaties In-Reply-To: <4722B0BD.7090802@netronome.com> Message-ID: <004c01c8184d$cb2ef5e0$640fa8c0@youru3ef4ouuir> http://www.ca5.uscourts.gov:8081/isysquery/irl346f/1/doc The full 5th circuit overturned that ruling. Unless the Supreme Court has said otherwise, or the 5th circuit has since changed its mind, city laws are public domain in the 5th circuit. -----Original Message----- From: gutvol-d-bounces at lists.pglaf.org [mailto:gutvol-d-bounces at lists.pglaf.org] On Behalf Of La Monte H.P. Yarroll Sent: Friday, October 26, 2007 10:30 PM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] Copyright on international treaties Jeroen Hellingman (Mailing List Account) wrote: > Facts cannot be copyrighted. Since any text that has force of law > cannot be paraphrased in any way without loosing its legal standing, > laws and treaties with force of law, at least in the US fall under the > fact-expression merger doctrine, and thus loose copyright > restrictions, certainly when use in a context of law and legal > discussion. In other words, laws are uncopyrightable facts. > > The case law in the US is not so clear: http://www.g4tv.com/techtvvault/features/32238/Who_Owns_the_Law.html > I didn't find this rule in PG books, though, but there is enough > jurisprudence available. Note that this is US only. Some countries > have even more insane copyright laws. > > Jeroen Hellingman > _______________________________________________ gutvol-d mailing list gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d From Bowerbird at aol.com Sat Oct 27 00:35:51 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 27 Oct 2007 03:35:51 EDT Subject: [gutvol-d] Comment on My Antonia scan set Message-ID: piggy said: > 24-bit color seems excessive for text. oh geez, is noring going on again about high-resolution scanning? like i said, he did a nice job on the "my antonia" scan-set, what with the cropping he did. he also regularized all scans to the same size. those are the things that need to be done to make a scan-set nice... but yeah, he scanned them at 600dpi, which just bloated their size... and he saved 'em as .png files, probably because .jpg is "lossy", but that just meant they were bigger than they needed to be, which has a negative impact on both storage requirements _and_ bandwidth... plus, of course, scanning at 600dpi is 4 times slower than 300dpi, unless you're using a camera-based setup like the big boys have, so it's really a waste of most people's time to scan at that resolution. and 24-bit color? time-consuming! it'd take days to scan one book. which -- by the way -- is how long it took noring to scan "my antonia". which is probably why he hasn't scanned more of them, i would think... > I find for most books that 8-bit gray actually produces > a more visually pleasing result at 1/3rd the storage. my opinion is that, if the printer considered a page to be black ink on white paper, then that is exactly how _we_ should consider the page... if it's something different than that, fine. otherwise, it should be that. scan in grayscale if it improves the o.c.r. but when we make the scans available to the masses, make 'em bandwidth-kind black-and-white... > Filenames which match page numbers are very handy, but > I've lately found that a hand-built HTML TOC is even more useful > for about the same amount of effort. wrong. but i'm tired of arguing this one. > Combine that with a good page-turning interface and > exactly correct page number filenames are not all that critical. "a good page-turning interface" is just a basic start on a good tool. it's even better, though, when you can just type in any pagenumber, hit , and instantly be at that page; that's how my tools work; that's how they've worked for _years_, and because of my experience, over the course of years, i know that it's stupid to work any other way. and once you've worked for years with tools that have this capability, you'll agree with me, and see the need for using _proper_ filenames... > I highly recommend a simple page-turning interface on top of > your raw pages. The program "curator" produces a simple HTML > hierarchy which increases the utility of a CD or DVD of page images > quite a bit with very little extra overhead. in the very exact same way that a "simple" interface can "increase utility" by "quite a bit", an interface just a little bit less crude also gives a huge jump... but like i said, i'm tired of arguing this. try it sometime, and you'll see... *** aside from these few points where you are off badly, however, the rest of your post was very informative on a number of topics. -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071027/448be772/attachment-0001.htm From grythumn at gmail.com Sat Oct 27 05:46:26 2007 From: grythumn at gmail.com (Robert Cicconetti) Date: Sat, 27 Oct 2007 08:46:26 -0400 Subject: [gutvol-d] Comment on My Antonia scan set In-Reply-To: <4722AE26.5000804@netronome.com> References: <1493909058.20071023234404@noring.name> <4722AE26.5000804@netronome.com> Message-ID: <15cfa2a50710270546r6b04b76fx6f589fde268c3d5c@mail.gmail.com> On 10/26/07, La Monte H.P. Yarroll wrote: > By your reference to "true rotation" I gather that you object to > rotation by successive shears? I have been unable to differentiate > sheared pages from so-called "true rotations" by visual inspection--even It eats line art for breakfast: http://home.comcast.net/~grythumn/abbyy/abbyy_shearing.png R C From piggy at netronome.com Sat Oct 27 05:50:42 2007 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Sat, 27 Oct 2007 08:50:42 -0400 Subject: [gutvol-d] Copyright on international treaties In-Reply-To: <004c01c8184d$cb2ef5e0$640fa8c0@youru3ef4ouuir> References: <004c01c8184d$cb2ef5e0$640fa8c0@youru3ef4ouuir> Message-ID: <47233422.5040508@netronome.com> David Alexander wrote: > http://www.ca5.uscourts.gov:8081/isysquery/irl346f/1/doc > > The full 5th circuit overturned that ruling. Unless the Supreme Court has > said otherwise, or the 5th circuit has since changed its mind, city laws are > public domain in the 5th circuit. > Hooray! I was greatly troubled by the original decision. > -----Original Message----- > From: gutvol-d-bounces at lists.pglaf.org > [mailto:gutvol-d-bounces at lists.pglaf.org] On Behalf Of La Monte H.P. Yarroll > Sent: Friday, October 26, 2007 10:30 PM > To: Project Gutenberg Volunteer Discussion > Subject: Re: [gutvol-d] Copyright on international treaties > > > Jeroen Hellingman (Mailing List Account) wrote: > >> Facts cannot be copyrighted. Since any text that has force of law >> cannot be paraphrased in any way without loosing its legal standing, >> laws and treaties with force of law, at least in the US fall under the >> fact-expression merger doctrine, and thus loose copyright >> restrictions, certainly when use in a context of law and legal >> discussion. In other words, laws are uncopyrightable facts. >> >> >> > The case law in the US is not so clear: > > http://www.g4tv.com/techtvvault/features/32238/Who_Owns_the_Law.html > > >> I didn't find this rule in PG books, though, but there is enough >> jurisprudence available. Note that this is US only. Some countries >> have even more insane copyright laws. >> >> Jeroen Hellingman >> >> > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > From jon at noring.name Sat Oct 27 09:36:58 2007 From: jon at noring.name (Jon Noring) Date: Sat, 27 Oct 2007 10:36:58 -0600 Subject: [gutvol-d] Comment on My Antonia scan set In-Reply-To: References: Message-ID: <16264944.20071027103658@noring.name> Bowerbird wrote: > oh geez, is noring going on again about high-resolution scanning? Oh geez, is Bowerbird still pushing ITF (impoverished text format)? O.k., to get to a couple of his points. > but yeah, he scanned them at 600dpi, which just bloated their size... No, I scanned them at 600 dpi, full color, and they are the size they are. I don't consider file size to be that important for *mastering* purposes. > and he saved 'em as .png files, probably because .jpg is "lossy", but > that just meant they were bigger than they needed to be, which has > a negative impact on both storage requirements _and_ bandwidth... Yes, I saved them as PNG since "masters" should not be saved in lossy formats which add *visible* artifacts to the scans. When music studios record a performer, do they convert the audio to MP3 right off the bat, And use the MP3 as their "master"? (Most professional grade audio recording equipment today, including that used by amateurs in their basement, sample at 96K and store the audio in lossy WAV. For iPod use, this is overkill. So why should they bother to "master" their music at audiophile quality if most people only listen to the low quality, audibly lossy formats on their iPods? There are some similarities between music recording and book scanning.) > plus, of course, scanning at 600dpi is 4 times slower than 300dpi, > unless you're using a camera-based setup like the big boys have, > so it's really a waste of most people's time to scan at that > resolution If something is to be done, it should be done right. > and 24-bit color?? time-consuming!? it'd take days to scan one book. > which -- by the way -- is how long it took noring to scan "my antonia". > which is probably why he hasn't scanned more of them, i would think... First, there's a lot of potential useful information in those three color channels. Plus, when there's image processing to be done, that information is actually quite important to assure the most accurate results, particularly deskewing by image rotation (not shearing, thanks for the correct word, La Monte.) Now, it took a few days because I had other things to do in that period of time. Each page did take a while, however, about one minute from start to start (I used the wait time for some online work in post-processing the images.) So, yes, it took time, but it is time well spent for that particular book which was a first edition (which in this case is considered the canonical version) of one of the great works of American fiction. And another part of the issue is the speed/quality of the scanner, and the time to push the data to the computer. The consumer-grade scanning equipment is improving in speed and quality. Disk space is becoming dirt cheap, and even DVD-ROM drives and media are getting cheap. And the reason I haven't done any more books is because my scanner went on the blink right after that. I'm about ready to buy a new scanner, and I am looking at the Plustek. I plan to be scanning a few books pretty soon, some of which I am not at liberty to "chop". As an aside, since I haven't done scanner product research lately, have other scanner manufactures put out models competitive to the Plustek OpticBook? > my opinion is that, if the printer considered a page to be black ink on > white paper, then that is exactly how _we_ should consider the page... > if it's something different than that, fine.? otherwise, it should be that. > scan in grayscale if it improves the o.c.r.? but when we make the scans > available to the masses, make 'em bandwidth-kind black-and-white... Do note that I downsampled the My Antonia master scans for public dissemination. Again, you have to differentiate between master scans and distributable scans. Here's the link to the various scan set options: http://www.openreader.org/myantonia/index.htm Notice I have 600 dpi bitonal, and 120 dpi anti-aliased gray scale. I could distribute pretty much anything I want (<= 600 dpi, <= 24 bit color), each of which will be optimal because I have color masters from which to generate them. But if one has low quality masters, repurposing leads to visibly bad results. Btw, when one is dealing with books with smaller print, down in the 4-5 point size, one must do 600 dpi to get reasonable results for both OCR and direct viewing. So even if one finds 300 dpi sufficient for most books *for their immediate needs*, there will be some books that have to be scanned at 600 dpi, even if only for OCR purposes. (Part of the hesitancy of many to do higher quality scans is because they are essentially scanning for a particular process in mind, usually OCR for use in DP or similar project. In essence, to them the scans are "throw away" -- simply an intermediary to some particular end goal. So if they are throw-away, why put in the effort and time to make them archival quality? -- they'll probably end up disappearing in some black hole never to be seen by the public. For example, PG distributes very few scan sets associated with the texts. Yes, I know DP intends to make its scan sets available someday... But I am talking about the present time and the message that sends.) > aside from these few points where you are off badly, however, > the rest of your post was very informative on a number of topics. Let me rewrite what you said, Bowerbird: "aside from a few points where *I believe* you are off badly, however," You forgot to add "I believe". Anyway, all the points I bring up in this message are imho, and other than my comment on ZML, I've avoided focusing on an individual. Jon Noring From jon at noring.name Sat Oct 27 09:50:04 2007 From: jon at noring.name (Jon Noring) Date: Sat, 27 Oct 2007 10:50:04 -0600 Subject: [gutvol-d] Comment on My Antonia scan set In-Reply-To: <15cfa2a50710270546r6b04b76fx6f589fde268c3d5c@mail.gmail.com> References: <1493909058.20071023234404@noring.name> <4722AE26.5000804@netronome.com> <15cfa2a50710270546r6b04b76fx6f589fde268c3d5c@mail.gmail.com> Message-ID: <1083089549.20071027105004@noring.name> Robert wrote: > La Monte H.P. Yarroll wrote: >> By your reference to "true rotation" I gather that you object to >> rotation by successive shears? I have been unable to differentiate >> sheared pages from so-called "true rotations" by visual inspection--even > It eats line art for breakfast: > > http://home.comcast.net/~grythumn/abbyy/abbyy_shearing.png This was discussed in the "Distributed Scanners" YahooGroup: http://groups.yahoo.com/group/distscan/ And of course, the shearing method introduces its own distortion to the characters, which gets worse as the skewing angle increases (both narrowing and angling of the characters backwards or forwards -- and not to mention line art as Robert brought up.) In the "My Antonia" scanning project, a number of pages had skewing as bad as 2-3 degrees, even though I was careful to align the pages (after all, I had the time -- imagine those who *hurry* their scanning.) Part of the problem is that the print block on pages itself can be skewed a little during printing, and the binding can also introduce some skewing. And the effects of shearing on 2-3 degree skews is visibly noticeable in the characters, and I can't help but think it might even affect OCR a little bit. With true rotation, the characters do not get distorted. And in order to get good results with true rotation, it is best to apply it to high resolution, high color depth images since one has a lot more "information" the rotation algorithm can use to good effect. Jon Noring From Bowerbird at aol.com Sat Oct 27 11:47:19 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 27 Oct 2007 14:47:19 EDT Subject: [gutvol-d] Comment on My Antonia scan set Message-ID: geez, i think i figured out why my spam folder is always _so_ overflowing; is jon sending the same message to the listserve _and_ to me personally? if so, jon, don't bother, because they both end up in the same spam folder. and no, i am _not_ gonna get on your merry-go-round for the exact same "discussion" that we've already gone through 89 times (at least) in the past. if you wanna be totally anal-compulsive and scan at 87000dpi, be my guest. if you wanna tell everyone else they should be as anal-compulsive as you are, be my guest; if they're stupid enough to listen, they will deserve what they get. but for anyone who wants to listen to common-sense, then listen up. if you're generating page-images with a camera, where it takes the same time to shoot higher-resolution as lower-resolution, fine, shoot higher-resolution. but when you're dragging a scan-head over a page, and it takes 4 times as long to do 600dpi as 300dpi, and another 4 times as long to jump it up to 1200dpi, and another 4 times as long to go 2400dpi, high-resolution is a waste of time... we've got millions of books needing to be scanned, and 300dpi is good enough. the only exception is the rare and fragile book that can only be scanned _once_. and, as long as you're not saving and resaving continually, .jpg will be just fine... want to see for yourself? then take a look and compare a .jpg with a .png here: > http://z-m-l.com/misc/myantf009.html -bowerbird p.s. now, if by some _miracle_, noring has come up with a new point, please, won't someone share it? because i'm not gonna read his same old shit again. p.p.s. the funniest thing about jon's "my antonia" project is the _cover_ scan. it's a 50-megabyte file. that's right, 50 megabytes! for one scan -- the cover! if you want a summary of jon noring's philosophy, that's the best one there is. boy, did i ever feel like a bloomin' idiot after i downloaded _that_... ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071027/f9f6b900/attachment.htm From jon at noring.name Sat Oct 27 14:10:00 2007 From: jon at noring.name (Jon Noring) Date: Sat, 27 Oct 2007 15:10:00 -0600 Subject: [gutvol-d] Comment on My Antonia scan set In-Reply-To: References: Message-ID: <1581359295.20071027151000@noring.name> Bowerbird wrote: > geez, i think i figured out why my spam folder is always _so_ overflowing; > is jon sending the same message to the listserve _and_ to me personally? Nope, just to gutvol-d. So your comment on "spam folder" overflowing is simply hyperbole ("hype"). Oops, maybe I should not use the word "hype"? > and no, i am _not_ gonna get on your merry-go-round for the exact same > "discussion" that we've already gone through 89 times (at least) in the past. Just like the ZML merry-go-round? It is clear that nobody on gutvol-d and DP are interested in your ZML. If anyone has been the number one *proponent* of it (not for mastering, but for a normalized end-user plain text rendition), it is ME. It must irk you to no end that I am the number one supporter of ZML (again as a distribution format, to make that clear.) I even asked the DP folk to provide a list of a few representative texts for you to master in ZML to show them what you can do. So you could strut your stuff to show ZML could also be a viable mastering format. They haven't even bothered to do that (well, actually a couple were mentioned -- have you converted those yet?) I wonder why the the PG/DP communities have not worked with you on ZML? And you can demonstrate all you want about converting ZML to crappy HTML (which you've intentionally made crappy for reasons that are totally irrational and violate several of *your own principles*) and probably crappy PDF, too... Making pudding, and making pudding that is the best tasting in the world, are not one and the same. You certainly are making pudding. But how will the world view its taste? Will it be edible? > if you wanna be totally anal-compulsive and scan at 87000dpi, be my guest. > if you wanna tell everyone else they should be as anal-compulsive as you are, > be my guest; if they're stupid enough to listen, they will deserve what they > get. If I were truly anal-compulsive, I'd be advocating reproduction quality, which is a step above archival quality. Jim Weiler, posting to DistScan, proposed 5 levels of quality that seems to have been embraced by pretty much everyone on DistScan: http://groups.yahoo.com/group/distscan/message/6 They are, from highest quality to lowest quality: 5) Reproduction quality. (extreme requirements) 4) Archival quality. 3) Recognition quality. 2) Reference quality. 1) Poor quality. (I term it "unusable for any purpose.") A lot of the stuff OCA lately produces seems to be a pretty solid Level 3 (they might claim it is archival quality -- some of their scans don't reach that level, imho.) Google's quality is quite variable but seems to also be getting better. (And again, for OCA at least, they may have internal "masters" of higher quality, but of that I'm not sure of.) > but for anyone who wants to listen to common-sense, then listen up. 600 dpi, 24-bit for text is reasonable and common sense for *mastering* page scans usable for *pretty much all uses except for facsimile reproduction.* And I've noted *reasons* why which you don't address because you can't (you call it a merry-go-round, I call it rational debate.) Certainly, your concern is quantity, to hurry up and scan the books. Fine, I share the same concern. But the key is that OCA is doing this at a rate that outstrips all of the PG/DP folk. (Not to mention Google.) In that case, unless one finds books that OCA will not scan for a number of years (how does one really know?), then maybe it is better that the DP/PG folk concentrate on the books which are available through OCA and Google? That is, if one were to apply "common-sense", then let the "pros" do the scanning using their camera scanners so DP's work can be focused on the OCR/proofing side. (Yes, there will be arguments now as to why the PG/DP folk still need to scan, but I am citing "common-sense" which is not necessarily the best advice. That is, using the phrase "common-sense" is oftentimes simply an empty phrase used to stifle rational debate.) Btw, it is interesting that I don't see the PG/DP folk promoting an archive of book scan sets, or stressing the need to submit a copy to the Internet Archive or something. I believe part of the reason is the prevailing viewpoint that page scans are simply a "throw away" intermediary to structured and proofed digital text, which is the real product. This "meme" persists today even though some are now beginning to see value in distributing the scan sets online. While the world applauds the production and archiving of book scans, we don't see much interest in that here in the PG world, except to use the scans that OCA/Google produces. Yet maybe over 10,000 book scan sets have been produced over the years by the PG/DP folk, yet very few of them are available to the world. (To be fair, DP plans to make its scan sets available, but has not done so because of some programming needs. Maybe Juliet can give us an update on the status of this...) So, yes, my views on scan quality are in the small minority here in the PG/DP communities. Does holding such an opinion make it automatically wrong? No. It depends upon what one considers the purpose/requirements of the scan sets. And certainly I can get a little extreme in my views that others should do archival quality scanning. I do not mean to give offense, but I would like people when they scan that book to ask themselves if the scan set they produce could have value in and of itself, if they should make their scans available to the world, and would they be proud of the quality of their work? So my purpose here is to provide a perspective, and various reasons, why they should consider taking that extra time to produce an archival quality scan set. If all they see is that they submit their scan set to DP and it then disappears from sight (like the scan set of the Kama Sutra I submitted to DP), they certainly would NOT be interested in making that extra effort. Maybe that's all DP needs to do: ask all those who produce scan sets to not only submit them into their "system" but to also submit them to the Internet Archive. Brewster has a standing invitation to receive such scan sets, and it would not take long to write up exact step-by-step instructions as to how PG/DP volunteers can submit their book scan sets to the Internet Archive: just gather up the minimum metadata they require, how to name the file(s), where to upload, etc. Since it is voluntary, some won't, but I think many will gladly do so since I think most people will see the need to preserve scan sets in and of themselves. In turn, the PG archive of DP texts will now be able to provide a link to the associated scan sets at IA, and if PG wants, can even download that and offer the same scan set from its own server. > if you're generating page-images with a camera, where it takes the same time > to shoot higher-resolution as lower-resolution, fine, shoot higher-resolution. > > but when you're dragging a scan-head over a page, and it takes 4 times as long > to do 600dpi as 300dpi, and another 4 times as long to jump it up to 1200dpi, > and another 4 times as long to go 2400dpi, high-resolution is a waste of time... By your argument, let's go to 150 dpi. LOL. The thing you are ignoring in all this discussion is the purpose/reason for scanning. If the sole purpose of the scans is simply to be fodder for OCR, then that establishes the scanning requirements. And note that flatbed scanners continue to improve in both mechanical and data transfer speeds. > we've got millions of books needing to be scanned, and 300dpi is good enough. > the only exception is the rare and fragile book that can only be scanned _once_. Even if one were to start a "Distributed Scanners" project which is all volunteer driven and lets everyone decide the quality level they want to scan, it will still not keep up with OCA and Google. (Now I may be wrong, but we've had a number of years for such a project to arise, and we don't see it yet. Are there some dynamics that such a project would not get started and reach some sort of critical size to challenge OCA in daily output?) > and, as long as you're not saving and resaving continually, .jpg will be just fine... > want to see for yourself?? then take a look and compare a .jpg with a .png here: > >?? http://z-m-l.com/misc/myantf009.html And you are mixing up "mastering" with "distribution". I'm not advocating distribution "to the poverty stricken in the third world" using the high rez masters. In fact, for bitonal of text only pages, JPG is *also* wasteful -- there are much better "lossless" and "lossy" algorithms than PNG and JPG, respectively. DjVu comes to mind. (Yes some issues regarding proprietary formats, browser viewability, and such... but if you are talking about "waste" then one has to put it into proper perspective...) (And also note that for bitonal images, the difference in size between PNG and JPG is rather small. In your example we are talking about 73.8k for lossless PNG which these days is already reasonable, and 47.1k for lossy JPG which is not that much smaller. Not a big difference. Hmmm, I wonder how the two images would OCR using different OCR packages? Anyone wanting to take Bowerbird's examples and run OCR on them?) By your argument, you'd say amateur musicians should convert all their "recorded masters" (at 96K, lossless WAV) to MP3 since that's all the world is interested in these days? No, you'd say they should take their "master" and generate a MP3 for "distribution" and keep the master laying around. Now with storage media getting so dirt cheap, and (hopefully) the Internet getting faster and faster, they can now even consider distributing the "master" as-is so everyone can experience the highest fidelity in the original recording, if they have the equipment to bring out the full fidelity. In fact, I follow online music distribution (most of which is "unauthorized"), since I am interested in this area. I'm noticing lately a huge rise in the distribution of lossless FLAC files rather than lossy MP3. More and more people are ripping CD audio tracks and converting them to lossless FLAC for distribution -- FLAC exactly preserves the original digital audio bit-for-bit, while MP3 actually alters the digital audio and introduces artifacts which for even 192 kbit I can hear on my audio system from blind tests I've done -- and looking at the frequency/time domain maps one really sees the major distortion MP3 does to the wave. Imagine a future 100 years from now when the studio masters are gone and all we have are MP3's and similar lossy formats floating around which have been discombobulated such that even remastering them is more difficult because the waveform has been so fundamentally altered. There is hope the MP3 and other lossy formats will be pushed out of the picture to be replaced by lossless audio formats. I hope so... (Btw, typical lossless compression of digital stereo audio averages around 50%, very similar to lossless compression of continuous tone, full color digital images -- also at about 50%.) > p.s.? now, if by some _miracle_, noring has come up with a new point, please, > won't someone share it?? because i'm not gonna read his same old shit again. It would not surprise me if over time there will be new points come up as to why we should have "mastered" book page scans at archival quality. The thing is that we *don't know the future.* We already discovered how decisions made in the mid 90's are now rearing their ugly heads today in the PG world. Those decisions were made for expediency, granted, given the constraints of the time, but in nearly all cases the decisions were not based on a rational requirements analysis, nor any consideration of future needs and opportunities -- the decisions I saw were pretty ad hoc and did not consider intermediary alternatives that would have worked then (e.g., preservation of accented characters found in many English texts.) > p.p.s.? the funniest thing about jon's "my antonia" project is the _cover_ scan. > it's a 50-megabyte file.? that's right, 50 megabytes!? for one scan -- the cover! > if you want a summary of jon noring's philosophy, that's the best one there is. > boy, did i ever feel like a bloomin' idiot after i downloaded _that_... Yes, so what's your point? That cover image is the *master*. I can easily generate lower resolution and JPEG versions if I wanted to, and to come to think of it, should provide a link to a browser viewable image, probably in JPEG. The cover to My Antonia is not that interesting anyway, but it is the book cover to the First Edition (considered the canonical PD version) of this great classic of American fiction. Subsequent editions (which are not PD) are considered inferior since some stuff in the introduction was removed. Jon Noring From Bowerbird at aol.com Sat Oct 27 16:33:19 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 27 Oct 2007 19:33:19 EDT Subject: [gutvol-d] Comment on My Antonia scan set Message-ID: i'm not biting at your bait, jon. i had enough of your merry-go-round years ago. like i said, you did a good job on my antonia, but you've scanned _2_ darn books! even the george w. bush library has 5 books. (coloring books, but _5_ of them.) your own (lack of) behavior points out that your "ideals" are _not_ cost-effective. i'd much rather have the thousands of crappy scan-sets from d.p. than your _2_: > http://www.pgdp.org/ols also excellent are the 400 scan-sets produced by nicholas hodson, which people can locate quite easily by searching the o.c.a. text-file archives for " athelstane"... high-enough resolution, carefully cropped and size-standardized, great work... good enough to read, if we must. but most importantly, good enough for o.c.r., so as to make small-footprint mini-storage narrow-bandwith light-markup text. and, while i'm on the topic, in this regard, the "digital reprint" that jose menendez made of "my antonia", at 2.2 megs, is 50 times better than your 30-meg scan-set: > http://www.ibiblio.org/ebooks/Cather/ faster, cheaper, _and_ higher-quality. that combo breaks several laws of physics. and that, in the long run, is why high-resolution scan-sets are a waste of time: because once we verify we digitized 'em correctly, we will have little use for 'em. the digital reprints we've created will be _better_ for every conceivable purpose. as michael hart has been telling us all along, "a picture of a book is not a book"... -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071027/8c870610/attachment.htm From jon at noring.name Sat Oct 27 19:02:08 2007 From: jon at noring.name (Jon Noring) Date: Sat, 27 Oct 2007 20:02:08 -0600 Subject: [gutvol-d] Comment on My Antonia scan set In-Reply-To: References: Message-ID: <686943385.20071027200208@noring.name> Bowerbird wrote: > like i said, you did a good job on my antonia, but you've scanned _2_ darn books! > even the george w. bush library has 5 books.? (coloring books, but _5_ of them.) > your own (lack of) behavior points out that your "ideals" are _not_ cost-effective. So how many books have you scanned and have made available online? Your comment again reminds me of a schoolyard taunt as to whose is bigger. I won't even dignify your comment with any kind of explanation. > also excellent are the 400 scan-sets produced by nicholas hodson, which people > can locate quite easily by searching the o.c.a. text-file archives for "athelstane"... > high-enough resolution, carefully cropped and size-standardized, great work... > good enough to read, if we must.? but most importantly, good enough for o.c.r., > so as to make small-footprint mini-storage narrow-bandwith light-markup text. Yes, Nicholas does laudable work. He participated in a lot of the discussion on "Distscan", and he and I had a lot of personal email exchanges. Even though he and I disagree on several things, I have the greatest respect for him and his work. And I'm glad his work is being contributed to OCA. He is very passionate about what he does and what he believes, and I always admire passionate people who are also polite towards everyone, as he is. So, don't bring up individuals and try to say that my comments about scan quality are a slap in the face of these individuals. That's the same tired psychological warfare you've used in the past. You don't really care for these people, in my estimation at least, because you are *clearly* misusing them as a "ploy" to avoid rational debate. It is, to be blunt, an ad hominem type of attack. > and, while i'm on the topic, in this regard, the "digital reprint" that jose menendez > made of "my antonia", at 2.2 megs, is 50 times better than your 30-meg scan-set: > ? http://www.ibiblio.org/ebooks/Cather/ > faster, cheaper, _and_ higher-quality.? that combo breaks several laws of physics. Since some will mistaken your comment (which you make clearer below), let me note that Jose's 2.2 meg PDF is a modern-typeset "reproduction" of the original, while my 30-meg scan set is simply a set of page scan images of the original sampled down to whatever they are. Without the following clarification, what you said would be an apples/oranges comparison. > and that, in the long run, is why high-resolution scan-sets are a waste of time: > because once we verify we digitized 'em correctly, we will have little use for 'em. > the digital reprints we've created will be _better_ for every conceivable purpose. > as michael hart has been telling us all along, "a picture of a book is not a book"... Of course, structured and proofed digital text is where it's at, but scanned images play a role. There are some very smart people behind what OCA is doing (especially Brewster), and they feel a need to achieve almost archival quality in what they do. Many of their book scans actually exceed 600 dpi (typically they are 400-500 dpi -- using a fixed camera the resolution actually varies) and are at 24-bit color depth. And note that they actually distribute these as JPEG2000 images at this high image quality -- I just grabbed a whole set of a particular book as JP2 images -- 550 megs -- took one hour while I was doing something else.) So, in a sense what you are saying is that the quality they produce *and* distribute at is pointless. So if I were to decide on who to listen to, who would I tend to ascribe greater authority? Maybe we should be asking why OCA itself has chosen this quality? And don't bring up that tired "well, since they use a camera they may as well capture it all" argument again -- it is fallacious argument (for reasons I won't delve into here), plus it doesn't cover the *distribution* of the hi-rez scan sets which are beyond the needs of just OCR (and in your estimation are overkill for all purposes.) (In a shameless name drop, I have had the fortune of personally talking with Brewster in the past about scan quality, and he is clearly passionate about this. His comments and arguments to me actually form a part of what I am advocating here. If he could master scan all books at 600 dpi/24-bit resolution and save in lossless format, I believe he would. What mostly keeps him from the lossless is simply disk space in doing massive quantities of books at a site -- but he won't compromise any further than that. His choice for JPEG2000 rather than JPG revolves, I would guess, around it having better image quality for the same compression, and does not introduce the same kind of chunky artifacts that JPG does. See the Wikipedia article on JPEG 2000. As a result, I plan in the future of still mastering at 600/24/lossless (and pages with graphics at 1200 dpi), but will create JPEG2000 of the scan sets for online distribution in zip archives -- for most books the zip archives will fit on a CD-ROM, and will be a slightly lower quality "master" backup. This does not preclude *also* distributing derivative scan sets at lower resolution and color depth, including derivatives optimized for OCR.) And about Michael's statement, a picture of a book (a set of scanned pages) is definitely a *book* since there's really no difference (other than maybe quality) between viewing the original page and a digital facsimile of it (and such a digital facsimile can itself be printed out onto paper.) Now, we know what Michael really means by his statement (and I agree structured and proofed digital text is far superior), but you are apparently misusing his statement to further your argument. Maybe Michael will explain what he means by his statement in regards to this discussion? ***** Finally, to interject a personal and, yes, emotional note directed to Bowerbird: Feel very fortunate that you have found a home in gutvol-d where you can be "yourself." Be thankful that Greg and Michael cut you a whole lot of slack. On nearly all other groups I've participated in you would have been thrown out a long time ago for what I term fostering a hostile discussion environment. For example, calling whatever someone writes a "merry-go-round" is exactly an ad-hominem attack on their person. It is, in my opinion, a form of hate speech and has no place in rational discourse. Jon Noring From jon at noring.name Sat Oct 27 20:20:16 2007 From: jon at noring.name (Jon Noring) Date: Sat, 27 Oct 2007 21:20:16 -0600 Subject: [gutvol-d] Comment on My Antonia scan set In-Reply-To: <686943385.20071027200208@noring.name> References: <686943385.20071027200208@noring.name> Message-ID: <1789330137.20071027212016@noring.name> I previously wrote: > Finally, to interject a personal and, yes, emotional note directed > to Bowerbird: Feel very fortunate that you have found a home in > gutvol-d where you can be "yourself." Be thankful that Greg and > Michael cut you a whole lot of slack. On nearly all other groups > I've participated in you would have been thrown out a long time ago > for what I term fostering a hostile discussion environment. For > example, calling whatever someone writes a "merry-go-round" is > exactly an ad-hominem attack on their person. It is, in my opinion, > a form of hate speech and has no place in rational discourse. And to add an addendum. Clearly, everything above is "in my opinion" based on observation. And since I run quite a few mail-based forums myself (all YahooGroups), including The eBook Community, I am supportive of those who administer any forum, even when I may disagree with their policies. Administering public groups like this is a thankless job since one can never please all the people all the time -- and sometimes we group administrators/ moderators have to make tough, often-times no-win decisions. In effect, Greg and Michael are the ones who have defacto control of this group simply by running the software that administers it. Thus, they ultimately decide who gets to post and what they allow to be posted here. They may even deny they have this power, but they do have this power by default -- if the voluntarily give up this power, they do so because they have the power. And I may say that such-and-such is "hate speech", or so-and-so is creating a "hostile discussion environment." But it is not what I say, but what they say that matters. If many of us don't like how this group is run to the point where we get nothing out of it, we simply vote with our feet and leave, maybe even starting a new discussion group if we find value in the discussion but want a different set of group policies. So with that said, I still believe what I wrote previously and reproduced above. But ultimately all that matters is what Greg and Michael think. If they decide that Bowerbird can pretty much write and say what he wants to gutvol-*, then that's the way it is. Jon Noring From Bowerbird at aol.com Sun Oct 28 05:47:31 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 28 Oct 2007 08:47:31 EDT Subject: [gutvol-d] Comment on My Antonia scan set Message-ID: more posts with this header in my spam folder. i don't know if noring is attempting to bait me. but i do know that i'm not biting. end of story. -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071028/335ff4f7/attachment.htm From jon at noring.name Sun Oct 28 06:42:31 2007 From: jon at noring.name (Jon Noring) Date: Sun, 28 Oct 2007 07:42:31 -0600 Subject: [gutvol-d] Comment on My Antonia scan set In-Reply-To: References: Message-ID: <419590423.20071028074231@noring.name> Bowerbird wrote: > more posts with this header in my spam folder. > > i don't know if noring is attempting to bait me. > > but i do know that i'm not biting.? end of story. Quote the fisherman. Yes, end of story. Now, back to real discussion. Jon From Bowerbird at aol.com Sun Oct 28 09:14:47 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 28 Oct 2007 12:14:47 EDT Subject: [gutvol-d] z.m.l. and youtube Message-ID: i'm now supporting embedding of youtube videos into the .html versions that i auto-convert from a .zml file... i'm not sure if i can plug them into the .pdf versions or handle them in my offline viewer, but time will tell me... i'm not sure if this is a bad thing or a good thing... :+) i guess it is what it is... -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071028/2b808a69/attachment.htm From hart at pglaf.org Sun Oct 28 10:20:57 2007 From: hart at pglaf.org (Michael Hart) Date: Sun, 28 Oct 2007 10:20:57 -0700 (PDT) Subject: [gutvol-d] Comment on My Antonia scan set In-Reply-To: <419590423.20071028074231@noring.name> References: <419590423.20071028074231@noring.name> Message-ID: Jon Noring makes us wonder about reviving moderation, just for him. . . . His folder is full of this sort of thing. . . . Michael S. Hart Founder Project Gutenberg On Sun, 28 Oct 2007, Jon Noring wrote: > Bowerbird wrote: > >> more posts with this header in my spam folder. >> >> i don't know if noring is attempting to bait me. >> >> but i do know that i'm not biting.? end of story. > > Quote the fisherman. > > > Yes, end of story. > > Now, back to real discussion. > > > Jon > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From hart at pglaf.org Sun Oct 28 10:24:25 2007 From: hart at pglaf.org (Michael Hart) Date: Sun, 28 Oct 2007 10:24:25 -0700 (PDT) Subject: [gutvol-d] !@!Re: Comment on My Antonia scan set In-Reply-To: <1789330137.20071027212016@noring.name> References: <686943385.20071027200208@noring.name> <1789330137.20071027212016@noring.name> Message-ID: On Sat, 27 Oct 2007, Jon Noring wrote: > I previously wrote: > > >> Finally, to interject a personal and, yes, emotional note directed >> to Bowerbird: Feel very fortunate that you have found a home in >> gutvol-d where you can be "yourself." Be thankful that Greg and >> Michael cut you a whole lot of slack. On nearly all other groups >> I've participated in you would have been thrown out a long time ago >> for what I term fostering a hostile discussion environment. For >> example, calling whatever someone writes a "merry-go-round" is >> exactly an ad-hominem attack on their person. It is, in my opinion, >> a form of hate speech and has no place in rational discourse. Jon's Noring's remarks are as close to "a form of hate speech" as anyone's I have read here. Michael S. Hart Founder Project Gutenberg > > And to add an addendum. > > Clearly, everything above is "in my opinion" based on observation. And > since I run quite a few mail-based forums myself (all YahooGroups), > including The eBook Community, I am supportive of those who administer > any forum, even when I may disagree with their policies. Administering > public groups like this is a thankless job since one can never please > all the people all the time -- and sometimes we group administrators/ > moderators have to make tough, often-times no-win decisions. > > In effect, Greg and Michael are the ones who have defacto control of > this group simply by running the software that administers it. Thus, > they ultimately decide who gets to post and what they allow to be > posted here. They may even deny they have this power, but they do have > this power by default -- if the voluntarily give up this power, they > do so because they have the power. > > And I may say that such-and-such is "hate speech", or so-and-so is > creating a "hostile discussion environment." But it is not what I say, > but what they say that matters. If many of us don't like how this > group is run to the point where we get nothing out of it, we simply > vote with our feet and leave, maybe even starting a new discussion > group if we find value in the discussion but want a different set of > group policies. > > So with that said, I still believe what I wrote previously and > reproduced above. But ultimately all that matters is what Greg and > Michael think. If they decide that Bowerbird can pretty much write and > say what he wants to gutvol-*, then that's the way it is. > > Jon Noring > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From hart at pglaf.org Sun Oct 28 10:38:11 2007 From: hart at pglaf.org (Michael Hart) Date: Sun, 28 Oct 2007 10:38:11 -0700 (PDT) Subject: [gutvol-d] !@! Re: Comment on My Antonia scan set In-Reply-To: <686943385.20071027200208@noring.name> References: <686943385.20071027200208@noring.name> Message-ID: I have gotten to the point where I would prefer Jon Noring NOT quote me, or pretend to quote me, as correctly below, and I am going to as him now, here, in public, never to quote me again, on or off this list, as is my right by copyright. I would prefer Mr. Noring actually never mention me, directly, or indirectly, or forward my messages without permission. Sorry Jon, but you've just overdone it this week. Thanks!!! Michael S. Hart Founder Project Gutenberg On Sat, 27 Oct 2007, Jon Noring wrote: > (In a shameless name drop, I have had the fortune of personally > talking with Brewster in the past about scan quality, and he is > clearly passionate about this. His comments and arguments to me > actually form a part of what I am advocating here. If he could > master scan all books at 600 dpi/24-bit resolution and save in > lossless format, I believe he would. What mostly keeps him from > the lossless is simply disk space in doing massive quantities > of books at a site -- but he won't compromise any further than > that. His choice for JPEG2000 rather than JPG revolves, I would > guess, around it having better image quality for the same > compression, and does not introduce the same kind of chunky > artifacts that JPG does. See the Wikipedia article on JPEG > 2000. As a result, I plan in the future of still mastering at > 600/24/lossless (and pages with graphics at 1200 dpi), but will > create JPEG2000 of the scan sets for online distribution in zip > archives -- for most books the zip archives will fit on a > CD-ROM, and will be a slightly lower quality "master" backup. > This does not preclude *also* distributing derivative scan sets > at lower resolution and color depth, including derivatives > optimized for OCR.) > > And about Michael's statement, a picture of a book (a set of > scanned pages) is definitely a *book* since there's really no > difference (other than maybe quality) between viewing the > original page and a digital facsimile of it (and such a digital > facsimile can itself be printed out onto paper.) Now, we know > what Michael really means by his statement (and I agree > structured and proofed digital text is far superior), but you > are apparently misusing his statement to further your argument. > Maybe Michael will explain what he means by his statement in > regards to this discussion? > > ***** > > Finally, to interject a personal and, yes, emotional note > directed to Bowerbird: Feel very fortunate that you have found > a home in gutvol-d where you can be "yourself." Be thankful > that Greg and Michael cut you a whole lot of slack. On nearly > all other groups I've participated in you would have been > thrown out a long time ago for what I term fostering a hostile > discussion environment. For example, calling whatever someone > writes a "merry-go-round" is exactly an ad-hominem attack on > their person. It is, in my opinion, a form of hate speech and > has no place in rational discourse. > > Jon Noring From hart at pglaf.org Sun Oct 28 11:11:55 2007 From: hart at pglaf.org (Michael Hart) Date: Sun, 28 Oct 2007 11:11:55 -0700 (PDT) Subject: [gutvol-d] !@! RESEND Re: Comment on My Antonia scan set In-Reply-To: <1789330137.20071027212016@noring.name> References: <686943385.20071027200208@noring.name> <1789330137.20071027212016@noring.name> Message-ID: My apologies, my previous attempt at replying didn't work out. Trying again below, perhaps with more patience. Thanks!!! Michael S. Hart Founder Project Gutenberg On Sat, 27 Oct 2007, Jon Noring wrote: > I previously wrote: > > >> Finally, to interject a personal and, yes, emotional note >> directed to Bowerbird: Feel very fortunate that you have found >> a home in gutvol-d where you can be "yourself." Be thankful >> that Greg and Michael cut you a whole lot of slack. On nearly >> all other groups I've participated in you would have been >> thrown out a long time ago for what I term fostering a hostile >> discussion environment. For example, calling whatever someone >> writes a "merry-go-round" is exactly an ad-hominem attack on >> their person. It is, in my opinion, a form of hate speech and >> has no place in rational discourse. As we have commented before, Mr. Noring is certainly one of the top people whom "Greg and Michael cut you a whole lot of slack." Mr. Noring seems as guilty of writing merry-go-rounds and/or hate speech as much as anyone. If anyone is going to be left off this list, you can be sure Mr. Noring will be among them. His baiting of Mr. Bowerbird, and of myself, is the mere work of an apprentice baiter. . .not even up to journeyman level. Yet I am sure Mr. Noring would elevate his words beyond that. Enough said, obviously more than enough words to the wise and thus not expected to reach Mr. Noring or his loyal opposition as it were. > And to add an addendum. > > Clearly, everything above is "in my opinion" based on > observation. And since I run quite a few mail-based forums > myself (all YahooGroups), including The eBook Community, I am > supportive of those who administer any forum, even when I may > disagree with their policies. Administering public groups like > this is a thankless job since one can never please all the > people all the time -- and sometimes we group administrators/ > moderators have to make tough, often-times no-win decisions. This is one reason why we don't do moderation here. Another is the obvious misuse of political powers-- as so often requested by Mr. Noring--both in front, where people can see it, and behind the scenes from perspectives hidden from the normal list members. I am sure all the list personnel know Mr. Noring is a person with an agenda who wishes other agendas to be silenced while his own goes forwards. > In effect, Greg and Michael are the ones who have defacto > control of this group simply by running the software that > administers it. Thus, they ultimately decide who gets to post > and what they allow to be posted here. They may even deny they > have this power, but they do have this power by default -- if > the voluntarily give up this power, they do so because they > have the power. It is only those who desire such power who consider this sort of thing. Gladly we have plenty of support to keep moderators out of the fray, unlike the other lists Mr. Noring, and others, have lobbied for such power. Even before these recent outbursts we have noticed, and commented on, Mr. Noring's apalling behavior. > And I may say that such-and-such is "hate speech", or so-and-so > is creating a "hostile discussion environment." But it is not > what I say, but what they say that matters. If many of us don't > like how this group is run to the point where we get nothing > out of it, we simply vote with our feet and leave, maybe even > starting a new discussion group if we find value in the > discussion but want a different set of group policies. Mr. Noring is striking out at himself here, as much as anyone. As most of us here are aware, and also other servers, Noring's speech is as much "hate speech" as anyone's, largely ignored-- but saved in the archives for future reference. Anyone should be able to trace Mr. Noring's comments for themselves, and see the trends, over the years. A "hostile discussin environment" usually follows Mr. Noring's agenda, rather than the opposite. > So with that said, I still believe what I wrote previously and > reproduced above. But ultimately all that matters is what Greg > and Michael think. If they decide that Bowerbird can pretty > much write and say what he wants to gutvol-*, then that's the > way it is. Mr. Noring appears to by saying, here and elsewhere, that his right to free speech trumps everyone else's rights. It is all to obvious that Mr. Noring and a few others baited, and continute to bait, Mr. Bowerbird and others, to create an environment in which he could claim hostility. This is not working here, and it should not work elsewhere. > Jon Noring Thanks!!! Michael S. Hart Founder Project Gutenberg From marcello at perathoner.de Sun Oct 28 11:47:20 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Sun, 28 Oct 2007 19:47:20 +0100 Subject: [gutvol-d] !@! RESEND Re: Comment on My Antonia scan set In-Reply-To: References: <686943385.20071027200208@noring.name> <1789330137.20071027212016@noring.name> Message-ID: <4724D938.2090009@perathoner.de> Michael Hart wrote: > It is all to obvious that Mr. Noring and a few others baited, > and continute to bait, Mr. Bowerbird and others, to create an > environment in which he could claim hostility. There are three persons on this list you should not take seriously. -- Marcello Perathoner webmaster at gutenberg.org From jon at noring.name Sun Oct 28 13:01:40 2007 From: jon at noring.name (Jon Noring) Date: Sun, 28 Oct 2007 14:01:40 -0600 Subject: [gutvol-d] Amazing! In-Reply-To: References: <686943385.20071027200208@noring.name> <1789330137.20071027212016@noring.name> Message-ID: <1253347043.20071028140140@noring.name> In reply to Michael's four detailed messages this morning where I am the topic of discussion: Michael and I certainly have quite different perspectives regarding general world-view (politics, economics, etc.) as well as specific issues in the realm of digitizing the Public Domain. For example, we recently contributed differing viewpoints on an ongoing discussion at Book People regarding book pricing ("where does the money go"). Some gutvol-ers here who do not subscribe to the Book People mailing list may be interested in that discussion -- refer to the BP archive at: http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2007 and look for the various messages which include "Book Price Inflation" in the subject header field. Quite a few people have contributed to that discussion. Despite our very major differences, I greatly respect Michael for what he has observably accomplished over the years. He will be written up in the history books for his true accomplishments and for various predictions of his which eventually come true. I've said this many times because it is a fact -- not because I'm trying to curry favor. So of course being specially singled out by Michael, who along with Greg runs this group, for a couple messages I posted last night, whether deserved or not as others here will decide for themselves, definitely did not make my day!... ...I spent some time writing up something to go in this spot, to continue with the above train of thought, and of course to provide my own perspective. But it would have ended up resaying what I've said before, and make this message overly long. Those here who are even following this discussion have pretty much already made up their minds on a number of issues and people. So for each reader what I write would either be a futile exercise at convincing, or a preaching to the choir. And of course, a diversion from rational, respectful, and cordial discussion on topics of interest to the PG community. Jon Noring From hart at pglaf.org Sun Oct 28 13:19:36 2007 From: hart at pglaf.org (Michael Hart) Date: Sun, 28 Oct 2007 13:19:36 -0700 (PDT) Subject: [gutvol-d] Amazing! In-Reply-To: <1253347043.20071028140140@noring.name> References: <686943385.20071027200208@noring.name> <1789330137.20071027212016@noring.name> <1253347043.20071028140140@noring.name> Message-ID: Of course, Mr. Noring is leaving out that much of my replies to his ranting and raving were not passed by the moderator-- and I can't understand why Mr. Noring's were passed. Hence, I presume that is why he recommends you reading there because his words were passed and mind were not. I should perhaps also mention that a private message send in a private manner to Mr. Noring and not to the moderator, was still forwarded by Mr. Noring to the moderator. Mr. Noring will probably tell you he didn't see the header-- indicating it was NOT a listserver message. . .an error even the moderator was quick to point out. On Sun, 28 Oct 2007, Jon Noring wrote: > In reply to Michael's four detailed messages this morning where > I am the topic of discussion: > > > Michael and I certainly have quite different perspectives > regarding general world-view (politics, economics, etc.) as > well as specific issues in the realm of digitizing the Public > Domain. > > For example, we recently contributed differing viewpoints on an > ongoing discussion at Book People regarding book pricing > ("where does the money go"). Some gutvol-ers here who do not > subscribe to the Book People mailing list may be interested in > that discussion -- refer to the BP archive at: > > > http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2007 > > and look for the various messages which include "Book Price > Inflation" in the subject header field. Quite a few people have > contributed to that discussion. > > Despite our very major differences, I greatly respect Michael > for what he has observably accomplished over the years. He will > be written up in the history books for his true accomplishments > and for various predictions of his which eventually come true. > I've said this many times because it is a fact -- not because > I'm trying to curry favor. With respect such as received from Mr. Noring, no disrespect is ever going to be needed. > So of course being specially singled out by Michael, who along > with Greg runs this group, for a couple messages I posted last > night, whether deserved or not as others here will decide for > themselves, definitely did not make my day!... Mr. Noring simply cannot admit this group has no moderation, that no one runs it at all, and never has. Ooops!!! Other than one time I can remember when, at Mr. Noring's request with a few others, one person was "moderated" for a time. That would make Mr. Noring the one who ran it the most. . . . In my own personal opinion, since no one else ever got anyone blacklisted. . .and if Mr. Noring gets himself blacklisted in the same respect. . .it will again be his own doing, as he is in receipt of plenty of warning. > ...I spent some time writing up something to go in this spot, > to continue with the above train of thought, and of course to > provide my own perspective. But it would have ended up resaying > what I've said before, and make this message overly long. Those > here who are even following this discussion have pretty much > already made up their minds on a number of issues and people. > So for each reader what I write would either be a futile > exercise at convincing, or a preaching to the choir. And of > course, a diversion from rational, respectful, and cordial > discussion on topics of interest to the PG community. If only Mr. Noring would say and do the same everywhere. . . . > Jon Noring Michael PS I am hoping to be too busy to reply to Mr. Noring again for some time, so he is welcome to dig his hole further. From hart at pglaf.org Sun Oct 28 13:31:48 2007 From: hart at pglaf.org (Michael Hart) Date: Sun, 28 Oct 2007 13:31:48 -0700 (PDT) Subject: [gutvol-d] z.m.l. and youtube In-Reply-To: References: Message-ID: What will happen when YouTube kills the vids? On Sun, 28 Oct 2007, Bowerbird at aol.com wrote: > i'm now supporting embedding of youtube videos into > the .html versions that i auto-convert from a .zml file... > > i'm not sure if i can plug them into the .pdf versions or > handle them in my offline viewer, but time will tell me... > > i'm not sure if this is a bad thing or a good thing... :+) > i guess it is what it is... > > -bowerbird > > > > ************************************** > See what's new at http://www.aol.com > From jon at noring.name Sun Oct 28 13:52:08 2007 From: jon at noring.name (Jon Noring) Date: Sun, 28 Oct 2007 14:52:08 -0600 Subject: [gutvol-d] Amazing! In-Reply-To: References: <686943385.20071027200208@noring.name> <1789330137.20071027212016@noring.name> <1253347043.20071028140140@noring.name> Message-ID: <1711258186.20071028145208@noring.name> Michael wrote: > Of course, Mr. Noring is leaving out that much of my replies > to his ranting and raving were not passed by the moderator-- > and I can't understand why Mr. Noring's were passed. Well, here's the URL to the message in that thread which I believe has the highest level of "ranting and raving". I'll let the others here on gutvol-d decide on the R&R level on a scale from 0 to 10: http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2007&post=2007-10-04,5 Does it qualify as a 10? I posted other rants and raves, too, as reference to the 2007 archive messages will show: http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2007 Yes, I'm just a no good ranter and raver. > Hence, I presume that is why he recommends you reading there > because his words were passed and mind were not. Actually, I had two or three messages on this topic filtered out by John Mark Ockerbloom as well. You may not realize this, Michael, but John has disallowed a number of my messages, too. I thank him for his moderation of the list. Am I sometimes disappointed? Certainly. But that's how he runs his group. Overall discussion is pretty good there. Would I run BP that way? Probably not. Just as Michael and Greg set the policy and guidelines for gutvol-*, so Mr. Ockerbloom has the right to set the guidelines for his group. We can complain, and say it is short-sighted, but that's the way it goes. The final point is that Bowerbird (I believe), started the YahooGroup "bpsuper" to be a place where messages rejected for Book People could be posted. When the group started, John Mark Ockerbloom actually allowed an announcement of that group, and it would not surprise me if he'd allow another announcement. This is a way that rejected messages could still be seen, and archived at Google. Here's that's group URL: http://groups.yahoo.com/group/bpsuper/ Btw, this brings up the issue that the archives for gutvol-d are not open to the public, therefore they are not archived by Google. Is this the intent? Jon Noring From Bowerbird at aol.com Sun Oct 28 14:36:01 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 28 Oct 2007 17:36:01 EDT Subject: [gutvol-d] z.m.l. and youtube Message-ID: michael said: > What will happen when YouTube kills the vids? i'm not sure, since i don't know exactly how they'll do it. in general, my inclination as to what i'll do when a file is requested from the internet and not delivered will be to inform the user that the requested file was not delivered. but it could vary. photobucket, for example, sends out a graphic that says "this account has exceeded its bandwidth" when that happens, instead of sending the requested photo. a bigger problem is when the file which _was_ at a u.r.l. is replaced by another file at the same u.r.l., which means your readers won't get the file you intended them to get... this is an insolvable problem with the internet as a whole. ted nelson is right that such shifting sands are quicksand. for this reason, and also because "hotlinking" pisses off some people, i recommend only using files in your control, either because you (a) bundle them with your document, or (b) store them at a web-location which _you_ fully control... http://www.zamzar.com is a free file-conversion website which will convert youtube videos into your chosen format. whether it's legal to repurpose the vids is another question. -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071028/35ef4492/attachment.htm From Bowerbird at aol.com Sun Oct 28 15:48:51 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sun, 28 Oct 2007 18:48:51 EDT Subject: [gutvol-d] academic digitization Message-ID: let's see, first duguid -- the academic who wrote the "first monday" piece that, among other things, was critical of the project gutenberg version of "tristram shandy" -- later wrote another one equally critical of google's scan of that same book, and their operations in general... then there was a post about the second article over on the o'reilly blogs: > http://radar.oreilly.com/archives/2007/08/the_google_exch.html#comments i made several comments there, including one critical of this white-paper: > http://www.clir.org/activities/details/lsdi.pdf which someone had recommended. i found it clueless, in content and form. i reformatted the white-paper, into z.m.l., as a demonstration... i have now auto-converted the z.m.l. file into .pdf format... the z.m.l. file is here: > http://z-m-l.com/oyayr/oyayr.zml the auto-generated .pdf file is here: > http://z-m-l.com/oyayr/oya-sunday.pdf over 120 footnotes in this baby. i even added a couple of footnotes onto existing footnotes, to show david starner that it's no big deal... (as is often the case with him, i couldn't figure out for the life of me why he thought it would be hard.) you'll find them at the very end. i also put in some .pdf links in my table-of-contents, so if you click the little boxes at the extreme right, you'll open up the original .pdf to the appropriate section, assuming you name it "oya-lsdi.pdf" and put it in the same folder as my .pdf. just a little thing to amuse me... oh yeah, and once again, the footnotes are in pop-up boxes that are displayed when you mouseover the "note" icon in the right-hard margin. (if you click the footnote number, you'll jump to the end-note section, to the exact page that contains that note. clicking the footnote number there jumps back to the referent in the body of the text, as you'd expect.) notes in pop-ups have been one of jon noring's _favorite_horses_ on his merry-go-round, so if he downloaded my earlier "test-suite" .pdf and checked it out and made a post on it, i'm sure he's already mentioned it... yep, jon, i did pop-ups _just_for_you_... (not really, but it sounds good.) anyway, if anyone has any comments on this .pdf -- either specific to it or more general reactions to the auto-conversion as a whole -- i welcome it. i'm quite satisfied all the functionality that needs to be included _is_ there, but if anyone has any requests, i'd love to entertain them. plus if anyone has any suggestions about enhancing the _beauty_ of the beast, speak up. as with the earlier test-suite .pdf, i used the butt-ugly helvetica font and all of the links have an obnoxious black rectangle so you can't miss 'em... so you don't have to comment on those things. but anything else is open. -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071028/844c67d9/attachment.htm From gbnewby at pglaf.org Sun Oct 28 17:23:56 2007 From: gbnewby at pglaf.org (Greg Newby) Date: Sun, 28 Oct 2007 17:23:56 -0700 Subject: [gutvol-d] Metadiscussion (Re: Amazing! ) In-Reply-To: <1711258186.20071028145208@noring.name> References: <686943385.20071027200208@noring.name> <1789330137.20071027212016@noring.name> <1253347043.20071028140140@noring.name> <1711258186.20071028145208@noring.name> Message-ID: <20071029002356.GA6330@mail.pglaf.org> On Sun, Oct 28, 2007 at 02:52:08PM -0600, Jon Noring wrote: > Michael wrote: > ... > Would I run BP that way? Probably not. Just as Michael and Greg set > the policy and guidelines for gutvol-*, so Mr. Ockerbloom has the > right to set the guidelines for his group. We can complain, and say it > is short-sighted, but that's the way it goes. I wasn't really following this thread, but saw Michael's responses and see the thread has partially turned into a meta-discussion concerning gutvol-d. A few thoughts on this... As many people on the list can recall, the list is non-moderated by choice. Everyone is encouraged to use their own judgement about which threads to follow, which email addresses to filter, etc. Because there is a generally high level of technical competency on the list (and willingness of list members to offer help!), we expect people can can handle their own email preferences, filtering, etc. Casting non-moderation as a policy is accurate. Seeking deeper meaning or symbolism in non-moderation of the PGLAF-hosted lists is more hazardous. This is because Project Gutenberg as an effort is much more than the gutvol-d list and other lists (contrary to the BP list, which to my knowledge isn't in place to support any particular centralized effort). It's also because this list isn't policy-making for PG, though of course it often provides wonderful input for policy. -- Greg From joshua at hutchinson.net Mon Oct 29 06:24:22 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Mon, 29 Oct 2007 13:24:22 +0000 (UTC) Subject: [gutvol-d] !@! RESEND Re: Comment on My Antonia scan set Message-ID: <4734168.1193664262993.JavaMail.?@fh1064.dia.cp.net> NOTE: I apologize for this post in advance. It is completely off-topic for the list, but I don't feel I can stand by without calling foul. This will be my last comment on the matter. *** Ok, this is getting ridiculous. Noring, while often painfully verbose and long-winded (even he admits that at times), he is almost always unfailingly polite. The only person he ever argues with is BB. And BB does everything he can to push Noring's buttons (and quite a few other people). I long ago learned to shunt BB's message straight to a kill file because my blood pressure just couldn't take him. Michael, you've made no effort to hide the fact that you flat out don't like Noring. Fine. You've made the same thing clear with me in the past, too. Again fine. But don't try to say Noring is using hate speech. Aside to Jon: I strongly recommend putting BB in your kill file. Your frustration level will thank you, the rest of the list will thank you and perhaps we can have more constructive conversations around here because of it. Again, to everyone else, I'm sorry for the off-topic post and have a nice day. Josh >----Original Message---- >From: hart at pglaf.org >Date: Oct 28, 2007 13:11 >To: "Project Gutenberg Volunteer Discussion" >Subj: [gutvol-d] !@! RESEND Re: Comment on My Antonia scan set > > >My apologies, my previous attempt at replying didn't work out. > >Trying again below, perhaps with more patience. > > >Thanks!!! > >Michael S. Hart >Founder >Project Gutenberg > > >On Sat, 27 Oct 2007, Jon Noring wrote: > >> I previously wrote: >> >> >>> Finally, to interject a personal and, yes, emotional note >>> directed to Bowerbird: Feel very fortunate that you have found >>> a home in gutvol-d where you can be "yourself." Be thankful >>> that Greg and Michael cut you a whole lot of slack. On nearly >>> all other groups I've participated in you would have been >>> thrown out a long time ago for what I term fostering a hostile >>> discussion environment. For example, calling whatever someone >>> writes a "merry-go-round" is exactly an ad-hominem attack on >>> their person. It is, in my opinion, a form of hate speech and >>> has no place in rational discourse. > >As we have commented before, Mr. Noring is certainly one of the >top people whom "Greg and Michael cut you a whole lot of slack." > >Mr. Noring seems as guilty of writing merry-go-rounds and/or >hate speech as much as anyone. If anyone is going to be left >off this list, you can be sure Mr. Noring will be among them. > >His baiting of Mr. Bowerbird, and of myself, is the mere work >of an apprentice baiter. . .not even up to journeyman level. > >Yet I am sure Mr. Noring would elevate his words beyond that. > >Enough said, obviously more than enough words to the wise and >thus not expected to reach Mr. Noring or his loyal opposition >as it were. > > >> And to add an addendum. >> >> Clearly, everything above is "in my opinion" based on >> observation. And since I run quite a few mail-based forums >> myself (all YahooGroups), including The eBook Community, I am >> supportive of those who administer any forum, even when I may >> disagree with their policies. Administering public groups like >> this is a thankless job since one can never please all the >> people all the time -- and sometimes we group administrators/ >> moderators have to make tough, often-times no-win decisions. > >This is one reason why we don't do moderation here. > >Another is the obvious misuse of political powers-- >as so often requested by Mr. Noring--both in front, >where people can see it, and behind the scenes from >perspectives hidden from the normal list members. > >I am sure all the list personnel know Mr. Noring is >a person with an agenda who wishes other agendas to >be silenced while his own goes forwards. > > >> In effect, Greg and Michael are the ones who have defacto >> control of this group simply by running the software that >> administers it. Thus, they ultimately decide who gets to post >> and what they allow to be posted here. They may even deny they >> have this power, but they do have this power by default -- if >> the voluntarily give up this power, they do so because they >> have the power. > >It is only those who desire such power who consider >this sort of thing. > >Gladly we have plenty of support to keep moderators >out of the fray, unlike the other lists Mr. Noring, >and others, have lobbied for such power. > >Even before these recent outbursts we have noticed, >and commented on, Mr. Noring's apalling behavior. > > >> And I may say that such-and-such is "hate speech", or so-and-so >> is creating a "hostile discussion environment." But it is not >> what I say, but what they say that matters. If many of us don't >> like how this group is run to the point where we get nothing >> out of it, we simply vote with our feet and leave, maybe even >> starting a new discussion group if we find value in the >> discussion but want a different set of group policies. > >Mr. Noring is striking out at himself here, as much as anyone. > >As most of us here are aware, and also other servers, Noring's >speech is as much "hate speech" as anyone's, largely ignored-- >but saved in the archives for future reference. Anyone should >be able to trace Mr. Noring's comments for themselves, and see >the trends, over the years. A "hostile discussin environment" >usually follows Mr. Noring's agenda, rather than the opposite. > > >> So with that said, I still believe what I wrote previously and >> reproduced above. But ultimately all that matters is what Greg >> and Michael think. If they decide that Bowerbird can pretty >> much write and say what he wants to gutvol-*, then that's the >> way it is. > >Mr. Noring appears to by saying, here and elsewhere, that his >right to free speech trumps everyone else's rights. > >It is all to obvious that Mr. Noring and a few others baited, >and continute to bait, Mr. Bowerbird and others, to create an >environment in which he could claim hostility. > >This is not working here, and it should not work elsewhere. > >> Jon Noring > > >Thanks!!! > >Michael S. Hart >Founder >Project Gutenberg > >_______________________________________________ >gutvol-d mailing list >gutvol-d at lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d > From hart at pglaf.org Mon Oct 29 09:40:13 2007 From: hart at pglaf.org (Michael Hart) Date: Mon, 29 Oct 2007 09:40:13 -0700 (PDT) Subject: [gutvol-d] !@! RESEND Re: Comment on My Antonia scan set In-Reply-To: <4734168.1193664262993.JavaMail.?@fh1064.dia.cp.net> References: <4734168.1193664262993.JavaMail.?@fh1064.dia.cp.net> Message-ID: On Mon, 29 Oct 2007, joshua at hutchinson.net wrote: > NOTE: I apologize for this post in advance. It is completely > off-topic for the list, but I don't feel I can stand by without > calling foul. This will be my last comment on the matter. > > *** > > Ok, this is getting ridiculous. > > Noring, while often painfully verbose and long-winded (even he > admits that at times), he is almost always unfailingly polite. > The only person he ever argues with is BB. And BB does > everything he can to push Noring's buttons (and quite a few > other people). I long ago learned to shunt BB's message > straight to a kill file because my blood pressure just couldn't > take him. > > Michael, you've made no effort to hide the fact that you flat > out don't like Noring. Fine. You've made the same thing clear > with me in the past, too. Again fine. But don't try to say > Noring is using hate speech. Mr. Noring pushes more buttons here and elsewhere as anyone, except, as you say, perhaps Mr. Bowerbird. Just because they are not YOUR buttons is not the criterion. Mr. Noring claims the protection of free speech for himself, while denying it to the world at large. . .out of bounds. Mr. Noring claims Dr. Newby and myself run things to a kind of advantage for Mr. Bowerbird and disadvantage to himself, again. . .out of bounds. What Mr. Noring really has wanted is to take over Gutenberg by creating a Board of Directors to his own liking, and the top priority of his proposed administration was money, even to the point of accusing me of being in it for the money. This has not been forgotten. I should perhaps add here that I still have not been paid a single salary check for the last 4.5 years, or so, but I do get some office expenses, perhaps half of the entitled in a similar period of time. I seriously doubt anyone else here would work as hard for a similar period on a career that was not paying any salary. Mr. Noring's points are rarely, if ever, about getting more eBooks to more people out there in the world without books, at least to the degree most of us are used to. One reason we keep up the freedom of speech is to find out, in this case, whether the policies of Mr. Noring or those a policy of Mr. Bowerbird will work better, impartially. I warn Mr. Bowerbird as often as I warn Mr. Noring, with an apparently greater effect, as I don't have to keep warning, eventually in public, to calm him down. Mr. Noring has had some success in getting Mr. Bowerbird or others censored, here and/or on other listserverers, and he understands that certain semantic tricks usually work on an ordinary Moderator. He is frustrated that neither Dr. Newby nor myself are some of those his tricks work on. By the way, I think you will find a pattern to Mr. Noring's and others' attempts to start flame wars, if you look. > Aside to Jon: I strongly recommend putting BB in your kill > file. Your frustration level will thank you, the rest of the > list will thank you and perhaps we can have more constructive > conversations around here because of it. Josh, I hate to put it this way but Mr. Noring likes his own voice more than anyone else's, and will use Mr. Bowerbird or anyone else as an excuse to voice the same lines over again. He doesnt' WANT an excuse to avoid Mr. Bowerbird or others a trolling for flame wars might work on. Mr. Noring's comments cause flame wars, incite censorship to a greater degree, etc. . .just not censorship here. As I said before, if Mr. Noring's attempts do cause any kind of censorship, he will be among the first batch to go. However, in regards to his previous years of such incitement I have noticed a tone that leads me to think that censorship of himself and Mr. Bowerbird would be regarded as exchanging queens, to employ a game theory analogy, to his benefit. Personally, I think of Mr. Noring as a bellweather warning a few of us when his ranting and raving get much more support, warning us of his potential to sway the minds of others. Mr. Bowerbird, on the other hand, cannot be accused of such; he is obviously not trying to curry favor with anyone. I am including myself, because I find his attitude as annoying as the rest of you, I am just more of the kind to take that old and new advice just given here, to ignore him. If everyone here ignored Mr. Bowerbird and Mr. Noring we all would be much calmer. Mr. Bowerbird at least has a product, one we could possibly take advantage of without taking those attitudinal qualities along with. . .a product that might be used to get a lot more eBooks to a lot more people, if those claims he makes are true. On the other hand, I'm not sure a wide implementation of Mr. Noring's suggestions would have a similar possible effect. However, I don't silence either one of them, even when these events take place as they have this week, which I see as the ploy of Mr. Noring more than of Mr. Bowerbird. Nevertheless, Mr. Noring's emails will continue to be passed through to our membership, though it has been suggested that his messages be put into a weekly digest to calm the waters, and thus avoid his attempts to incite flames. Thanks!!! Michael S. Hart Founder Project Gutenberg > Again, to everyone else, I'm sorry for the off-topic post and > have a nice day. > > Josh > > >> ----Original Message---- >> From: hart at pglaf.org >> Date: Oct 28, 2007 13:11 >> To: "Project Gutenberg Volunteer Discussion" org> >> Subj: [gutvol-d] !@! RESEND Re: Comment on My Antonia scan set >> >> >> My apologies, my previous attempt at replying didn't work out. >> >> Trying again below, perhaps with more patience. >> >> >> Thanks!!! >> >> Michael S. Hart >> Founder >> Project Gutenberg >> >> >> On Sat, 27 Oct 2007, Jon Noring wrote: >> >>> I previously wrote: >>> >>> >>>> Finally, to interject a personal and, yes, emotional note >>>> directed to Bowerbird: Feel very fortunate that you have found >>>> a home in gutvol-d where you can be "yourself." Be thankful >>>> that Greg and Michael cut you a whole lot of slack. On nearly >>>> all other groups I've participated in you would have been >>>> thrown out a long time ago for what I term fostering a hostile >>>> discussion environment. For example, calling whatever someone >>>> writes a "merry-go-round" is exactly an ad-hominem attack on >>>> their person. It is, in my opinion, a form of hate speech and >>>> has no place in rational discourse. >> >> As we have commented before, Mr. Noring is certainly one of the >> top people whom "Greg and Michael cut you a whole lot of slack." >> >> Mr. Noring seems as guilty of writing merry-go-rounds and/or >> hate speech as much as anyone. If anyone is going to be left >> off this list, you can be sure Mr. Noring will be among them. >> >> His baiting of Mr. Bowerbird, and of myself, is the mere work >> of an apprentice baiter. . .not even up to journeyman level. >> >> Yet I am sure Mr. Noring would elevate his words beyond that. >> >> Enough said, obviously more than enough words to the wise and >> thus not expected to reach Mr. Noring or his loyal opposition >> as it were. >> >> >>> And to add an addendum. >>> >>> Clearly, everything above is "in my opinion" based on >>> observation. And since I run quite a few mail-based forums >>> myself (all YahooGroups), including The eBook Community, I am >>> supportive of those who administer any forum, even when I may >>> disagree with their policies. Administering public groups like >>> this is a thankless job since one can never please all the >>> people all the time -- and sometimes we group administrators/ >>> moderators have to make tough, often-times no-win decisions. >> >> This is one reason why we don't do moderation here. >> >> Another is the obvious misuse of political powers-- >> as so often requested by Mr. Noring--both in front, >> where people can see it, and behind the scenes from >> perspectives hidden from the normal list members. >> >> I am sure all the list personnel know Mr. Noring is >> a person with an agenda who wishes other agendas to >> be silenced while his own goes forwards. >> >> >>> In effect, Greg and Michael are the ones who have defacto >>> control of this group simply by running the software that >>> administers it. Thus, they ultimately decide who gets to post >>> and what they allow to be posted here. They may even deny they >>> have this power, but they do have this power by default -- if >>> the voluntarily give up this power, they do so because they >>> have the power. >> >> It is only those who desire such power who consider >> this sort of thing. >> >> Gladly we have plenty of support to keep moderators >> out of the fray, unlike the other lists Mr. Noring, >> and others, have lobbied for such power. >> >> Even before these recent outbursts we have noticed, >> and commented on, Mr. Noring's apalling behavior. >> >> >>> And I may say that such-and-such is "hate speech", or so-and-so >>> is creating a "hostile discussion environment." But it is not >>> what I say, but what they say that matters. If many of us don't >>> like how this group is run to the point where we get nothing >>> out of it, we simply vote with our feet and leave, maybe even >>> starting a new discussion group if we find value in the >>> discussion but want a different set of group policies. >> >> Mr. Noring is striking out at himself here, as much as anyone. >> >> As most of us here are aware, and also other servers, Noring's >> speech is as much "hate speech" as anyone's, largely ignored-- >> but saved in the archives for future reference. Anyone should >> be able to trace Mr. Noring's comments for themselves, and see >> the trends, over the years. A "hostile discussin environment" >> usually follows Mr. Noring's agenda, rather than the opposite. >> >> >>> So with that said, I still believe what I wrote previously and >>> reproduced above. But ultimately all that matters is what Greg >>> and Michael think. If they decide that Bowerbird can pretty >>> much write and say what he wants to gutvol-*, then that's the >>> way it is. >> >> Mr. Noring appears to by saying, here and elsewhere, that his >> right to free speech trumps everyone else's rights. >> >> It is all to obvious that Mr. Noring and a few others baited, >> and continute to bait, Mr. Bowerbird and others, to create an >> environment in which he could claim hostility. >> >> This is not working here, and it should not work elsewhere. >> >>> Jon Noring >> >> >> Thanks!!! >> >> Michael S. Hart >> Founder >> Project Gutenberg >> >> _______________________________________________ >> gutvol-d mailing list >> gutvol-d at lists.pglaf.org >> http://lists.pglaf.org/listinfo.cgi/gutvol-d >> > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From piggy at netronome.com Mon Oct 29 09:54:38 2007 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Mon, 29 Oct 2007 12:54:38 -0400 Subject: [gutvol-d] !@! RESEND Re: Comment on My Antonia scan set In-Reply-To: <4724D938.2090009@perathoner.de> References: <686943385.20071027200208@noring.name> <1789330137.20071027212016@noring.name> <4724D938.2090009@perathoner.de> Message-ID: <4726104E.7060904@netronome.com> Marcello Perathoner wrote: > Michael Hart wrote: > > >> It is all to obvious that Mr. Noring and a few others baited, >> and continute to bait, Mr. Bowerbird and others, to create an >> environment in which he could claim hostility. >> > > There are three persons on this list you should not take seriously. > > But I'm grateful that most of you read my postings anyway :-). From Bowerbird at aol.com Mon Oct 29 10:20:02 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 29 Oct 2007 13:20:02 EDT Subject: [gutvol-d] another tempest in the gutvol-d teapot Message-ID: wow, the spin has churned up to the point where the term "hate speech" is being used? c'mon, folks, get a sense of proportion here. there are people in the world who're _really_ being victimized by hate speech, and it is a shame when their experience is compared to something so trivial... it's a listserve. easy to ignore. just press delete. unsubscribe if you must. as for the term "merry-go-round", well, isn't it patently obvious to _all_ that jon has a tendency to repeat himself? when given any opportunity? i mean, geez, i talk a lot about the same old set of subjects, it's true, but at least i try to bring something new to the table with every message... sometimes it takes me two frickin' years to bring that "something new", as was the case with the .pdf improvements i recently discussed here, but i've always been in this e-book game for the long haul, ever since i started doing it 25-plus years ago. i'm tenacious. but i bore easily too. indeed, a main reason i quit responding to the noring merry-go-round was that i got tired of making the same old replies time after time after... gotta keep stuff fresh, especially if you want people to keep reading you. but yeah, i _chuckle_ when people accuse me of "pushing their buttons". what a classic way to blame _me_ for _their_behavior_. it's very amusing. i disagree with noring on a lot of points. even on the issues where _he_ thinks he's in _agreement _with me, half the time he's misunderstanding. but hey, i'd be _extremely_ happy to put our opposing positions on a wiki somewhere, one time, and point people to it when the question came up. jon seems to prefer to do the same old little dance over and over and over. in the long run, though, people aren't interested in _discussions_ of things. they want the _proof_ in the _pudding_. if you can't deliver it, you're done. so don't be too hard on "mr. noring". he's just frustrated. he wants other people to mark up books in the manner he's prescribed, but they don't seem willing to do that. he wants other people to write converters that'll create beautiful books, but they don't seem willing to do that. he wants open-source programmers to write viewer-apps for his format, but they don't seem willing to do that. he wants them to program innovative tools that will make authoring easy, but they don't seem willing to do that. he wants all the other format advocates to give up to his one true format, but they don't seem willing to do that. he wants to "win friends and influence people" and even change the world, but the world doesn't seem willing to do that. so he's frustrated. and it's totally understandable. if i were in his shoes, i would be frustrated too... but i'm doing my markup myself, and coding my own programs, and i don't give a flying frog if the world listens up or not, i'm just doing it to have fun... and if that "pushes your buttons", well, i'm sorry about that, but i'm not likely to stop anytime soon... -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071029/fe12279c/attachment-0001.htm From f.fuchs at gmx.net Mon Oct 29 10:50:28 2007 From: f.fuchs at gmx.net (Franz Fuchs) Date: Mon, 29 Oct 2007 18:50:28 +0100 Subject: [gutvol-d] New Yorker: Anthony Grafton: Digitization and its discontents In-Reply-To: Message-ID: --- Future Reading Digitization and its discontents. by Anthony Grafton --- http://www.newyorker.com/reporting/2007/11/05/071105fa_fact_grafton (ca. 4 300 words) From jon at noring.name Mon Oct 29 11:36:35 2007 From: jon at noring.name (Jon Noring) Date: Mon, 29 Oct 2007 12:36:35 -0600 Subject: [gutvol-d] Since MIchael brought up some points, e.g., the PGLAF board Message-ID: <913787777.20071029123635@noring.name> In order to constructively focus on some of Michael's comments about me, let's redirect discussion to focus on what is good for PG. Why he has spent an incredible amount of time focusing on me, my motives, etc., is sort of puzzling. ***** One issue Michael brought up: The PGLAF Board and the general operation of the organization. It is very true that a few years ago I proposed that PGLAF reorganize to improve its organization, potential for fund-raising, and to work more cooperatively and tangibly with other organizations digitizing the public domain. I continue to suggest two actions, which were based upon talking with experts in non-profit organizations, including one who is a very well noted attorney in that area and who has advised me on a couple projects I was involved with. And both of these suggestions were supported in whole, or in part, by several other notable people who are involved with PG, DP and other projects to digitize the public domain tests My two suggestions are: 1) Setting up a real Board of Trustees that would include notables from the public domain digitization arena. 2) Transfer the "Project Gutenberg" trademark to PGLAF. There are reasons for these recommendations. I've noted them before. Now Michael has taken the above proposals as a sort of "power play grab". He is, to be frank, ascribing certain motives on my part for having even proposed them. And I know that my motives are NOT what he believes them to me. My motives are what I believe is best for the movement, just as I believe that Michael's motives are what he believes is best for the movement. I hope Michael will accept on face-value what I just said my motives are. Those who know me know that I am not interested in power nor fame. Now, obviously, Michael, who still holds the defacto reins of power in PGLAF, has vehemently opposed my two proposals, as one can see by his last few messages where *he* brings these up. Why? Well, again, I will only ascribe pure motives on his part and that he is afraid that embracing the above suggestions will harm the mission of the Project Gutenberg "movement." I very much hope that Michael will provide us the exact, detailed reasons why PGLAF should continue to be organized as it is, and why he should personally hold the trademark, which is universally "not recommended." He has not yet done so. What makes this even more ironic is that my two suggestions were offered at a time when Greg and Michael officially asked for ideas as to how to improve the movement. So, I offered mine, and others offered theirs. Funny thing though that Michael has repeated and viciously attacked mine, and has not offered objective reasons why the two suggestions should not even be considered in some form. Regarding #1, I mentioned that the current PGLAF Board is a rump board. No one on that Board has real experience with digitizing the Public Domain, nor are notable in any sense in the arena. Three of the four Board members work at UIUC in Illinois. No doubt they are nice people, and competent at what they do (e.g., one has a Ph.D. in aeronautical engineering), but they are not the kind of people one would want to completely fill the Board of Trustees. I do not think I need to go into the reasons why the right people on the Board will greatly benefit PG. For example, I have mentioned the kind of people who should be asked to serve on the PGLAF Board, and over time have proposed names, a sort of "dream team" list: a) Charles Franks b) Juliet Sutherland c) Brewster Kahle d) John Mark Ockerbloom e) Dr. Widger f) Steve Harris g) Peter Brantley h) Dr. Allen Renear (who is at UIUC) (Btw, for the current PGLAF Board, refer to: http://www.gutenberg.org/wiki/Gutenberg:Project_Gutenberg_Literary_Archive_Foundation ) Regarding #2, well no need to explain that in any detail. It is never recommended that a trademark be held by an individual. Individuals do funny things at times -- like die. And organizations can be more aggressive at defending their trademark. There are other benefits, too. Again, Michael has NOT explained WHY he must hold on to the Project Gutenberg trademark. I hope he offers an explanation. If not, people will begin to think of their own reasons, some of which are not flattering to Michael nor the PG movement. And PGLAF is taken less seriously by others. Jon Noring From jon at noring.name Mon Oct 29 12:06:07 2007 From: jon at noring.name (Jon Noring) Date: Mon, 29 Oct 2007 13:06:07 -0600 Subject: [gutvol-d] Focus on the ideas, not the person In-Reply-To: References: Message-ID: <308293095.20071029130607@noring.name> Bowerbird wrote: > [an amazing list of what he says I want] What is important to see in both Michael's and Bowerbird's replies is that they strictly focus on me: my motives, my wants, my deficiencies (I have many, Josh pointed out one), etc., etc. One has to ask the question: What's the point? How does this benefit Project Gutenberg? And how do such messages *harm* the PG Community? Whatever happened to the healthy debate of thoughts and ideas? To view them as completely independent of the person proposing them? To let the thoughts and ideas and debate points stand or fall on their own merits? I've run a lot of mailing lists the last 14 years, and one thing I've observed is that when a group focuses on the thoughts and ideas, and treats the proposer almost as "anonymous" in a cordial manner, the group thrives. Discussion is robust and meaningful, and sometimes leads to some new group-level insight. I also notice greater participation over time because the group is "safe" for participation. Cordiality and *full respect* for others (and yes I've probably not been as respectful as I should have been) actually leads to more fruitful discussion -- eventually leading to new ideas and ways of doing things. As soon as discussion turns toward the person proposing an idea or debate point -- to focus on personality or motives -- the group rapidly devolves into chaos and the topic is never explored in sufficient depth to provide information for everyone to make up their own mind. I wonder how many here who follow gutvol-d are interested in sharing their ideas, but have not out of fear that doing so might lead to a personal attack? (Feel free to email me in private if indeed you are put off by the recent "tar and feathering" messages.) Jon Noring From johnson.leonard at gmail.com Mon Oct 29 12:25:32 2007 From: johnson.leonard at gmail.com (Leonard Johnson) Date: Mon, 29 Oct 2007 15:25:32 -0400 Subject: [gutvol-d] New Yorker: Anthony Grafton: Digitization and its discontents In-Reply-To: References: Message-ID: <748ba8e50710291225y151d805bj3c4cafc659717963@mail.gmail.com> On 10/29/07, Franz Fuchs wrote: > > --- > Future Reading > > Digitization and its discontents. > by Anthony Grafton > --- > > http://www.newyorker.com/reporting/2007/11/05/071105fa_fact_grafton > (ca. 4 300 words) > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > I seldom get involved with the discussions here, but I wish to thank Franz Fuchs for the link. I found the article very interesting. -- http://members.cox.net/leaonarddjohnson/ From jon at noring.name Mon Oct 29 12:36:47 2007 From: jon at noring.name (Jon Noring) Date: Mon, 29 Oct 2007 13:36:47 -0600 Subject: [gutvol-d] Since MIchael brought up some points, e.g., the PGLAF board In-Reply-To: <913787777.20071029123635@noring.name> References: <913787777.20071029123635@noring.name> Message-ID: <1103954005.20071029133647@noring.name> Oops, need to clarify a couple points in my prior message. I wrote: > I very much hope that Michael will provide us the exact, detailed > reasons why PGLAF should continue to be organized as it is, and why > he should personally hold the trademark, which is universally "not > recommended." He has not yet done so. Now Michael has given reasons here and there, but in my opinion they are not yet sufficient, nor cogently and objectively organized, to form a final answer on his part. Hopefully he will fully clarify his thoughts to the level or organization where he can simply repost it whenever someone brings it up again. Of course, he can refuse to answer this request, but that does not help for transparency of an organization/movement whose whole philosophy is built around transparency and openness, and relies upon thousands of volunteers. To summarize, I'd like to hear Michael's beliefs on: 1) Why is the PGLAF Board made up the way it is -- what is Michael's philosophy towards the purpose and role of the Board, and what kind of people should not serve on the PGLAF Board (besides me, of course. ) I also continue to be mystified why at least Juliet Sutherland or someone else from DP is not on the PGLAF Board (maybe she was asked and turned it down, but this is important to know given the importance of DP to the PG "movement".) 2) What value he sees in himself personally holding the PG trademark rather than turning it over to PGLAF. How does personal ownership benefit the long-term mission of the Project Gutenberg "movement"? > Regarding #1, I mentioned that the current PGLAF Board is a rump > board. No one on that Board has real experience with digitizing the > Public Domain, nor are notable in any sense in the arena. Obviously, one of the four Board members listed at the URL I gave, and I assume that is an updated list, is Dr. Greg Newby, who is the Board Chair and also the CEO. I was referring to the other three members. Refer to: http://www.gutenberg.org/wiki/Gutenberg:Project_Gutenberg_Literary_Archive_Foundation Jon Noring From bowerbird at aol.com Mon Oct 29 13:44:03 2007 From: bowerbird at aol.com (bowerbird at aol.com) Date: Mon, 29 Oct 2007 16:44:03 -0400 Subject: [gutvol-d] i've asked before, but i'll ask again` Message-ID: <8C9E88393776E53-554-31ED@FWM-D08.sysops.aol.com> jon- i've asked before, but i'll ask again. when you send a message to the list, it comes to me. so there's no need to send a second copy to me too... so please stop doing that. and please don't make me have to ask you a third time. i'm not even reading messages that come from you anyway -- and i mostly succeed not reading even the fragments that people include when they make a reply to you -- so it's really unnecessary to double up your messages... now, i'm gonna say some things to you here, jon, but i'm not going to read your reply if you make one, so if that bothers you, then don't even read the rest... *** you make a big deal about "respect". but the thing is, you long ago sacrificed the _modicum_ of respect that _everyone_ deserves from me, by virtue of being human. so the only "respect" you have left is what you _earn_, by your _ideas_, and the little bit of that which you _once_ had has evaporated, because you've proven that your ideas generally don't pass the cost-benefit test, even in terms of _your_very_own_behavior_. and, quite frankly, i don't feel any "respect" coming in the opposite direction, from you toward me, which doesn't bother me, since i'm not hung up on "respect". plus i simply don't ascribe much value to your judgment. but this hypocrisy in your position always humors me... finally, i don't see you have much "respect" for the institution of _dialog_and_discussion_, because you never seem to learn much of anything at all from it. it's merely a way for you to reiterate your opinions. like the rest of the "win friends and influence people" crowd, it seems you want others to bend to you, but yet the notion that you might bend to them is inconceivable. i think my positions are correct too, but that's because i'm willing to shift 'em immediately if a stronger argument emerges for any another position. so there. reply if you want to, but i won't even read it. i spent way too many years already reading your messages, and way too much time replying to them, until i finally decided that you no longer had anything of value to me... and that you _hadn't_ had anything of value for years... and it took several years to wean myself off of replying, because it had just become a bad habit. and also because i thought it was necessary to counter many of your ideas, just in case some newbies believed you, but now i realize that's not even a problem any more, so i am fully clean. or maybe not, because look at me, here i am once again, wasting my time writing a post to jon noring... geez! go waste other people's time, and leave me alone. goodbye. -bowerbird ________________________________________________________________________ Email and AIM finally together. You've gotta check out free AOL Mail! - http://mail.aol.com From hart at pglaf.org Mon Oct 29 14:22:42 2007 From: hart at pglaf.org (Michael Hart) Date: Mon, 29 Oct 2007 14:22:42 -0700 (PDT) Subject: [gutvol-d] Since MIchael brought up some points, e.g., the PGLAF board In-Reply-To: <1103954005.20071029133647@noring.name> References: <913787777.20071029123635@noring.name> <1103954005.20071029133647@noring.name> Message-ID: Jon has already had all the answers to these questions, and he knows it, he is just trying once again to move a certain conversation to the point where he can expound, again, at length, how he thinks things should be run. The most obvious and complete answer is, of course, the one you have heard the most often, that he is welcome-- nay, ENCOURAGED, to stop talking and start DOING and as with all suggested courses of ACTION, Project Gutenberg will provide as much assistance as possible. HOWEVER, as long as Jon is igNoring the call to ACTION, his words remaind just that, words, and it is obvious-- thank goodness--that his words never carry any weight-- and no one has tried to elect him to anything. DOUBLY HOWEVER, it Mr. Noring SHOULD eventually amass a group who want him to lead them, Project Gutenberg will be only to glad to offer all possible assistance. As it has been, as it is, and we can only hope. . . . Distributed Proofreaders is a perfect example, and this example should be more than enough to provide Jon with, we hope, all the encouragement needed. DP is its own entity, has its own leaders, and gets the support it requests from Project Gutenberg, just as Mr. Noring would/could/should have had if he were a willing worker as much as he is a willing talker. Now you understand why we don't censor what he says, he is just too good an example, we'd never find better. Thanks!!! Michael S. Hart Founder Project Gutenberg On Mon, 29 Oct 2007, Jon Noring wrote: > Oops, need to clarify a couple points in my prior message. I > wrote: > >> I very much hope that Michael will provide us the exact, >> detailed reasons why PGLAF should continue to be organized as >> it is, and why he should personally hold the trademark, which >> is universally "not recommended." He has not yet done so. > > Now Michael has given reasons here and there, but in my opinion > they are not yet sufficient, nor cogently and objectively > organized, to form a final answer on his part. Hopefully he > will fully clarify his thoughts to the level or organization > where he can simply repost it whenever someone brings it up > again. Of course, he can refuse to answer this request, but > that does not help for transparency of an organization/movement > whose whole philosophy is built around transparency and > openness, and relies upon thousands of volunteers. > > To summarize, I'd like to hear Michael's beliefs on: > > 1) Why is the PGLAF Board made up the way it is -- what is > Michael's > philosophy towards the purpose and role of the Board, and > what > kind of people should not serve on the PGLAF Board (besides > me, of > course. ) > > I also continue to be mystified why at least Juliet > Sutherland or > someone else from DP is not on the PGLAF Board (maybe she was > asked > and turned it down, but this is important to know given the > importance of DP to the PG "movement".) > > 2) What value he sees in himself personally holding the PG > trademark > rather than turning it over to PGLAF. How does personal > ownership > benefit the long-term mission of the Project Gutenberg > "movement"? > > >> Regarding #1, I mentioned that the current PGLAF Board is a >> rump board. No one on that Board has real experience with >> digitizing the Public Domain, nor are notable in any sense in >> the arena. > > Obviously, one of the four Board members listed at the URL I > gave, and I assume that is an updated list, is Dr. Greg Newby, > who is the Board Chair and also the CEO. I was referring to the > other three members. Refer to: > > http://www.gutenberg.org/wiki/Gutenberg:Project_Gutenberg_Literary_Archive_Foundation > > > > Jon Noring > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From marcello at perathoner.de Mon Oct 29 14:47:22 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon, 29 Oct 2007 22:47:22 +0100 Subject: [gutvol-d] i've asked before, but i'll ask again` In-Reply-To: <8C9E88393776E53-554-31ED@FWM-D08.sysops.aol.com> References: <8C9E88393776E53-554-31ED@FWM-D08.sysops.aol.com> Message-ID: <472654EA.8000701@perathoner.de> bowerbird at aol.com wrote: > when you send a message to the list, it comes to me. > so there's no need to send a second copy to me too... Stop bugging people. *Your* mail agent is misconfigured. You are sending a copy of all messages to yourself: > From: Bowerbird at aol.com > Message-ID: > Date: Mon, 29 Oct 2007 13:20:02 EDT > To: gutvol-d at lists.pglaf.org, Bowerbird at aol.com Now, if people hit 'reply', each former recipient gets an answer. It's *your* fault, not Jon's. Solution: stop sending a copy of your messages to yourself. Or, if you absolutely need a copy, send it as BCC. While you're reconfiguring, stop sending your post as text *and* HTML. That will save you even more bandwidth than if you get answers twice. -- Marcello Perathoner webmaster at gutenberg.org From jon at noring.name Mon Oct 29 14:52:09 2007 From: jon at noring.name (Jon Noring) Date: Mon, 29 Oct 2007 15:52:09 -0600 Subject: [gutvol-d] Since MIchael brought up some points, e.g., the PGLAF board In-Reply-To: References: <913787777.20071029123635@noring.name> <1103954005.20071029133647@noring.name> Message-ID: <1035758925.20071029155209@noring.name> Michael wrote: > Jon has already had all the answers to these questions, > and he knows it, he is just trying once again to move a > certain conversation to the point where he can expound, > again, at length, how he thinks things should be run. These answers have NOT been given, and if they have, I'd be happy to have a link to the message in the archive, or a summary of your reasons. For the defacto leader of the PG movement, these are legitimate questions to ask: 1) The makeup of the PGLAF Board, why it is as it is, and future plans, and 2) The personal ownership of the trademark. People will begin to wonder why you continue to avoid answering these simple yet important questions. The governance of any non-profit organization is a legitimate thing to ask when it asks thousands of people to volunteer for it. So are you saying that asking these questions is out of bounds? You can keep trying to divert attention back to me and my foibles (real and imagined). But the questions asked are legitimate, and they have not yet been cogently answered in any public forum. If you have, then point a link to your answer in the archives and I'll be happy to blog the link somewhere. Better yet, provide a link to it on the Gutenberg site. Jon Noring From hart at pglaf.org Mon Oct 29 14:56:24 2007 From: hart at pglaf.org (Michael Hart) Date: Mon, 29 Oct 2007 14:56:24 -0700 (PDT) Subject: [gutvol-d] Focus on the ideas, not the person In-Reply-To: <308293095.20071029130607@noring.name> References: <308293095.20071029130607@noring.name> Message-ID: On Mon, 29 Oct 2007, Jon Noring wrote: > Bowerbird wrote: > >> [an amazing list of what he says I want] > > What is important to see in both Michael's and Bowerbird's > replies is that they strictly focus on me: my motives, my > wants, my deficiencies (I have many, Josh pointed out one), > etc., etc. No one I know has any idea of Mr. Noring's "motives," "wants," etc., for the simple reason that he never presents any goal to consider other than that he and his cabinet should control PG. PG has been created, from the very start, up to today, simply, and completely, by DOING. . .not just TALKING. Mr. Noring has continually been invited to run anything he is interested in up the flagpole, with or without example works, though many presume it would work better if he had at least a small handful of examples, and then just see who salutes. Mr. Noring is not getting the kinds of salutes he wants so he has asked for some kind of knighthood process again and again in the hopes that if we acknowledge him BEFORE his action the action will then prove worthy of this knighthood. We continually offer him all the support we possibly can. The results are there for all to see. > One has to ask the question: What's the point? How does this > benefit Project Gutenberg? And how do such messages *harm* the > PG Community? Just wasting our time and energy, seems to be all for the moment, but I always wonder what else Mr. Noring has in mind. > Whatever happened to the healthy debate of thoughts and ideas? > To view them as completely independent of the person proposing > them? To let the thoughts and ideas and debate points stand or > fall on their own merits? I'm sorry, did I miss Mr. Noring's presentation of some project? Perhaps I am just not able to read between the lines to get some inner meaning to something that will change Project gutenberg or perhaps even help change the world. If Mr. Noring has a way to get more books to more people, I have nothing but interest. However, this is not what I seem to have been receiving. And I most sincerely apologize to all concerned, and Mr. Noring, a dozen times over, for any such lack. > I've run a lot of mailing lists the last 14 years, and one > thing I've observed is that when a group focuses on the > thoughts and ideas, and treats the proposer almost as > "anonymous" in a cordial manner, the group thrives. Discussion > is robust and meaningful, and sometimes leads to some new > group-level insight. I also notice greater participation over > time because the group is "safe" for participation. This seems to be just the opposite of Mr. Noring's approach with "The Book People" mailing list, where my responses to him have a double rate of censorship than what he claimed for his own and I strongly suspect this is not accidental on his part. > Cordiality and *full respect* for others (and yes I've probably > not been as respectful as I should have been) actually leads to > more fruitful discussion -- eventually leading to new ideas and > ways of doing things. Mr. Noring claims to respect me, and perhaps Project Gutenberg as well, but, again, I have trouble with this kind of respect. Mr. Noring seems to take the same strategic stand every year at a similar time, around the equinoxes, has anyone noticed. I strongly suspect the entire world has a certain susceptability, if you will pardon me, to Seasonal Affective Disorder, not just a certain individual, but enough to flavor the world at large. Comments? I have asked "The Book People" Moderator, but he instantly denied any such susceptability to Seasonal Affective Distorder, even tho he seems to censor me more at these times. . .perhaps only due to the conversations with Mr. Noring, and the world at large. > As soon as discussion turns toward the person proposing an idea > or debate point -- to focus on personality or motives -- the > group rapidly devolves into chaos and the topic is never > explored in sufficient depth to provide information for > everyone to make up their own mind. Then make a real proposal, which is what we always ask. . . . Just what is it that you have in mind that no one is accepting? > I wonder how many here who follow gutvol-d are interested in > sharing their ideas, but have not out of fear that doing so > might lead to a personal attack? (Feel free to email me in > private if indeed you are put off by the recent "tar and > feathering" messages.) Jon, you have had me "tarred and feathered" far more than any such treatment you have received, I think most will agree tho with the exception of the few you always bring with you and a perhaps new recruit this year, if you are aware enough to get such a recruit firmly on your side. If you feel "tarred and feathered" by me, I don't see why, as it appears you have simply made your usual fray into politics at the usual time, and you never seemed to feel the responses from me, or others, were out of place. Again, and again, and again, it always comes down to the one, simple, unavoidable question: "What do you want to do?" If you want to take over some ACTION, you must first ACT. I'm not at all sure just what ACTIONS you are proposing other than your continual efforts to stack the Board of Directors-- but for what PURPOSE, other than your own personal power. What is the Project Gutenberg of YOUR dreams??? If you will just give us that handful of examples we ask for, year after year, we will as always offered, give you your own directory, with all permissions, your own newsletter slot and all the publicity we can to promote you and your project. > Jon Noring Thanks!!! Michael S. Hart Founder Project Gutenberg From jon at noring.name Mon Oct 29 15:02:24 2007 From: jon at noring.name (Jon Noring) Date: Mon, 29 Oct 2007 16:02:24 -0600 Subject: [gutvol-d] Focus on the ideas, not the person In-Reply-To: References: <308293095.20071029130607@noring.name> Message-ID: <391113694.20071029160224@noring.name> Michael Hart wrote: > No one I know has any idea of Mr. Noring's "motives," "wants," > etc., for the simple reason that he never presents any goal to > consider other than that he and his cabinet should control PG. Ah, this is the crux. You have no basis to say this Michael, because I *never* said I wanted control, nor advocated any structure where *I* and my "cabinet" would have control. (whoever my "cabinet" is) Do you have any evidence to back up this ridiculous charge? Or did you just imagine I'm part of the Trilateral commission or something. Please Michael, return to reality. Jon Noring From hart at pglaf.org Mon Oct 29 15:11:30 2007 From: hart at pglaf.org (Michael Hart) Date: Mon, 29 Oct 2007 15:11:30 -0700 (PDT) Subject: [gutvol-d] Since MIchael brought up some points, e.g., the PGLAF board In-Reply-To: <1035758925.20071029155209@noring.name> References: <913787777.20071029123635@noring.name> <1103954005.20071029133647@noring.name> <1035758925.20071029155209@noring.name> Message-ID: Jon, over all the years I've done eBooks, only one other person than yourself has ever mentioned who should be on the Board and we had already offered that person a board position. The fact is, as we have told you before, that there is not much direction from anyone, Board or not, as to what YOU should do-- you are free to do whatever you think best to do eBooks. The trouble, it seems, is that everyone one else is free, too-- and they don't seem to want to do it your way. I would certainly be only too happy if they would. Then perhaps you would stop this. No one seems to want to take over Project Gutenberg but you. And we have rolled out the red carpet for any project you might want to try. . .you can have your own project. . .take all of a whole world of credit for it. . .and more power to you!!! Just Do It! Michael PS I seriously doubt there will be any "meat" on this "bone of contention" again this year, if the follows the patter of Jon's past attempts, so I will be passing over Mr. Noring's emails at least for a few days, as I have a presentation to do for the UI Library Mortenson Center for visiting librarians from at least, so I am told, 14 countries. Jon, please do not take offense. . .only possibly you should be happier than I, if you get the volunteers you want to do what's in your mind, as long as it's designed to further eBooks in the sense we have been doing, or perhaps even in a better way. On Mon, 29 Oct 2007, Jon Noring wrote: > Michael wrote: > >> Jon has already had all the answers to these questions, >> and he knows it, he is just trying once again to move a >> certain conversation to the point where he can expound, >> again, at length, how he thinks things should be run. > > These answers have NOT been given, and if they have, I'd be happy to > have a link to the message in the archive, or a summary of your > reasons. > > For the defacto leader of the PG movement, these are legitimate > questions to ask: > > 1) The makeup of the PGLAF Board, why it is as it is, and future > plans, and > > 2) The personal ownership of the trademark. > > People will begin to wonder why you continue to avoid answering these > simple yet important questions. The governance of any non-profit > organization is a legitimate thing to ask when it asks thousands of > people to volunteer for it. > > So are you saying that asking these questions is out of bounds? > > You can keep trying to divert attention back to me and my foibles > (real and imagined). But the questions asked are legitimate, and they > have not yet been cogently answered in any public forum. If you have, > then point a link to your answer in the archives and I'll be happy to > blog the link somewhere. Better yet, provide a link to it on the > Gutenberg site. > > Jon Noring > > > From hart at pglaf.org Mon Oct 29 15:23:03 2007 From: hart at pglaf.org (Michael Hart) Date: Mon, 29 Oct 2007 15:23:03 -0700 (PDT) Subject: [gutvol-d] i've asked before, but i'll ask again` In-Reply-To: <472654EA.8000701@perathoner.de> References: <8C9E88393776E53-554-31ED@FWM-D08.sysops.aol.com> <472654EA.8000701@perathoner.de> Message-ID: Marcello may not be quite correct, at least for my emailer, and some others I have tried. "reply" and "reply to all" many times are different commands, which each generate different results. "reply" goes only to the sender, or their assigned "reply to" [which might be a different address than the sending address] Depending on your own default settings, you might get one or the other of these two commands when you reply to an email. Thanks!!! Michael S. Hart Founder Project Gutenberg On Mon, 29 Oct 2007, Marcello Perathoner wrote: > bowerbird at aol.com wrote: > >> when you send a message to the list, it comes to me. >> so there's no need to send a second copy to me too... > > Stop bugging people. *Your* mail agent is misconfigured. You are sending > a copy of all messages to yourself: > >> From: Bowerbird at aol.com >> Message-ID: >> Date: Mon, 29 Oct 2007 13:20:02 EDT >> To: gutvol-d at lists.pglaf.org, Bowerbird at aol.com > > Now, if people hit 'reply', each former recipient gets an answer. It's > *your* fault, not Jon's. > > Solution: stop sending a copy of your messages to yourself. Or, if you > absolutely need a copy, send it as BCC. > > > While you're reconfiguring, stop sending your post as text *and* HTML. > That will save you even more bandwidth than if you get answers twice. > > > > -- > Marcello Perathoner > webmaster at gutenberg.org > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From jon at noring.name Mon Oct 29 15:26:19 2007 From: jon at noring.name (Jon Noring) Date: Mon, 29 Oct 2007 16:26:19 -0600 Subject: [gutvol-d] Since MIchael brought up some points, e.g., the PGLAF board In-Reply-To: References: <913787777.20071029123635@noring.name> <1103954005.20071029133647@noring.name> <1035758925.20071029155209@noring.name> Message-ID: <1373634448.20071029162619@noring.name> Michael wrote: > Jon, over all the years I've done eBooks, only one other person > than yourself has ever mentioned who should be on the Board and > we had already offered that person a board position. O.k., thanks. I think my suggestion is for PGLAF to look at the importance of remaking its board. I believe it will provide positive value in several ways I've talked about before. This does not mean that PG can't continue to foster all kinds of independent projects (which is actually good). But having a good core Board may not only better help proposed projects but possibly foster more. One should not underestimate the value and potential of a good Board. Jon From hart at pglaf.org Mon Oct 29 15:28:23 2007 From: hart at pglaf.org (Michael Hart) Date: Mon, 29 Oct 2007 15:28:23 -0700 (PDT) Subject: [gutvol-d] Focus on the ideas, not the person In-Reply-To: <391113694.20071029160224@noring.name> References: <308293095.20071029130607@noring.name> <391113694.20071029160224@noring.name> Message-ID: OK, Jon has suggested returning to reality, and I agree. Nothing more need be said on the unreal. On Mon, 29 Oct 2007, Jon Noring wrote: > Michael Hart wrote: > >> No one I know has any idea of Mr. Noring's "motives," "wants," >> etc., for the simple reason that he never presents any goal to >> consider other than that he and his cabinet should control PG. > > Ah, this is the crux. > > You have no basis to say this Michael, because I *never* said I wanted > control, nor advocated any structure where *I* and my "cabinet" would > have control. (whoever my "cabinet" is) > > Do you have any evidence to back up this ridiculous charge? Or did you > just imagine I'm part of the Trilateral commission or something. > > Please Michael, return to reality. > > Jon Noring > From hart at pglaf.org Mon Oct 29 15:34:08 2007 From: hart at pglaf.org (Michael Hart) Date: Mon, 29 Oct 2007 15:34:08 -0700 (PDT) Subject: [gutvol-d] Since MIchael brought up some points, e.g., the PGLAF board In-Reply-To: <1373634448.20071029162619@noring.name> References: <913787777.20071029123635@noring.name> <1103954005.20071029133647@noring.name> <1035758925.20071029155209@noring.name> <1373634448.20071029162619@noring.name> Message-ID: The idea of the proposed projects is still lacking, that's apparently where our "realities" differ. Jon want the political power of the Board before an assortment of projects is proposed. However, this is putting the cart before the horse. In Project Gutenberg the political power is ignored in favor of a "Just Do It!" kind of attitude. What Jon wants is the political power without works on the projects that would earn it. Because we offer that kind of power free to all who ask for it, Jon doesn't seem to want it. He seems to want the other kind of power. Power over others without. . . . enough said. . .I hope. . . mh On Mon, 29 Oct 2007, Jon Noring wrote: > Michael wrote: > >> Jon, over all the years I've done eBooks, only one other person >> than yourself has ever mentioned who should be on the Board and >> we had already offered that person a board position. > > O.k., thanks. > > I think my suggestion is for PGLAF to look at the importance of > remaking its board. I believe it will provide positive value in > several ways I've talked about before. This does not mean that PG > can't continue to foster all kinds of independent projects (which is > actually good). But having a good core Board may not only better help > proposed projects but possibly foster more. One should not > underestimate the value and potential of a good Board. > > Jon > > > From jon at noring.name Mon Oct 29 15:52:32 2007 From: jon at noring.name (Jon Noring) Date: Mon, 29 Oct 2007 16:52:32 -0600 Subject: [gutvol-d] Since MIchael brought up some points, e.g., the PGLAF board In-Reply-To: References: <913787777.20071029123635@noring.name> <1103954005.20071029133647@noring.name> <1035758925.20071029155209@noring.name> <1373634448.20071029162619@noring.name> Message-ID: <1786059042.20071029165232@noring.name> Michael wrote: > enough said. . .I hope. . . Well, you've had your say, and I've had mine. So we'll leave it to the others who are following this (if anyone!) to decide for themselves. Jon From klofstrom at gmail.com Mon Oct 29 16:03:28 2007 From: klofstrom at gmail.com (Karen Lofstrom) Date: Mon, 29 Oct 2007 13:03:28 -1000 Subject: [gutvol-d] Founder's syndrome Message-ID: <1e8e65080710291603y5bbe76a4s72b2bc516dfea73d@mail.gmail.com> If Jon is to be ignored because he hasn't done enough books, perhaps my voice will be heard. I've proofed some 39,000 pages at DP, over the course of four years, and post-processed several books. PG has a bad case of Founder's Syndrome: http://www.help4nonprofits.com/NP_Bd_FoundersSyndrome_Art.htm It's frequent, it's predictable, it's the common cold of non-profits. Jon has raised some sensible questions about governance and ownership of the trademark and they shouldn't be dismissed with accusations that Jon is trying to take over PG. Alas, I don't expect that I *will* be heard. So I won't belabor the point. -- Karen Lofstrom From Bowerbird at aol.com Mon Oct 29 17:34:07 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 29 Oct 2007 20:34:07 EDT Subject: [gutvol-d] the z.m.l. dingus Message-ID: ok, i'm not really sure if i'm ready to make this public yet, but let's give this "reality" a whirl and see what happens... i've mentioned here before my zml-to-html converter, at: > http://z-m-l.com/go/vl3.pl this takes pre-formatted .zml texts and auto-converts them to an .html version that is displayed right there on the page. that gives you a _taste_ of pudding, so it is a nice demo, but it doesn't let you stick your finger in the pudding and swirl it. so here's the z.m.l. dingus: > http://z-m-l.com/go/zmldingus093.pl it's live. like a wiki. you edit the field, click the "do it" button, and boom, whatever you edited gets converted _from_ z.m.l. into .html. you can click in the same preformatted texts, if you want, and confirm they still work. but you can also enter your own stuff. to get you started, click "skeleton" to pull in a bare-bones file. of course, if what you enter is not "correct" z.m.l., then it won't get converted right. indeed, since the dingus is "in-progress", you might even do "correct" z.m.l. and have it come out wrong. in such a case, i'd like to know about it. if the output isn't right, there is a chance your input is wrong... so please do make sure that your input is "correct" z.m.l. first... because if i get a bunch of people saying "it doesn't work right" when the _real_ problem is that they just fed it some bad input, i'll just shut the free-entry dingus off again, and continue with the preformatted stuff i _know_ is right, to prove my pudding... -bowerbird p.s. it has display glitches in internet explorer -- imagine that! -- where you'll find the generated .html _underneath_ the editfield. on most other browsers, you should find them to be side-by-side. the "w=##" and "h=##" fields let you adjust the height and width of the edit-field, since c.s.s. seems not to affect an .html editfield. ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071029/31fdcf47/attachment.htm From editor at pg-news.org Mon Oct 29 17:55:22 2007 From: editor at pg-news.org (Mike Cook) Date: Tue, 30 Oct 2007 00:55:22 -0000 Subject: [gutvol-d] Since MIchael brought up some points, e.g. the PGLAF board In-Reply-To: References: <913787777.20071029123635@noring.name> <1103954005.20071029133647@noring.name> Message-ID: <001901c81a8f$8e2034e0$aa609ea0$@org> >> Jon has already had all the answers to these questions, >> and he knows it Perhaps he has...and perhaps he hasn't...but I would be very interested in hearing a response to those questions put forward by Jon. Mike -----Original Message----- From: Michael Hart [mailto:hart at pglaf.org] Sent: 29 October 2007 21:23 To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] Since MIchael brought up some points, e.g., the PGLAF board Jon has already had all the answers to these questions, and he knows it, he is just trying once again to move a certain conversation to the point where he can expound, again, at length, how he thinks things should be run. The most obvious and complete answer is, of course, the one you have heard the most often, that he is welcome-- nay, ENCOURAGED, to stop talking and start DOING and as with all suggested courses of ACTION, Project Gutenberg will provide as much assistance as possible. HOWEVER, as long as Jon is igNoring the call to ACTION, his words remaind just that, words, and it is obvious-- thank goodness--that his words never carry any weight-- and no one has tried to elect him to anything. DOUBLY HOWEVER, it Mr. Noring SHOULD eventually amass a group who want him to lead them, Project Gutenberg will be only to glad to offer all possible assistance. As it has been, as it is, and we can only hope. . . . Distributed Proofreaders is a perfect example, and this example should be more than enough to provide Jon with, we hope, all the encouragement needed. DP is its own entity, has its own leaders, and gets the support it requests from Project Gutenberg, just as Mr. Noring would/could/should have had if he were a willing worker as much as he is a willing talker. Now you understand why we don't censor what he says, he is just too good an example, we'd never find better. Thanks!!! Michael S. Hart Founder Project Gutenberg On Mon, 29 Oct 2007, Jon Noring wrote: > Oops, need to clarify a couple points in my prior message. I > wrote: > >> I very much hope that Michael will provide us the exact, >> detailed reasons why PGLAF should continue to be organized as >> it is, and why he should personally hold the trademark, which >> is universally "not recommended." He has not yet done so. > > Now Michael has given reasons here and there, but in my opinion > they are not yet sufficient, nor cogently and objectively > organized, to form a final answer on his part. Hopefully he > will fully clarify his thoughts to the level or organization > where he can simply repost it whenever someone brings it up > again. Of course, he can refuse to answer this request, but > that does not help for transparency of an organization/movement > whose whole philosophy is built around transparency and > openness, and relies upon thousands of volunteers. > > To summarize, I'd like to hear Michael's beliefs on: > > 1) Why is the PGLAF Board made up the way it is -- what is > Michael's > philosophy towards the purpose and role of the Board, and > what > kind of people should not serve on the PGLAF Board (besides > me, of > course. ) > > I also continue to be mystified why at least Juliet > Sutherland or > someone else from DP is not on the PGLAF Board (maybe she was > asked > and turned it down, but this is important to know given the > importance of DP to the PG "movement".) > > 2) What value he sees in himself personally holding the PG > trademark > rather than turning it over to PGLAF. How does personal > ownership > benefit the long-term mission of the Project Gutenberg > "movement"? > > >> Regarding #1, I mentioned that the current PGLAF Board is a >> rump board. No one on that Board has real experience with >> digitizing the Public Domain, nor are notable in any sense in >> the arena. > > Obviously, one of the four Board members listed at the URL I > gave, and I assume that is an updated list, is Dr. Greg Newby, > who is the Board Chair and also the CEO. I was referring to the > other three members. Refer to: > > http://www.gutenberg.org/wiki/Gutenberg:Project_Gutenberg_Literary_Archive_Found ation > > > > Jon Noring > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From Bowerbird at aol.com Mon Oct 29 18:23:56 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 29 Oct 2007 21:23:56 EDT Subject: [gutvol-d] some responses for mike Message-ID: mike said: > I would be very interested in hearing a response > to those questions put forward by Jon. since anyone seems to be able to put forward the questions, maybe anyone is able to put forward some responses for mike. so here are mine: 1. the project gutenberg board has decided the current constitution of the p.g. board is just fine. they'll let you know if they change their mind. 2. michael hart, who owns the project gutenberg trademark likely because he's the person who ordered and paid for it, not to mention who nurtured the project through its first decades without much support from _anyone_, thinks his ownership is just fine. he'll let you know if he changes his mind. if you don't like those, try these: 1. actually, it never occurred to the people on the board that someone would object to the volunteer service they rendered for many years, so they haven't even really thought about putting anyone else on the board, but if they were questioned, they'd wonder why you brought the issue up. 2. michael sleeps better at night knowing that he owns the trademark. you know how parents worry about their children when they're out late. i'm on a roll now: 1. we don't need no stinkeen' board. 2. i sleep better at night knowing that michael owns the trademark. or, if you don't like any of those, how 'bout these? 1. what's it to you? 2. what's it to you? perhaps you'll get the "flavor" of these last two responses if you remember that project gutenberg was conceived and nurtured not far from chicago... so if you will put a chicago construction worker "accent" on those answers, you might grok them a bit better... anyway, let me know if you need any more, and i'll do my best... :+) -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071029/b8ede745/attachment.htm From Bowerbird at aol.com Mon Oct 29 18:42:55 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 29 Oct 2007 21:42:55 EDT Subject: [gutvol-d] "founder's syndrome" Message-ID: i'll have a response to that silly "founder's syndrome" message, too... indeed, i hope michael doesn't even feel the need to explain himself. i'll go into it in more detail later, but i'm going out to dinner now... in a nutshell, though, the answer is that michael -- _intentionally_ -- set out to build a different kind of "non-profit" organization than the "typical" one _some_ people here now seem to want him to have built. jimmy wales now has the same kind of "problem" over at wikipedia... (indeed, his is even _more_ pronounced, because he has to deal with all the people who want him to "capitalize" on all of his "page-views". at least michael doesn't have to appear to "give up" a bunch of cash.) make no mistake about it, the "networking" power of the internet can create a lot of money. google is on its way to being the richest business in the history of the planet, eclipsing a good many nations of the world. but all that "networking" power can _also_ be used for _collaboration_... and when it _is_ used for that purpose, it will _transform_the_world_... so it depends on if you wanna live in the old world, where greed rules, or the new world, where people live together in peace and harmony... once you've really "tuned in" to the idea of "unlimited distribution", you will get it. that's a notion that makes no sense in the old world. but really, it's not about "distribution" at all, as that implies "product". the essence of this new world is that it is about generous spirituality. but, you know, people still have to eat. and now i'm late for dinner... :+) -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071029/aa9a35c1/attachment.htm From ksclarke at gmail.com Mon Oct 29 18:44:28 2007 From: ksclarke at gmail.com (Kevin S. Clarke) Date: Mon, 29 Oct 2007 21:44:28 -0400 Subject: [gutvol-d] Since MIchael brought up some points, e.g. the PGLAF board In-Reply-To: <001901c81a8f$8e2034e0$aa609ea0$@org> References: <913787777.20071029123635@noring.name> <1103954005.20071029133647@noring.name> <001901c81a8f$8e2034e0$aa609ea0$@org> Message-ID: <3557b8d0710291844k1c2868e0s262005d0bdf792c8@mail.gmail.com> I'm curious too, but I think the answer might be gleaned from reading: http://www.gutenberg.org/wiki/Gutenberg:Administrivia_by_Michael_Hart It seems the purpose of the board is to do as little as possible. The organization seems to encourage "being the change" they'd like to see (it offers no incentives to bring about any one given change). That makes Michael Hart's responses for Jon Noring to DO something make a little more sense (though it didn't seem like much of an answer to the questions asked to me before I read the above URL). I think what it boils down to is that the questions Noring is asking aren't of much interest to PG as an organization. I think a good example of this "be the change" perspective is seen in PG moving towards XML as a format. When I've seen it discussed over the years, the answer always seems to be, "If you are interested do it." There isn't any movement on from the PG organization; they aren't interested in it, it seems... the question always falls back to "What standard? We can't reach agreement." So, Noring, why try to make PG something other than what it is? It doesn't seem like the contributors are clamouring for something different. On a related note, anyone attempted lately to do autoconversion from the plain text formats to XML (any format)? Any luck with it? Just curious, Kevin On 10/29/07, Mike Cook wrote: > >> Jon has already had all the answers to these questions, > >> and he knows it > > Perhaps he has...and perhaps he hasn't...but I would be very interested in > hearing a response to those questions put forward by Jon. > > Mike > > > -----Original Message----- > From: Michael Hart [mailto:hart at pglaf.org] > Sent: 29 October 2007 21:23 > To: Project Gutenberg Volunteer Discussion > Subject: Re: [gutvol-d] Since MIchael brought up some points, e.g., the PGLAF > board > > > Jon has already had all the answers to these questions, > and he knows it, he is just trying once again to move a > certain conversation to the point where he can expound, > again, at length, how he thinks things should be run. > > The most obvious and complete answer is, of course, the > one you have heard the most often, that he is welcome-- > nay, ENCOURAGED, to stop talking and start DOING and as > with all suggested courses of ACTION, Project Gutenberg > will provide as much assistance as possible. > > HOWEVER, as long as Jon is igNoring the call to ACTION, > his words remaind just that, words, and it is obvious-- > thank goodness--that his words never carry any weight-- > and no one has tried to elect him to anything. > > DOUBLY HOWEVER, it Mr. Noring SHOULD eventually amass a > group who want him to lead them, Project Gutenberg will > be only to glad to offer all possible assistance. > > As it has been, as it is, and we can only hope. . . . > > Distributed Proofreaders is a perfect example, and this > example should be more than enough to provide Jon with, > we hope, all the encouragement needed. > > DP is its own entity, has its own leaders, and gets the > support it requests from Project Gutenberg, just as Mr. > Noring would/could/should have had if he were a willing > worker as much as he is a willing talker. > > Now you understand why we don't censor what he says, he > is just too good an example, we'd never find better. > > > Thanks!!! > > Michael S. Hart > Founder > Project Gutenberg > > > > On Mon, 29 Oct 2007, Jon Noring wrote: > > > Oops, need to clarify a couple points in my prior message. I > > wrote: > > > >> I very much hope that Michael will provide us the exact, > >> detailed reasons why PGLAF should continue to be organized as > >> it is, and why he should personally hold the trademark, which > >> is universally "not recommended." He has not yet done so. > > > > Now Michael has given reasons here and there, but in my opinion > > they are not yet sufficient, nor cogently and objectively > > organized, to form a final answer on his part. Hopefully he > > will fully clarify his thoughts to the level or organization > > where he can simply repost it whenever someone brings it up > > again. Of course, he can refuse to answer this request, but > > that does not help for transparency of an organization/movement > > whose whole philosophy is built around transparency and > > openness, and relies upon thousands of volunteers. > > > > To summarize, I'd like to hear Michael's beliefs on: > > > > 1) Why is the PGLAF Board made up the way it is -- what is > > Michael's > > philosophy towards the purpose and role of the Board, and > > what > > kind of people should not serve on the PGLAF Board (besides > > me, of > > course. ) > > > > I also continue to be mystified why at least Juliet > > Sutherland or > > someone else from DP is not on the PGLAF Board (maybe she was > > asked > > and turned it down, but this is important to know given the > > importance of DP to the PG "movement".) > > > > 2) What value he sees in himself personally holding the PG > > trademark > > rather than turning it over to PGLAF. How does personal > > ownership > > benefit the long-term mission of the Project Gutenberg > > "movement"? > > > > > >> Regarding #1, I mentioned that the current PGLAF Board is a > >> rump board. No one on that Board has real experience with > >> digitizing the Public Domain, nor are notable in any sense in > >> the arena. > > > > Obviously, one of the four Board members listed at the URL I > > gave, and I assume that is an updated list, is Dr. Greg Newby, > > who is the Board Chair and also the CEO. I was referring to the > > other three members. Refer to: > > > > > http://www.gutenberg.org/wiki/Gutenberg:Project_Gutenberg_Literary_Archive_Found > ation > > > > > > > > Jon Noring > > > > _______________________________________________ > > gutvol-d mailing list > > gutvol-d at lists.pglaf.org > > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From marcello at perathoner.de Tue Oct 30 04:12:07 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue, 30 Oct 2007 12:12:07 +0100 Subject: [gutvol-d] Since MIchael brought up some points, e.g. the PGLAF board In-Reply-To: <001901c81a8f$8e2034e0$aa609ea0$@org> References: <913787777.20071029123635@noring.name> <1103954005.20071029133647@noring.name> <001901c81a8f$8e2034e0$aa609ea0$@org> Message-ID: <47271187.5090206@perathoner.de> Mike Cook wrote: >>> Jon has already had all the answers to these questions, >>> and he knows it > > Perhaps he has...and perhaps he hasn't...but I would be very interested in > hearing a response to those questions put forward by Jon. Me too! Those were the very questions that emerged on this list during the aftermath of the "PG II incident". And even if the answers were already given, a short summary would not hurt at this place. -- Marcello Perathoner webmaster at gutenberg.org From richfield at telkomsa.net Tue Oct 30 04:54:38 2007 From: richfield at telkomsa.net (Jon Richfield) Date: Tue, 30 Oct 2007 13:54:38 +0200 Subject: [gutvol-d] Harmless monsters Message-ID: <47271B7E.5000408@telkomsa.net> You know folks, much of the tone of this forum is frustrating. In many on-line forums mutual satire, extending to outright abuse, are appropriate, widely enjoyed, and even admired; their regulars frequent them for just such performances, a sort of verbal all-in-wrestling to please those still callow enough to be impressed by the delusion that a flaming amounts to a flaying, and that conveying an insult in long words or capitals will cow an opponent and thrill the groupies. However, in a forum of literate people, where there is work to be done, the appropriate emblem might be the bitten tongue. Unfortunately, the more strongly anyone feels about the superiority of his own ideas or products, the more passionately he is likely to resent rival ideas or slighting responses, and accordingly, the more spitefully he is likely retaliate for any offence, real or fancied. The problem is that in the resulting fuss and bother, sound points deserving fair consideration, or weighing against each other in appropriate contexts, are lost or distorted, without compensatory benefit to anyone. They are hardly worth even a smirk from a competitor who imagines that he has administered a well-deserved gob-smacking. The plain fact is that this little playpen is no heavyweight boxing ring. In terms of literary intimidation it has yielded neither an Ali to cower before, nor a Bierce to enjoy, nor yet a Swift to respect, just a few intrusive Donald Ducks to ignore. I ask you: on reading the most vituperative exchanges during say, the last few months, was there a solitary one that, if its like had occurred in a kindergarten, you would have dignified with special attention? Is there one quip or insight that you were tempted to frame for your desk or memorise for your next literary dinner? As for putting anyone's name on a kill list, suit yourself of course, but it amounts to sulking and is about as effective. The participants might be dead losses as polemicists, and not all their ideas worth the paper that one hopes they are not printed upon, but sifted from the dreck, some of the actual substance of their material is professional and may be rewarding. But, you insist, you are too thin-skinned to put up with the nonsense or malice pervading the writings of certain parties? Your blood pressure cannot take the nastiness or the stupidity? Well, bad luck yer 'avin! You will just have to put up with missing (unfortunately) something like half the substance of the forum, and console yourself with my overflowing sympathy, and no doubt, that of some of correspondents with no time to waste on all that nonsense. If that is your view, please be very careful not to reflect on the altogether higher blood pressure attendant on contemplating the joys you missed by ignoring the bonnest of their mots and leaving them to die in silence in the empty house. When pays things attention commensurate to their sources, it can be quite startling to find how soon one forgets to give a damn, or even fails to notice anything to damn. Winnowing the relevant material from the chaffing becomes fully automatic. It calls to mind one of Pope's more pungent observations (which seems not to be anthologised in any material that I have seen in PG). He was embroiled in an ink-and-spittle match with one John Dennis who matched him in smallness of spirit, if not of person, but not in largeness of talent. Should Dennis print how once you robb'd your brother, Traduc'd your monarch and debauched your mother; Say what revenge on Dennis can be had, Too dull for laughter, for reply too mad? Of one so poor you cannot take the law; On one so old your sword you cannot draw. Uncag'd, then, let the harmless monster rage, Secure in dullness, madness, want and age. Alexander Pope (From Cohen, "More Comic and curious verse" Penguin 1956) In practice Pope did not follow his own advice (knowing anything about his nature, how surprising do you find that?) but in this forum it works for me. My blood pressure is fine, thanks for asking, and skimming the digested input is far less problematic than tuning kill criteria to keep protecting oneself adequately and harmlessly from correspondence from undesirable sources. I live in hopes of some day stumbling across some really worth-while squelch or insult from the munchkins, but faced with their barrenness so far, it is just as well that I have other reasons for such skimming as I do. Meanwhile, can everyone else please refrain from taking prisoners, refuse to show any mercy, and concentrate on matters in hand, instead of rewarding undeserving tantrums? Cheers, Jon From nwolcott2ster at gmail.com Tue Oct 30 08:33:51 2007 From: nwolcott2ster at gmail.com (Norm Wolcott) Date: Tue, 30 Oct 2007 10:33:51 -0500 Subject: [gutvol-d] Harmless monsters References: <47271B7E.5000408@telkomsa.net> Message-ID: <008401c81b0a$527ce980$660fa8c0@atlanticbb.net> As I suppose many I use the Outlook Express to move certain author's emails to a junk file where I can later peruse them or delete them. (28 on Oct 28th) Unfortunately the list of offending names seems to grow and grow, and the amount of good information reduces in proportion. . nwolcott2 at post.harvard.edu ----- Original Message ----- From: "Jon Richfield" To: Sent: Tuesday, October 30, 2007 6:54 AM Subject: [gutvol-d] Harmless monsters > You know folks, much of the tone of this forum is frustrating. In many > on-line forums mutual satire, extending to outright abuse, are > appropriate, widely enjoyed, and even admired; their regulars frequent > them for just such performances, a sort of verbal all-in-wrestling to > please those still callow enough to be impressed by the delusion that a > flaming amounts to a flaying, and that conveying an insult in long words > or capitals will cow an opponent and thrill the groupies. However, in > a forum of literate people, where there is work to be done, the > appropriate emblem might be the bitten tongue. > > Unfortunately, the more strongly anyone feels about the superiority of > his own ideas or products, the more passionately he is likely to resent > rival ideas or slighting responses, and accordingly, the more spitefully > he is likely retaliate for any offence, real or fancied. The problem is > that in the resulting fuss and bother, sound points deserving fair > consideration, or weighing against each other in appropriate contexts, > are lost or distorted, without compensatory benefit to anyone. They are > hardly worth even a smirk from a competitor who imagines that he has > administered a well-deserved gob-smacking. > > The plain fact is that this little playpen is no heavyweight boxing > ring. In terms of literary intimidation it has yielded neither an Ali > to cower before, nor a Bierce to enjoy, nor yet a Swift to respect, just > a few intrusive Donald Ducks to ignore. I ask you: on reading the most > vituperative exchanges during say, the last few months, was there a > solitary one that, if its like had occurred in a kindergarten, you would > have dignified with special attention? Is there one quip or insight > that you were tempted to frame for your desk or memorise for your next > literary dinner? > > As for putting anyone's name on a kill list, suit yourself of course, > but it amounts to sulking and is about as effective. The participants > might be dead losses as polemicists, and not all their ideas worth the > paper that one hopes they are not printed upon, but sifted from the > dreck, some of the actual substance of their material is professional > and may be rewarding. > > But, you insist, you are too thin-skinned to put up with the nonsense or > malice pervading the writings of certain parties? Your blood pressure > cannot take the nastiness or the stupidity? Well, bad luck yer 'avin! > You will just have to put up with missing (unfortunately) something like > half the substance of the forum, and console yourself with my > overflowing sympathy, and no doubt, that of some of correspondents with > no time to waste on all that nonsense. If that is your view, please be > very careful not to reflect on the altogether higher blood pressure > attendant on contemplating the joys you missed by ignoring the bonnest > of their mots and leaving them to die in silence in the empty house. > When pays things attention commensurate to their sources, it can be > quite startling to find how soon one forgets to give a damn, or even > fails to notice anything to damn. Winnowing the relevant material from > the chaffing becomes fully automatic. > > It calls to mind one of Pope's more pungent observations (which seems > not to be anthologised in any material that I have seen in PG). He was > embroiled in an ink-and-spittle match with one John Dennis who matched > him in smallness of spirit, if not of person, but not in largeness of > talent. > > Should Dennis print how once you robb'd your brother, > Traduc'd your monarch and debauched your mother; > Say what revenge on Dennis can be had, > Too dull for laughter, for reply too mad? > > Of one so poor you cannot take the law; > On one so old your sword you cannot draw. > Uncag'd, then, let the harmless monster rage, > Secure in dullness, madness, want and age. > Alexander Pope > (From Cohen, "More Comic and curious verse" > Penguin 1956) > > In practice Pope did not follow his own advice (knowing anything about > his nature, how surprising do you find that?) but in this forum it works > for me. My blood pressure is fine, thanks for asking, and skimming the > digested input is far less problematic than tuning kill criteria to keep > protecting oneself adequately and harmlessly from correspondence from > undesirable sources. I live in hopes of some day stumbling across some > really worth-while squelch or insult from the munchkins, but faced with > their barrenness so far, it is just as well that I have other reasons > for such skimming as I do. > > Meanwhile, can everyone else please refrain from taking prisoners, > refuse to show any mercy, and concentrate on matters in hand, instead of > rewarding undeserving tantrums? > > Cheers, > > Jon > > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From lee at novomail.net Tue Oct 30 10:01:14 2007 From: lee at novomail.net (Lee Passey) Date: Tue, 30 Oct 2007 10:01:14 -0700 Subject: [gutvol-d] Since MIchael brought up some points, e.g. the PGLAF board In-Reply-To: <3557b8d0710291844k1c2868e0s262005d0bdf792c8@mail.gmail.com> References: <913787777.20071029123635@noring.name> <1103954005.20071029133647@noring.name> <001901c81a8f$8e2034e0$aa609ea0$@org> <3557b8d0710291844k1c2868e0s262005d0bdf792c8@mail.gmail.com> Message-ID: <4727635A.9030305@novomail.net> Kevin S. Clarke wrote: [snip] > On a related note, anyone attempted lately to do autoconversion from > the plain text formats to XML (any format)? Any luck with it? Just > curious, > > Kevin I gave up on trying to convert PG's Impoverished Text Format to any XML vocabulary several years ago, when I concluded that it would be impossible to do it in any meaningful way. Most of the work in this area has been done by BowerBird. Five years ago he was claiming that he would 'soon' be able to write a program that would 'intuit' markup from a PG file, much like a human being can. I think that he realized that this was too great a task for him alone, so he started developing a new markup language which, to the uninitiated, would look like a PG ITF file, but which had subtle, almost imperceptible markup that a computer could detect. He has since written a perl script which would take a file written in his markup language, which he calls z.m.l., and convert it to HTML. I haven't looked at the output carefully enough to determine if it is XHTML, which is the direct answer to your question, but in any case it would always be possible to take BB's HTML (assuming it is valid) and run it through Tidy to get valid XHTML. As of today I don't think an automated conversion process from PG ITF to an XML vocabulary exists. You could, of course, do a hand conversion of PG ITF to z.m.l., and from there use BB's perl script and Tidy to get XHTML to automate at least a portion of it. Of course, there are any number of ways to autoconvert PG ITF to non-meaningful XML vocabularies, but I don't think that's what you were asking. -- Nothing of significance below this line. From Bowerbird at aol.com Tue Oct 30 10:53:47 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 30 Oct 2007 13:53:47 EDT Subject: [gutvol-d] autoconversion from the plain text formats to XML Message-ID: kevin said: > anyone attempted lately to do autoconversion > from the plain text formats to XML (any format)?? > Any luck with it?? Just curious, i'm here on a regular basis giving advice on that... :+) there are any number of light-markup formats that _could_ be used to generate x.m.l. output, from markdown (the frontrunner i always plug) to ascii-doc (the newest contender i mentioned): > http://www.methods.co.nz/asciidoc/index.html and there's no question that these methodologies _can_ generate the complexity of markup required. markdown is being used all over cyberspace now. restructured text is used by the python community as the light-markup for all of their documentation. ascii-doc is used for a number of technoid things. then, of course, there is my zen markup language, which is actually _based_ on the pg-ascii format... this means that "conversion" of the e-texts to .zml is almost fully automatic, with the main exception being the front-matter, which is typically a mess. (the largest offenders on this are the title-pages. while i believe i can code routines to fix them too, interacting with my clean version of p.g.'s catalog, the unfortunate fact is i have not yet _done_ that. once it is done, though, conversion of the entire project gutenberg library is just one click away.) although the other light-markup methodologies make x.m.l. by default, and mine could do x.m.l., i don't usually stress x.m.l. as an output format, since i don't wanna give ideas to my antagonists. (they still cling to the idea that "it's impossible".) but if you want the fast-track to an x.m.l. library, light-markup is the solution you're looking for... of course, the _real_ problem is the _inconsistency_ of the library in its current state, which would need to be remedied before you can do any conversions. (i am using routines i developed over several years, but i'm not sharing them with my opponents here.) but if you do _that_, followed by auto-conversion, then what's the purpose of x.m.l. in the first place? it makes more sense to use the light-markup files as your "master", especially since they're infinitely more malleable than crusty bloatware x.m.l. files... (because even when they aren't, a conversion to the x.m.l. variant is just one button-click away...) that's the existential conundrum of heavy-markup. until it can be applied _automatically_, it's too costly. but if it can be auto-applied, it becomes unnecessary. the best it can hope for is to be a transitory middleman. and once you understand this conundrum, it's funny... -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071030/8c6bdca9/attachment.htm From jon at noring.name Tue Oct 30 11:29:15 2007 From: jon at noring.name (Jon Noring) Date: Tue, 30 Oct 2007 12:29:15 -0600 Subject: [gutvol-d] autoconversion from the plain text formats to XML In-Reply-To: References: Message-ID: <907090513.20071030122915@noring.name> Bowerbird wrote: > although the other light-markup methodologies > make x.m.l. by default, and mine could do x.m.l., > i don't usually stress x.m.l. as an output format, > since i don't wanna give ideas to my antagonists. > (they still cling to the idea that "it's impossible".) Well, since you've added me to the "it's impossible" list without explaining what you mean by "impossible," I've noted for a while now that one can go from normalized plain text (where the rules for normalization identify document structures and stuff), into XML of any compatible vocabulary. In fact, it's intrigued me to talk with a script friend of mine and see how long it would take him to write a script to take ZML (assuming we have the latest rule set) and convert it into a pre-defined XML vocabulary, probably XHTML with a couple pre-defined classes. What would be more interesting is to reverse the process from that exactly defined XML vocabulary back into ZML. Again, I know it can be done. (Yes, I know you've already written a ZML to crappy SGML-based HTML script, but again I knew this could be done before you did it.) The thing is it can be done. Just like we know when you drop a bowling ball from a ten story building it will not fly up but rather down to the ground, we know that when one has ZML one can convert it to XML. No need to perform the experiment since we know the result. The key is: is it worth it to spend the time writing the scripts? There are other things requiring experimentation to see if our speculations hold true or not -- this is not one of them. > but if you want the fast-track to an x.m.l. library, > light-markup is the solution you're looking for... If you replace "is" by "may be", then what you say will be correct, in my opinion. Using the word "is" is sales jargon. > of course, the _real_ problem is the _inconsistency_ > of the library in its current state, which would need > to be remedied before you can do any conversions. > (i am using routines i developed over several years, > but i'm not sharing them with my opponents here.) Funny thing how you continue to treat this as a game, a competition. Whatever happened to peace and harmony and working with your fellow man and all of that? Open source collaboration, etc. > but if you do _that_, followed by auto-conversion, > then what's the purpose of x.m.l. in the first place? This is a loaded question since it presupposes that ZML is sufficient, and we've noted time and again that that remains to be seen, and many of us believe it is not sufficient. Such as how to handle blockquotes (which themselves can be documents -- I even suggested a tweak to ZML to make it work). Unless you can demonstrate how to properly identify a block quote from verse, ZML is insufficient. (Now maybe you've fixed this deficiency in ZML, but I've not seen it in your online rule set, or are you hiding the latest ZML rules as proprietary?) Jon Noring From hart at pglaf.org Tue Oct 30 12:30:48 2007 From: hart at pglaf.org (Michael Hart) Date: Tue, 30 Oct 2007 12:30:48 -0700 (PDT) Subject: [gutvol-d] !@! Re: Founder's syndrome In-Reply-To: <1e8e65080710291603y5bbe76a4s72b2bc516dfea73d@mail.gmail.com> References: <1e8e65080710291603y5bbe76a4s72b2bc516dfea73d@mail.gmail.com> Message-ID: The real issue about Jon's frequent efforts to rewrite PG in his own image is that Jon really hasn't presented this image for anyone to consider. . .something we have all in a number of ways given him permission and encouragements, again and again, to do. Whatever Jon wants to do, we will gladly support, as will we support any number of other such parties. However, it seems to come down to the idea that Jon wants to control the means of such support not just receive it, and I haven't seen anyone support that idea/ideal. Jon wants not only to control what he does, what his own, should they arise, army of volunteers does, but what some others would be encouraged or discouraged to do, at least that's what it has always look to me like, over the years as Jon has tried various suggestions to stack the Boards, get control of the trademark, make money the top priority and the rest of his agenda, none of which seems to be the actual process of getting more eBook to more people. I would be only too happy to see Jon raise his army, have some use for the support we continually offer him, and do something really truly great in the world of eBooks. Very little would make me happier. . . . However, as also said before, this is not to be at a huge expense of giving Jon control over PG at large, either in budgetary considerations [not that we have a real budget, since we don't have any real money], or in terms of Board membership, placement as CEO, etc. We will help Jon do whatever project he has in mind. We will not help Jon or anyone else gain political power, simply because we don't believe in political power. Jon WANTS there to be political power, financial power or even more kinds of power in Project Gutenberg, but, those who could have had such power do NOT want it in their own hands or in anyone else's. That, in a nutshell, is the reply to Mr. Noring's effort. And, it has all been said pretty much that way before. I, personally, do not like referring people to archives-- or just pulling out pieces of archives and reprinting the ones I find most relevant. . .it's too much of the past-- and even though Mr. Noring's arguments are of the past in very large part, I find it more worthwhile to consider it in the present, especially since he does. As for "Founder's Syndrome," I continue to encourage each of you to found her/his own projects and to expect a full order of all support we can muster, then you can each say whatever you like about "Founder's Syndrome" from the one point of view you haven't had. . .as Founder. No one here is upset about "Founder's Syndrome" rather it is just the opposite, someone straining to attain powers, so great, they COULD create their own Founder's Syndrome. Project Gutenberg has never relied on political power and financial power, and that is real issue here, to create a power system than can then be manipulated or taken over. Not going to happen. . . . "Freedom, as knowledge, is best served when all have it." And you can quote me on that. Thanks!!! Michael S. Hart Founder Project Gutenberg On Mon, 29 Oct 2007, Karen Lofstrom wrote: > If Jon is to be ignored because he hasn't done enough books, perhaps > my voice will be heard. I've proofed some 39,000 pages at DP, over the > course of four years, and post-processed several books. > > PG has a bad case of Founder's Syndrome: > > http://www.help4nonprofits.com/NP_Bd_FoundersSyndrome_Art.htm > > It's frequent, it's predictable, it's the common cold of non-profits. > > Jon has raised some sensible questions about governance and ownership > of the trademark and they shouldn't be dismissed with accusations that > Jon is trying to take over PG. > > Alas, I don't expect that I *will* be heard. So I won't belabor the point. > > -- > Karen Lofstrom > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From joshua at hutchinson.net Tue Oct 30 12:50:28 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Tue, 30 Oct 2007 19:50:28 +0000 (UTC) Subject: [gutvol-d] !@! Re: Founder's syndrome Message-ID: <24187201.1193773828225.JavaMail.?@fh1038.dia.cp.net> >----Original Message---- >From: hart at pglaf.org > >as Jon has tried various suggestions to stack the Boards, >get control of the trademark, make money the top priority >and the rest of his agenda, none of which seems to be the >actual process of getting more eBook to more people. > Jon has not done any of the above things in any of the MANY notes I've read. You've accused him of it often enough, true, but that isn't the same thing. He has asked about the makeup of the board. Even suggested folks he thought would be good folks to have on the board. Never once has he suggested himself OR any of the people you've accused of being his "groupies" (your term from a couple messages back). You've never answered how the board was constituted, but then again that may not really be your area. Jon should probably aim that question to Greg. He has asked why you own the trademark and not the PGLAF foundation. In fact, LOTS of people have asked this question. To be honest, I kind of remember Greg saying it was at the direction of a lawyer at some point, but I don't remember for certain. An actual answer to this would be appreciated. He has said making money would be a good thing. But he never said it was making money for profit, but as a way to increase the amount of good work PG puts out there. Whether you agree with this (and I don't know that I do), it is most definitely not an evil plot, as you insinuate. Now, as far as personally working to get more books in more hands ... you probably have a point as far as his PG efforts are concerned. Other than the My Antonia scans, I don't know of a whole lot he has done. But that doesn't stop folks from having opinions and shouldn't stop him from expressing them. Provided he is polite and all ... which he usually is. More so than I am, oftimes. Josh From hart at pglaf.org Tue Oct 30 12:54:42 2007 From: hart at pglaf.org (Michael Hart) Date: Tue, 30 Oct 2007 12:54:42 -0700 (PDT) Subject: [gutvol-d] Founder's syndrome In-Reply-To: <1e8e65080710291603y5bbe76a4s72b2bc516dfea73d@mail.gmail.com> References: <1e8e65080710291603y5bbe76a4s72b2bc516dfea73d@mail.gmail.com> Message-ID: A bit about the specifics of "Founder's Syndrome" Here are the basics from the article mentioned: http://www.help4nonprofits.com/NP_Bd_FoundersSyndrome_Art.htm 1. Ideas of the Founder are "rubber stamped" With Project Gutenberg it is just the opposite: What Mr. Noring is really upset about is that ALL ideas in PG are "rubber stamped" to a nearly 100% degree. Mr. Noring would prefer the kind of political power that gets its real power from saying "NO!" Not going to happen. . . . Project Gutenberg will continue to encourage all efforts. 2. The current leadership took over in "tough times" or as "a start-up" or through "a growth spurt" or surviving types of "financial collapse," etc. Sorry, but none of this has happened here. No one took over in "tough times." No one took us through "financial collapse." Everyone is encouraged to found "a start-up" of their own. Greg Newby, Harry Hilton, Mark Zinzow, the board members of Project Gutenberg, have all been here for 20 years, and all encourage anyone who would like to start their own projects in every way they possibly can. The Distributed Proofreaders is a perfect example. Do you think they bought their own super-scanners. 3. "$12 million community powerhouse" Sorry, PG hasn't had even $1 million in all 37 years. . . . Yes, we ARE a powerhouse, but not through financials. . . . We are a powerhouse because we are all volunteers, and thus no can can stop us by stopping our funding. 4. "The Founder might lose his/her total control of the organization. Boards of these organizations usually don't govern, but instead 'approve' what the founder suggests." What Mr. Noring hates the most is that I do NOT exercise "control of the organization" much less "total control," and that neither does the Board of Directors. Instead, everyone is encouraged to try their projects to make the world of eBooks a better place. Mr. Noring would prefer a system that had such control. At least that's the way I hear his proposals for Boards of Directors and money as the top priority. He is welcome to that kind of control in whatever types of project he creates, over however many volunteers his army is made up of, but that will never be 100%, which, truly, is as it should be. Thanks!!! Michael S. Hart Founder Project Gutenberg From Bowerbird at aol.com Tue Oct 30 12:54:44 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 30 Oct 2007 15:54:44 EDT Subject: [gutvol-d] =?iso-8859-1?q?!=40!_Re=3A=A0_Founder=27s_syndrome?= Message-ID: michael said: > Project Gutenberg has never relied on political power and > financial power, and that is real issue here, to create a > power system than can then be manipulated or taken over. and what's _most_ interesting about this is the _meta_ level... anyone can build an organization that rejects a power system. a few people can even make such an organization _work_well_. but michael has gone beyond even that. michael says, "if you want to create your own organization that relies on a power system, you can use my p.g. content to do it!" wow. that's impressive. michael is so confident in the superiority of his methodology that he is willing to _give_other_people_ the fruit of its labor so they can build upon it to create a competing organization. he'll even give that competing organization _webspace_ and pay for their _bandwidth_. and do it all with a happy heart... that takes _balls_. -bowerbird ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071030/f8627adb/attachment.htm From jon at noring.name Tue Oct 30 13:03:53 2007 From: jon at noring.name (Jon Noring) Date: Tue, 30 Oct 2007 14:03:53 -0600 Subject: [gutvol-d] Wow! Re: !@! Re: Founder's syndrome In-Reply-To: References: <1e8e65080710291603y5bbe76a4s72b2bc516dfea73d@mail.gmail.com> Message-ID: <76647410.20071030140353@noring.name> Michael wrote: > The real issue about Jon's frequent efforts to rewrite PG > in his own image is that Jon really hasn't presented this > image for anyone to consider. . .something we have all in > a number of ways given him permission and encouragements, > again and again, to do. > Whatever Jon wants to do, we will gladly support, as will > we support any number of other such parties. > > > [and more things about me that I didn't know about myself.] Wow! Well, I certainly now know what it feels like to be tarred and feathered in a public forum! (I've experienced similar over the years, but this is probably the most profound.) Anyway, I will not reciprocate in kind since that serves no good purpose. What is interesting is that Michael's message was in response to a message posted by Karen Lofstrom, and which I never replied. For those still reading this, three questions to ponder: 1) How appropriate was Michael's answer to the mission of PG? 2) Was it important for Michael, the founder of the PG movement, to write what he did, focusing on my character and his perception of my nefarious designs towards PG? (Yes, he made some general philosophical points, but I appeared to be the "example" he wanted to use.) 3) Will this group, gutvol-d, now be a better place because of that message? Jon Noring From hart at pglaf.org Tue Oct 30 13:15:38 2007 From: hart at pglaf.org (Michael Hart) Date: Tue, 30 Oct 2007 13:15:38 -0700 (PDT) Subject: [gutvol-d] !@! Re: Founder's syndrome In-Reply-To: <24187201.1193773828225.JavaMail.?@fh1038.dia.cp.net> References: <24187201.1193773828225.JavaMail.?@fh1038.dia.cp.net> Message-ID: On Tue, 30 Oct 2007, joshua at hutchinson.net wrote: >> ----Original Message---- >> From: hart at pglaf.org >> >> as Jon has tried various suggestions to stack the Boards, >> get control of the trademark, make money the top priority >> and the rest of his agenda, none of which seems to be the >> actual process of getting more eBook to more people. >> > > Jon has not done any of the above things in any of the MANY notes I've > read. > > You've accused him of it often enough, true, but that isn't the same > thing. Just go back to where he proposed duplicating Stanford's Board. Then when he nominated Brewster Kahle. Did he not also nominate Richard Stallman around that time? Am I the only one who remembers? What need has PG of a Board structure of a billion dollar a year major world university? If you have me listed as using the term "Groupies" then I do not doubt that you may also have mixed up everything else. As for Kahle and Stallman, I have probably spent more time with them than anyone here, both in person, on the phone or email. I think they do a fine job with their own organizations, but it is not the same in either case as the goals of Project Gutenberg. If you want your own Board structures, or places on them, just do what Distributed Proofreaders did. . .you don't even have to make yourselves a corporation, they didn't for a long time. If you are successful, as we hope you are, you can either stay as a part of Project Gutenberg or become and affiliate; as suggested below, some of this is Greg's turf, I'm not up on all the legals, as to what constitutes and "affiliate" or "parter" or other. I have answered the trademark question before, and will again, to your never to be had satisfaction. . .I believe in the separation of powers. . .I don't wan't someone to be able to take over Board of Directors positions and thence all claim to everything that is named "Project Gutenberg." I give everyone who asks permission to open their own sites in an approved "Project Gutenberg" legal fashion: again you might want to ask Greg about all the legal details, but as soon as they will be accomplished there is no other impediment. The only thing I see here is more of the power to say "NO!" Project Gutenberg is a resounding "YES!" And I think THAT is what bothers Mr. Noring and the others whom I think would NOT share the extremely "Open Door Policies" of Greg, Harry, Mark, myself, and the rest of the people who just want the eBooks to get out there to as many people as possible. Most Boards, CEO's and Founders, say "NO!" most of the time. . .! Project Gutenberg says "YES!" nearly all of the time. . . ! THAT is what you see me fighting so hard to preserver here!!!!!!! Thank You!!! Give the world eBooks in 2007!!! Michael S. Hart Founder Project Gutenberg 100,000 eBooks easy to download at: http://www.gutenberg.org [coming up on 25,000 eBooks] http://www/gutenberg.cc [already passed 75,000 eBooks] http://gutenberg.net.au Project Gutenberg of Australia 1500+ http://pge.rastko.net 65 languages PG of Europe ~500 http://gutenberg.ca Project Gutenberg of Canada Blog at http://hart.pglaf.org > He has asked about the makeup of the board. Even suggested folks he > thought would be good folks to have on the board. Never once has he > suggested himself OR any of the people you've accused of being his > "groupies" (your term from a couple messages back). You've never > answered how the board was constituted, but then again that may not > really be your area. Jon should probably aim that question to Greg. > > He has asked why you own the trademark and not the PGLAF foundation. > In fact, LOTS of people have asked this question. To be honest, I kind > of remember Greg saying it was at the direction of a lawyer at some > point, but I don't remember for certain. An actual answer to this > would be appreciated. > > He has said making money would be a good thing. But he never said it > was making money for profit, but as a way to increase the amount of > good work PG puts out there. Whether you agree with this (and I don't > know that I do), it is most definitely not an evil plot, as you > insinuate. > > Now, as far as personally working to get more books in more hands ... > you probably have a point as far as his PG efforts are concerned. > Other than the My Antonia scans, I don't know of a whole lot he has > done. But that doesn't stop folks from having opinions and shouldn't > stop him from expressing them. Provided he is polite and all ... which > he usually is. More so than I am, oftimes. > > Josh > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From hart at pglaf.org Tue Oct 30 13:55:25 2007 From: hart at pglaf.org (Michael Hart) Date: Tue, 30 Oct 2007 13:55:25 -0700 (PDT) Subject: [gutvol-d] Wow! Re: !@! Re: Founder's syndrome In-Reply-To: <76647410.20071030140353@noring.name> References: <1e8e65080710291603y5bbe76a4s72b2bc516dfea73d@mail.gmail.com> <76647410.20071030140353@noring.name> Message-ID: On Tue, 30 Oct 2007, Jon Noring wrote: > Michael wrote: > >> The real issue about Jon's frequent efforts to rewrite PG >> in his own image is that Jon really hasn't presented this >> image for anyone to consider. . .something we have all in >> a number of ways given him permission and encouragements, >> again and again, to do. > >> Whatever Jon wants to do, we will gladly support, as will >> we support any number of other such parties. >> >> >> [and more things about me that I didn't know about myself.] > > Wow! > > > Well, I certainly now know what it feels like to be tarred and > feathered in a public forum! (I've experienced similar over > the years, but this is probably the most profound.) > > Anyway, I will not reciprocate in kind since that serves no good > purpose. > > What is interesting is that Michael's message was in response to a > message posted by Karen Lofstrom, and which I never replied. > > > For those still reading this, three questions to ponder: > > 1) How appropriate was Michael's answer to the mission of PG? > > 2) Was it important for Michael, the founder of the PG movement, to > write what he did, focusing on my character and his perception of > my nefarious designs towards PG? > > (Yes, he made some general philosophical points, but I appeared > to be the "example" he wanted to use.) > > 3) Will this group, gutvol-d, now be a better place because of that > message? > > > Jon Noring Jon brings up at least one point here we should really consider, how will this conversation be recorded in history, will Project Gutenberg "now be a better place because of that message?" As for his other comments, Jon has insisted on being the focal- point of this conversation, year after year. Should I just ignore him if he tries again next year? Should I have ignored all this this year? Now my reply: The reason I take so much time answering Jon's yearly efforts, such as they are, is to make sure people know I answer all the email I receive, other than the most obvious of trolls, spams, and the like, and to find out just how seriously persons might be taking these sorts of things. I answered again for what I considered obvious reasons, even a day after hoping the conversation was over for another year. That reason being that someone seemed sincerely concerned. However, the question still comes down do what would Jon do in the future with the power he wants that he can't do now? Personally, I can't think of anything, other than saying he is the whatever position he would like of Project Gutenberg as an entire entity, as opposed to his own team of volunteers. The real question to be considered in response to this message is "what will all this look like years from now?" I can only presume that it will continue to look as if Noring, over and over and over again, has tried to remake a Gutenberg, in his own image of what he thinks it should be like, but with political power to establish no policy. Let us not forget that policy is the root of politics, and the resulting term, political power. Jon has no policy. This is why my answers always include questions as to what Jon would like to DO. . .what ACTION he would like to take. . . . It seems all to obvious that Jon wants the power for itself on no other basis or he would have put forth any number of plans, projects, proposals, etc., over the years that would have made him the position he seems to desire so longingly. Mr. Noring is welcome, once again, to all the supports Project Gutenberg has offered to everyone else who asks, and to name a title to his own liking within whatever projects he does. He is more than free to do what he wants to do. Does anyone know what that is? Other than change Project Gutenberg as a whole to whatever? Why does Mr. Noring feel this great need to Project Gutenberg, as a whole, when he has not even staked out his own portion of it as anyone and everyone is encouraged to do? Is it really not that completely obvious? This is why I answer. . .for the future. . . . So when anyone looks back on what has happened before, they're going to hopefully find they are not alone in the future. The question is: Will it be the future Mr./Ms. Noring's who look? Or will it be those s/he wishes to manipulate? Thank You For Your Time And Attention, and hopefully, for your support. . .in the years, decades, centuries to come. Michael S. Hart Founder Project Gutenberg From jon at noring.name Tue Oct 30 14:03:46 2007 From: jon at noring.name (Jon Noring) Date: Tue, 30 Oct 2007 15:03:46 -0600 Subject: [gutvol-d] !@! Re: Founder's syndrome In-Reply-To: References: <24187201.1193773828225.JavaMail.?@fh1038.dia.cp.net> Message-ID: <1952325838.20071030150346@noring.name> Michael Hart wrote: > Just go back to where he proposed duplicating Stanford's Board. > > Then when he nominated Brewster Kahle. > > Did he not also nominate Richard Stallman around that time? So what is wrong with having a Board include a number of notables in the PD text digitization arena, and related areas such as open source software, who themselves are making things happen? Brewster would be a *great* addition to the PGLAF Board, as would one or two from DP. And Richard Stallman -- maybe (I don't have any feelings one way or the other for him, but suggested him as an example.) And remember, my proposals were part of an early 2004 call at the time, which followed the one and only PG "face-to-face" meeting at Brewster's place, for ideas on strengthening the PG organization. And I remember that part of the call was to strengthen the organization so as to increase donations from various sources, as well as other potential sources of revenue. PG may be austere and live on very little, but it still needs the good graces of money or its equivalent to keep going even at its current level of organization. So in good faith I made suggestions. I guess no good deed goes unpunished. > Am I the only one who remembers? No, I remembered, too. In fact, I remember a lot of things said at the 2003 PG "face-to-face" get together.... > What need has PG of a Board structure of a billion dollar a year > major world university? That's not the point. The point is good governance, and that includes people on the Board who are active in the field and with a proven track record in making things happen. It has several benefits. E.g., this helps build bridges to other organizations. The one thing I know everyone notices is how "isolated" PG is. Now some might say that is a good thing, while others say it is not a good thing. There are good reasons on both sides. I tend to fall on the side that PG is best served by coming out of its isolation *some*. I reject the notion that doing so will somehow taint the organization and lead to losing its "mojo". ***** Anyway, I think the point is that it is a stretch to claim I was trying to "take over" PG by my simple suggestions. Anyone with a rational mind would know I had and still have no power base to effect anything like that. Thus it is a ridiculous suggestion. In fact, I'm somewhat flattered that anyone would even think I have that level of power that scares them so. Geez, if I had that level of power, maybe I *should* use it? One good thing to come out of this is that at least, in the noise, the makeup of the PGLAF Board is being discussed, that the philosophy of the organization is being discussed, etc. This is healthy no matter what results. Jon Noring From grythumn at gmail.com Tue Oct 30 14:06:59 2007 From: grythumn at gmail.com (Robert Cicconetti) Date: Tue, 30 Oct 2007 17:06:59 -0400 Subject: [gutvol-d] !@! Re: Founder's syndrome In-Reply-To: <24187201.1193773828225.JavaMail.?@fh1038.dia.cp.net> References: <24187201.1193773828225.JavaMail.?@fh1038.dia.cp.net> Message-ID: <15cfa2a50710301406t31e57f13t2aa62c2636aaf3da@mail.gmail.com> On 10/30/07, joshua at hutchinson.net wrote: > suggested himself OR any of the people you've accused of being his > "groupies" (your term from a couple messages back). You've never Josh, I think you'll find that was Jon Richfield that first mentioned the word "groupies"... unless you are referring to a message that I have not received. I don't want to get entangled in the rest of it, but I would like to hear the reason why the PG trademark is set up the way it is. I also think there is a difference between a) mandating something, b) recommending something but accepting virtually anything, and c) accepting virtually everything and recommending very little. There seems to be little discussion of the middle path. R C (Who is not going to read gutvol-d for a few days until he finishes up some Rule 6 research.) From joshua at hutchinson.net Tue Oct 30 14:14:48 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Tue, 30 Oct 2007 21:14:48 +0000 (UTC) Subject: [gutvol-d] !@! Re: Founder's syndrome Message-ID: <22297019.1193778888173.JavaMail.?@fh1038.dia.cp.net> >----Original Message---- >From: grythumn at gmail.com > >On 10/30/07, joshua at hutchinson.net wrote: >> suggested himself OR any of the people you've accused of being his >> "groupies" (your term from a couple messages back). You've never > >Josh, I think you'll find that was Jon Richfield that first mentioned >the word "groupies"... unless you are referring to a message that I >have not received. > I apologize. I should have double-checked that attribution. Josh From hart at pglaf.org Tue Oct 30 14:37:41 2007 From: hart at pglaf.org (Michael Hart) Date: Tue, 30 Oct 2007 14:37:41 -0700 (PDT) Subject: [gutvol-d] !@! Re: Founder's syndrome In-Reply-To: <1952325838.20071030150346@noring.name> References: <24187201.1193773828225.JavaMail.?@fh1038.dia.cp.net> <1952325838.20071030150346@noring.name> Message-ID: On Tue, 30 Oct 2007, Jon Noring wrote: > Michael Hart wrote: > >> Just go back to where he proposed duplicating Stanford's Board. >> >> Then when he nominated Brewster Kahle. >> >> Did he not also nominate Richard Stallman around that time? > > So what is wrong with having a Board include a number of notables in > the PD text digitization arena, and related areas such as open source > software, who themselves are making things happen? > > Brewster would be a *great* addition to the PGLAF Board, as would one > or two from DP. And Richard Stallman -- maybe (I don't have any feelings > one way or the other for him, but suggested him as an example.) What exactly would we be gaining from Messrs Kahle and Stallman that we can't get from them simply by asking? > And remember, my proposals were part of an early 2004 call at > the time, which followed the one and only PG "face-to-face" > meeting at Brewster's place, for ideas on strengthening the PG > organization. And I remember that part of the call was to > strengthen the organization so as to increase donations from > various sources, as well as other potential sources of revenue. The First Fule of Journalism: "Follow The Money" So. . it still comes down to the fact that Mr. Noring wants PG's top priority to be money. However, Mr. Noring, very obviously, still refuses to outline an array of projects he would enact if he were able to get APPROVAL from Project Gutenberg in some way he can't already get. WHAT does Mr. Noring want? He won't say. . . . Other than what appears to be simple political/financial power. > PG may be austere and live on very little, but it still needs > the good graces of money or its equivalent to keep going even > at its current level of organization. Not in any real sense. We've never had a project we couldn't pay for. We've never had a bill received we couldn't pay for. And there is no reason Mr. Noring couldn't go out fundraising-- in the name of Project Gutenberg--for whatever projects. > So in good faith I made suggestions. I guess no good deed goes > unpunished. That's just the problem. . .WHAT are Mr. Noring's suggestions??? What are the "good deeds" Mr. Noring wants to accomplish? >> Am I the only one who remembers? > > No, I remembered, too. > > In fact, I remember a lot of things said at the 2003 PG "face-to-face" > get together.... > > >> What need has PG of a Board structure of a billion dollar a year >> major world university? > > That's not the point. The point is good governance, and that includes > people on the Board who are active in the field and with a proven > track record in making things happen. It has several benefits. Mr. Noring never says just what he would do with "good governance." > E.g., this helps build bridges to other organizations. The one > thing I know everyone notices is how "isolated" PG is. Everyone notices how "isolated" PG is??? Then how does Mr. Noring explain the hundreds of eLibraries eBook donations come to us from, or the thousands of people from world- wide geographic locations, inside and outside academia, etc??? What kind of "isolation" is this??? > Now some might say that is a good thing, while others say it is > not a good thing. There are good reasons on both sides. I tend > to fall on the side that PG is best served by coming out of its > isolation *some*. I reject the notion that doing so will > somehow taint the organization and lead to losing its "mojo". Project Gutenberg allows everyone to use our eBook collection. Perhaps Mr. Noring's response to this is on the same order as the reaction to the fact that everyone in Project Gutenberg gets YES! as a response to nearly every project they want to try. Still no answer to what Mr. Noring's version of the Board may get on the agenda that would be be there under the current board.... It still comes down to the fact that Mr. Noring as no ACTION in a long, long, long series of TALKING. . .no proposal of ACTION. The reason is quite literally that any ACTION he would propose is likely to get an IMMEDIATE "YES!" And then he would have nothing to complain about. As it is, the only think he can complain about not getting is the political and financial power that is not Project Gutenberg. If you want that kind of poliitical/financial power, it will have to be found elsewhere than in Project Gutenberg. > > ***** > > Anyway, I think the point is that it is a stretch to claim I > was trying to "take over" PG by my simple suggestions. Anyone > with a rational mind would know I had and still have no power > base to effect anything like that. Thus it is a ridiculous > suggestion. Still only the requests for money any political power. Certainly "a ridiculous suggestion" for Project Gutenberg. > In fact, I'm somewhat flattered that anyone would even think I > have that level of power that scares them so. Geez, if I had > that level of power, maybe I *should* use it? You have managed to get more attention focused on you than any other person has ever had on this list, not enough for you? > One good thing to come out of this is that at least, in the > noise, the makeup of the PGLAF Board is being discussed, that > the philosophy of the organization is being discussed, etc. > This is healthy no matter what results. Except that this conversation has been all about TALK. . . . It would be nice if you put the same attention to bring eBooks to a greater and greater portion of the world. > > Jon Noring Michael Hart > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From hart at pglaf.org Tue Oct 30 14:40:41 2007 From: hart at pglaf.org (Michael Hart) Date: Tue, 30 Oct 2007 14:40:41 -0700 (PDT) Subject: [gutvol-d] !@! Re: Founder's syndrome In-Reply-To: <15cfa2a50710301406t31e57f13t2aa62c2636aaf3da@mail.gmail.com> References: <24187201.1193773828225.JavaMail.?@fh1038.dia.cp.net> <15cfa2a50710301406t31e57f13t2aa62c2636aaf3da@mail.gmail.com> Message-ID: On Tue, 30 Oct 2007, Robert Cicconetti wrote: > On 10/30/07, joshua at hutchinson.net wrote: >> suggested himself OR any of the people you've accused of being his >> "groupies" (your term from a couple messages back). You've never > > Josh, I think you'll find that was Jon Richfield that first mentioned > the word "groupies"... unless you are referring to a message that I > have not received. > > > I don't want to get entangled in the rest of it, but I would like to > hear the reason why the PG trademark is set up the way it is. > > I also think there is a difference between a) mandating something, b) > recommending something but accepting virtually anything, and c) > accepting virtually everything and recommending very little. There > seems to be little discussion of the middle path. The simple reason is the saying "YES!" gets you more than "NO!" michael > > R C > (Who is not going to read gutvol-d for a few days until he finishes up > some Rule 6 research.) > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From jon at noring.name Tue Oct 30 14:53:29 2007 From: jon at noring.name (Jon Noring) Date: Tue, 30 Oct 2007 15:53:29 -0600 Subject: [gutvol-d] !@! Re: Founder's syndrome In-Reply-To: References: <24187201.1193773828225.JavaMail.?@fh1038.dia.cp.net> <1952325838.20071030150346@noring.name> Message-ID: <892078063.20071030155329@noring.name> Michael wrote: > WHAT does Mr. Noring want? > > He won't say. . . . > > Other than what appears to be simple political/financial power. Michael, I've had it with your flights of fantasies and delusions. I do not plan to continue any conversation with you under the current circumstances of the irrational hostility you are showing me. If you want to believe you've won this "debate", go right ahead. It's not a debate, it's an irrational spewing of delusions, and my dad told me a long time ago there's no use arguing with a crazy person. I actually feel sorry for you. There's enough messages out there that the few others who are even continuing to follow this exchange (and I don't blame them if they gave up a long time ago) will be able to form their own opinions. Jon Noring From hart at pglaf.org Tue Oct 30 15:00:56 2007 From: hart at pglaf.org (Michael Hart) Date: Tue, 30 Oct 2007 15:00:56 -0700 (PDT) Subject: [gutvol-d] !@! Re: Founder's syndrome In-Reply-To: <892078063.20071030155329@noring.name> References: <24187201.1193773828225.JavaMail.?@fh1038.dia.cp.net> <1952325838.20071030150346@noring.name> <892078063.20071030155329@noring.name> Message-ID: I must say I agree wholeheartedly with Jon on this: Jon and I had agreed to let this all go yesterday, and I can only wish everyone had felt the same. I can only offer Jon my most sincere apologies for what has transpired since. . . . I honestly wish it hadn't happened. I feel Jon did get more support, but obviously not with the results he and I both intended yesterday. I feel I should perhaps have asked his permission, and perhaps everyone's, before answering the other messages that continued things. I feel even MORE strongly that I should have asked all concerned if they would mind if I NOT answer-- out of respect for Jon's and my decision. . . . Hopefully I will do better next time. . .presuming there IS a next time. . . . Michael On Tue, 30 Oct 2007, Jon Noring wrote: > Michael wrote: > >> WHAT does Mr. Noring want? >> >> He won't say. . . . >> >> Other than what appears to be simple political/financial power. > > Michael, I've had it with your flights of fantasies and delusions. > > I do not plan to continue any conversation with you under the current > circumstances of the irrational hostility you are showing me. If you > want to believe you've won this "debate", go right ahead. It's not a > debate, it's an irrational spewing of delusions, and my dad told me a > long time ago there's no use arguing with a crazy person. > > I actually feel sorry for you. > > There's enough messages out there that the few others who are even > continuing to follow this exchange (and I don't blame them if they > gave up a long time ago) will be able to form their own opinions. > > Jon Noring > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From Bowerbird at aol.com Tue Oct 30 15:03:50 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Tue, 30 Oct 2007 18:03:50 EDT Subject: [gutvol-d] "founder's syndrome" as it relates to project gutenberg Message-ID: here's the webpage on "founder's syndrome" that was "recommended": > http://www.help4nonprofits.com/NP_Bd_FoundersSyndrome_Art.htm so let's see how it relates to project gutenberg... it says: > Founder?s Syndrome occurs when a single individual > or a small group of individuals bring an organization > through tough times (a start-up, a growth spurt, > a financial collapse, etc.). Often these sorts of situations > require a strong passionate personality - someone who > can make fast decisions and motivate people to action. well, it _kinda_ sounds like project gutenberg. except for that "fast decisions" part, that is... for its first 20 years, p.g. moved _very_ slowly. and i'm sure that, back in those early days, when a person worked on a book, it went slowly, and michael wasn't there "motivating them to action". > Once those rough times are over, however, > the decision-making needs of the organization change, > requiring mechanisms for shared responsibility and authority. it's not at all clear to me that "the decision-making needs" of project gutenberg have changed... it's "same as it ever was"... > It is when those decision-making mechanisms don?t change, > regardless of growth and changes on the program side, > that Founder?s Syndrome becomes an issue. there have been no "growth and changes on the program side". if the staff of this non-profit organization had grown immensely, or the modus operandi was significantly different than it had been, then there might be some need to change. but that hasn't happened. a lot more books are coming in, yeah, but they're still treated the same. > We see this most frequently with organizations that have grown from > a mom-and-pop operation to a $12 million community powerhouse, > while decisions are still made as if the founders are gathered around > someone?s living room, desperately trying to hold things together. so, now we see were this guy is coming from. if project gutenberg had grown to a $12-million organization, maybe michael's laissez-faire style would no longer be the best one. or if michael was clinging to a style of "desperately holding things together" when that was not appropriate, then there might be some validity to a call for change. but not only is michael not now "desperately holding things together", he never was... there's very little money in the budget now, and there never has been, but i've never heard project gutenberg described as being "desperate". michael consciously and _intentionally_ created a thing which is able to survive and even thrive _without_much_cash_, a remarkable achievement. project gutenberg is a _weed_, one that cannot be killed, no matter what. and this is _precisely_ the magic of project gutenberg. (and wikipedia too.) to create a no-budget organization that goes on to produce something of _immense_value_ is an extraordinary accomplishment. it's truly _amazing_. instead of criticizing michael's style, people should be _emulating_ it, and trying (and failing) to find the words to express its _brilliance_... you don't do a weed a favor trying to transform it into a hot-house flower. > the main symptom of Founder?s Syndrome is that > decisions are not made collectively. Most decisions > are simply made by the "founder." All other parties > merely rubber stamp what the founder suggests. now, do you see why this whole topic is just _silly_ in regard to p.g.? michael doesn't "make all the decisions". he makes _no_ decisions! there isn't any "rubber-stamping" happening at project gutenberg... the _only_ rubber-stamp at project gutenberg is one that says "yes!" > There is generally strong resistance to any change in that decision-making, > where the Founder might lose his/her total control of the organization. ha! like michael has "total control of the organization". what a laugh! there's no organization to control; it's everybody do whatever you want. want to make your own library? fine! take our books! free of charge! if you don't want to go through the hassle of downloading all of them, we'll even send you a d.v.d. with them. we'll even pay for the postage! and if the organization you create -- with the board that you want and the decision-making structure you think is best -- out-performs p.g., then _more_power_to_you_. p.g. will be happy to have helped you out. that's an amazing philosophy. one that is _confident_ in its strength... > Boards of these organizations usually don't govern, > but instead "approve" what the founder suggests. > Planning isn't done collectively, but by the founder. > And plans / ideas that do NOT come from the founder > usually don't go very far. this whole "founder's syndrome" thing is about autocratic leaders and their refusal to give up power. michael never took any power in the first place... so this notion is simply _not_applicable_. but let me address one last point... > Some may ask, ?So what?s wrong with that?? And the answer is simple: > If the ?founder? is hit by a bus tomorrow, the organization is not sustainable, > and all the good work the organization has done over the years is > in danger of screeching to a halt. That?s because organizations facing > Founder?s Syndrome usually have little infrastructure in place, because it > simply hasn?t been needed. In these situations, the founder IS the infrastructure! let me tell you something that i don't need to tell you. when michael dies -- let's hope it's due to old age after he's just celebrated birthday #108 -- project gutenberg _will_ live on... michael's organization _is_ sustainable... and _none_ of the good work it has done over the years is in _any_ danger of "screeching to a halt". the infrastructure of project gutenberg _is_ "in place"... indeed, like the good weed that it is, it has reached out to millions of cracks in the internet sidewalk, and we couldn't get rid of it no matter how hard we tried. even if we wanted to. and _most_ of us here don't even want to. end of post... -bowerbird p.s. well... i was _going_ to end the post right there. but you know what? there _was_ some stuff on that webpage that might, ironically, be applicable. the advice was to prepare your organization for the time when you pass away. michael, i think you need to take a good hard look at the community right here. among them, you will see some -- not many, it's true, but it's the _loud_ ones -- who will try to sabotage the organization that _you_ built. they want a _different_ type of organization, the kind you _intentionally_ set out to set yourself apart from. you need to think about putting into place some protection against these saboteurs. maybe, after considering it, you'll decide your organization is sufficiently resistant to repulse their takeover attempts. and if that's your decision, i will bow to your wisdom. i believe in the strength of your organization, because i've seen the power of the weed. but please, please, please do consider what i have said here. your contribution has been far too valuable to the future of the world-at-large to let small-minded people destroy it. ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071030/a2acbd3a/attachment-0001.htm From jon at noring.name Tue Oct 30 15:10:40 2007 From: jon at noring.name (Jon Noring) Date: Tue, 30 Oct 2007 16:10:40 -0600 Subject: [gutvol-d] Re[: !@! Re: Founder's syndrome In-Reply-To: References: <24187201.1193773828225.JavaMail.?@fh1038.dia.cp.net> <1952325838.20071030150346@noring.name> <892078063.20071030155329@noring.name> Message-ID: <27986341.20071030161040@noring.name> Michael, I appreciate your apology, and accept it. And likewise I offer an apology for things that I said which were hurtful towards you. After all, you and I are only human. And it takes a big person to offer the first apology. We certainly disagree on a whole lot of things, and we are driven both by passion in what we believe is right. That's where we agree. We also agree in that we need to digitize the public domain and get it out there for preservation and for free use by the public who is the owner of the public domain. So it is good to focus on where we agree. Again, thanks, Michael. And I will do my best to always reply to your messages in a cordial and respectful manner, even when I disagree with you (which I guess is somewhat often ). Jon From hart at pglaf.org Tue Oct 30 15:53:10 2007 From: hart at pglaf.org (Michael Hart) Date: Tue, 30 Oct 2007 15:53:10 -0700 (PDT) Subject: [gutvol-d] Re[: !@! Re: Founder's syndrome In-Reply-To: <27986341.20071030161040@noring.name> References: <24187201.1193773828225.JavaMail.?@fh1038.dia.cp.net> <1952325838.20071030150346@noring.name> <892078063.20071030155329@noring.name> <27986341.20071030161040@noring.name> Message-ID: On Tue, 30 Oct 2007, Jon Noring wrote: > Michael, > > I appreciate your apology, and accept it. > > And likewise I offer an apology for things that I said which > were hurtful towards you. Thank you! > After all, you and I are only human. And it takes a big person > to offer the first apology. I never wanted to "win" in a manner that was hurtful to you, just to keep an open balance for all concerned. I feel your desires are to upset that balance, hence I defend it. No one you accuse of wanting power wants it. We don't want anyone to have that kind of power at PG. Everyone should be equally empowered. Thus there is no need for "Board Approval" on anything, only the rarest requests are even brought to the board beforehand as we just presume everything but the oddest requests are to be approved, and everyone already knows that. In the cases where things ARE brought to the Board I am very pleased to report that they are even more open-minded than I would have had any expectation of. . . . We would like to keep it this way. . . . > We certainly disagree on a whole lot of things, and we are > driven both by passion in what we believe is right. That's > where we agree. We also agree in that we need to digitize the > public domain and get it out there for preservation and for > free use by the public who is the owner of the public domain. > So it is good to focus on where we agree. As I said earlier today, and I hope you can/did see that part, it is my greatest hope for you that you come to manage project armies that are even beyond your own expectations, just not to the exclusion of other such projects, if you understand. . . . > Again, thanks, Michael. And I will do my best to always reply > to your messages in a cordial and respectful manner, even when > I disagree with you (which I guess is somewhat often ). I think it is obvious to all concerned that we disagree on any number of things, I fear you, and they, may not understand the desire I have NOT to have the kind of power aforementioned and for NO ONE to have that kind of power over Project Gutenberg. I think very highly of Brewster's work, but he also has a kind of power over his projects I wouldn't want here, but he has an entirely different protocol and system, and he pays for it, in more ways than one, that I could never do. Richard Stallman and I perhaps share equal credit for starting "The Open Source Movement," but, again, he has a kind of power I would never want to have, nor would I pay his price, either. I am what I am, and worst of all, for some, I like who I am. I hope who I am, and what I have done, and hope to do, bring a whole lifetime of opportunities your way, just not the the kind of opportunities I mentioned earlier, but would not now. I hope we can achieve some kind of balance, rather than balance of having this contesing for control every year. > Jon Michael From jon at noring.name Tue Oct 30 16:18:55 2007 From: jon at noring.name (Jon Noring) Date: Tue, 30 Oct 2007 17:18:55 -0600 Subject: [gutvol-d] Re[: !@! Re: Founder's syndrome In-Reply-To: References: <24187201.1193773828225.JavaMail.?@fh1038.dia.cp.net> <1952325838.20071030150346@noring.name> <892078063.20071030155329@noring.name> <27986341.20071030161040@noring.name> Message-ID: <1646571185.20071030171855@noring.name> Michael wrote: > [a lot of good things, even if somethings I disagree with.] > > I hope we can achieve some kind of balance, rather than balance > of having this contesing for control every year. I appreciate your reply here, and yes, we will always disagree on a number of things, but then are there any two people who ever think exactly alike (other than maybe identical twins)? The key is for both of us to disagree the right away, and I am guilty of sometimes not doing it the right way. If we do, then everyone benefits by how the differing views seed the "idea commons". Hopefully when we meet again we won't need to punch each other out, (and you are a little bigger than me), but rather sit down with beers or sodas in hand and verbally argue with each other with smiles on our faces. Jon From piggy at netronome.com Wed Oct 31 05:32:48 2007 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Wed, 31 Oct 2007 08:32:48 -0400 Subject: [gutvol-d] Wow! Re: !@! Re: Founder's syndrome In-Reply-To: References: <1e8e65080710291603y5bbe76a4s72b2bc516dfea73d@mail.gmail.com> <76647410.20071030140353@noring.name> Message-ID: <472875F0.4040606@netronome.com> Michael Hart wrote: > Now my reply: > > The reason I take so much time answering Jon's yearly efforts, > such as they are, is to make sure people know I answer all the > email I receive, other than the most obvious of trolls, spams, > and the like, and to find out just how seriously persons might > be taking these sorts of things. > Please forgive me for addressing a Noring topic. There is one point he brought up which I for one take seriously. I have some long-term concerns about the Project Gutenberg trademark. I am quite content with how Michael is managing the trademark and have every expectation that his sensible policies will continue throughout his lifetime. May I suggest that Michael designate PGLAF as the heir of the trademark? I can tell you at close second hand that dealing with heirs and estates over intellectual property rights is at least an order of magnitude more difficult than dealing with the original creators. Heirs tend to look at most intellectual property as financial assets rather than ideas which need to be disseminated. We have a similar problem in the Linux community over Linus' ownership of the Linux trademark. We have an obvious heir for the trademark, the Linux Foundation. To the best of my knowledge, Linus has made no public statement about arrangements to safeguard the trademark for future generations. Obviously there are many other issues for both Linux and PG to address about surviving their founders. There is no real rush in either case--both Linus and Michael are in good health and fully in command of their faculties. Each has an excellent set of capable lieutenants. At some point in the next decade or two, it would be reassuring to hear that Michael has made long-term arrangements for the stability of the trademark. From piggy at netronome.com Wed Oct 31 05:42:47 2007 From: piggy at netronome.com (La Monte H.P. Yarroll) Date: Wed, 31 Oct 2007 08:42:47 -0400 Subject: [gutvol-d] !@! Re: Founder's syndrome In-Reply-To: References: <24187201.1193773828225.JavaMail.?@fh1038.dia.cp.net> Message-ID: <47287847.1030509@netronome.com> Michael Hart wrote: > I have answered the trademark question before, and will again, to > your never to be had satisfaction. . .I believe in the separation > of powers. . .I don't wan't someone to be able to take over Board > of Directors positions and thence all claim to everything that is > named "Project Gutenberg." > Thanks. This makes good sense. I amend my earlier suggestion to designate PGLAF as the heir to the PG trademark to encourage you to designate a specific heir sometime in the next decade or two. From Bowerbird at aol.com Wed Oct 31 09:55:08 2007 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Wed, 31 Oct 2007 12:55:08 EDT Subject: [gutvol-d] harmless monsters, and the dogs that have stopped barking Message-ID: what a delightful post from jon richfield! :+) i believe mr. richfield has been here for a while, but i'm not sure how long... and there might be other people here who have been subscribed for less than a period of many years, for whom i can provide some background information. because one of the most interesting aspects of this listserve _these_days_ is the dogs who are not barking. in other words, posts you are _not_ seeing... (i assume you're familiar with the sherlock holmes story where he solves the mystery by noting that a dog did _not_ bark, when he usually _would_have_, thus indicating the murderer was someone with whom the dog was familiar.) and there is a lot of "non-barking" going on here these days... a ton of it... to those who know what the noise level _could_ be, what it actually has been, this silence is deafening. the dogs have stopped barking... *** so here's the backstory... it's kinda long, if you prefer to save it for when you have some time... if you want to cut right to the chase, and read the rest later (or never), skip toward the end, where you'll find a header saying "the message"... *** i subscribed almost exactly 4 years ago. i believe the date was 2003/11/11. that was what i intended anyway, but i mighta jumped the gun by a few days. i read all the archives -- even going back to previous web-based forums -- before posting, so i was thoroughly immersed in the issues being discussed. but the one issue that i was most interested in was "the planned move to .tei", this transition had been "in the works" for several years even at that time, but nothing had actually happened. that didn't surprise me, because .tei is _hard_. way too difficult for the ordinary people that were doing volunteering for p.g. the reason i appeared at this time was because i'd been tracking p.g. progress for well over 20 years by then. to tell you the truth, i was skeptical back then... oh sure, i loved the idea. heck, it was _my_ idea. i'd been speaking incessantly since 1980 myself about "putting all the books in the world on the computer". "and then all the pictures," i would add. "and all the music, and all the movies." i figured it would go in that order because of the bandwidth required by each. obviously, with college dorms wired with fiber, i was wrong. music led the way. so i knew about project gutenberg way back when it was almost entirely vapor. and i called it "almost entirely vapor" _then_, as that was the truth of the matter. still, there was something i admired about michael hart. not only did he have a very good idea -- i.e., my idea -- but also a punk d.i.y. attitude that i loved. (d.i.y. stands for "do it yourself", for all you young'uns out there.) i was a graduate student at the time, spending way too much time in libraries, surrounded by books, so i envisioned some big effort to digitize those books. (the university had also just debuted a newfangled thing called "text editing" on the mainframe, which i loved, as i was one of those people who write once and then rewrite a thousand times. but all my scribbling and lines and arrows and inserts and crossouts and stuff drove me crazy. the computer gave me an absolutely clean copy any time i wanted it, which i found quite heavenly. i harangued my fellow graduate students constantly about word-processing, and went out immediately and bought an osborne when it became available, so that i could do my writing at home. and the rest, as they say, is histor y... the point is, i knew that very soon, _all_ documents would be digital in form. so the only thing we needed to do was to digitize the paper-book "backlog".) anyway, back to project gutenberg. i loved michael's spirit, and the fact that he was willing to _type_in_ a darn book, if that's the only way to get it online. the other thing that i loved was that he was very good about getting press... oh sure, i was jealous that _he_ was getting all the credit for _my_ idea, but since _i_ wasn't putting the idea out into the big world, i was glad _he_ was... and he was doing a good job. i especially remembered an article in "wired". but the fact of the matter is, for some 20 years, he had almost nothing done. a few high-profile _documents_. the bible. shakespeare. that was about it. anyone who was _determined_ could have caught up to his total quite easily. even _years_ after the emergence of the web, which i saw as the _beginning_ of the time when "all of the books in the world would be on the computer" -- except starting with new born-digital documents rather than old paper books, which (in my opinion) was better than the public-domain of project gutenberg -- the p.g. "library" was almost laughably small. again, anyone with _stamina_ and a good scanner could have matched and surpassed michael's total easily... but then, michael's "doubling cube" started to give us some impressive output. i mean, when you're doubling your output from 100 to 200 books, no big deal. even going from 1000 to 2000 isn't all that overwhelming. but when you go from 2000 to 4000, and then double to 8000, all of a sudden it _is_ a big deal. because now you're starting to talk about a rather large number of books... so, along about 2002, i started planning to engage project gutenberg soon. i'm remembering it was around 8000 books then, so it was increasingly clear michael had brought his project to critical-mass. so he deserved the credit. i'd been writing viewer-programs all along, so i decided to make one for p.g., as a present for michael hart. for the first time in my work, i decided to use a 2-up facing-pages interface. for more than a decade, it was my orientation that "an electronic-book doesn't have to look like a physical-book", a mantra that was shared by virtually all of the other e-book programmers of that time. ordinary people, however, pushed back against e-book viewer-applications, and one of the things they said was that they liked the "look" of a paper-book. so, more or less as a lark, i said, "well heck, why don't i just give 'em that look?" still, i was of the opinion that, once people "got used to" an electronic-book, they would give up this silliness about wanting it to "look like" a "real" book... boy, was i ever wrong. what _really_ happened was that _i_ got convinced that the 2-up facing-pages display has many natural advantages, and that it was indeed the best interface. and, no, _not_ because it "looks" like a "real" book. although that doesn't hurt. (and i want to stress that it _doesn't_ hurt for an e-book to look like a p-book; indeed, i'm beginning to think it's actually a good thing. but save that for later.) it's good because the right-side page can "drop away" and be a _workspace_. this _workspace_ might hold a table of contents, for example, or a list of hits from a user's search operation, or a full-size version of a picture in the book, or any number of other things. but what's important is the _left-side_ page is undisturbed by all this. so you always maintain your "contact" with the text. compare this with adobe's acrobat viewer. when you summon up "bookmarks", the page of text from the book is resized, which makes you "lose contact" with it. and when you dismiss the "bookmarks", the text-page is resized once again, so you "lose contact" again. all this mental effort required to re-establish "contact" is unnecessarily draining, and it detracts from the overall reading experience... and the clincher was that monitors, which had been getting bigger all along, were now wide enough that the most _readable_ line-length only used up _half_ of the monitor, meaning there was finally _room_ for a 2-up display. so i was quite happy with this new viewer-program i had made. i had also developed my own methodology for compressing an e-text -- the 4-meg bible i was using in my demo compressed to 1.2 megs -- so i figured i could make good use of the p.g. library as a demo corpus. by this time, with 9000+ e-texts in it, it was far away from vapor-land... when i happened to be in chicago in august of 2003, i decided to visit michael. by then, i'd been talking with him on various e-book listserves for 8 years, but i'd never met him. it was a pleasure to see him, buy him dinner, and have him talk my ear off about shakespeare. i told him about my viewer-program and my compression method, and asked him if i could use his library as a demo corpus. he said yes, of course, and offered disk-space. he said he could not guarantee me any volunteers, that i'd have to get those myself, but he'd help with that too... i told him i intended to do it all myself, and i wouldn't know how to use any help. i was on my way... *** the message... *** in the fall of 2003, as p.g. zeroed in on 10,000 e-texts, a victory celebration was scheduled for december in san francisco, so i was able to attend. as precursor to all of that, i decided to get on the project gutenberg listserves for the first time... my work with the plain-ascii e-texts from project gutenberg had convinced me that -- with just a little bit of work -- they could be modified to make e-books that (a) have high-powered functionality, and (b) are typographically beautiful. as i did more and more of this work, i became more and more impressed with the power of plain-text. i'd never been fond of heavy-markup, but didn't have any _allergies_ to it either. but as i came to realize it is completely unnecessary, my aversion to it grew. so i approached the p.g. listserves to share a message: you don't need heavy-markup, which is costly, to get the benefits you desire... and boy, did my message get flak in return. as the archives will clearly reveal to anyone who reads them, the markup-crowd responded with a vengeance there were days with literally dozens of messages, all of them _hostile_, all of them saying "your methodologies will never work..." (there was a lot of ad hominem crap as well, but i have an extremely thick skin.) this went on for months. for months and months. for _years_, in fact. i was dumbfounded. i had a methodology that worked. i _knew_ it worked. i watched it work, on my machine, every day, day in and day out, thank you. but these people insisted they knew better, that my stuff _couldn't_ work. and they did it loudly, and incessantly, and with entirely too much venom. from my vantage-point, it was quite easy to see that they were fools. because i knew that when i revealed my evidence, they would lose every single ounce of their credibility. why did they squander it so? i still can't give you a good answer to that question. why would you bet against someone -- anyone -- about what was in their pocket? think about it -- you have no idea what's in their pocket, not really, and they probably do. so why even consider betting against them? heck, even if i told you something outrageous, like that i can _fly_ without an airplane, i could see you being very skeptical about it. i could see you saying, "until you really show me, i don't believe it." but why would you assert -- flat out -- that "that is _impossible_", and even _bet_your_entire_credibility_ that what i said was untrue? doesn't make any sense to me. yet that is _exactly_ what they did... so i just kept the poker game running, made them bet their wad... perhaps it was "cruel" to make them lose _all_ of their credibility, when i could've ended it right away and let them retain _some_, but i figured if they were going to treat the credibility they had in such a cavalier manner, then they didn't _deserve_ to have it. by the way, i was on vacation a few weeks back, and we went to the glider-port at torrey pines, and saw humans fly _without_airplanes_. one was using a hang-glider, and the others had those sailing-seats. i was reminded of all of those old short-films that showed old-timers trying to take flight with "wings" they'd built, or various contraptions. they all crashed hard, of course, and watching those films is _funny_. but still, here was the hang-glider, stepping off a cliff into an updraft, and sailing up into the sky. somehow, the juxtaposition was poignant. anyway, back to all that flak i was getting... in a phrase, whenever i made a post, the dogs would start barking. (actually, since they ganged up on me, it was more like a wolf-pack, but that doesn't fit the metaphor, so we'll stick to calling 'em dogs.) now, at the time, i even said that, once i started revealing my proof, my antagonists would clam up, pretending they never said otherwise. i said this many times. you can go back in the archives and confirm it. sure enough, over the past few years, that is exactly what happened. for a while, they attempted to do some back-pedaling, but their pride just wouldn't let them. even in the past year, when i pressed him into specifying a percentage of the books in the p.g. library that i would be _unable_ to handle with z.m.l., jon noring put his estimate at _50%_... yet, when i've asked jon and others to point to some examples, i get stone cold silence in return, since they know they simple cannot do it. if they could, the animosity they've shown me assures me they would. in the _abstract_, they love saying what z.m.l. cannot do. but when i boil things down to the _reality_, then they "can't be bothered" with it. especially since the pudding i've revealed so far proves they're wrong. so these days, when i make a post that reveals the latest bit of proof in my z.m.l. pudding, there is no longer any flurry of hostile replies... the dogs have stopped barking... a while back, i announced an offline authoring-tool to make z.m.l. no response. it's not like people "just aren't interested" in the topic. in the past, they've gone on and on and on about how _difficult_ it would be to author z.m.l. my word, you'd have to _manually_count_ the linebreaks to tell if something was a header. or so they claimed... yet when i present an authoring-tool that gives a _formatted_ display -- no linebreak counting necessary, thank you very much -- _silence_. the dogs have stopped barking... heck, the other day i posted a message about a demo .pdf conversion, and there wasn't a single reply. four years ago, there was no shortage of barking dogs that would respond to every single one of my posts... as long as z.m.l. was "theoretical", they thought they could destroy me, and they showed a keen interested in doing that. now that it's _real_... even two years ago, when i posted the precursor of my .pdf conversion, the thread ran on and on for days, trying to pick little holes in my work. but this week? crickets... the dogs have stopped barking... when they thought they could strangle this little baby in its crib, they were positively _viscious_ in their attacks. now that it's big, big enough to kick their ass, they've got absolutely nothing to say. (except maybe to whine about how i'm making this "a competition". they didn't mind that at all when they thought _they_ were winning.) monday, for the first time, i revealed a "live" zml-to-html converter. it demonstrates rather clearly that z.m.l. works, and works fairly well. it'll work better as i improve it, but it's absolutely clear that it _works_. a year ago, when i put up this same converter that used z.m.l. files that i had preformatted, the dogs barked that it had been "rigged", that it couldn't handle anything _except_ those preformatted files... back then, they _dared_ me to make it live, thinking they "had me"... but this year -- this week -- yesterday and today -- when given the evidence that their claims couldn't hold water, you heard _nothing_. the dogs have stopped barking... and as i continue to reveal more and more evidence that z.m.l. works, it will be increasing clear that i can handle almost _all_ of the library... (i'm unsure because i don't know if i want to clutter z.m.l. with latex.) so even now, it's obvious that my antagonists have lost their credibility. lost it completely. and all of their "good arguments" for heavy-markup are quickly disintegrating into dust. nobody will listen to 'em any more. not without laughing. and that's _why_ the dogs have stopped barking... it's easy to get distracted by the flash of the flames happening here. some arguments can still be _loud_, and go on for far, far too long. (especially when certain people keep bringing them up repeatedly.) but the most interesting sound on this listserve these days is the silence... and the silence that you hear is because the dogs have stopped barking... and you don't have to be sherlock holmes to figure that out, or know why. -bowerbird p.s. there stretches out in front of me a long string of technologies that i will continue to debut over the course of some time to come... indeed, as my tool-chain is quickly beginning to achieve coherence across the entire work-flow, my pace will be picking up very shortly. further, the long-term cost-effectiveness of a z.m.l. library will take... well... it will take a "long-term" amount of time in order to prove it... so that deafening silence you now hear will get louder and louder. louder and louder after every post i make nailing z.m.l. down more. and, of course, the assiduous revolution toward light-markup will continue to exert itself in a great many arenas across cyberspace... so, at this point, the silence of my critics on the usefulness of z.m.l. is just a humorous curiosity, maybe even a welcome respite after years and years of yapping. however, at some point down the line, when it's clear to all heavy-markup was an evolutionary dead-end -- since it makes more sense to put smarts in apps, not formats -- they will either continue to resist it (and look stupid), or embrace it (in which case you know i'll say "i told you so", even from the grave), or they'll fade away (with tail between legs), probably best for us all. for those interested in the long-game, those will be interesting times. when you're predicting the future, it's not just "a difference of opinion" if people disagree. one might be right, at most. the other _is_ wrong. _accuracy_matters_. accuracy is _all_ that matters. as alan kay said, "the best way to predict the future is to invent it". so i invented z.m.l. but i'm under no illusions. i'm very confident that, in five or ten years, some people will be trying to rip it out of my hands, pontificating how "of course you invented it, but _now_ it belongs to _the_community_", as they try their hardest to change it from what it is to what they want. ************************************** See what's new at http://www.aol.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071031/26ab8fc6/attachment-0001.htm From jon at noring.name Wed Oct 31 10:48:15 2007 From: jon at noring.name (Jon Noring) Date: Wed, 31 Oct 2007 11:48:15 -0600 Subject: [gutvol-d] Note the definition of "work" (was harmless monsters, and the dogs that have stopped barking) In-Reply-To: References: Message-ID: <754722244.20071031114815@noring.name> To summarize my longer message below, let's look at the use of the phrase "it works": Bowerbird: "See, I take this ZML document and convert it to HTML and PDF, and doesn't it look pretty? ZML works as a mastering format!" Most of the rest of us studying the issue of a PG "master" format": "A universal mastering format must adequately represent all book types, and meet several other requirements regarding archiving, retrieval, etc. Certainly ZML documents may be converted to other formats. But as a universal mastering format ZML is insufficient, and therefore will not work *as a universal mastering format.*" Notice the differing definitions of "works"? Words can be very powerful as Bowerbird knows being a performance poet, and so it is important to understand the nuances of the underlying definitions that are being asserted when the words are used. Bowerbird wrote: > [overall a good summary of his historical perspective of things.] > > as the archives will clearly reveal to anyone who reads them, the markup-crowd > responded with a vengeance?there were days with literally dozens of messages, > all of them _hostile_, all of them saying "your methodologies will never work..." > (there was a lot of ad hominem crap as well, but i have an extremely thick skin.) Well, Bowerbird continues to rewrite history and intentions with this "it won't work" message. Do you think if you say it long enough it will become true? Most of the "markup crowd" never said "it would never work" as you have defined "work". I proposed back in the early days of ebook-list (now TeBC) that PG should normalize its plain texts so as to make reliable conversion/repurposing possible, which shows I understood that regularized/normalized plain text can certainly be used in the role of a "master". So I knew from the start that normalized plain text (which ZML is one flavor) can be converted to other formats for presentation. And anyone who works with document conversion understands this. I believe the "light markup revolution" as you call it has been around since the dawn of the computer era, since there are applications where "light markup" to plain text *is sufficient*. Heck, in the early days of PG, PG deployed "light markup" so as to identify highlighted text, for example. (And this message uses "light markup".) What the "markup crowd" here essentially said, collectively (with some individual differences) was that "oh, that's quaint -- will it be able to properly identify this, and that, and this other thing?". As well as meet certain other requirements for a "master" rendition *for the entire PG collection, present and future* from which everything else is derived? That is, the "markup crowd" believed, and overall still believes today, that normalized plain text is "not sufficient". This is NOT the same as "it won't work" per Bowerbird's definition of "it works". Of course it will "work" by Bowerbird's definition of work, which is simply "here's a ZML document, and here's the HTML or PDF derivative of it -- doesn't it look pretty in this viewer or browser? See, it works!" Even if you fix the "block quote" deficiency of the current published ZML spec (and I suspect you are since you've not addressed my comment even in messages where you did reply, making me believe you've privately added support for that for some grand splash to come soon), there are several other things where putting all our eggs into the "let's master all our books in ZML" will lead to a host of problems. Now I do like normalized plain text for plain text end-user renditions (not as a master), and ZML is a viable candidate for this role (fix the block quote thing as I described before and it becomes an even stronger candidate for this role.) As I've said before, of all the people who've commented on ZML, I'm the only one, I believe, who has been intrigued with ZML for the role it can play in the "digital text ecology", and that role is as the preferred plain text end-user rendition, not the master. Jon Noring From marcello at perathoner.de Wed Oct 31 12:12:18 2007 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed, 31 Oct 2007 20:12:18 +0100 Subject: [gutvol-d] Note the definition of "work" (was harmless monsters, and the dogs that have stopped barking) In-Reply-To: <754722244.20071031114815@noring.name> References: <754722244.20071031114815@noring.name> Message-ID: <4728D392.6040105@perathoner.de> Jon Noring wrote: > Bowerbird: > > "See, I take this ZML document and convert it to HTML and PDF, and > doesn't it look pretty? ZML works as a mastering format!" I would rather take a still of my dog and convert it to a movie. In the movie the dog sits still for 2 hours, but hey! with format changed to "movie" now I can post it to youtube! All ZML will ever do is produce an ascii text with the ending changed to .html. -- Marcello Perathoner webmaster at gutenberg.org From joshua at hutchinson.net Wed Oct 31 12:21:45 2007 From: joshua at hutchinson.net (joshua at hutchinson.net) Date: Wed, 31 Oct 2007 19:21:45 +0000 (UTC) Subject: [gutvol-d] Librivox Hits 1000! Message-ID: <4643470.1193858505675.JavaMail.?@fh1064.dia.cp.net> I just go this press release from the Librivox folks! We've got a little over 300 of their audio books so far and more are being added. (See www.librivox.org) *** LibriVox makes it to 1,000! LibriVox, the free audio book project has just cataloged it's 1,000th book: "The Murders in the Rue Morgue," by Edgar Allan Poe (read by Reynard T. Fox). LibriVox.org started in August 2005 with a simple objective: "to make all public domain books available as free audio books." Thirteen people collaborated to make the first recording, Joseph Conrad's "Secret Agent." Two years later, LibriVox has become the most prolific audiobook publisher in the world - we are now putting out 60-70 books a month, we have a catalog of 1,000 works, which represents a little over 6 months of *continuous* audio; we have some 1,500 volunteers who have contributed audio to the project; and a catalog that includes Jane Austin's "Pride and Prejudice," "Moby Dick," Darwin's "Origin of the Species," "Alice's Adventures in Wonderland," Einstein's "Relativity: The Special and General Theory," Kant's "Critique of Pure Reason," and other less well-known gems such as "Romance of Rubber" edited by John Martin. We have recordings in 21 languages, and about half of our recordings are solo efforts by one reader, while the other half are collaborations among many readers. We are always looking for new volunteers! Come join us. From creeva at gmail.com Wed Oct 31 12:35:14 2007 From: creeva at gmail.com (Brent Gueth) Date: Wed, 31 Oct 2007 15:35:14 -0400 Subject: [gutvol-d] Note the definition of "work" (was harmless monsters, and the dogs that have stopped barking) In-Reply-To: <754722244.20071031114815@noring.name> References: <754722244.20071031114815@noring.name> Message-ID: <2510ddab0710311235r77ad9e97tdfaaa3b96c510019@mail.gmail.com> You know I first posted on the gutenberg mailing list 2 years ago on this very subject. Somehow no one seems to be able to get around this issue. Pert of the issue back then stemmed out of getting rid of plaintext version and moving to XML - which I defended and said their should always be a plain text version. As long as a plaintext version is always created and hosted by PG of every work I have no complaints. That being said - PG has source editors - what format works best for these people? The end user and normal PG contributor really doesn't need to know much about whatever type of format it is- whether it be plaintext or xml or html version 423.5 - some people like myself have popped on this - and since the argument has turned slightly (may I mention again this has been TWO YEARS this discussion has been going on) who are the parties specifically responsible for handling and passing out the masters? What do they want? If they come up with a consensus on a format they can agree on AND provide the output tools to convert it to plain text or PDF or whatever - what does it truly matter since other editors can work on those editions? I think the focus on this should be the merits of the formats - what conversion tools exist - what is the limitations of the conversion tools - and who is going to create adequate conversion tools for the sub par ones? Which formats is PG going ot be supplying to the public - regardless of the master markup language? Since PG is now trying to include original images that were in the manuscripts - the master faithfully render the following: PDF with embedded images? HTML - with identical output as the PDF - if it's not identical we need to look at the tools? TXT - that is well formatted with the pointers to the images stripper? JPG - images that output should look the same as the pdf Any pure text based conversion tool whether TXT or DOC - should look identical when everything is said and done. Everything that includes images should look identical when everything is said and don (i.e all images should be centered the same way and bold should be added where needed identically in TXT leave out codes to say bold - no markups in plain text). When someone is trumping a master format make them responsible to show it's flexibility at being able to convert the master format to every other format indentically. If they can not prove that - chair the discussion on that format until progress of the conversion is deemed acceptable by all members. I don't think any of us really cares about the master format as long as the people standing behind them have the proper tools or a team of people working on the tools to faithfully render these identically across any format. On 10/31/07, Jon Noring wrote: > > To summarize my longer message below, let's look at the use of the > phrase "it works": > > Bowerbird: > > "See, I take this ZML document and convert it to HTML and PDF, and > doesn't it look pretty? ZML works as a mastering format!" > > Most of the rest of us studying the issue of a PG "master" format": > > "A universal mastering format must adequately represent all book > types, and meet several other requirements regarding archiving, > retrieval, etc. Certainly ZML documents may be converted to other > formats. But as a universal mastering format ZML is insufficient, > and therefore will not work *as a universal mastering format.*" > > Notice the differing definitions of "works"? Words can be very > powerful as Bowerbird knows being a performance poet, and so it is > important to understand the nuances of the underlying definitions > that are being asserted when the words are used. > > > Bowerbird wrote: > > > [overall a good summary of his historical perspective of things.] > > > > as the archives will clearly reveal to anyone who reads them, the > markup-crowd > > responded with a vengeancethere were days with literally dozens of > messages, > > all of them _hostile_, all of them saying "your methodologies will > never work..." > > (there was a lot of ad hominem crap as well, but i have an extremely > thick skin.) > > Well, Bowerbird continues to rewrite history and intentions with > this "it won't work" message. Do you think if you say it long enough > it will become true? > > Most of the "markup crowd" never said "it would never work" as you > have defined "work". I proposed back in the early days of ebook-list > (now TeBC) that PG should normalize its plain texts so as to make > reliable conversion/repurposing possible, which shows I understood > that regularized/normalized plain text can certainly be used in the > role of a "master". So I knew from the start that normalized plain > text (which ZML is one flavor) can be converted to other formats for > presentation. > > And anyone who works with document conversion understands this. I > believe the "light markup revolution" as you call it has been around > since the dawn of the computer era, since there are applications where > "light markup" to plain text *is sufficient*. Heck, in the early days > of PG, PG deployed "light markup" so as to identify highlighted text, > for example. (And this message uses "light markup".) > > What the "markup crowd" here essentially said, collectively (with some > individual differences) was that "oh, that's quaint -- will it be able > to properly identify this, and that, and this other thing?". As well as > meet certain other requirements for a "master" rendition *for the > entire PG collection, present and future* from which everything else is > derived? > > That is, the "markup crowd" believed, and overall still believes > today, that normalized plain text is "not sufficient". This is NOT the > same as "it won't work" per Bowerbird's definition of "it works". > > Of course it will "work" by Bowerbird's definition of work, which is > simply "here's a ZML document, and here's the HTML or PDF derivative of > it -- doesn't it look pretty in this viewer or browser? See, it works!" > > Even if you fix the "block quote" deficiency of the current published > ZML spec (and I suspect you are since you've not addressed my comment > even in messages where you did reply, making me believe you've > privately added support for that for some grand splash to come soon), > there are several other things where putting all our eggs into the > "let's master all our books in ZML" will lead to a host of problems. > > Now I do like normalized plain text for plain text end-user > renditions (not as a master), and ZML is a viable candidate for this > role (fix the block quote thing as I described before and it becomes > an even stronger candidate for this role.) > > As I've said before, of all the people who've commented on ZML, I'm the > only one, I believe, who has been intrigued with ZML for the role it > can play in the "digital text ecology", and that role is as the > preferred plain text end-user rendition, not the master. > > Jon Noring > > _______________________________________________ > gutvol-d mailing list > gutvol-d at lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071031/db1f8e25/attachment.htm From jon at noring.name Wed Oct 31 13:26:59 2007 From: jon at noring.name (Jon Noring) Date: Wed, 31 Oct 2007 14:26:59 -0600 Subject: [gutvol-d] Note the definition of "work" (was harmless monsters, and the dogs that have stopped barking) In-Reply-To: <2510ddab0710311235r77ad9e97tdfaaa3b96c510019@mail.gmail.com> References: <754722244.20071031114815@noring.name> <2510ddab0710311235r77ad9e97tdfaaa3b96c510019@mail.gmail.com> Message-ID: <1419906406.20071031142659@noring.name> Brent wrote: > When someone is trumping a master format make them responsible to > show it's flexibility at being able to convert the master format to > every other format indentically.? If they can not prove that - chair > the discussion on that format until progress of the conversion is > deemed acceptable by all members. You bring up several good points. Let's look at this from a different angle, at least from the plain text markup approach (which encompasses XML-vocabularies, like TEI, and ZML): Bowerbird's ZML is actually no different from XML with respect to conversion to other formats. There is, for example, an XML equivalent to ZML where it is possible to build ZML <---> XML converters, and I believe they are almost trivial, that can perfectly round-trip between the two. (I'll call this ZML-equivalent vocabulary "zXML".) Thus, ZML essentially defines a vocabulary to markup various structures and inline text semantics. Although the conversion tools may be a little different, in terms of programming effort they are about the same (one approach to use with ZML is conversion to zXML, then use standard XSLT to convert to XHTML or whatever else -- this may actually be the easiest approach and now allows ZML to plugin to the myriad existing XML processing tools.) Thus the question *has* to go back to the "vocabulary" used. (Plus other benefits that XML confers that the present ZML does not, and probably cannot.) Is the ZML "vocabulary" sufficient for representing all the books in the PG Corpus? Many of us believe it is not. And not only for representing structures/semantics, but for metadata, referencing/ citation, and text durability (in ZML, white space normalization is *critical*, in XML white space is totally flexible.) (ZML does not yet even include something inside of it which will say to a machine: "I am ZML". For philosophical reasons I think Bowerbird is opposed to adding any identifier to the text file which identifies the content as conforming to ZML. THIS IS CRITICAL, among other critical things. And don't ask him about machine readable metadata -- I've never seen anyone who is so opposed to even flagging simple metadata/catalog information in a machine-readable manner. He has this belief we can write programs today which will extract all the needed metadata right from the plain ZML content. Of course, he has decided what metadata is meaningful and what is not.) Thus, building some sort of conversion toolset does NOT demonstrate that ZML is the mastering format PG is looking for. Bowerbird is trying to convince us it is -- it is not. In fact, he is hoping it will fool people since people like to see "results", so he is going to show them "results". Notice that Bowerbird chooses the example texts he normalizes in ZML. And he does not address the other various issues brought up by many of us -- I think I've only scratched the surface -- others will have their own. He continues to say (over and over and over again like a broken record -- who's on the merry-go-round?): "Look at my conversion and reading toolz -- See! ZML works!" It's like someone standing on the 10th floor of a building and dropping a bowling ball -- it hits the ground, and then they say "see, gravity works as I told you so! If you want to crack nuts I now have the simple solution! Who needs a complicated mechanism to crack nuts?" Jon Noring From bowerbird at aol.com Wed Oct 31 14:12:07 2007 From: bowerbird at aol.com (bowerbird at aol.com) Date: Wed, 31 Oct 2007 17:12:07 -0400 Subject: [gutvol-d] back to the basics In-Reply-To: <2510ddab0710311235r77ad9e97tdfaaa3b96c510019@mail.gmail.com> References: <2510ddab0710311235r77ad9e97tdfaaa3b96c510019@mail.gmail.com> Message-ID: <8C9EA19D44675AC-4FC-5175@FWM-M15.sysops.aol.com> -----Original Message----- From: Brent Gueth To: Jon Noring ; Project Gutenberg Volunteer Discussion Sent: Wed, Oct 31 3:35 PM Subject: Re: [gutvol-d] Note the definition of "work" (was harmless monsters, and the dogs that have stopped barking) brent said: > Since PG is now trying to include original images that were in the manuscripts - > the master faithfully render the following: > PDF with embedded images? > HTML - with identical output as the PDF - if it's not identical we need to look at the tools? > TXT - that is well formatted with the pointers to the images stripper? > JPG - images that output should look the same as the pdf ok, i can see it's time for a refresher on the basics, because some people are confused, because other people are trying their darndest to _make_ us confused, the better to snow us. first of all, let me make it clear that i am here precisely because my message has a great deal of relevance to project gutenberg. further, were it not for michael hart's insistence on plain-text, which meant his library was structured that way, it's possible -- perhaps even likely -- i wouldn't have learned the value of it... so i came here to share, in an effort to repay michael for that... having said all that, though, i have no intention of letting z.m.l. fade away just because of "enemies" here at project gutenberg. i'd always intended on creating my own mirror of the p.g. library. originally, i thought i would have to, because i intended on using my compression format on the e-texts, and i just assumed that i would be the only person interested in doing such a boring job. and since it's just a massive file-handling job -- the compression itself is button-click automatic -- it's not a job that can be shared. so heck, i was jazzed michael offered me webspace/bandwidth... anyway, when i realized my routines were fast enough to handle a regular file -- without having it be compressed first -- i was freed from the compression task. but i decided i'd still mount a mirror. so it matters to me not one tiny little whit if p.g. doesn't use z.m.l. because i'm going to prove its efficacy all by myself with my mirror. indeed, i sincerely hope that p.g. does _not_ use z.m.l. at first... i hope the markup freaks continue their campaign and drive p.g. into a state of pure stagnation, followed by a complete collapse when the complexities of their system result in total confusion... because that will lead to a total housecleaning, and the new staff will understand deeply the need for a system based on simplicity. and my mirror will be the model letting them know that is possible. so they'll just copy over my z.m.l. files and call it the p.g. library. (which is only fitting, because i got the files from p.g. originally.) so how will my mirror differ from the p.g. library that exists now? first, it will have one file only for each e-text -- the z.m.l. version... you can call it the "master" if you like, but since there will be no "slaves" around, it's really unnecessary. there's just one version. so this is one arena in which you've been confused, namely this focus on "master" files. the best way to get the answer you want is to frame the question in your target's head. the x.m.l. crowd asks you about "a master file" so your answer becomes "x.m.l." (go read the "win friends and influence people" manifestos, and you'll see how this strategy is laid out and completely explained.) the z.m.l. version will look very much like a pg-ascii version now, with the big difference (aside from the structured nature of z.m.l.) being that the z.m.l. file will contain references to _illustrations_. (such references are now unceremoniously stripped from pg-ascii.) end-users can read the z.m.l. versions on the web if they want. for an example of that, see the "babelfish" script located here: > http://z-m-l.com/go/babelfish19.pl (my website seems to be down today, probably due to some perl hacking i was doing on it yesterday, so check back later.) but i expect most people will download the z.m.l. files instead, either to their desktop/laptop machines, or to wireless readers. z.m.l. viewer-programs -- similar to babelfish, but even better -- will display the z.m.l. files offline. (these viewer-programs will also download the entire library, including newly-posted books, in the background, to _duplicate_ books as widely as possible, while at the same time constantly growing the person's library.) z.m.l. viewer-programs are amazingly easy to code, and to port to other platforms. (i have written them in 3 languages already.) and -- importantly -- they kick other viewer-programs to the curb. eventually, there'll be little need to "produce derivative formats". indeed, i would say "no need", but i'm not dumb enough to think that companies with deep pockets -- like adobe and amazon -- are simply gonna roll over and play dead and let z.m.l. be king. and the publishing companies will play a role as well, because the possibility for authors to easily create e-books scares them, so they'll attempt to impose a complex system on publishing... so the upshot is we're gonna be dealing with .pdf and .mobi for a long time to come. furthermore, the web is a factor as well, so .html is something that must be brought into the picture too. that is why i decided to demonstrate that z.m.l. converts well... well, that plus -- at least at the outset -- the so-called ability of x.m.l.-based methodologies to "produce derivative formats" was one of the big selling-points. indeed, go over to the d.p. forums and you'll see that this is _still_ one of the selling-points for .tei. it is, of course, hype. we've looked at the "conversions" produced, and found them to be wanting. they fall far short of what's desired. sometimes they fall _so_ short that "vapor" is a more honest label. plus, in spite of the fact that there were supposedly "all kinds of" x.s.l.t. scripts capable of auto-generating a _plethora_ of formats, when push came to shove, it ended up that marcello is the go-to, and he made it clear that he didn't really care much about output... so the .tei folks are stuck with some pretty crummy-looking .pdfs. meanwhile, i've managed to turn out some pretty respectable .pdfs. they're not yet as beautiful as i'd like them to be, not by a long shot, but they've got a ton of functionality built into them, and that's good... given the fact that .pdf is "supported" on a lot of the reader-machines, i think it's probably important that z.m.l. can auto-generate .pdf now... but again, long-term, my viewer-program can kick the acrobat's ass... same thing with .html. with the web, it's obviously an important format. at least until web-browsers support light-markup internally, which will be not too long from now, at least if light-markup continues its steady march. in the meantime, though, it's very good that .zml can convert to .html... .html is also good for reader-machines, since they all support that format. furthermore, something the heavy-markup crowd wants you not to notice is that the .html output that is auto-generated from a .zml file is basically _the_exact_same_thing_ as the .html output generated from an .xml file. indeed, i believe if you compare my "my antonia" .html version with jon's, you'll find that mine is actually _more_ capable. if i remember correctly, the only thing his had that mine didn't was an i.d. on _every_ paragraph. and it would take me less than 10 minutes to add that to my routines... but again, a focus on "conversions" to "other formats" is beside the point. yeah, i talk about it, but only because i want to show clearly that i can undermine even their very best selling point. same benefits. lower price. so don't get hung up on "conversions". it's better not to have to do any. people who own a rocketbook will tell you they can now do conversions in their sleep. but they'll also tell you they would rather be dreaming... with the easy-to-author, easy-to-edit, easy-to-remix zen markup language, people will soon come to see that there's just no need for another format... especially since, if they _do_ ever need one, it's just a button-click away. but it will surprise you how quickly they'll decide to jettison other formats. and yeah, yeah, it will be easy for my opponents to label that "ridiculous". so that might set them off to yapping for a little while again, in an attempt to distract you from the fact they have no credibility left. but believe me, in a couple of years from now, when i have proven it, with tasty pudding, they will again want you to forget they ever said anything to the contrary... -bowerbird ________________________________________________________________________ Email and AIM finally together. You've gotta check out free AOL Mail! - http://mail.aol.com From hart at pglaf.org Mon Oct 29 08:49:59 2007 From: hart at pglaf.org (Michael Hart) Date: Mon, 29 Oct 2007 08:49:59 -0700 (PDT) Subject: [gutvol-d] Oct. 29, 1991 "Internet" First Appears Message-ID: 16 years ago today the word "Internet" first appeared on a front page or cover of the major media. Wall Street Journal, Page One, the story about eBooks. Thanks!!! Michael S. Hart Founder Project Gutenberg From hart at pglaf.org Mon Oct 29 11:31:54 2007 From: hart at pglaf.org (Michael Hart) Date: Mon, 29 Oct 2007 11:31:54 -0700 (PDT) Subject: [gutvol-d] CORRECTION: Oct. 29, 1991 "Internet" First Appears Message-ID: Perhaps I should have included a reference to an earlier article in the Washington Post about the "worm" that did some serious slowing down of the Internet in 1988. I don't know why the article I was referring to did not, in any way, mention this. . .perhaps they don't thing of a political paper such as The Washington Post as "major" world media, or perhaps it just wasn't in their index. Micheael 16 years ago today the word "Internet" first appeared on a front page or cover of the major media. Wall Street Journal, Page One, the story about eBooks. Thanks!!! Michael S. Hart Founder Project Gutenberg