From bill at truthdb.org Tue Feb 1 01:01:19 2005 From: bill at truthdb.org (bill jenness) Date: Tue Feb 1 01:01:51 2005 Subject: [gutvol-d] Top 100 EBooks this week (and other stories) Message-ID: <1350.134.117.141.69.1107248479.squirrel@134.117.141.69> The picture of Dorian Gray (http://www.gutenberg.org/etext/174) is 47th and The picture of Dorian Gray (http://www.gutenberg.org/etext/4078) is 99th. I guess these are different editions but it seems a little odd nonetheless. The bibrec of the more recent edition lack two subject entries the earlier one lists. The other story I want to mention has to do with 2 books that I had cleared that I am not able to proceed with at this time. gbn0307140503: Unknown, The Reason Why--Natural History. bill jenness . 1860c. 7/24/2003. ok. and gbn0307140507: warren colburn, Arithmetic upon the inductive method of instruction.... bill jenness . 1856p1826c. 7/24/2003. ok. These books are no longer in my possesion and my scanner is toast, I am limping along with a p166 until I can afford to pickup some new equipment. I have "The Reason Why" partially scanned but that won't do anyone much good as the bulb got progressively more discolored as I went along. If there is someone in Ottawa (Canada) who could scan them in, I could probably get my hands back on them but they do not belong to me and I would need them returned. From maitriv at yahoo.com Tue Feb 1 06:32:51 2005 From: maitriv at yahoo.com (maitri venkat-ramani) Date: Tue Feb 1 06:32:56 2005 Subject: [gutvol-d] Arabic eTexts In-Reply-To: <8d.1f9e780f.2f2fa784@aol.com> Message-ID: <20050201143251.56694.qmail@web52302.mail.yahoo.com> Some of these books may be in the public domain and worth looking into. Anyone particularly interested in developing an Arabic language partnership with the project mentioned below? Maitri SOFTWARE FOR SCANNING ARABIC DOCUMENTS Noting that "the whole Internet is skewed toward people who speak English," computer scientist Venu Govindaraju of the University of Buffalo says his research group is developing software to scan Arabic printed and handwritten documents. Without optical character recognition software developed for a particular language, Govindaraju fears that "all the classic texts in that language will disappear into oblivion." The project's Arabic software will take into account the fact that characters may take different forms depending on where within a word they appear, and that Arabic vowels are pronounced but often not written. (AP 27 Jan 2005) __________________________________ Do you Yahoo!? Yahoo! Mail - You care about security. So do we. http://promotions.yahoo.com/new_mail From nwolcott at dsdial.net Tue Feb 1 10:03:43 2005 From: nwolcott at dsdial.net (N Wolcott) Date: Tue Feb 1 10:04:23 2005 Subject: [gutvol-d] Arabic eTexts References: <20050201143251.56694.qmail@web52302.mail.yahoo.com> Message-ID: <001e01c50888$64ec2820$2b9495ce@gw98> A scientist at the U ov. of Washington developed a "arabic printed text" after digitizing handwritten scripts by expert Arabic calligraphers. This was done because of the poor quality arabic used in modern printed arabic books. I remember a sample from Diocles "On Burning Mirrors" which he put on the internet. Unfortunately his characters were kept private I believe although he also had developed a program which would write the script correctly accounting for accents, position in word, etc. He developed outline fonts which could be the basis for something new if they are available. He did give me a deck of cards for the numerals, which has faded into punch card history. ----- Original Message ----- From: "maitri venkat-ramani" To: "Project Gutenberg Volunteer Discussion" Sent: Tuesday, February 01, 2005 9:32 AM Subject: [gutvol-d] Arabic eTexts > Some of these books may be in the public domain and worth looking into. > Anyone particularly interested in developing an Arabic language > partnership with the project mentioned below? > > Maitri > > SOFTWARE FOR SCANNING ARABIC DOCUMENTS > > Noting that "the whole Internet is skewed toward people who speak > English," computer scientist Venu Govindaraju of the University of > Buffalo says his research group is developing software to scan Arabic > printed and handwritten documents. Without optical character > recognition software developed for a particular language, Govindaraju > fears that "all the classic texts in that language will disappear into > oblivion." The project's Arabic software will take into account the > fact that characters may take different forms depending on where within > a word they appear, and that Arabic vowels are pronounced but often not > written. (AP 27 Jan 2005) > > > > > > > > __________________________________ > Do you Yahoo!? > Yahoo! Mail - You care about security. So do we. > http://promotions.yahoo.com/new_mail > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From bubblegirl at optusnet.com.au Tue Feb 1 14:27:18 2005 From: bubblegirl at optusnet.com.au (Season BubbleGirl) Date: Tue Feb 1 14:27:22 2005 Subject: [gutvol-d] Question. Message-ID: <200502012227.j11MR2ke026835@mail28.syd.optusnet.com.au> Hi, I'm compiling a clean jokes book to be included in free book archives. I was just wondering if jokes are copyrighted. Not comedian-specific jokes - jokes such as, Why did the chicken cross the road? I found all jokes on different webpages. Because the ebook is free, aren't I doing the same as the webpages? Season BubbleGirl: Writer, poet, Pocket PC enthusiast bubblegirl@bubblegirl.net www.bubblegirl.net Did you know ROM of PC POWERPLAY moved? He's now an AUSSIE PLAYING UP at www.bubblegirl.net/playingup.php -----Original Message----- From: "Gutenberg9443@aol.com" Sent: 02/01/2005 1:53:48 AM To: "gutvol-d@lists.pglaf.org" Subject: Re: [gutvol-d] date-sensitive info about ebook purchase In a message dated 1/30/2005 4:46:21 PM Mountain Standard Time, gbnewby@pglaf.org writes: Evidently, the mainstream publishers are not putting their mainstream works onto the Fictionwise site - maybe they're elsewhere. My strong suspicion is that many the works on the Fictionwise site are those that are owned by authors, not publishers. So, right now, this device doesn't replace bn.com or whatever for my reading of contemporary works. There is a lot of new stuff at FictionWise. It's in RB format and can be dumped straight into the ebook. Probably most of the stuff on the sites is older and the author has gotten copyright revision, but more and more publishers are getting the idea and putting new works up. For example, THE DA VINCI CODE went up on FictionWise about the same time it was released in hardback. Its success in eformat has certainly caught the eyes of other mainstream publishers. It's a good beginning, but it IS a beginning. Anne From shalesller at writeme.com Tue Feb 1 15:35:01 2005 From: shalesller at writeme.com (D. Starner) Date: Tue Feb 1 15:35:18 2005 Subject: [gutvol-d] Arabic eTexts Message-ID: <20050201233501.A9C654BDAB@ws1-1.us4.outblaze.com> "maitri venkat-ramani" writes: > Some of these books may be in the public domain and worth looking into. > Anyone particularly interested in developing an Arabic language > partnership with the project mentioned below? It sounds like they're writing software, not transcribing books. I'm not sure why this is news that everyone's carrying; is a ten-year old review of Arabic OCRs. is a commercially available Arabic OCR program, even if it's a touch expensive. is a nice list of OCR programs if you're looking for something beyond what ABBYY supports. -- ___________________________________________________________ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm From maitriv at yahoo.com Tue Feb 1 20:52:14 2005 From: maitriv at yahoo.com (maitri venkat-ramani) Date: Tue Feb 1 20:52:33 2005 Subject: [gutvol-d] Arabic eTexts In-Reply-To: <20050201233501.A9C654BDAB@ws1-1.us4.outblaze.com> Message-ID: <20050202045215.27164.qmail@web52310.mail.yahoo.com> >From the article I read, I got the impression that the lead researcher is passionate about certain texts which will be lost if his reader is not developed. I'll email him and find out if he has any particular eBook intentions and forward him some of our questions. Maitri --- "D. Starner" wrote: > "maitri venkat-ramani" writes: > > > Some of these books may be in the public domain and worth looking > into. > > Anyone particularly interested in developing an Arabic language > > partnership with the project mentioned below? > > It sounds like they're writing software, not transcribing books. I'm > not sure why this is news that everyone's carrying; > is a ten-year old review > of Arabic OCRs. > > is a commercially available Arabic OCR program, even if it's a touch > expensive. is a nice list of > OCR > programs if you're looking for something beyond what ABBYY supports. > -- __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From j.hagerson at comcast.net Wed Feb 2 17:04:12 2005 From: j.hagerson at comcast.net (John Hagerson) Date: Wed Feb 2 17:04:37 2005 Subject: [gutvol-d] GREG NEWBY: Please check your e-mail! Message-ID: <004301c5098c$48aa7810$6401a8c0@sarek> Sorry for the broadcast, but other methods to reach Greg have been unsuccessful. Greg: Please look for messages from Aaron Cannon and John Hagerson. Thank you. From gbnewby at pglaf.org Wed Feb 2 19:53:27 2005 From: gbnewby at pglaf.org (Greg Newby) Date: Wed Feb 2 19:53:29 2005 Subject: [gutvol-d] GREG NEWBY: Please check your e-mail! In-Reply-To: <004301c5098c$48aa7810$6401a8c0@sarek> References: <004301c5098c$48aa7810$6401a8c0@sarek> Message-ID: <20050203035327.GB7603@pglaf.org> On Wed, Feb 02, 2005 at 07:04:12PM -0600, John Hagerson wrote: > Sorry for the broadcast, but other methods to reach Greg have been > unsuccessful. > > Greg: Please look for messages from Aaron Cannon and John Hagerson. Thank > you. Ok: soon. I've been a little busy. Life, job, flu, that sort of thing. -- Greg From cannona at fireantproductions.com Wed Feb 2 20:10:52 2005 From: cannona at fireantproductions.com (Aaron Cannon) Date: Wed Feb 2 20:12:38 2005 Subject: [gutvol-d] GREG NEWBY: Please check your e-mail! In-Reply-To: <20050203035327.GB7603@pglaf.org> References: <004301c5098c$48aa7810$6401a8c0@sarek> <20050203035327.GB7603@pglaf.org> Message-ID: <6.1.2.0.0.20050202220817.01c48840@mail.fireantproductions.com> Sorry to bother. It just appeared that messages just weren't getting through. You being occupied with other matters changes everything, and is completely understandable. Sorry again and take your time. Sincerely Aaron Cannon At 09:53 PM 2/2/2005, you wrote: >On Wed, Feb 02, 2005 at 07:04:12PM -0600, John Hagerson wrote: > > Sorry for the broadcast, but other methods to reach Greg have been > > unsuccessful. > > > > Greg: Please look for messages from Aaron Cannon and John Hagerson. Thank > > you. > >Ok: soon. > >I've been a little busy. Life, job, flu, that sort of thing. > -- Greg >_______________________________________________ >gutvol-d mailing list >gutvol-d@lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d -- E-mail: cannona@fireantproductions.com Skype: cannona MSN Messenger: cannona@hotmail.com (Do not send E-mail to the hotmail address.) From j.hagerson at comcast.net Wed Feb 2 20:12:56 2005 From: j.hagerson at comcast.net (John Hagerson) Date: Wed Feb 2 20:13:18 2005 Subject: [gutvol-d] A question raised by Part 2 of this week's weekly newsletter... Message-ID: <004f01c509a6$a3e4a920$6401a8c0@sarek> Quoth the newsletter: >And yes I said yes today is the 83rd anniversary of the first >publication of Ulysses. I'm probably missing something. Was Foghorn Leghorn involved with the publication of Ulysses? From gbnewby at pglaf.org Wed Feb 2 23:16:20 2005 From: gbnewby at pglaf.org (Greg Newby) Date: Wed Feb 2 23:16:22 2005 Subject: [gutvol-d] GREG NEWBY: Please check your e-mail! In-Reply-To: <6.1.2.0.0.20050202220817.01c48840@mail.fireantproductions.com> References: <004301c5098c$48aa7810$6401a8c0@sarek> <20050203035327.GB7603@pglaf.org> <6.1.2.0.0.20050202220817.01c48840@mail.fireantproductions.com> Message-ID: <20050203071620.GC11085@pglaf.org> On Wed, Feb 02, 2005 at 10:10:52PM -0600, Aaron Cannon wrote: > Sorry to bother. It just appeared that messages just weren't getting > through. You being occupied with other matters changes everything, and is > completely understandable. > > Sorry again and take your time. De nada - I'm sorry for not responding sooner. It's always fine to re-send an email after a few days, since sometimes things get lost, deleted or filtered by mistake. -- Greg > At 09:53 PM 2/2/2005, you wrote: > >On Wed, Feb 02, 2005 at 07:04:12PM -0600, John Hagerson wrote: > >> Sorry for the broadcast, but other methods to reach Greg have been > >> unsuccessful. > >> > >> Greg: Please look for messages from Aaron Cannon and John Hagerson. Thank > >> you. > > > >Ok: soon. > > > >I've been a little busy. Life, job, flu, that sort of thing. > > -- Greg > >_______________________________________________ > >gutvol-d mailing list > >gutvol-d@lists.pglaf.org > >http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > > -- > E-mail: cannona@fireantproductions.com > Skype: cannona > MSN Messenger: cannona@hotmail.com (Do not send E-mail to the hotmail > address.) > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From gbnewby at pglaf.org Wed Feb 2 23:23:48 2005 From: gbnewby at pglaf.org (Greg Newby) Date: Wed Feb 2 23:23:50 2005 Subject: [gutvol-d] Fwd: Proposed CD Navigation Files now available for review In-Reply-To: <6.1.2.0.0.20050128112358.01c90da0@mail.fireantproductions.com> References: <6.1.2.0.0.20050128112358.01c90da0@mail.fireantproductions.com> Message-ID: <20050203072348.GC11403@pglaf.org> On Fri, Jan 28, 2005 at 11:24:03AM -0600, Aaron Cannon wrote: > > >From: "John Hagerson" > >To: "'Aaron Cannon'" > >Subject: Proposed CD Navigation Files now available for review > >Date: Fri, 28 Jan 2005 08:04:16 -0600 > >X-Mailer: Microsoft Outlook, Build 10.0.6626 > > > >Four navigation files built for a new Project Gutenberg CD-ROM which > >contains primarily non-English electronic books are now available for > >review > >at http://www.aaronandgabby.com/pgcd/ The files allow one to browse the CD > >by Author, Language and Author, Language and Title, or Title. > > This is great stuff! Once it's raedy (or ready enough), I'd like to go ahead and make an .iso image to add to our collection. -- Greg > >The files were developed from the Project Gutenberg production prior to > >book > >14700. The Distributed Proofreaders have been especially prolific in > >non-English books recently, so it seems that a number of books of recent > >production will be omitted regardless of where we draw the line. > > > >I believe I have included every non-English book produced prior to 14700 > >with the exception of three books (7216, 7337, and 12407) where the title > >and author were both in Unicode characters that most fonts do not support. > >Each of the omitted works is in Chinese. If someone could help me obtain > >more information on these works, there is ample space to include them. > > > >Please respond to the list or directly to mailto:j.hagerson@comcast.net > >with > >your comments regarding the files. > > > >Thank you. > > > >Aaron: Before you forward this to the list, please make sure that the http > >download works. My attempts to view the directory were met with a 403 > >error. > >Thank you. > > > > -- > E-mail: cannona@fireantproductions.com > Skype: cannona > MSN Messenger: cannona@hotmail.com (Do not send E-mail to the hotmail > address.) > > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From jlinden at projectgutenberg.ca Wed Feb 2 23:41:09 2005 From: jlinden at projectgutenberg.ca (James Linden) Date: Wed Feb 2 23:44:29 2005 Subject: [gutvol-d] Fwd: Proposed CD Navigation Files now available for review In-Reply-To: <20050203072348.GC11403@pglaf.org> References: <6.1.2.0.0.20050128112358.01c90da0@mail.fireantproductions.com> <20050203072348.GC11403@pglaf.org> Message-ID: <4201D595.2050604@projectgutenberg.ca> >>>Four navigation files built for a new Project Gutenberg CD-ROM which >>>contains primarily non-English electronic books are now available for >>>review >>>at http://www.aaronandgabby.com/pgcd/ The files allow one to browse the CD >>>by Author, Language and Author, Language and Title, or Title. >>> > This is great stuff! Once it's raedy (or ready enough), > I'd like to go ahead and make an .iso image to add to our > collection. > -- Greg Why aren't we generating navigation pages from the catalog DB, instead of making HUGE single files? We should be providing indexes by Language, Subject Matter, Alphabetical Author, Alphabetical Title, etc -- all in a nicely paged manner. Other than that, I like the format of each record -- easily readable. Nice work Aaron! -- James From gbnewby at pglaf.org Wed Feb 2 23:48:18 2005 From: gbnewby at pglaf.org (Greg Newby) Date: Wed Feb 2 23:48:19 2005 Subject: [gutvol-d] Fwd: Proposed CD Navigation Files now available for review In-Reply-To: <4201D595.2050604@projectgutenberg.ca> References: <6.1.2.0.0.20050128112358.01c90da0@mail.fireantproductions.com> <20050203072348.GC11403@pglaf.org> <4201D595.2050604@projectgutenberg.ca> Message-ID: <20050203074818.GA11955@pglaf.org> On Thu, Feb 03, 2005 at 02:41:09AM -0500, James Linden wrote: > >>>Four navigation files built for a new Project Gutenberg CD-ROM which > >>>contains primarily non-English electronic books are now available for > >>>review > >>>at http://www.aaronandgabby.com/pgcd/ The files allow one to browse the > >>>CD > >>>by Author, Language and Author, Language and Title, or Title. > >>> > >This is great stuff! Once it's raedy (or ready enough), > >I'd like to go ahead and make an .iso image to add to our > >collection. > > -- Greg > > Why aren't we generating navigation pages from the catalog DB, instead > of making HUGE single files? We should be providing indexes by Language, > Subject Matter, Alphabetical Author, Alphabetical Title, etc -- all in a > nicely paged manner. People using a CD or DVD directly won't necessarily have access to the catalog DB, so some sort of built-in file-based navigation seems necessary. Providing a CD-based program + database for Win or Mac or Lin or whatever would be fine, too (in addition to file-based), but we don't have one. > Other than that, I like the format of each record -- easily readable. > Nice work Aaron! Related: I'm finally making moves (thanks to the XML/RDF file) on generating ISO files on the fly, based on a list of eBook #s. Stay tuned... -- Greg From jlinden at projectgutenberg.ca Wed Feb 2 23:52:51 2005 From: jlinden at projectgutenberg.ca (James Linden) Date: Wed Feb 2 23:56:10 2005 Subject: [gutvol-d] Fwd: Proposed CD Navigation Files now available for review In-Reply-To: <20050203074818.GA11955@pglaf.org> References: <6.1.2.0.0.20050128112358.01c90da0@mail.fireantproductions.com> <20050203072348.GC11403@pglaf.org> <4201D595.2050604@projectgutenberg.ca> <20050203074818.GA11955@pglaf.org> Message-ID: <4201D853.504@projectgutenberg.ca> Greg Newby wrote: > On Thu, Feb 03, 2005 at 02:41:09AM -0500, James Linden wrote: > >>>>>Four navigation files built for a new Project Gutenberg CD-ROM which >>>>>contains primarily non-English electronic books are now available for >>>>>review >>>>>at http://www.aaronandgabby.com/pgcd/ The files allow one to browse the >>>>>CD >>>>>by Author, Language and Author, Language and Title, or Title. >>>>> >>> >>>This is great stuff! Once it's raedy (or ready enough), >>>I'd like to go ahead and make an .iso image to add to our >>>collection. >>> -- Greg >> >> Why aren't we generating navigation pages from the catalog DB, instead >>of making HUGE single files? We should be providing indexes by Language, >>Subject Matter, Alphabetical Author, Alphabetical Title, etc -- all in a >>nicely paged manner. > > > People using a CD or DVD directly won't necessarily have access > to the catalog DB, so some sort of built-in file-based navigation > seems necessary. > > Providing a CD-based program + database for Win or Mac or Lin or > whatever would be fine, too (in addition to file-based), but we don't > have one. The idea of _generating_ the navigation files is that we can burn static files on the CD, but generate them for each CD image version using paging, various sort options, etc. This does not require users to have access to the DB, only a simple script that creates the HTML files. -- James From cannona at fireantproductions.com Thu Feb 3 00:24:17 2005 From: cannona at fireantproductions.com (Aaron Cannon) Date: Thu Feb 3 00:25:40 2005 Subject: [gutvol-d] Fwd: Proposed CD Navigation Files now available for review In-Reply-To: <4201D595.2050604@projectgutenberg.ca> References: <6.1.2.0.0.20050128112358.01c90da0@mail.fireantproductions.com> <20050203072348.GC11403@pglaf.org> <4201D595.2050604@projectgutenberg.ca> Message-ID: <6.1.2.0.0.20050203022013.01af7838@mail.fireantproductions.com> At 01:41 AM 2/3/2005, you wrote: > Other than that, I like the format of each record -- easily readable. > Nice work Aaron! Thanks. I only wish I could take credit. :) Actually the majority of the work on the CD came from John Hagerson. I've just been doing some very light assisting. Nevertheless, your feedback is appreciated by both of us. Sincerely Aaron Cannon -- E-mail: cannona@fireantproductions.com Skype: cannona MSN Messenger: cannona@hotmail.com (Do not send E-mail to the hotmail address.) From cannona at fireantproductions.com Thu Feb 3 00:28:21 2005 From: cannona at fireantproductions.com (Aaron Cannon) Date: Thu Feb 3 00:29:46 2005 Subject: [gutvol-d] Fwd: Proposed CD Navigation Files now available for review In-Reply-To: <20050203072348.GC11403@pglaf.org> References: <6.1.2.0.0.20050128112358.01c90da0@mail.fireantproductions.com> <20050203072348.GC11403@pglaf.org> Message-ID: <6.1.2.0.0.20050203022635.01c81178@mail.fireantproductions.com> At 01:23 AM 2/3/2005, you wrote: >This is great stuff! Once it's raedy (or ready enough), >I'd like to go ahead and make an .iso image to add to our >collection. > -- Greg Indeed. We'll be sure to build it under linux, so as to avoid any problems with capitalization of file names. -- E-mail: cannona@fireantproductions.com Skype: cannona MSN Messenger: cannona@hotmail.com (Do not send E-mail to the hotmail address.) From ke at gnu.franken.de Wed Feb 2 20:48:24 2005 From: ke at gnu.franken.de (Karl Eichwalder) Date: Thu Feb 3 07:39:09 2005 Subject: [gutvol-d] Re: Error Correction Data Needed In-Reply-To: (Michael Hart's message of "Fri, 28 Jan 2005 10:06:10 -0800 (PST)") References: Message-ID: Michael Hart writes: > However, my most recent research, in conjunctions with the head > of error correction at a major publisher, leads me to think 1/3 > of errors might be found per pass, instead of the previous 1/2. What a about a proper case study? I'd say you would better give up on talking about numbers ;) If you are interested in catching errors, print it out and read a paper copy. Or even better, if you are interested in 1:1 accuracy between the original and the copy, let one loud-read(?) the text with all diacritical marks while a second person looks at the text of the copy. -- http://www.gnu.franken.de/ke/ | ,__o | _-\_<, | (*)/'(*) Key fingerprint = F138 B28F B7ED E0AC 1AB4 AA7F C90A 35C3 E9D0 5D1C From ag737 at freenet.carleton.ca Thu Feb 3 08:17:23 2005 From: ag737 at freenet.carleton.ca (Wallace J.McLean) Date: Thu Feb 3 08:17:32 2005 Subject: [gutvol-d] Re Error Correction Data Needed Message-ID: <4114d14113c9.4113c94114d1@ncf.ca> I'm inclined to think that the 1/3 figure, AT MOST, may be closer to the truth. I've been working on a massive (300,000 word) publication for a number of years now. (It's FINALLY in pre-press, hooray!) My workflow was: handkey text (except for 16 pages I OCRd as a test.) Proofread 1 Attestation* 1 Proofread 2 Attestation* 2 Skimread (very superficial, but often you find stupid errors that way. Like the one on PAGE 1!!!) Software spellcheck Proofread 3 Readback** 1 Readback 2 And IIRC, Readback 3 * Attestation: Comparing my typescript to the original, word-by-word, phrase-by-phrase. ** Readback: After the HUGE error rates I was still getting after each prevous pass, I bought voice synthesis software, and had the work read back to me, while I followed along in the original. I've kept stats somewhere on the error catch-rate at each stage; I'll dig them up later. The caveat, of course, is that the only way for me to get "fresh eyes" on the project was to put it aside for a few weeks or months; I can't afford to hire someone else. The error rate on the last pass was so small that, even if I had only caught 30% of the remaining errors, the few that are statistically expectable are no longer worth it on the law of diminishing returns curve. ----- Original Message ----- >From Michael Hart Date Fri, 28 Jan 2005 10:06:10 -0800 (PST) Subject [gutvol-d] Error Correction Data Needed [Please excuse cross-posting.] However, my most recent research, in conjunctions with the head of error correction at a major publisher, leads me to think 1/3 of errors might be found per pass, instead of the previous 1/2. If any of you have any suggestions as to what these figures are, please let me know. From sly at victoria.tc.ca Thu Feb 3 09:21:17 2005 From: sly at victoria.tc.ca (Andrew Sly) Date: Thu Feb 3 09:21:42 2005 Subject: [gutvol-d] Re Error Correction Data Needed In-Reply-To: <4114d14113c9.4113c94114d1@ncf.ca> References: <4114d14113c9.4113c94114d1@ncf.ca> Message-ID: On Thu, 3 Feb 2005, Wallace J.McLean wrote: > Skimread (very superficial, but often you find stupid errors that way. > Like the one on PAGE 1!!!) > That reminds me of a book I have called "Indian Myths and Legends" which, in the introduction, details how the whole text was carefully translated from the German, and double checked many times, over the course of 30 years. And of course, I see an obvious error on page 1. :) Andrew From gbuchana at rogers.com Sat Feb 5 08:15:54 2005 From: gbuchana at rogers.com (Gardner Buchanan) Date: Sat Feb 5 08:16:10 2005 Subject: [gutvol-d] Fwd: Proposed CD Navigation Files now available f In-Reply-To: <6.1.2.0.0.20050203022635.01c81178@mail.fireantproductions.com> Message-ID: Hi Aaron, On 08:28:21 Aaron Cannon wrote: > > Indeed. We'll be sure to build it under linux, so as to avoid any problems > with capitalization of file names. > I noticed that in the BrowsebyLanguageandTitle page there is some funny business at the end, with the Welsh section appearing more than once. In general, my comment is that the pages are too large. I think a page with just language, linked to a page with the titles for that language would, for example, be more managable. ============================================================ Gardner Buchanan Ottawa, ON FreeBSD: Where you want to go. Today. From ag737 at freenet.carleton.ca Sat Feb 5 10:16:34 2005 From: ag737 at freenet.carleton.ca (Wallace J.McLean) Date: Sat Feb 5 10:16:41 2005 Subject: [gutvol-d] Error rate statistics Message-ID: <442d4e44a4fa.44a4fa442d4e@ncf.ca> As I previously discussed, these are the figures from a project I've been working on for several years. It's a massive, three-volume job, for print publication. After each round, described below, I would reprint the text, verify the correction of the previous round's errors, and then do another round. I also did a batch-verify of ALL previous rounds' corrections, finding one or two that I had missed along the way. Round Type Errors 1/1a p/a 944 2 p 415 3 a 454 4&5 p/a 154 sc 35-40 6 rb 170 7 sr 0 8 rb 64 "9" rb 0 Explanation: Round = round, Type = type of reading: p/roofreading, a/ttestation, s/pell c/heck, r/ead b/ack using voice synthesis. I keyed most of the text, apart from a small sample (about 15-20pp) which I OCRd near the end of the text-entry phase. This made it imperative that I not only proofread the typescript in the conventional sense, but also attest it, compare it back to the original, para by para, line by line, word by word. There were many errors that I introduced to the text that would pass spellcheck or proofread; they weren't "errors", but they weren't faithful to the text, either. They had to be exterminated Some - many, actually - of the errors were native to the original. As I keyed the text, I retained them, but corrected them afterwards. Thus, the error stats are somewhat inflated, in that a good number of them, probably 10 or 12 percent, weren't my fault. Rounds 1 and 1a, my first attestation and proof, I did on the same copy, so I couldn't do separate stats. After rounds 4-5, I did a spellcheck, which returned about 35-40 spelling errors which my eyes hadn't caught. This was a bit of a shock to my own esteem of my proofing skills, so I went out and got some speech synthesis software to do readbacks. I'd clip a few hundred words at a time, and follow in the original, highlighting discrepancies as I went along. 7 was a skimread of the whole thing. 8 was a second full readback. I know I did a third full readback, but didn't seem to keep stats on it. "9" was a partial readback. At 64 errors in round 8, that works out to about one discrepency every 15 pages or 4500 words. I did a bunch of batches of 15 pages and 4500 words, and also did a complete readback of several of the most error-prone sections of the book. Even with the long breaks I took in between rounds, round "9", with no moments of sheer "d'uh" to break up the monotony, was where the law of diminishing returns kicked in. I re-did perhaps 15% of the entire text without finding any further errors. At that point, I estimated the number of remaining typos or text discrepencies in the entire book to be somewhere between 6 and 20, and I'll be damned if I'm going to spend another three months of evenings hunting the buggers down. (At the same time, in my second readback pass, I at times would go 100 pages without finding ANY errors, then hit three or four on the same page.) The total number of native typos, my typos, and my transcription errors, worked out to about 2 per 300-word page. Not great, but not bad. It was, probably, actually higher, but in my early eyeball rounds, if I came across an error that I thought I had repeatedly made, I would do a global search, attest, and replace on it when I did the corrections at the end of that round. However, I only caught under 50% by eye on my first round, and fewer than 90% by eye, overall, on subsequent rounds. About 12%, I would not have caught at all, but for speech synthesis and spellcheck. From miranda_vandeheijning at blueyonder.co.uk Sun Feb 6 02:36:59 2005 From: miranda_vandeheijning at blueyonder.co.uk (Miranda van de Heijning) Date: Sun Feb 6 02:37:26 2005 Subject: [gutvol-d] Surge in users? In-Reply-To: <20050202045215.27164.qmail@web52310.mail.yahoo.com> References: <20050202045215.27164.qmail@web52310.mail.yahoo.com> Message-ID: <4205F34B.2010606@blueyonder.co.uk> Just wondering, I was looking through the PG Top 100 and realised the figures for all the books are a lot higher than usual. Do we have a surge in visitors this week? Secondly, mainly because I can't get enough of stats, would it be possible to have a 'total number of books downloaded' somewhere, so we can compare week on week how we are doing? Miranda From marcello at perathoner.de Sun Feb 6 10:25:22 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Sun Feb 6 11:33:11 2005 Subject: [gutvol-d] Surge in users? In-Reply-To: <4205F34B.2010606@blueyonder.co.uk> References: <20050202045215.27164.qmail@web52310.mail.yahoo.com> <4205F34B.2010606@blueyonder.co.uk> Message-ID: <42066112.1020206@perathoner.de> Miranda van de Heijning wrote: > Just wondering, I was looking through the PG Top 100 and realised the > figures for all the books are a lot higher than usual. Do we have a > surge in visitors this week? Due to problems at ibiblios file servers we didn't get the log files for some days and so the script couldn't count the downloads. If you want to see the global numbers go to: http://www.gutenberg.org/internal/stats/2005/02/ user: internal pass: books and look at month-files.html To see an independent stat about gutenberg.org's popularity go to: http://www.alexa.com/data/details/traffic_details?&range=3m&size=large&compare_sites=gutenberg.net,promo.net&y=t&url=gutenberg.org > Secondly, mainly because I can't get enough of stats, would it be > possible to have a 'total number of books downloaded' somewhere, so we > can compare week on week how we are doing? I could add that figure quite easily on the top 100 page but it will be misleading. We just count the downloads from ibiblio's servers. We don't know how many books get downloaded from our mirrors. And, at the rate we are going, we are far below the numbers Michael likes to put in his newsletter (billions, trillions, gazillions) and most likely that'll start another war about how to count downloaded ebooks. -- Marcello Perathoner webmaster@gutenberg.org From hyphen at hyphenologist.co.uk Sun Feb 6 12:53:02 2005 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Sun Feb 6 12:53:29 2005 Subject: [gutvol-d] Surge in users? In-Reply-To: <4205F34B.2010606@blueyonder.co.uk> References: <20050202045215.27164.qmail@web52310.mail.yahoo.com> <4205F34B.2010606@blueyonder.co.uk> Message-ID: <0m0d01tu4u2243uknmr00t3q4j7mhqebci@4ax.com> On Sun, 06 Feb 2005 10:36:59 +0000, Miranda van de Heijning wrote: | | Just wondering, I was looking through the PG Top 100 and realised the | figures for all the books are a lot higher than usual. Do we have a | surge in visitors this week? | | Secondly, mainly because I can't get enough of stats, would it be | possible to have a 'total number of books downloaded' somewhere, so we | can compare week on week how we are doing? Beware the numbers of books downloaded will vary drastically on a weekly basis, because of Christmas and other public holidays, University and school holidays etc. IMO monthly running averages would give a better idea of what is happening. -- Dave F From krooger at debian.org Sun Feb 6 14:16:43 2005 From: krooger at debian.org (Jonathan Walther) Date: Sun Feb 6 14:16:58 2005 Subject: [gutvol-d] Can project use legally encumbered scans? In-Reply-To: <00a001c5048d$5371e480$f69495ce@gw98> References: <20050126210155.GA8093@reactor-core.org> <00a001c5048d$5371e480$f69495ce@gw98> Message-ID: <20050206221643.GA22130@reactor-core.org> On Thu, Jan 27, 2005 at 11:20:55AM -0500, N Wolcott wrote: >If you have a a valuable collection, if the scans are high quality >tiff's or tiff's and jpegs you might enquire about space on ibiblio >where they can be accessed as a collection. Many PG tiff's are just >high enought quality to "get the job done", you might want yours to be >separated from the dross. I know of a situation. Let's say that it's hypothetical. Someone got access to some extremely old and rare books, and photographed them. The photos were scanned and distributed on CDROM by a company. The owners of the photos say the scans constitute stolen property, and after years of legal action, stopped the company from distributing the scans. The books in question are up to 500 years old and unlikely to ever come back into print. What is PG's position? The books themselves are clearly not in copyright; the few remaing copies are heirlooms tucked away in a few select private libraries. PG would not be distributing the scans themselves. If PG could get access to the scans, would it be ethical to use them? Please let me know the official answer. Jonathan -- It's not true unless it makes you laugh, but you don't understand it until it makes you weep. Eukleia: Jonathan Walther Address: 12706 99 Ave, Surrey, BC V3V2P8 (Canada) Contact: 604-684-1319 (daytime) Contact: 604-582-9308 (morning and evening) Puritan: Purity of faith, Purity of doctrine. Sola Scriptura! From gbnewby at pglaf.org Sun Feb 6 14:31:44 2005 From: gbnewby at pglaf.org (Greg Newby) Date: Sun Feb 6 14:31:44 2005 Subject: [gutvol-d] Can project use legally encumbered scans? In-Reply-To: <20050206221643.GA22130@reactor-core.org> References: <20050126210155.GA8093@reactor-core.org> <00a001c5048d$5371e480$f69495ce@gw98> <20050206221643.GA22130@reactor-core.org> Message-ID: <20050206223144.GA30756@pglaf.org> On Sun, Feb 06, 2005 at 02:16:43PM -0800, Jonathan Walther wrote: > On Thu, Jan 27, 2005 at 11:20:55AM -0500, N Wolcott wrote: > >If you have a a valuable collection, if the scans are high quality > >tiff's or tiff's and jpegs you might enquire about space on ibiblio > >where they can be accessed as a collection. Many PG tiff's are just > >high enought quality to "get the job done", you might want yours to be > >separated from the dross. > > I know of a situation. Let's say that it's hypothetical. Someone got > access to some extremely old and rare books, and photographed them. The > photos were scanned and distributed on CDROM by a company. The owners > of the photos say the scans constitute stolen property, and after years > of legal action, stopped the company from distributing the scans. The > books in question are up to 500 years old and unlikely to ever come back > into print. > > What is PG's position? The books themselves are clearly not in > copyright; the few remaing copies are heirlooms tucked away in a few > select private libraries. PG would not be distributing the scans > themselves. If PG could get access to the scans, would it be ethical to > use them? (Are you talking about scans of photos, from CDs? Were there any other value-added processes involved in creating the scans/photos? Are these entire books, or some sort of collection of items, which might have a compilation copyright?) > Please let me know the official answer. This is an official answer, but doesn't quite meet your needs. The short answer is that it's hard to deal with hypotheticals, since there are a few issues that could mitigate. The main one is if there's a relevant court case that was decided that could impact our decision. The other is if the books could count as unpublished manuscripts, which get a separate copyright period of modern-day protection, regardless of when they were published (http://gutenberg.org/howto/copyright-howto). But our basic answer is that IF the source is verifiably public domain in the US, using our clearance procedures, then scans or pictures of the source, as well as OCR, proofreading, markup, and completed eBooks, are also public domain. This is a position that has been vetted by several lawyers who help PG, but has not yet been tested in court as far as we know. The closest counter-example I can think of is the dead sea scrolls, which (IIRC) did end up with some sort of copyright protection despite their age. In other words, there *might* be a risk. When we get such requests, we sometimes need to look at the risk of getting sued, as well as our own procedures. We're definitely willing to take risks, but in a thoughtful manner. Feel free to send me further details, or just upload the request via http://copy.pglaf.org, along with details. -- Greg Dr. Gregory B. Newby Chief Executive and Director Project Gutenberg Literary Archive Foundation http://gutenberg.net A 501(c)(3) not-for-profit organization with EIN 64-6221541 gbnewby@pglaf.org From kouhia at nic.funet.fi Mon Feb 7 06:21:16 2005 From: kouhia at nic.funet.fi (Juhana Sadeharju) Date: Mon Feb 7 06:21:26 2005 Subject: [gutvol-d] Re: Arabic eTexts Message-ID: Are those Arabic OCR software open source and free? Having no Arabic OCR software has not prevented us from digitizing Arabic texts earlier. If only buying a $$$$ software gets you motivated to digitize arabic texts, then it is fine by me. However, I feel the arabic texts should be digitized first as image files. Specially if the text is written by hand. This apparoach will be cheaper and faster as well. Please don't make the mistage of not archiving and making available the images if you choose the OCR approach. I'm pleased to archive any arabic digitizations as image files for now and for future use. Only image files can preserve the text as close to original as possible. Juhana -- http://music.columbia.edu/mailman/listinfo/linux-graphics-dev for developers of open source graphics software From hart at pglaf.org Mon Feb 7 07:52:05 2005 From: hart at pglaf.org (Michael Hart) Date: Mon Feb 7 07:52:06 2005 Subject: [gutvol-d] Can project use legally encumbered scans? In-Reply-To: <20050206223144.GA30756@pglaf.org> References: <20050126210155.GA8093@reactor-core.org> <00a001c5048d$5371e480$f69495ce@gw98> <20050206221643.GA22130@reactor-core.org> <20050206223144.GA30756@pglaf.org> Message-ID: Photographs, even of public domain materials, can be copyrighted, though I doubt a similar photograph would infringe. However, this has not been established for photocopies, scans, etc. As for the WORDS on the pages, in the photographs, etc., those are still in the public domain, and you could legally type/scan them in to create an eBook, probably even if the license says you cannot. This would be similar to the case of someone owning a painting in the public domain, and you take a picture of it. You could either copyright the picture or put it in the public domain. Some people claim all rights to reproduction of certain public domain materials, such as museums, but I don't know if that can be enforced outside of certain contracts with the museums. Perhaps just walking in to the museums is regarded in some places like ye olde "shrikwrap" licenses that are no longer legally enforceable. I am not a lawyer. . .this is NOT a legal opinion or legal advice. IANAL = I am not a lawyer. mh On Sun, 6 Feb 2005, Greg Newby wrote: > On Sun, Feb 06, 2005 at 02:16:43PM -0800, Jonathan Walther wrote: >> On Thu, Jan 27, 2005 at 11:20:55AM -0500, N Wolcott wrote: >>> If you have a a valuable collection, if the scans are high quality >>> tiff's or tiff's and jpegs you might enquire about space on ibiblio >>> where they can be accessed as a collection. Many PG tiff's are just >>> high enought quality to "get the job done", you might want yours to be >>> separated from the dross. >> >> I know of a situation. Let's say that it's hypothetical. Someone got >> access to some extremely old and rare books, and photographed them. The >> photos were scanned and distributed on CDROM by a company. The owners >> of the photos say the scans constitute stolen property, and after years >> of legal action, stopped the company from distributing the scans. The >> books in question are up to 500 years old and unlikely to ever come back >> into print. >> >> What is PG's position? The books themselves are clearly not in >> copyright; the few remaing copies are heirlooms tucked away in a few >> select private libraries. PG would not be distributing the scans >> themselves. If PG could get access to the scans, would it be ethical to >> use them? > > (Are you talking about scans of photos, from CDs? Were there any > other value-added processes involved in creating the scans/photos? > Are these entire books, or some sort of collection of items, which > might have a compilation copyright?) > >> Please let me know the official answer. > > This is an official answer, but doesn't quite meet your needs. > > The short answer is that it's hard to deal with hypotheticals, > since there are a few issues that could mitigate. The main > one is if there's a relevant court case that was decided that > could impact our decision. The other is if the books could count > as unpublished manuscripts, which get a separate copyright > period of modern-day protection, regardless of when they > were published (http://gutenberg.org/howto/copyright-howto). > > But our basic answer is that IF the source is verifiably > public domain in the US, using our clearance procedures, > then scans or pictures of the source, as well as OCR, > proofreading, markup, and completed eBooks, are also public > domain. > > This is a position that has been vetted by several > lawyers who help PG, but has not yet been tested in court > as far as we know. The closest counter-example I can > think of is the dead sea scrolls, which (IIRC) did end > up with some sort of copyright protection despite their age. > > In other words, there *might* be a risk. When we get such > requests, we sometimes need to look at the risk of getting sued, > as well as our own procedures. We're definitely willing > to take risks, but in a thoughtful manner. > > Feel free to send me further details, or just upload > the request via http://copy.pglaf.org, along with details. > -- Greg > > Dr. Gregory B. Newby > Chief Executive and Director > Project Gutenberg Literary Archive Foundation http://gutenberg.net > A 501(c)(3) not-for-profit organization with EIN 64-6221541 > gbnewby@pglaf.org > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From shimmin at uiuc.edu Mon Feb 7 08:16:00 2005 From: shimmin at uiuc.edu (Robert Shimmin) Date: Mon Feb 7 08:16:05 2005 Subject: [gutvol-d] Can project use legally encumbered scans? In-Reply-To: References: <20050126210155.GA8093@reactor-core.org> <00a001c5048d$5371e480$f69495ce@gw98> <20050206221643.GA22130@reactor-core.org> <20050206223144.GA30756@pglaf.org> Message-ID: <42079440.4080106@uiuc.edu> The closest U.S. case law I know of is Bridgeman Art Library Ltd. v. Corel Corporation (1999). There, a U.S. District Court ruled that photographic reproductions of two-dimensional works of art, where the goal is to make as accurate a reproduction of the work as possible, were not 'original works,' and therefore not copyrightable. By no means does this apply to all photographs of artwork, but only those where the artistic capacity of the photographer in choosing angle, composing the subject matter, selecting lighting, etc., has been subjugated to the overarching goal of reproducing the artwork as accurately as possible. -- RS From hart at pglaf.org Mon Feb 7 08:38:17 2005 From: hart at pglaf.org (Michael Hart) Date: Mon Feb 7 08:38:18 2005 Subject: [gutvol-d] Surge in users? In-Reply-To: <0m0d01tu4u2243uknmr00t3q4j7mhqebci@4ax.com> References: <20050202045215.27164.qmail@web52310.mail.yahoo.com> <4205F34B.2010606@blueyonder.co.uk> <0m0d01tu4u2243uknmr00t3q4j7mhqebci@4ax.com> Message-ID: This could have been from some press we got in the UK. And don't forget that every once in a while some big outfit like Yahoo or Google just grabs everything, likely if we get lots more hits, but over all the eBooks in general. . . . mh On Sun, 6 Feb 2005, Dave Fawthrop wrote: > On Sun, 06 Feb 2005 10:36:59 +0000, Miranda van de Heijning > wrote: > > | > | Just wondering, I was looking through the PG Top 100 and realised the > | figures for all the books are a lot higher than usual. Do we have a > | surge in visitors this week? > | > | Secondly, mainly because I can't get enough of stats, would it be > | possible to have a 'total number of books downloaded' somewhere, so we > | can compare week on week how we are doing? > > Beware the numbers of books downloaded will vary drastically on a weekly > basis, because of Christmas and other public holidays, University and > school holidays etc. > > IMO monthly running averages would give a better idea of what is happening. > > > > -- > Dave F > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From cannona at fireantproductions.com Mon Feb 7 10:32:46 2005 From: cannona at fireantproductions.com (Aaron Cannon) Date: Mon Feb 7 11:09:19 2005 Subject: [gutvol-d] Surge in users? In-Reply-To: References: <20050202045215.27164.qmail@web52310.mail.yahoo.com> <4205F34B.2010606@blueyonder.co.uk> <0m0d01tu4u2243uknmr00t3q4j7mhqebci@4ax.com> Message-ID: <6.1.2.0.0.20050207123100.01beaea0@mail.fireantproductions.com> At 10:38 AM 2/7/2005, you wrote: >This could have been from some press we got in the UK. I believe this first explanation to be more likely, as the requests for DVDs have gone through the roof, and 90% of them were from the UK. Fortunately, things have slowed down a lot over the last few days. Aaron >And don't forget that every once in a while some big >outfit like Yahoo or Google just grabs everything, >likely if we get lots more hits, but over all the >eBooks in general. . . . > >mh > > >On Sun, 6 Feb 2005, Dave Fawthrop wrote: > >>On Sun, 06 Feb 2005 10:36:59 +0000, Miranda van de Heijning >> wrote: >> >>| >>| Just wondering, I was looking through the PG Top 100 and realised the >>| figures for all the books are a lot higher than usual. Do we have a >>| surge in visitors this week? >>| >>| Secondly, mainly because I can't get enough of stats, would it be >>| possible to have a 'total number of books downloaded' somewhere, so we >>| can compare week on week how we are doing? >> >>Beware the numbers of books downloaded will vary drastically on a weekly >>basis, because of Christmas and other public holidays, University and >>school holidays etc. >> >>IMO monthly running averages would give a better idea of what is happening. >> >> >> >>-- >>Dave F >> >>_______________________________________________ >>gutvol-d mailing list >>gutvol-d@lists.pglaf.org >>http://lists.pglaf.org/listinfo.cgi/gutvol-d >_______________________________________________ >gutvol-d mailing list >gutvol-d@lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d -- E-mail: cannona@fireantproductions.com Skype: cannona MSN Messenger: cannona@hotmail.com (Do not send E-mail to the hotmail address.) From servalan at ar.com.au Tue Feb 8 16:47:16 2005 From: servalan at ar.com.au (Pauline) Date: Tue Feb 8 16:48:12 2005 Subject: [gutvol-d] Issues with links from posted notices failing in Firefox 1.0 Message-ID: <42095D94.1020509@ar.com.au> Hi All, I don't think I am the only one with this problem, but I have yet to see this being discussed here... In Firefox 1.0 links to recently posted extexts from the posted mailing list such as: http://www.gutenberg.net/1/4/9/8/14980 fail. I see an error: Files Lookup I see no such file here! (1/4/9/8/14980) The links work perfectly well in IE6. The links start to work in Firefox a few days after a project is posted, but it's very frustrating to not be able to send links to others without knowing whether they will fail or not. There is a discussion at DP on this issue here: http://www.pgdp.net/phpBB2/viewtopic.php?p=109131#109131 Some users say if they change the skin back to the default, links work OK again. No such luck for me. Thanks in advance, P -- Distributed Proofreaders: http://www.pgdp.net "Preserving history one page at a time." From kouhia at nic.funet.fi Wed Feb 9 08:54:17 2005 From: kouhia at nic.funet.fi (Juhana Sadeharju) Date: Wed Feb 9 08:54:28 2005 Subject: [gutvol-d] Re: Can project use legally encumbered scans? Message-ID: >From: Jonathan Walther > >I know of a situation. Let's say that it's hypothetical. Someone got >access to some extremely old and rare books, and photographed them. The >photos were scanned and distributed on CDROM by a company. The owners >of the photos say the scans constitute stolen property, and after years >of legal action, stopped the company from distributing the scans. We can and should process the scans privately. Please make them (if the scans were not entirely hypothetical) available privately, only to people who are willing to help. Remember, Hershey fonts were in public domain but the file format was not. The solution was to convert the data to another file format. How that info could be exploited here? I would like to have a copy of every scan. If I develop a solution, the scans may not be available when the solution is ready. That would only waste our time and give false hopes. Best regards, Juhana -- http://music.columbia.edu/mailman/listinfo/linux-graphics-dev for developers of open source graphics software From maitriv at yahoo.com Wed Feb 9 09:41:21 2005 From: maitriv at yahoo.com (maitri venkat-ramani) Date: Wed Feb 9 09:41:28 2005 Subject: [gutvol-d] Re: Arabic eTexts In-Reply-To: Message-ID: <20050209174122.75408.qmail@web52306.mail.yahoo.com> --- Juhana Sadeharju wrote: > Are those Arabic OCR software open source and free? I don't know, I merely pointed you all to the research that is going on in this area. My hope was that one/several of our volunteers who are interested in and have previous experience in PG Arabic etexts would get in touch with the project and find out what they are doing. > Having no Arabic OCR software has not prevented us from > digitizing Arabic texts earlier. If only buying a $$$$ software > gets you motivated to digitize arabic texts, then it is fine > by me. It's not the digitization method, but the books that come out of the process that are my concern. Who cares how he does it - if that guy does all the scanning and work, what harm is there in asking if he will share his collection with PG? If not, fine. Cheers, Maitri __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From marcello at perathoner.de Wed Feb 9 10:24:03 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed Feb 9 13:47:59 2005 Subject: [gutvol-d] Issues with links from posted notices failing in Firefox 1.0 In-Reply-To: <42095D94.1020509@ar.com.au> References: <42095D94.1020509@ar.com.au> Message-ID: <420A5543.7060709@perathoner.de> Pauline wrote: > In Firefox 1.0 links to recently posted extexts from the posted mailing > list such as: > http://www.gutenberg.net/1/4/9/8/14980 > > fail. I see an error: > > Files Lookup > > I see no such file here! (1/4/9/8/14980) > > The links work perfectly well in IE6. I'm using Firefox 1.0 (Linux) for development. I don't have any such problems. Are you using some web filtering proxy? Did you try http://www.gutenberg.net/dirs/1/4/9/8/14980/ with a trailing slash? (which, by the way, is the correct url for a directory) Can you install the HTTP Live Headers Plugin and send me a dump of the request your browser is generating? > There is a discussion at DP on this issue here: > http://www.pgdp.net/phpBB2/viewtopic.php?p=109131#109131 That's fine, because my DP account somehow got removed. > Some users say if they change the skin back to the default, links work > OK again. No such luck for me. Skins have nothing to do with that. -- Marcello Perathoner webmaster@gutenberg.org From marcello at perathoner.de Wed Feb 9 14:16:19 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed Feb 9 14:16:23 2005 Subject: [gutvol-d] Issues with links from posted notices failing in Firefox 1.0 In-Reply-To: <42095D94.1020509@ar.com.au> References: <42095D94.1020509@ar.com.au> Message-ID: <420A8BB3.6090201@perathoner.de> Pauline wrote: > In Firefox 1.0 links to recently posted extexts from the posted mailing > list such as: > http://www.gutenberg.net/1/4/9/8/14980 > > fail. I see an error: > > Files Lookup > > I see no such file here! (1/4/9/8/14980) Also, I noticed that the posting note gets sent *before* the files are posted. In that case the error message you get is quite appropriate. -- Marcello Perathoner webmaster@gutenberg.org From servalan at ar.com.au Wed Feb 9 14:34:48 2005 From: servalan at ar.com.au (Pauline) Date: Wed Feb 9 14:35:25 2005 Subject: [gutvol-d] Issues with links from posted notices failing in Firefox 1.0 In-Reply-To: <420A5543.7060709@perathoner.de> References: <42095D94.1020509@ar.com.au> <420A5543.7060709@perathoner.de> Message-ID: <420A9008.6060608@ar.com.au> Hiya Marcello, Marcello Perathoner wrote: > I'm using Firefox 1.0 (Linux) for development. I don't have any such > problems. > > Are you using some web filtering proxy? Nope. > Did you try > > http://www.gutenberg.net/dirs/1/4/9/8/14980/ > > with a trailing slash? (which, by the way, is the correct url for a > directory) I'm using the urls which appear in the PG posted notices e.g. Friedrich v. Schiller's Biographie, by H. Doering 14997 [Language: German] [Link: http://www.gutenberg.net/1/4/9/9/14997 ] [Files: 14997-8.txt] > Can you install the HTTP Live Headers Plugin and send me a dump of > the request your browser is generating? I can, but whatever the problem was I cannot reproduce it today. Looks like it has been fixed & interestingly enough, when I click on a link without a trailing /, the redirected URL has one tacked on the end. e.g. http://www.gutenberg.net/1/4/9/9/14994 becomes: http://www.gutenberg.org/dirs/1/4/9/9/14994/ in the browser address bar. >> There is a discussion at DP on this issue here: >> http://www.pgdp.net/phpBB2/viewtopic.php?p=109131#109131 > > > That's fine, because my DP account somehow got removed. I see an account for you (I'm a DP site admin). If you have hassles logging in, email dphelp@pgdp.net. I'll post to that thread & see if this issue is now resolved for the other users. >> Some users say if they change the skin back to the default, links >> work OK again. No such luck for me. > > > Skins have nothing to do with that. One of those weird coincidence things then. :) Thanks, P From servalan at ar.com.au Wed Feb 9 14:37:38 2005 From: servalan at ar.com.au (Pauline) Date: Wed Feb 9 14:38:13 2005 Subject: [gutvol-d] Issues with links from posted notices failing in Firefox 1.0 In-Reply-To: <420A8BB3.6090201@perathoner.de> References: <42095D94.1020509@ar.com.au> <420A8BB3.6090201@perathoner.de> Message-ID: <420A90B2.5040305@ar.com.au> Marcello Perathoner wrote: > Pauline wrote: > >> In Firefox 1.0 links to recently posted extexts from the posted >> mailing list such as: >> http://www.gutenberg.net/1/4/9/8/14980 >> >> fail. I see an error: >> >> Files Lookup >> >> I see no such file here! (1/4/9/8/14980) > > > Also, I noticed that the posting note gets sent *before* the files are > posted. In that case the error message you get is quite appropriate. I realise that. But the links were working fine in IE, just not in Firefox. The error message for a pre-emptive posting note is different & fails in all browsers. Whatever the problem was, it's fixed today for me. Thanks, P From fvandrog at scripps.edu Wed Feb 9 17:58:44 2005 From: fvandrog at scripps.edu (Frank van Drogen) Date: Wed Feb 9 17:58:48 2005 Subject: [gutvol-d] What about 15000?? In-Reply-To: <420A90B2.5040305@ar.com.au> References: <42095D94.1020509@ar.com.au> <420A8BB3.6090201@perathoner.de> <420A90B2.5040305@ar.com.au> Message-ID: <6.2.0.8.0.20050209175751.029e5da0@mail.scripps.edu> eBook 14999 was posted today, as was eBook 15001 :) What about the one in between?? Frank From servalan at ar.com.au Wed Feb 9 18:21:11 2005 From: servalan at ar.com.au (Pauline) Date: Wed Feb 9 18:21:52 2005 Subject: [gutvol-d] Issues with links from posted notices failing in Firefox 1.0 In-Reply-To: <420A90B2.5040305@ar.com.au> References: <42095D94.1020509@ar.com.au> <420A8BB3.6090201@perathoner.de> <420A90B2.5040305@ar.com.au> Message-ID: <420AC517.4040306@ar.com.au> Pauline wrote: > Marcello Perathoner wrote: >> Also, I noticed that the posting note gets sent *before* the files >> are posted. In that case the error message you get is quite >> appropriate. > > > I realise that. But the links were working fine in IE, just not in > Firefox. The error message for a pre-emptive posting note is > different & fails in all browsers. > > Whatever the problem was, it's fixed today for me. I spoke too soon. The links work ok from a link in email, they fail when opened as a link from another web page. e.g. from within a DP Forum post. They also seem to work ok if you cut & paste the URL directly into a browser window as I am doing for IE. Also - I see recursive redirects behaviour in Firefox for this links like: http://www.gutenberg.org/dirs/1/4/9/0/14908/14908-h/14908-h.htm in IE I am presented with the HTML version of the text, in Firefox I get sent to: http://www.gutenberg.org/etext/14908 & clicking on the HTML link from the catalogue page which results in Firefox just winds up back at the catalogue page. There is an explanation of this behaviour on the DP Forums here: http://www.pgdp.net/phpBB2/viewtopic.php?p=109273#109273 I'm happy to help debug off-list if needed, I'll go & install that Firefox extension installed. Contact me if you want my help, or you can track the discussion on the DP Forums. Needless to say navigating PG as a Firefox user is v. frustrating at the moment. Thanks, P From phil at thalasson.com Thu Feb 10 17:28:28 2005 From: phil at thalasson.com (Philip Baker) Date: Thu Feb 10 17:30:27 2005 Subject: [gutvol-d] Issues with links from posted notices failing in Firefox 1.0 In-Reply-To: <420AC517.4040306@ar.com.au> Message-ID: Pauline writes > >I spoke too soon. The links work ok from a link in email, they fail when >opened as a link from another web page. e.g. from within a DP Forum post. >They also seem to work ok if you cut & paste the URL directly into a browser >window as I am doing for IE. > >Also - I see recursive redirects behaviour in Firefox for this links like: >http://www.gutenberg.org/dirs/1/4/9/0/14908/14908-h/14908-h.htm > >in IE I am presented with the HTML version of the text, in Firefox I get >sent to: >http://www.gutenberg.org/etext/14908 > >& clicking on the HTML link from the catalogue page which results in >Firefox just winds up back at the catalogue page. There is an >explanation of this behaviour on the DP Forums here: >http://www.pgdp.net/phpBB2/viewtopic.php?p=109273#109273 > >I'm happy to help debug off-list if needed, I'll go & install that >Firefox extension installed. Contact me if you want my help, or you can >track the discussion on the DP Forums. > >Needless to say navigating PG as a Firefox user is v. frustrating at the >moment. > Check what kind of HTTP_REFERER value, if any, your Firefox is configured to send to a web server. I believe Firefox uses the term 'network.http.sendRefererHeader' for this in its configuration options. -- Philip Baker From sly at victoria.tc.ca Tue Feb 15 21:47:53 2005 From: sly at victoria.tc.ca (Andrew Sly) Date: Tue Feb 15 21:48:14 2005 Subject: [gutvol-d] Tamil eBooks site Message-ID: Doing a little bit of browsing of ebooks in other languages, I found a project working on Tamil texts that appears to be using the DP software to process their texts. See: http://www.tamil.net/projectmadurai/dppm.html It's nice to see more efforts out there digitizing old literature, but as usual, they are not as free about having their texts redistributed as PG is. Andrew From shalesller at writeme.com Tue Feb 15 23:41:50 2005 From: shalesller at writeme.com (D. Starner) Date: Tue Feb 15 23:42:11 2005 Subject: Canceling Clearances (Was: [gutvol-d] Top 100 EBooks this week (and other stories)) Message-ID: <20050216074150.6430F4BDAA@ws1-1.us4.outblaze.com> > The other story I want to mention has to do with 2 books that I had > cleared that I am not able to proceed with at this time. > > gbn0307140503: Unknown, The Reason Why--Natural History.? bill jenness > .? 1860c.? 7/24/2003.? ok. > > and > > gbn0307140507: warren colburn, Arithmetic upon the inductive method of > instruction....? bill jenness .? 1856p1826c. > 7/24/2003. > ok. > > These books are no longer in my possesion and my scanner is toast, I am > limping along with a p166 until I can afford to pickup some new equipment. There ought to be a standard way of canceling clearances, at least for the new system. I've got clearances that turned out to be done already, and ones that should be done, but the copy I have isn't in a condition I can get usable scans out of, which sometimes turns up only when I start scanning. -- ___________________________________________________________ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm From nwolcott at dsdial.net Wed Feb 16 06:03:02 2005 From: nwolcott at dsdial.net (N Wolcott) Date: Wed Feb 16 06:24:59 2005 Subject: Canceling Clearances (Was: [gutvol-d] Top 100 EBooks this week(and other stories)) References: <20050216074150.6430F4BDAA@ws1-1.us4.outblaze.com> Message-ID: <007c01c51433$3397c840$a99495ce@gw98> I often submit clearances for books I think I might get around to if only so that PG knows they are available PD and others can make use of the clearance. Not a dog in the manger thing. ----- Original Message ----- From: "D. Starner" To: "Project Gutenberg Volunteer Discussion" Sent: Wednesday, February 16, 2005 2:41 AM Subject: Canceling Clearances (Was: [gutvol-d] Top 100 EBooks this week(and other stories)) > > The other story I want to mention has to do with 2 books that I had > > cleared that I am not able to proceed with at this time. > > > > gbn0307140503: Unknown, The Reason Why--Natural History. bill jenness > > . 1860c. 7/24/2003. ok. > > > > and > > > > gbn0307140507: warren colburn, Arithmetic upon the inductive method of > > instruction.... bill jenness . 1856p1826c. > > 7/24/2003. > > ok. > > > > These books are no longer in my possesion and my scanner is toast, I am > > limping along with a p166 until I can afford to pickup some new equipment. > > There ought to be a standard way of canceling clearances, at least for > the new system. I've got clearances that turned out to be done already, > and ones that should be done, but the copy I have isn't in a condition > I can get usable scans out of, which sometimes turns up only when I start > scanning. > -- > ___________________________________________________________ > Sign-up for Ads Free at Mail.com > http://promo.mail.com/adsfreejump.htm > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From joshua at hutchinson.net Wed Feb 16 06:29:05 2005 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Wed Feb 16 06:29:09 2005 Subject: Canceling Clearances (Was: [gutvol-d] Top 100 EBooks thisweek(and other stor Message-ID: <20050216142905.35F174F4CB@ws6-5.us4.outblaze.com> That's probably not a good idea. The only "list" we have of things people are supposedly working on is the clearance list. I know a lot of people will skip a book if they see someone else has cleared it recently. You should probably only clear something if you plan on working on it... and probably only if you plan on working on it soon. DP churns through a lot of books and if you clear something but don't work on it for a year, you're blocking that content from DP. Josh ----- Original Message ----- From: "N Wolcott" To: "Project Gutenberg Volunteer Discussion" Subject: Re: Canceling Clearances (Was: [gutvol-d] Top 100 EBooks thisweek(and other stories)) Date: Wed, 16 Feb 2005 09:03:02 -0500 > > I often submit clearances for books I think I might get around to if only so > that PG knows they are available PD and others can make use of the > clearance. Not a dog in the manger thing. From shalesller at writeme.com Wed Feb 16 07:47:10 2005 From: shalesller at writeme.com (D. Starner) Date: Wed Feb 16 07:47:31 2005 Subject: Canceling Clearances (Was: [gutvol-d] Top 100 EBooks thisweek(and other stories)) Message-ID: <20050216154710.8511B4BE64@ws1-1.us4.outblaze.com> "N Wolcott" writes: > I often submit clearances for books I think I might get around to if only so > that PG knows they are available PD and others can make use of the > clearance. Not a dog in the manger thing. Except in the rare case of renewal notices, if someone cares if they are PD, it's trivial to check. I can't make use of the clearance, as I've to clear my own copy of the book, but your clearance will tell me that I shouldn't bother, since other people are working on it. -- ___________________________________________________________ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm From hart at pglaf.org Wed Feb 16 12:52:30 2005 From: hart at pglaf.org (Michael Hart) Date: Wed Feb 16 12:52:31 2005 Subject: [gutvol-d] Tamil eBooks site In-Reply-To: References: Message-ID: On Tue, 15 Feb 2005, Andrew Sly wrote: > > Doing a little bit of browsing of ebooks in other languages, > I found a project working on Tamil texts that appears to > be using the DP software to process their texts. > > See: > http://www.tamil.net/projectmadurai/dppm.html > > It's nice to see more efforts out there digitizing old > literature, but as usual, they are not as free about > having their texts redistributed as PG is. > > Andrew We've never required that anyone using PG services, even copyright research, give their results back to PG. We are here to encourage the creation and distribution of eBooks. We don't have to create and distribute all the eBooks we have some involvement with. "There is no end to the great things we can accomplish if we don't worry about who gets the credit." - Anon. Life is an open book test, without any time limits. So let's provide more books. The continuing standard of living of humankind is how we measure the value of our work. Michael From ag737 at freenet.carleton.ca Wed Feb 16 13:47:12 2005 From: ag737 at freenet.carleton.ca (Wallace J.McLean) Date: Wed Feb 16 13:47:23 2005 Subject: [gutvol-d] Re: Canceling Clearances Message-ID: <571a7156e479.56e479571a71@ncf.ca> It would be a perfectly good idea, if the clearance system and IP list was dynamic and allowed for contact between volunteers. I'm shocked that the system is still the way it is. I'm half surprised it's not on the back of envelopes and napkins. > That's probably not a good idea. The only "list" we have of things people are > supposedly working on is the clearance list. I know a lot of people will skip a > book if they see someone else has cleared it recently. You should probably only > clear something if you plan on working on it... and probably only if you plan on > working on it soon. DP churns through a lot of books and if you clear something but > don't work on it for a year, you're blocking that content from DP. ----- Original Message ----- From: "N Wolcott" To: "Project Gutenberg Volunteer Discussion" Subject: Re: Canceling Clearances (Was: [gutvol-d] Top 100 EBooks thisweek(and other stories)) Date: Wed, 16 Feb 2005 09:03:02 -0500 > > I often submit clearances for books I think I might get around to if only so > that PG knows they are available PD and others can make use of the > clearance. Not a dog in the manger thing. From joshua at hutchinson.net Wed Feb 16 14:06:49 2005 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Wed Feb 16 14:06:58 2005 Subject: [gutvol-d] Re: Canceling Clearances Message-ID: <20050216220649.E51FC1099CE@ws6-4.us4.outblaze.com> Well, if pigs flew ... fried pork wings would be a perfectly good idea, too. The point is, we don't have a dynamic system. We have what we have (until someone makes something better). That means, currently, creating book clearances you don't intend to use shortly is a "bad idea." I'm not trying to be rude; I'm just trying to point out a behavior that doesn't work well in the currently implemented clearance system. Josh ----- Original Message ----- From: "Wallace J.McLean" > > It would be a perfectly good idea, if the clearance system and IP list > was dynamic and allowed for contact between volunteers. > > I'm shocked that the system is still the way it is. I'm half surprised > it's not on the back of envelopes and napkins. > > > > That's probably not a good idea. The only "list" we have of things > people are > > supposedly working on is the clearance list. I know a lot of people > will skip a > > book if they see someone else has cleared it recently. You should > probably only > > clear something if you plan on working on it... and probably only if > you plan on > > working on it soon. DP churns through a lot of books and if you > clear something but > > don't work on it for a year, you're blocking that content from DP. > > > ----- Original Message ----- > From: "N Wolcott" > To: "Project Gutenberg Volunteer Discussion" > Subject: Re: Canceling Clearances (Was: [gutvol-d] Top 100 EBooks > thisweek(and other stories)) > Date: Wed, 16 Feb 2005 09:03:02 -0500 > > > > > I often submit clearances for books I think I might get around to if > only so > > that PG knows they are available PD and others can make use of the > > clearance. Not a dog in the manger thing. > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From alex at awstudios.net Thu Feb 17 07:52:09 2005 From: alex at awstudios.net (Alex Wilson) Date: Thu Feb 17 08:39:26 2005 Subject: [gutvol-d] David Wyllie Email address Message-ID: About a month ago Greg Newby offered to get me in touch with David Wyllie--who provided the English translation of Kafka's Metamorphosis for PG--and I haven't heard from him since. I'm thinking Greg's emails or mine are ending up in a junk mail folder, so I'm wondering if anyone here knows how I can get in touch with Mr. Wyllie. Thanks. Alex. http://www.telltaleweekly.org - Funding a Free Audiobook Library From hart at pglaf.org Thu Feb 17 10:01:53 2005 From: hart at pglaf.org (Michael Hart) Date: Thu Feb 17 10:01:54 2005 Subject: [gutvol-d] David Wyllie Email address In-Reply-To: References: Message-ID: I sent the address, unless someone has a better one. Michael On Thu, 17 Feb 2005, Alex Wilson wrote: > About a month ago Greg Newby offered to get me in touch with David > Wyllie--who provided the English translation of Kafka's Metamorphosis for > PG--and I haven't heard from him since. I'm thinking Greg's emails or mine > are ending up in a junk mail folder, so I'm wondering if anyone here knows > how I can get in touch with Mr. Wyllie. > > Thanks. > > Alex. > > http://www.telltaleweekly.org - Funding a Free Audiobook Library > > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From miranda_vandeheijning at blueyonder.co.uk Sat Feb 19 02:23:03 2005 From: miranda_vandeheijning at blueyonder.co.uk (Miranda van de Heijning) Date: Sat Feb 19 02:23:33 2005 Subject: [gutvol-d] 500th French book In-Reply-To: References: Message-ID: <42171387.5020807@blueyonder.co.uk> Hi guys, There are 485 French books in PG at the moment, so we will be reaching 500 pretty soon. Has any thought been given yet about what could be the 500th book? If no decision has been made, there are quite a few George Sand's coming up from DP and they may be suitable, considering that we are working on providing her complete works. Secondly, are there any statistics on which are the most popular French books? I know that Le Kama Soutra is quite a crowdpleaser, but what about the rest? Miranda Michael Hart wrote: > > I sent the address, > unless someone has a better one. > > Michael > > > On Thu, 17 Feb 2005, Alex Wilson wrote: > >> About a month ago Greg Newby offered to get me in touch with David >> Wyllie--who provided the English translation of Kafka's Metamorphosis >> for >> PG--and I haven't heard from him since. I'm thinking Greg's emails or >> mine >> are ending up in a junk mail folder, so I'm wondering if anyone here >> knows >> how I can get in touch with Mr. Wyllie. >> >> Thanks. >> >> Alex. >> >> http://www.telltaleweekly.org - Funding a Free Audiobook Library >> >> >> _______________________________________________ >> gutvol-d mailing list >> gutvol-d@lists.pglaf.org >> http://lists.pglaf.org/listinfo.cgi/gutvol-d >> > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > From gbnewby at pglaf.org Sat Feb 19 21:49:56 2005 From: gbnewby at pglaf.org (Greg Newby) Date: Sat Feb 19 21:49:58 2005 Subject: [gutvol-d] 500th French book In-Reply-To: <42171387.5020807@blueyonder.co.uk> References: <42171387.5020807@blueyonder.co.uk> Message-ID: <20050220054956.GB30309@pglaf.org> On Sat, Feb 19, 2005 at 10:23:03AM +0000, Miranda van de Heijning wrote: > > Hi guys, > > There are 485 French books in PG at the moment, so we will be reaching > 500 pretty soon. Has any thought been given yet about what could be the > 500th book? If no decision has been made, there are quite a few George > Sand's coming up from DP and they may be suitable, considering that we > are working on providing her complete works. I don't think anyone has suggested one yet. Sands sounds like a good choice. We also have a nice array of Jules Verne and Victor Hugo, and I've noticed some Shakespeare translations. > Secondly, are there any statistics on which are the most popular French > books? I know that Le Kama Soutra is quite a crowdpleaser, but what > about the rest? There's a "top 100" list at http://gutenberg.org/catalog There is also a non-public analysis of the download statistics. Both of these are for ibiblio only, so while they're useful they don't represent other download sources (notably, our many mirrors). You'd need to look through the download list "by hand" to spot the French titles. Email if if you want the URL & username+password, and I'll dig it up. -- Greg > Michael Hart wrote: > > > > >I sent the address, > >unless someone has a better one. > > > >Michael > > > > > >On Thu, 17 Feb 2005, Alex Wilson wrote: > > > >>About a month ago Greg Newby offered to get me in touch with David > >>Wyllie--who provided the English translation of Kafka's Metamorphosis > >>for > >>PG--and I haven't heard from him since. I'm thinking Greg's emails or > >>mine > >>are ending up in a junk mail folder, so I'm wondering if anyone here > >>knows > >>how I can get in touch with Mr. Wyllie. > >> > >>Thanks. > >> > >>Alex. > >> > >>http://www.telltaleweekly.org - Funding a Free Audiobook Library > >> > >> > >>_______________________________________________ > >>gutvol-d mailing list > >>gutvol-d@lists.pglaf.org > >>http://lists.pglaf.org/listinfo.cgi/gutvol-d > >> > >_______________________________________________ > >gutvol-d mailing list > >gutvol-d@lists.pglaf.org > >http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > > > > > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From miranda_vandeheijning at blueyonder.co.uk Sun Feb 20 03:11:33 2005 From: miranda_vandeheijning at blueyonder.co.uk (Miranda van de Heijning) Date: Sun Feb 20 03:11:59 2005 Subject: [gutvol-d] 500th French book In-Reply-To: <20050220054956.GB30309@pglaf.org> References: <42171387.5020807@blueyonder.co.uk> <20050220054956.GB30309@pglaf.org> Message-ID: <42187065.4060107@blueyonder.co.uk> Hi all, I have just looked through the download info which Marcello very kindly compiled for me and I would like to suggest we post as the 500th book part 1 of 'Sodome et Gomorrhe'. It is part of Proust's classic A la recherche du temps perdu and the only remaining volume which we can actually post to PG. This is because the other parts of the series were published after his death, between 1923 and 1927. We already have Sodomo et Gomorrhe 2. Sodome et Gomorrhe 1 is close to finishing proofing at Distributed Proofreaders (162 pages to go in round 2) so I expect it will be available for post-processing/posting soon. Or are there any other suggestions? Miranda Greg Newby wrote: >On Sat, Feb 19, 2005 at 10:23:03AM +0000, Miranda van de Heijning wrote: > > >>Hi guys, >> >>There are 485 French books in PG at the moment, so we will be reaching >>500 pretty soon. Has any thought been given yet about what could be the >>500th book? If no decision has been made, there are quite a few George >>Sand's coming up from DP and they may be suitable, considering that we >>are working on providing her complete works. >> >> > >I don't think anyone has suggested one yet. Sands sounds >like a good choice. We also have a nice array of Jules Verne >and Victor Hugo, and I've noticed some Shakespeare translations. > > > >>Secondly, are there any statistics on which are the most popular French >>books? I know that Le Kama Soutra is quite a crowdpleaser, but what >>about the rest? >> >> > >There's a "top 100" list at http://gutenberg.org/catalog >There is also a non-public analysis of the download >statistics. Both of these are for ibiblio only, so while they're >useful they don't represent other download sources (notably, >our many mirrors). > >You'd need to look through the download list "by hand" to spot the >French titles. Email if if you want the URL & username+password, >and I'll dig it up. > -- Greg > > > > >>Michael Hart wrote: >> >> >> >>>I sent the address, >>>unless someone has a better one. >>> >>>Michael >>> >>> >>>On Thu, 17 Feb 2005, Alex Wilson wrote: >>> >>> >>> >>>>About a month ago Greg Newby offered to get me in touch with David >>>>Wyllie--who provided the English translation of Kafka's Metamorphosis >>>>for >>>>PG--and I haven't heard from him since. I'm thinking Greg's emails or >>>>mine >>>>are ending up in a junk mail folder, so I'm wondering if anyone here >>>>knows >>>>how I can get in touch with Mr. Wyllie. >>>> >>>>Thanks. >>>> >>>>Alex. >>>> >>>>http://www.telltaleweekly.org - Funding a Free Audiobook Library >>>> >>>> >>>>_______________________________________________ >>>>gutvol-d mailing list >>>>gutvol-d@lists.pglaf.org >>>>http://lists.pglaf.org/listinfo.cgi/gutvol-d >>>> >>>> >>>> >>>_______________________________________________ >>>gutvol-d mailing list >>>gutvol-d@lists.pglaf.org >>>http://lists.pglaf.org/listinfo.cgi/gutvol-d >>> >>> >>> >>> >>> >>_______________________________________________ >>gutvol-d mailing list >>gutvol-d@lists.pglaf.org >>http://lists.pglaf.org/listinfo.cgi/gutvol-d >> >> > > > > > From bill at truthdb.org Sun Feb 20 22:20:34 2005 From: bill at truthdb.org (bill jenness) Date: Sun Feb 20 22:20:54 2005 Subject: [gutvol-d] pgdvd access index file error Message-ID: <1272.134.117.137.41.1108966834.squirrel@134.117.137.41> I have just downloaded the dvd from ibiblio and there seems to be a problem with the index.htm in the access directory. Here is an excerpt to illustrate the problem: James Fenimore Cooper The file refered to on the dvd is actually gtnletc.htm and this error is propagated throughout the index. The gtnanon.htm file is correctly linked but the links for the others are not. This may not be a problem on a windows machine but it is on a case sensitive filesystem. The download file was 10802.iso, the file date is Nov 22, 2003. It should be fairly trivial to fix. Two ways to correct this are change the hotlinks in access/index.htm or change the filenames in that directory to agree. Either way it would mean opening the iso for editing to make the repair then correcting the md5sum to match. Is this something that has already been looked at? From jon_niehof at yahoo.com Sun Feb 20 23:12:18 2005 From: jon_niehof at yahoo.com (Jon Niehof) Date: Sun Feb 20 23:12:36 2005 Subject: [gutvol-d] pgdvd access index file error In-Reply-To: <1272.134.117.137.41.1108966834.squirrel@134.117.137.41> Message-ID: <20050221071218.218.qmail@web80905.mail.scd.yahoo.com> > I have just downloaded the dvd from ibiblio and there seems to > be a problem with the index.htm in the access directory. Here > is an excerpt to illustrate the problem: > > James Fenimore Cooper > > The file refered to on the dvd is actually gtnletc.htm and > this error is propagated throughout the index. The gtnanon.htm > file is correctly linked but the links for the others are not. > This may not be a problem on a windows machine but it is on a > case sensitive filesystem. You don't say on what sort of system you had mounted the DVD. Is it possible the DVD has Joliet or Rock Ridge extensions but they are not being read by your system? If there are no such extensions (and indeed they would seem to be against PG philosophy of least common denominator), I would expect the OS should treat the ISO9660 filesystem as case-insensitive; it's often translated anyhow (e.g. filenames usually show up as lowercase on my Linux box). Of course, I haven't validated this behaviour as either required or actually implemented ;) but if the filenames are to be corrected due to case sensitivity making them all caps would, I believe, be more accurate. __________________________________ Do you Yahoo!? Yahoo! Mail - 250MB free storage. Do more. Manage less. http://info.mail.yahoo.com/mail_250 From gbnewby at pglaf.org Mon Feb 21 00:03:50 2005 From: gbnewby at pglaf.org (Greg Newby) Date: Mon Feb 21 00:03:52 2005 Subject: [gutvol-d] pgdvd access index file error In-Reply-To: <1272.134.117.137.41.1108966834.squirrel@134.117.137.41> References: <1272.134.117.137.41.1108966834.squirrel@134.117.137.41> Message-ID: <20050221080350.GA5557@pglaf.org> On Mon, Feb 21, 2005 at 01:20:34AM -0500, bill jenness wrote: > I have just downloaded the dvd from ibiblio and there seems to be a > problem with the index.htm in the access directory. Here is an excerpt to > illustrate the problem: > > James Fenimore Cooper > > The file refered to on the dvd is actually gtnletc.htm and this error is > propagated throughout the index. The gtnanon.htm file is correctly linked > but the links for the others are not. This may not be a problem on a > windows machine but it is on a case sensitive filesystem. The download > file was 10802.iso, the file date is Nov 22, 2003. It should be fairly > trivial to fix. > > Two ways to correct this are change the hotlinks in access/index.htm or > change the filenames in that directory to agree. Either way it would mean > opening the iso for editing to make the repair then correcting the md5sum > to match. > > Is this something that has already been looked at? This is a a known problem. It's on the list, with a few other things, to fix in the next iteration of an ISO image. I keep thinking I'm going to roll out a brand new ISO, but it hasn't happened yet. Meanwhile, for most people, just editing the index.htm to all lower-case is a good "quick hack" solution - assuming you've copied all the files to your hard drive. Something like this: cp index.htm /tmp/oldindex.htm cat /tmp/oldindex.htm | tr '[A-Z]' '[a-z]' > index.htm -- Greg From hart at pglaf.org Mon Feb 21 11:06:02 2005 From: hart at pglaf.org (Michael Hart) Date: Mon Feb 21 11:06:05 2005 Subject: [gutvol-d] 500th French book In-Reply-To: <42187065.4060107@blueyonder.co.uk> References: <42171387.5020807@blueyonder.co.uk> <20050220054956.GB30309@pglaf.org> <42187065.4060107@blueyonder.co.uk> Message-ID: Don't forget, all of Proust can be posted at Project Gutenberg sites with "life +50" and +70 copyrights, since he died so long ago. Michael On Sun, 20 Feb 2005, Miranda van de Heijning wrote: > Hi all, > > I have just looked through the download info which Marcello very kindly > compiled for me and I would like to suggest we post as the 500th book part 1 > of 'Sodome et Gomorrhe'. > > It is part of Proust's classic A la recherche du temps perdu and the only > remaining volume which we can actually post to PG. This is because the other > parts of the series were published after his death, between 1923 and 1927. We > already have Sodomo et Gomorrhe 2. > > Sodome et Gomorrhe 1 is close to finishing proofing at Distributed > Proofreaders (162 pages to go in round 2) so I expect it will be available > for post-processing/posting soon. > > Or are there any other suggestions? > > Miranda > > > > Greg Newby wrote: > >> On Sat, Feb 19, 2005 at 10:23:03AM +0000, Miranda van de Heijning wrote: >> >>> Hi guys, >>> >>> There are 485 French books in PG at the moment, so we will be reaching >>> 500 pretty soon. Has any thought been given yet about what could be the >>> 500th book? If no decision has been made, there are quite a few George >>> Sand's coming up from DP and they may be suitable, considering that we >>> are working on providing her complete works. >>> >> >> I don't think anyone has suggested one yet. Sands sounds >> like a good choice. We also have a nice array of Jules Verne >> and Victor Hugo, and I've noticed some Shakespeare translations. >> >> >>> Secondly, are there any statistics on which are the most popular French >>> books? I know that Le Kama Soutra is quite a crowdpleaser, but what >>> about the rest? >>> >> >> There's a "top 100" list at http://gutenberg.org/catalog >> There is also a non-public analysis of the download >> statistics. Both of these are for ibiblio only, so while they're >> useful they don't represent other download sources (notably, >> our many mirrors). >> >> You'd need to look through the download list "by hand" to spot the >> French titles. Email if if you want the URL & username+password, >> and I'll dig it up. >> -- Greg >> >> >> >>> Michael Hart wrote: >>> >>> >>>> I sent the address, >>>> unless someone has a better one. >>>> >>>> Michael >>>> >>>> >>>> On Thu, 17 Feb 2005, Alex Wilson wrote: >>>> >>>> >>>>> About a month ago Greg Newby offered to get me in touch with David >>>>> Wyllie--who provided the English translation of Kafka's >>>>> Metamorphosis for >>>>> PG--and I haven't heard from him since. I'm thinking Greg's emails >>>>> or mine >>>>> are ending up in a junk mail folder, so I'm wondering if anyone here >>>>> knows >>>>> how I can get in touch with Mr. Wyllie. >>>>> >>>>> Thanks. >>>>> >>>>> Alex. >>>>> >>>>> http://www.telltaleweekly.org - Funding a Free Audiobook Library >>>>> >>>>> >>>>> _______________________________________________ >>>>> gutvol-d mailing list >>>>> gutvol-d@lists.pglaf.org >>>>> http://lists.pglaf.org/listinfo.cgi/gutvol-d >>>>> >>>>> >>>> _______________________________________________ >>>> gutvol-d mailing list >>>> gutvol-d@lists.pglaf.org >>>> http://lists.pglaf.org/listinfo.cgi/gutvol-d >>>> >>>> >>>> >>>> >>> _______________________________________________ >>> gutvol-d mailing list >>> gutvol-d@lists.pglaf.org >>> http://lists.pglaf.org/listinfo.cgi/gutvol-d >>> >> >> >> >> > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From sly at victoria.tc.ca Mon Feb 21 21:53:23 2005 From: sly at victoria.tc.ca (Andrew Sly) Date: Mon Feb 21 21:53:41 2005 Subject: [gutvol-d] Understatment of the day Message-ID: Here's a little mention of PG from usenet... Newsgroups: alt.games.video.nintendo.gameboy.advance Date: 2005-02-19 12:22:54 PST Files for books? Text files are available for books all over. Project Gutenberg has hundreds of public domain text files. From miranda_vandeheijning at blueyonder.co.uk Tue Feb 22 01:50:56 2005 From: miranda_vandeheijning at blueyonder.co.uk (Miranda van de Heijning) Date: Tue Feb 22 01:51:23 2005 Subject: [gutvol-d] 500th French book In-Reply-To: References: <42171387.5020807@blueyonder.co.uk> <20050220054956.GB30309@pglaf.org> <42187065.4060107@blueyonder.co.uk> Message-ID: <421B0080.8060402@blueyonder.co.uk> My intention is to continue A la recherche du temps perdu on DP-EU and hopefully, one of the other PG sites will be able to publish them. After that, we just need to wait for US copyright to move along a few years and then PG-US will have the full lot as well. :-) Miranda Michael Hart wrote: > > Don't forget, all of Proust can be posted at Project Gutenberg sites > with "life +50" and +70 copyrights, since he died so long ago. > > Michael > > > On Sun, 20 Feb 2005, Miranda van de Heijning wrote: > >> Hi all, >> >> I have just looked through the download info which Marcello very >> kindly compiled for me and I would like to suggest we post as the >> 500th book part 1 of 'Sodome et Gomorrhe'. >> >> It is part of Proust's classic A la recherche du temps perdu and the >> only remaining volume which we can actually post to PG. This is >> because the other parts of the series were published after his death, >> between 1923 and 1927. We already have Sodomo et Gomorrhe 2. >> >> Sodome et Gomorrhe 1 is close to finishing proofing at Distributed >> Proofreaders (162 pages to go in round 2) so I expect it will be >> available for post-processing/posting soon. >> >> Or are there any other suggestions? >> >> Miranda >> >> >> >> Greg Newby wrote: >> >>> On Sat, Feb 19, 2005 at 10:23:03AM +0000, Miranda van de Heijning >>> wrote: >>> >>>> Hi guys, >>>> >>>> There are 485 French books in PG at the moment, so we will be >>>> reaching 500 pretty soon. Has any thought been given yet about what >>>> could be the 500th book? If no decision has been made, there are >>>> quite a few George Sand's coming up from DP and they may be >>>> suitable, considering that we are working on providing her complete >>>> works. >>>> >>> >>> I don't think anyone has suggested one yet. Sands sounds >>> like a good choice. We also have a nice array of Jules Verne >>> and Victor Hugo, and I've noticed some Shakespeare translations. >>> >>> >>>> Secondly, are there any statistics on which are the most popular >>>> French books? I know that Le Kama Soutra is quite a crowdpleaser, >>>> but what about the rest? >>>> >>> >>> There's a "top 100" list at http://gutenberg.org/catalog >>> There is also a non-public analysis of the download >>> statistics. Both of these are for ibiblio only, so while they're >>> useful they don't represent other download sources (notably, >>> our many mirrors). >>> >>> You'd need to look through the download list "by hand" to spot the >>> French titles. Email if if you want the URL & username+password, >>> and I'll dig it up. >>> -- Greg >>> >>> >>> >>>> Michael Hart wrote: >>>> >>>> >>>>> I sent the address, >>>>> unless someone has a better one. >>>>> >>>>> Michael >>>>> >>>>> >>>>> On Thu, 17 Feb 2005, Alex Wilson wrote: >>>>> >>>>> >>>>>> About a month ago Greg Newby offered to get me in touch with David >>>>>> Wyllie--who provided the English translation of Kafka's >>>>>> Metamorphosis for >>>>>> PG--and I haven't heard from him since. I'm thinking Greg's >>>>>> emails or mine >>>>>> are ending up in a junk mail folder, so I'm wondering if anyone >>>>>> here knows >>>>>> how I can get in touch with Mr. Wyllie. >>>>>> >>>>>> Thanks. >>>>>> >>>>>> Alex. >>>>>> >>>>>> http://www.telltaleweekly.org - Funding a Free Audiobook Library >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> gutvol-d mailing list >>>>>> gutvol-d@lists.pglaf.org >>>>>> http://lists.pglaf.org/listinfo.cgi/gutvol-d >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> gutvol-d mailing list >>>>> gutvol-d@lists.pglaf.org >>>>> http://lists.pglaf.org/listinfo.cgi/gutvol-d >>>>> >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> gutvol-d mailing list >>>> gutvol-d@lists.pglaf.org >>>> http://lists.pglaf.org/listinfo.cgi/gutvol-d >>>> >>> >>> >>> >>> >> >> _______________________________________________ >> gutvol-d mailing list >> gutvol-d@lists.pglaf.org >> http://lists.pglaf.org/listinfo.cgi/gutvol-d >> > > > From marcello at perathoner.de Tue Feb 22 13:36:01 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue Feb 22 16:13:07 2005 Subject: [gutvol-d] Gutenbergprosjektet (PG mentioned in Norwegian web site) Message-ID: <421BA5C1.2070605@perathoner.de> http://www.dinside.no/php/art.php?id=117187 -- Marcello Perathoner webmaster@gutenberg.org From nwolcott at dsdial.net Wed Feb 23 07:49:16 2005 From: nwolcott at dsdial.net (N Wolcott) Date: Wed Feb 23 07:50:17 2005 Subject: [gutvol-d] Pepys' birthday Message-ID: <005901c519bf$420ca4e0$ac9495ce@gw98> Being his birthday maybe this is appropriate. Pepys gave his memoirs to Cambridge University, but the full text was not published until 1970. That being the case would not the text (minus editorial comment and added footnotes) be now public domain as it is now more than 75 years since the author's death? N Wolcott nwolcott2@post.harvard.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050223/b4389498/attachment.html From hyphen at hyphenologist.co.uk Wed Feb 23 09:07:23 2005 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Wed Feb 23 09:07:47 2005 Subject: [gutvol-d] Pepys' birthday In-Reply-To: <005901c519bf$420ca4e0$ac9495ce@gw98> References: <005901c519bf$420ca4e0$ac9495ce@gw98> Message-ID: On Wed, 23 Feb 2005 10:49:16 -0500, "N Wolcott" wrote: | Being his birthday maybe this is appropriate. Pepys gave his | memoirs to Cambridge University, Probably not the University more likely a college. Do you have any idea which college? | but the full text was not | published until 1970. That being the case would not the text | (minus editorial comment and added footnotes) be now public | domain as it is now more than 75 years since the author's death? Probably but how to obtain a scan to work from. See also www.pepysdiary.com which appears to be everything on line. -- Dave F From gbnewby at pglaf.org Wed Feb 23 09:16:46 2005 From: gbnewby at pglaf.org (Greg Newby) Date: Wed Feb 23 09:16:47 2005 Subject: [gutvol-d] Pepys' birthday In-Reply-To: <005901c519bf$420ca4e0$ac9495ce@gw98> References: <005901c519bf$420ca4e0$ac9495ce@gw98> Message-ID: <20050223171646.GA15596@pglaf.org> On Wed, Feb 23, 2005 at 10:49:16AM -0500, N Wolcott wrote: > Being his birthday maybe this is appropriate. Pepys gave his memoirs to Cambridge University, but the full text was not published until 1970. That being the case would not the text (minus editorial comment and added footnotes) be now public domain as it is now more than 75 years since the author's death? > Are we missing something? Quotes and Images From The Diary of Samuel Pepys, by Samuel Pepys 7554 Jul 2003 Quotations From Diary of Samuel Pepys, by Widger [dwqspxxx.xxx] 4202 Diary of Samuel Pepys, Complete, by Samuel Pepys 4200 Diary of Samuel Pepys, 1669 N.S. Complete, by Samuel Pepys 4199 Diary of Samuel Pepys, Apr/May 1669, by Samuel Pepys 4198 Diary of Samuel Pepys, Feb/Mar 1668/69, by Samuel Pepys 4197 Diary of Samuel Pepys, January 1668/69, by Samuel Pepys 4196 Diary of Samuel Pepys, 1668 N.S. Complete, by Samuel Pepys 4195 Diary of Samuel Pepys, December 1668, by Samuel Pepys 4194 Diary of Samuel Pepys, November 1668, by Samuel Pepys 4193 Diary of Samuel Pepys, September/October 1668, by Samuel Pepys 4192 Diary of Samuel Pepys, August 1668, by Samuel Pepys 4191 Diary of Samuel Pepys, June/July 1668, by Samuel Pepys 4190 Diary of Samuel Pepys, May 1668, by Samuel Pepys 4189 Diary of Samuel Pepys, April 1668, by Samuel Pepys 4188 Diary of Samuel Pepys, March 1667/68, by Samuel Pepys 4187 Diary of Samuel Pepys, February 1667/68, by Samuel Pepys 4186 Diary of Samuel Pepys, January 1667/68, by Samuel Pepys 4185 Diary of Samuel Pepys, 1667 N.S. Complete, by Samuel Pepys 4184 Diary of Samuel Pepys, December 1667, by Samuel Pepys 4183 Diary of Samuel Pepys, November 1667, by Samuel Pepys 4182 Diary of Samuel Pepys, October 1667, by Samuel Pepys 4181 Diary of Samuel Pepys, September 1667, by Samuel Pepys 4180 Diary of Samuel Pepys, August 1667, by Samuel Pepys 4179 Diary of Samuel Pepys, July 1667, by Samuel Pepys 4178 Diary of Samuel Pepys, June 1667, by Samuel Pepys 4177 Diary of Samuel Pepys, May 1667, by Samuel Pepys 4176 Diary of Samuel Pepys, April 1966/67, by Samuel Pepys 4175 Diary of Samuel Pepys, March 1966/67, by Samuel Pepys 4174 Diary of Samuel Pepys, February 1966/67, by Samuel Pepys 4173 Diary of Samuel Pepys, January 1966/67, by Samuel Pepys 4172 Diary of Samuel Pepys, 1666 N.S. Complete, by Samuel Pepys 4171 Diary of Samuel Pepys, December 1666, by Samuel Pepys 4170 Diary of Samuel Pepys, November 1666, by Samuel Pepys 4169 Diary of Samuel Pepys, October 1666, by Samuel Pepys 4168 Diary of Samuel Pepys, August/September 1666, by Samuel Pepys 4167 Diary of Samuel Pepys, July 1666, by Samuel Pepys 4166 Diary of Samuel Pepys, May/June 1666, by Samuel Pepys 4165 Diary of Samuel Pepys, March/April 1665/66, by Samuel Pepys 4164 Diary of Samuel Pepys, January/February 1965/66, by Samuel Pepys 4163 Diary of Samuel Pepys, 1665 N.S. Complete, by Samuel Pepys 4162 Diary of Samuel Pepys, November/December 1665, by Samuel Pepys 4161 Diary of Samuel Pepys, October 1665, by Samuel Pepys 4160 Diary of Samuel Pepys, September 1665, by Samuel Pepys 4159 Diary of Samuel Pepys, August 1665, by Samuel Pepys 4158 Diary of Samuel Pepys, July 1665, by Samuel Pepys 4157 Diary of Samuel Pepys, May/June 1665, by Samuel Pepys 4156 Diary of Samuel Pepys, March/April 1664/65, by Samuel Pepys 4155 Diary of Samuel Pepys, January/February 1964/65, by Samuel Pepys 4154 Diary of Samuel Pepys, 1664 N.S. Complete, by Samuel Pepys 4153 Diary of Samuel Pepys, December 1664, by Samuel Pepys 4152 Diary of Samuel Pepys, October/November 1664, by Samuel Pepys 4151 Diary of Samuel Pepys, August/September 1664, by Samuel Pepys 4150 Diary of Samuel Pepys, June/July 1664, by Samuel Pepys 4149 Diary of Samuel Pepys, April/May 1664, by Samuel Pepys 4148 Diary of Samuel Pepys, March 1663/64, by Samuel Pepys 4147 Diary of Samuel Pepys, January/February 1663/64, by Samuel Pepys 4146 Diary of Samuel Pepys, 1663 N.S. Complete, by Samuel Pepys 4145 Diary of Samuel Pepys, November/December 1663, by Samuel Pepys 4144 Diary of Samuel Pepys, September/October 1663, by Samuel Pepys 4143 Diary of Samuel Pepys, July/August 1663, by Samuel Pepys 4142 Diary of Samuel Pepys, May/June 1663, by Samuel Pepys 4141 Diary of Samuel Pepys, March/April 1662/63, by Samuel Pepys 4140 Diary of Samuel Pepys, January/February 1662/63, by Samuel Pepys 4139 Diary of Samuel Pepys, 1662 N.S. Complete, by Samuel Pepys 4138 Diary of Samuel Pepys, November/December 1662, by Samuel Pepys 4137 Diary of Samuel Pepys, September/October 1662, by Samuel Pepys 4136 Diary of Samuel Pepys, July/August 1662, by Samuel Pepys 4135 Diary of Samuel Pepys, May/June 1662, by Samuel Pepys 4134 Diary of Samuel Pepys, March/April 1661/62, by Samuel Pepys 4133 Diary of Samuel Pepys, January/February 1661/62, by Samuel Pepys 4132 Diary of Samuel Pepys, 1661 N.S. Complete, by Samuel Pepys 4131 Diary of Samuel Pepys, November/December 1661, by Samuel Pepys 4130 Diary of Samuel Pepys, September/October 1661, by Samuel Pepys 4129 Diary of Samuel Pepys, June/July/August 1661, by Samuel Pepys 4128 Diary of Samuel Pepys, April/May 1661, by Samuel Pepys 4127 Diary of Samuel Pepys, January/February/March 1660/61, by Samuel Pepys 4126 Diary of Samuel Pepys, 1660 N.S. Complete, by Samuel Pepys 4125 Diary of Samuel Pepys, October/November/December 1660, by Samuel Pepys 4124 Diary of Samuel Pepys, August/September 1660, by Samuel Pepys 4123 Diary of Samuel Pepys, June/July 1660, by Samuel Pepys 4122 Diary of Samuel Pepys, May 1660, by Samuel Pepys 4121 Diary of Samuel Pepys, March/April 1659/1660, by Samuel Pepys 4120 Diary of Samuel Pepys, February 1659/1660, by Samuel Pepys 4119 Diary of Samuel Pepys, January 1659/1660, by Samuel Pepys 4118 Diary of Samuel Pepys, Unabridged, Preface and Life, by Samuel Pepys 4117 Jul 2002 The Diary of Samuel Pepys, Lord Braybrooke/Editor [pepysxxx.xxx] 3331 From shimmin at uiuc.edu Wed Feb 23 09:33:18 2005 From: shimmin at uiuc.edu (Robert Shimmin) Date: Wed Feb 23 09:33:24 2005 Subject: [gutvol-d] Pepys' birthday In-Reply-To: <005901c519bf$420ca4e0$ac9495ce@gw98> References: <005901c519bf$420ca4e0$ac9495ce@gw98> Message-ID: <421CBE5E.7060801@uiuc.edu> N Wolcott wrote: > Being his birthday maybe this is appropriate. Pepys gave his memoirs to > Cambridge University, but the full text was not published until 1970. > That being the case would not the text (minus editorial comment and > added footnotes) be now public domain as it is now more than 75 years > since the author's death? In the US, a work first published in 1970 has a 95-year term, and won't hit the public domain until 2066. In the UK, posthumous works are no different than other works today, but that has only been the case since 1988. Before 1988, posthumous works got a 50-year copyright (2021). This may have been extended to 70 years since then (2041). Canada was also offering a 50-year copyright to first publications of posthumous works at the time, and I know they haven't extended their term. -- RS From Gutenberg9443 at aol.com Wed Feb 23 12:42:58 2005 From: Gutenberg9443 at aol.com (Gutenberg9443@aol.com) Date: Wed Feb 23 12:43:11 2005 Subject: [gutvol-d] Pepys' birthday Message-ID: <1e.3fc5749d.2f4e44d2@aol.com> In a message dated 2/23/2005 8:50:44 AM Mountain Standard Time, nwolcott@dsdial.net writes: Pepys gave his memoirs to Cambridge University, but the full text was not published until 1970. That being the case would not the text (minus editorial comment and added footnotes) be now public domain as it is now more than 75 years since the author's death? Oh horrors! I forgot Pepys's birthday! Oh well, there's still time to bake a cake. (We celebrate the birthdays of Shakespeare, Robert Burns, and Rudyard Kipling already.) As to the diaries, they're already posted. I gave them on CD to a very dear neighbor for Christmas two or three years ago. She never got around to reading them, mainly because despite an IQ somewhat stratospheric, she never did figure out her computer. But she was grateful that I had thought of giving them to her. Anne -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050223/b745fa39/attachment.html From Gutenberg9443 at aol.com Wed Feb 23 12:44:51 2005 From: Gutenberg9443 at aol.com (Gutenberg9443@aol.com) Date: Wed Feb 23 12:45:11 2005 Subject: [gutvol-d] Pepys' birthday Message-ID: <1e2.36266c1c.2f4e4543@aol.com> In a message dated 2/23/2005 10:33:50 AM Mountain Standard Time, shimmin@uiuc.edu writes: In the US, a work first published in 1970 has a 95-year term, and won't hit the public domain until 2066. In the UK, posthumous works are no different than other works today, but that has only been the case since 1988. Before 1988, posthumous works got a 50-year copyright (2021). This may have been extended to 70 years since then (2041). So who is going to complain? There is a new edition as of about 24 years ago, which includes all Pepys's XXX comments that are omitted from the earlier edition. Anne -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050223/09bc8b3f/attachment.html From krooger at debian.org Wed Feb 23 13:35:39 2005 From: krooger at debian.org (Jonathan Walther) Date: Wed Feb 23 13:35:55 2005 Subject: [gutvol-d] Pepys' birthday In-Reply-To: <1e2.36266c1c.2f4e4543@aol.com> References: <1e2.36266c1c.2f4e4543@aol.com> Message-ID: <20050223213539.GB14264@reactor-core.org> On Wed, Feb 23, 2005 at 03:44:51PM -0500, Gutenberg9443@aol.com wrote: > So who is going to complain? There is a new edition as of about 24 > years ago, which includes all Pepys's XXX comments that are omitted > from the earlier edition. I seem to recall that Pepys diaries were written in a special shorthand. The current editions may claim copyright on their "transcriptions" of the shorthand. Anyone game for scanning in the original shorthand, and transcribing it? Jonathan -- It's not true unless it makes you laugh, but you don't understand it until it makes you weep. Eukleia: Jonathan Walther Address: 12706 99 Ave, Surrey, BC V3V2P8 (Canada) Contact: 604-684-1319 (daytime) Contact: 604-582-9308 (morning and evening) Puritan: Purity of faith, Purity of doctrine. Sola Scriptura! Patriarchy, Polygamy, Slavery === Fatherhood, Husbandry, Mastery Matriarchy, Monogamy, Prisons === Wickedness, Stupidity, Buggery From marcello at perathoner.de Wed Feb 23 11:31:50 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed Feb 23 14:44:16 2005 Subject: [gutvol-d] Filesystem changes to the web site Message-ID: <421CDA26.7060507@perathoner.de> At my request ibiblio is moving our site to the new file server. Our new Apache documentroot will be: /public/vhost/g/gutenberg/html/ our old documentroot was: /public/html/gutenberg/ The steps are: 1. (done) The new directory has been created and the old directory has been copied to the new directory. The new directory is accessible thru the development web server at: http://www-dev.gutenberg.org 2. (in progress) I will update the files in the new directory and test them. 3. ibiblio will switch the production server to the new directory. 4. ibiblio will delete the old directory. How does this affect you ? If you are editing files in /public/html/gutenberg/ you should copy your edits to the corresponding files in /public/vhost/g/gutenberg/html/. At least you should keep a list of which files you did edit so you can copy them over before we switch the production servers. -- Marcello Perathoner webmaster@gutenberg.org From gbuchana at rogers.com Wed Feb 23 17:42:39 2005 From: gbuchana at rogers.com (Gardner Buchanan) Date: Wed Feb 23 17:43:01 2005 Subject: [gutvol-d] Ibn Batuta (Was Re: Fwd: Project Googleberg) In-Reply-To: <41CB953A.50402@dsl.pipex.com> Message-ID: Hi all, So after all the enthusiastic chatter about this in December, I'm a little surprised two months later to find myself the first mover, but here I am. I splashed out $12 for a facsimile edition of the 1829 Lee translation from Amazon, and yesterday I got hold of it. I've submitted it for clearance and will do the scans as time permits. I intend to push the scans into a DP project versus trying to handle it myself. There will be difficulty however: this is a scholarly translation and is full of footnotes, pronunciation notation and is stuffed with arabic passages. Have a look at a sample here: http://unixcomputer.net/new-photo/cd/p12.gif Anyone got ideas or suggestions for handling this sort of material? See you, On 04:04:10 Holden McGroin wrote: > Gutenberg9443@aol.com wrote: >> By the way, does ANYBODY know where we can get a public domain copy of >> Ibn Batuta? I've had no luck finding one online. I even asked the king >> of Saudi Arabia for a copy, but His Majesty didn't answer. The few >> snippets I've seen are fascinating. He left his home to go on a haj, and >> then kept going, spending 29 years travelling and writing fascinating >> notes of where he went, namely everywhere you could get to without going >> to Arctica, Antarctica, or the Americas. > > I have to agree with Anne. Every time I hear about Ibn Batuta's amazing > travels, I feel the urge to read his writings. Is there any chance we > could get them online as part of Gutenberg's collection? ============================================================ Gardner Buchanan Ottawa, ON FreeBSD: Where you want to go. Today. From la_joconde_orange at yahoo.com Wed Feb 23 19:49:03 2005 From: la_joconde_orange at yahoo.com (Melissa) Date: Wed Feb 23 19:49:19 2005 Subject: [gutvol-d] Ibn Batuta (Was Re: Fwd: Project Googleberg) In-Reply-To: Message-ID: <20050224034904.99107.qmail@web20224.mail.yahoo.com> Footnotes should not be a problem for DP, we handle them all the time. Arabic is another question. An html edition could certainly be made, with the help of someone who knows arabic to do those transcriptions, and a transcriber's note added that to view the arabic text, an arabic font must be installed. An ascii text would just be plaintext of course and therefore incomplete, but there could be a transcriber's note in that edition too, pointing the reader to the html edition for the complete text. Many at DP are not scared by the thought of a scholarly work or making faithful renditions of them. Some even relish the challenge. With the collaboration of a speaker of Arabic, whether someone at DP or elsewhere, such a project could be reliably done at DP. On DP, la_joconde On PG, Melissa Er-Raqabi (you may search my name at pgdp.net. Some of my recent uploads for Black History Month are non-fiction works with transcriber's notes. Higher project numbers are obviously more recent.) --Melissa Gardner Buchanan wrote: Hi all, So after all the enthusiastic chatter about this in December, I'm a little surprised two months later to find myself the first mover, but here I am. I splashed out $12 for a facsimile edition of the 1829 Lee translation from Amazon, and yesterday I got hold of it. I've submitted it for clearance and will do the scans as time permits. I intend to push the scans into a DP project versus trying to handle it myself. There will be difficulty however: this is a scholarly translation and is full of footnotes, pronunciation notation and is stuffed with arabic passages. Have a look at a sample here: http://unixcomputer.net/new-photo/cd/p12.gif Anyone got ideas or suggestions for handling this sort of material? See you, On 04:04:10 Holden McGroin wrote: > Gutenberg9443@aol.com wrote: >> By the way, does ANYBODY know where we can get a public domain copy of >> Ibn Batuta? I've had no luck finding one online. I even asked the king >> of Saudi Arabia for a copy, but His Majesty didn't answer. The few >> snippets I've seen are fascinating. He left his home to go on a haj, and >> then kept going, spending 29 years travelling and writing fascinating >> notes of where he went, namely everywhere you could get to without going >> to Arctica, Antarctica, or the Americas. > > I have to agree with Anne. Every time I hear about Ibn Batuta's amazing > travels, I feel the urge to read his writings. Is there any chance we > could get them online as part of Gutenberg's collection? ============================================================ Gardner Buchanan Ottawa, ON FreeBSD: Where you want to go. Today. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050223/c5a803f2/attachment.html From lofstrom at lava.net Wed Feb 23 20:49:37 2005 From: lofstrom at lava.net (Karen Lofstrom) Date: Wed Feb 23 20:49:53 2005 Subject: [gutvol-d] Ibn Batuta (Was Re: Fwd: Project Googleberg) In-Reply-To: References: Message-ID: On Wed, 23 Feb 2005, Gardner Buchanan wrote: > I splashed out $12 for a facsimile edition of the > 1829 Lee translation from Amazon, and yesterday I got hold of it. > I've submitted it for clearance and will do the scans as time > permits. I intend to push the scans into a DP project versus trying > to handle it myself. There will be difficulty however: this is a > scholarly translation and is full of footnotes, pronunciation > notation and is stuffed with arabic passages. Do put it through DP-EU (the European Distributed Proofreaders). They are using Unicode and can handle Arabic text. In fact, they were or are doing some Urdu texts in Arabic script. -- Karen Lofstrom Zora on DP From Gutenberg9443 at aol.com Fri Feb 25 07:04:23 2005 From: Gutenberg9443 at aol.com (Gutenberg9443@aol.com) Date: Fri Feb 25 07:05:16 2005 Subject: [gutvol-d] Pepys' birthday Message-ID: <7b.3fb076d7.2f509877@aol.com> In a message dated 2/23/2005 2:36:14 PM Mountain Standard Time, krooger@debian.org writes: I seem to recall that Pepys diaries were written in a special shorthand. The current editions may claim copyright on their "transcriptions" of the shorthand. Anyone game for scanning in the original shorthand, and transcribing it? I don't know how you could get a copy of the diaries OR the shorthand. I'm sure the recent edition would claim copyright, but I still don't think the older edition is likely to be a problem. It's been up for a while and so far as I know, nobody's complained. Considering how long it took to transliterate them the two times they were transliterated, I expect most of us have more to do with the next fifteen years of our lives. I think we should stay with what we have. If anybody needs something more thorough, that person will probably need to go to the closest major university library. Anne -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050225/e38fd817/attachment.html From ron at zytrax.com Fri Feb 25 08:38:18 2005 From: ron at zytrax.com (Ron Aitchison) Date: Fri Feb 25 08:39:55 2005 Subject: [gutvol-d] Enlightened self-interst Message-ID: <421F547A.1080007@zytrax.com> Having discovered Jane Austen regrettably late in life I have down-loaded a couple of novels and since I find the raw text format unpleasant to read I have reformatted for my own use. It seems to me since I have the ability to produce PDFs and OpenOffice formats and even - heaven forfend - MS doc format should they be wanted, it would be churlish not to make such an offer. If you can point me at a standard for PDF, page width, font size etc, etc., and let me know what formats you do want I would be happy to undertake the small additional work for the two novels I have currently downloaded. I cannot supply DocBook at this time but hope to have that available shortly. Regards -- Ron Aitchison From sly at victoria.tc.ca Fri Feb 25 09:36:54 2005 From: sly at victoria.tc.ca (Andrew Sly) Date: Fri Feb 25 09:37:44 2005 Subject: [gutvol-d] Enlightened self-interst In-Reply-To: <421F547A.1080007@zytrax.com> References: <421F547A.1080007@zytrax.com> Message-ID: One possible problem is that PDF files are not easily editable. All of our older texts are being gradually worked through, corrected, supplied with a new PG header (which puts all the legal "small print" at the end of the file instead of the beginning) and REPosted into the currant directory structure. When this process is done it will make some of the back-end organization much easier to deal with. However, if during this process, we come across a non-editable file (PDF, Lit, whatever), we cannot update it, and it's generally moved into an "old" directory, where it is still availible if someone goes looking for it, but otherwise is not shown in the catalog. Andrew On Fri, 25 Feb 2005, Ron Aitchison wrote: > Having discovered Jane Austen regrettably late in life I have > down-loaded a couple of novels and since I find the raw text format > unpleasant to read I have reformatted for my own use. > It seems to me since I have the ability to produce PDFs and OpenOffice > formats and even - heaven forfend - MS doc format should they be wanted, > it would be churlish not to make such an offer. > If you can point me at a standard for PDF, page width, font size etc, > etc., and let me know what formats you do want I would be happy to > undertake the small additional work for the two novels I have currently > downloaded. > I cannot supply DocBook at this time but hope to have that available > shortly. > Regards > > From ron at zytrax.com Fri Feb 25 14:24:49 2005 From: ron at zytrax.com (Ron Aitchison) Date: Fri Feb 25 14:26:33 2005 Subject: [gutvol-d] Enlightened Self Interest Message-ID: <421FA5B1.2080806@zytrax.com> Understand the issue of editing. My proposal would be to supply an editable file in OpenOffice or MS doc format (BTW if you are not using the Open Source OpenOffice suite I recommend you check it out - the features are great, at least as feature rich as MS word, plus - one button PDF creation, output as doc, text or native XML format and a great price = $0! http://www.openoffice.org ). I propose to take nothing away you will have edit control over the file. This also opens up another question over what base document formats you have standardized for editability and portability e.g. OASIS etc.. Maybe that is a topic another list. Finally I note you have PDF formats available for some other books. Andrew Sly wrote: >One possible problem is that PDF files are not easily editable. > >All of our older texts are being gradually worked through, >corrected, supplied with a new PG header (which puts all the >legal "small print" at the end of the file instead of the >beginning) and REPosted into the currant directory structure. >When this process is done it will make some of the back-end >organization much easier to deal with. > >However, if during this process, we come across a non-editable file >(PDF, Lit, whatever), we cannot update it, and it's generally moved >into an "old" directory, where it is still availible if someone >goes looking for it, but otherwise is not shown in the catalog. > >Andrew > >> Having discovered Jane Austen regrettably late in life I have >> down-loaded a couple of novels and since I find the raw text format >> unpleasant to read I have reformatted for my own use. >> It seems to me since I have the ability to produce PDFs and OpenOffice >> formats and even - heaven forfend - MS doc format should they be wanted, >> it would be churlish not to make such an offer. >> If you can point me at a standard for PDF, page width, font size etc, >> etc., and let me know what formats you do want I would be happy to >> undertake the small additional work for the two novels I have currently >> downloaded. >> I cannot supply DocBook at this time but hope to have that available >> shortly. >> Regards >> >> > > -- Ron Aitchison From cannona at fireantproductions.com Fri Feb 25 15:16:13 2005 From: cannona at fireantproductions.com (Aaron Cannon) Date: Fri Feb 25 15:20:06 2005 Subject: [gutvol-d] Filesystem changes to the web site In-Reply-To: <421CDA26.7060507@perathoner.de> References: <421CDA26.7060507@perathoner.de> Message-ID: <6.1.2.0.0.20050225171445.01be73b0@mail.fireantproductions.com> Any chance we could get a specific timeline on when these changes would be taking place? I just want to be sure we don't miss any CD/DVD requests. Thanks. Sincerely Aaron Cannon At 01:31 PM 2/23/2005, you wrote: >At my request ibiblio is moving our site to the new file server. > > >Our new Apache documentroot will be: > > /public/vhost/g/gutenberg/html/ > >our old documentroot was: > > /public/html/gutenberg/ > > >The steps are: > >1. (done) > >The new directory has been created and the old directory has been copied >to the new directory. > >The new directory is accessible thru the development web server at: > > http://www-dev.gutenberg.org > >2. (in progress) > >I will update the files in the new directory and test them. > >3. > >ibiblio will switch the production server to the new directory. > >4. > >ibiblio will delete the old directory. > > >How does this affect you ? > >If you are editing files in /public/html/gutenberg/ you should copy your >edits to the corresponding files in /public/vhost/g/gutenberg/html/. At >least you should keep a list of which files you did edit so you can copy >them over before we switch the production servers. > > > >-- >Marcello Perathoner >webmaster@gutenberg.org > > >_______________________________________________ >gutvol-d mailing list >gutvol-d@lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d -- E-mail: cannona@fireantproductions.com Skype: cannona MSN Messenger: cannona@hotmail.com (Do not send E-mail to the hotmail address.) From bruce at zuhause.org Fri Feb 25 16:11:02 2005 From: bruce at zuhause.org (Bruce Albrecht) Date: Fri Feb 25 16:11:56 2005 Subject: [gutvol-d] Enlightened Self Interest In-Reply-To: <421FA5B1.2080806@zytrax.com> References: <421FA5B1.2080806@zytrax.com> Message-ID: <16927.48790.533173.228950@celery.zuhause.org> I think the long term view, at least from the Distributed Proofreader's supply chain, is to provide a TEI-Lite document for each text, and from it programmatically create HTML, plain text, PDF, etc on the fly. I'm not sure when this will happen, but I expect that some of the precursor activities at DP will take place this year. I don't know if DP will try to replace all previous versions of texts with TEI-Lite documents, but my guess is that once a system is in place, there will be volunteers that will go back and rework the texts, just as we have volunteers today providing revised editions of earlier texts with HTML and text versions that follow the current formatting guidelines. As always, volunteer in the ways you see fit, but I suspect many here (at least us DPers) would argue that working on new texts hitherto unavailable to PG is probably a better use of your time than providing multiple reformatted versions of existing works. Bruce http://www.pgdp.net/vision/ For Charlz' vision http://www.tei-c.org/Lite/ For information on TEI-Lite http://www.pdgp.net For volunteering at Distributed Proofreaders Ron Aitchison writes: > Understand the issue of editing. My proposal would be to supply an > editable file in OpenOffice or MS doc format (BTW if you are not using > the Open Source OpenOffice suite I recommend you check it out - the > features are great, at least as feature rich as MS word, plus - one > button PDF creation, output as doc, text or native XML format and a > great price = $0! http://www.openoffice.org ). > I propose to take nothing away you will have edit control over the file. > This also opens up another question over what base document formats you > have standardized for editability and portability e.g. OASIS etc.. Maybe > that is a topic another list. > Finally I note you have PDF formats available for some other books. From jon_niehof at yahoo.com Fri Feb 25 17:20:08 2005 From: jon_niehof at yahoo.com (Jon Niehof) Date: Fri Feb 25 17:21:04 2005 Subject: [gutvol-d] Enlightened Self Interest In-Reply-To: <16927.48790.533173.228950@celery.zuhause.org> Message-ID: <20050226012008.95779.qmail@web41601.mail.yahoo.com> > As always, volunteer in the ways you see fit, but I suspect > many here (at least us DPers) would argue that working on new > texts hitherto unavailable to PG is probably a better use of > your time than providing multiple reformatted versions of > existing works. I would agree; it seems to me that converting into a format that cannot be programmatically converted into other formats (including other "master" formats like DP-TEI, whenever that gets specified), is rather a waste of one's time. Anything that isn't a value-add (like converting straight text to Word or PDF without adding, say, bookmark information) also strikes me as not too useful. I could blast all of PG into Weasel format without a lot of trouble, for example, but I don't see a benefit as anybody who could make use of it could easily do the conversion as well. (pie-in-the-sky: being able to on-the-fly convert TEI to format of user's choice on download would be nearly Grail-like.) __________________________________ Do you Yahoo!? Yahoo! Mail - now with 250MB free storage. Learn more. http://info.mail.yahoo.com/mail_250 From jtinsley at pobox.com Fri Feb 25 18:04:52 2005 From: jtinsley at pobox.com (Jim Tinsley) Date: Fri Feb 25 18:05:53 2005 Subject: [gutvol-d] Enlightened Self Interest In-Reply-To: <20050226012008.95779.qmail@web41601.mail.yahoo.com> References: <16927.48790.533173.228950@celery.zuhause.org> <20050226012008.95779.qmail@web41601.mail.yahoo.com> Message-ID: <20050226020452.GA24272@panix.com> On Fri, Feb 25, 2005 at 05:20:08PM -0800, Jon Niehof wrote: >> As always, volunteer in the ways you see fit, but I suspect >> many here (at least us DPers) would argue that working on new >> texts hitherto unavailable to PG is probably a better use of >> your time than providing multiple reformatted versions of >> existing works. > >I would agree; it seems to me that converting into a format that >cannot be programmatically converted into other formats >(including other "master" formats like DP-TEI, whenever that >gets specified), is rather a waste of one's time. > >Anything that isn't a value-add (like converting straight text >to Word or PDF without adding, say, bookmark information) also >strikes me as not too useful. I could blast all of PG into >Weasel format without a lot of trouble, for example, but I don't >see a benefit as anybody who could make use of it could easily >do the conversion as well. > Well put. What we call "blind format conversions" -- conversions from one format to another, based on your own preferences, without any value-added input such as, say, illustrations from an eligible edition -- are not things that we really want to post, without some special reason. We have done it in the past, and it hasn't worked well. Sites like Blackmask http://blackmask.com do a better job of managing such content than we do, and in fact David Moynihan of Blackmask has offered us all of his converted files if we want them. We discussed it a few years ago, and decided against. >(pie-in-the-sky: being able to on-the-fly convert TEI to format >of user's choice on download would be nearly Grail-like.) You don't need TEI just for conversion. Today, HTML is the Universal Format for converting _from_. It may not be so always, and HTML has limits; it ain't great on mathematical texts, for instance, but given HTML, you can very easily get to any of the common reader formats in one step. jim From prosfilaes at gmail.com Fri Feb 25 18:57:42 2005 From: prosfilaes at gmail.com (David Starner) Date: Fri Feb 25 18:58:39 2005 Subject: [gutvol-d] Enlightened Self Interest In-Reply-To: <421FA5B1.2080806@zytrax.com> References: <421FA5B1.2080806@zytrax.com> Message-ID: <6d99d1fd05022518573b49c21a@mail.gmail.com> On Fri, 25 Feb 2005 17:24:49 -0500, Ron Aitchison writes: > Finally I note you have PDF formats available for some other books. Primarily from TeX, which makes it easy to generate, and primarily for mathematical and scientific documents that pretty much have to be done in TeX. Jim Tinsley writes: > Today, HTML is the Universal > Format for converting _from_. It may not be so always, and > HTML has limits; it ain't great on mathematical texts, More importantly, HTML can't really do footnotes, and I doubt anything is doing decent transformations on what we kludge sidenotes into. From ron at zytrax.com Fri Feb 25 19:04:01 2005 From: ron at zytrax.com (Ron Aitchison) Date: Fri Feb 25 19:05:48 2005 Subject: [gutvol-d] Enlightened Self Interest In-Reply-To: <20050226020452.GA24272@panix.com> References: <16927.48790.533173.228950@celery.zuhause.org> <20050226012008.95779.qmail@web41601.mail.yahoo.com> <20050226020452.GA24272@panix.com> Message-ID: <421FE721.2000004@zytrax.com> Whoa there. Clearly I walked into a minefield and feel in imminent danger of having various limbs blasted from my poor undeserving corpus. Let me state my point of view or why I made the offer and why I think perhaps trees and forests may be getting a little confused. Now I'm new to this stuff and many of you good folks have labored for years so if I lay a few mines of my own - so be it. 1. The primary reason for my offer was simply that since I found the simple text version unpleasant to read I thought there may be others and that having a choice of formats available may make the output - the books - more approachable hence reach a wider audience and all the good things that must flow from that. Seems to me this is that GP is about - outreach. 2. I fully understand the issue of editable text. and rampant variations - a maintenance nightmare. Untenable. So let me address the issue of maintenance and incidentally why I do not think that my offer need cause the end of the world as we know it. There are two parts to this argument: 1. The basic format that I have converted to is OpenOffice 's XML format from which multiple conversions - PDF and MS doc if you want - are derived. . All essentially driven from a set of DTD's. My brief reading of TEI is that it too uses an XML base. So we have a trivial level of commonality as a starting point. By looking at the conversion processes we could have a WSYIWYG editor off-the-shelf at $0 cost with output convertible to TEI output by driving it through appropriate XLST's and all that good stuff. OpenOffice has a pilot development with DocBook to do something similar. It is not making much progress but with the right effort it could. 2. The second point relates to the difficulty, of success possibility, of conversion. I used 4 styles in the book. Header 1, paragraph, page header and page footer (the last two could be easily removed but are tactically useful because of page numbering). For a simple text book I see no reason to use any more and the cost of replacement of header /footer with an alternate implementation is trivial in the extreme. Hard pagination is perhaps a bit more difficult to handle and I'm not sure I should have done it but in the absence of any instructions/suggestions to the contrary I did. So a set of simple rules in the period before an idealized solution is available would significantly reduce difficulties. Now whether TEI is better than DocBook or a converged OASIS standard is not for me to say. But it does seem to me there is a way forward in the short term by making the right intercepts - a combination of technology and rules - without building up a redundant and unmanageable nightmare. Or am I wrong? Finally does anyone want my pathetic conversions of Northanger Abbey and Persuasion !! -:) Or is it thanks but no thanks! Jim Tinsley wrote: >On Fri, Feb 25, 2005 at 05:20:08PM -0800, Jon Niehof wrote: > > >>>As always, volunteer in the ways you see fit, but I suspect >>>many here (at least us DPers) would argue that working on new >>>texts hitherto unavailable to PG is probably a better use of >>>your time than providing multiple reformatted versions of >>>existing works. >>> >>> >>I would agree; it seems to me that converting into a format that >>cannot be programmatically converted into other formats >>(including other "master" formats like DP-TEI, whenever that >>gets specified), is rather a waste of one's time. >> >>Anything that isn't a value-add (like converting straight text >>to Word or PDF without adding, say, bookmark information) also >>strikes me as not too useful. I could blast all of PG into >>Weasel format without a lot of trouble, for example, but I don't >>see a benefit as anybody who could make use of it could easily >>do the conversion as well. >> >> >> > >Well put. What we call "blind format conversions" -- conversions >from one format to another, based on your own preferences, >without any value-added input such as, say, illustrations from >an eligible edition -- are not things that we really want to >post, without some special reason. We have done it in the past, >and it hasn't worked well. > >Sites like Blackmask http://blackmask.com do a better job of >managing such content than we do, and in fact David Moynihan >of Blackmask has offered us all of his converted files if >we want them. We discussed it a few years ago, and decided >against. > > > >>(pie-in-the-sky: being able to on-the-fly convert TEI to format >>of user's choice on download would be nearly Grail-like.) >> >> > >You don't need TEI just for conversion. Today, HTML is the Universal >Format for converting _from_. It may not be so always, and >HTML has limits; it ain't great on mathematical texts, for >instance, but given HTML, you can very easily get to any of >the common reader formats in one step. > >jim > >_______________________________________________ >gutvol-d mailing list >gutvol-d@lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > -- Ron Aitchison http://www.zytrax.com ZyTrax mailto:r.aitchison@zytrax.com 70 rue Notre Dame West Montreal Quebec H2Y 1S6 Tel:(514) 285.9088 From jon at noring.name Fri Feb 25 19:26:13 2005 From: jon at noring.name (Jon Noring) Date: Fri Feb 25 19:27:12 2005 Subject: [gutvol-d] Enlightened Self Interest In-Reply-To: <421FE721.2000004@zytrax.com> References: <16927.48790.533173.228950@celery.zuhause.org> <20050226012008.95779.qmail@web41601.mail.yahoo.com> <20050226020452.GA24272@panix.com> <421FE721.2000004@zytrax.com> Message-ID: <7815066468.20050225202613@noring.name> Ron wrote: > 1. The basic format that I have converted to is OpenOffice 's XML format > from which multiple conversions - PDF and MS doc if you want - are > derived. . All essentially driven from a set of DTD's. My brief reading > of TEI is that it too uses an XML base. So we have a trivial level of > commonality as a starting point. By looking at the conversion processes > we could have a WSYIWYG editor off-the-shelf at $0 cost with output > convertible to TEI output by driving it through appropriate XLST's and > all that good stuff. OpenOffice has a pilot development with DocBook to > do something similar. It is not making much progress but with the right > effort it could. For maximum archivability, repurposeability and accessibility, it is important for the XML markup vocabulary used in the master document to be wholly structural and semantic. Except where absolutely necessary (and maybe best solved using SVG and MathML), presentational markup should be avoided. TEI is primarily structural/semantic, but there are some presentational components. The base DP-TEI (I envision three levels of DP-TEI), when it comes into being, should not specify any presentational markup components. I am not familiar with OpenOffice's XML vocabulary, but I would guess that it, too, is a mix of structural/semantic tags with presentation tags (I also guess that it is much more presentationally-oriented than TEI, and doesn't have the structural/semantic richness of TEI.) If OpenOffice's XML vocabulary is to be used, it should be subsetted (at least at the base level) to not allow presentational markup. I do not recommend DocBook as the primary markup vocabulary for general books, but certainly it is intriguing to consider it as a second "blessed" vocabulary for particular types of documents it is designed for (primarily technical documents.) Just my $0.02 worth. Jon Noring From jtinsley at pobox.com Fri Feb 25 19:29:24 2005 From: jtinsley at pobox.com (Jim Tinsley) Date: Fri Feb 25 19:30:22 2005 Subject: [gutvol-d] Enlightened Self Interest In-Reply-To: <421FE721.2000004@zytrax.com> References: <16927.48790.533173.228950@celery.zuhause.org> <20050226012008.95779.qmail@web41601.mail.yahoo.com> <20050226020452.GA24272@panix.com> <421FE721.2000004@zytrax.com> Message-ID: <20050226032924.GA29574@panix.com> On Fri, Feb 25, 2005 at 10:04:01PM -0500, Ron Aitchison wrote: >Whoa there. Clearly I walked into a minefield and feel in imminent >danger of having various limbs blasted from my poor undeserving corpus. Minefield, yes. We really should put a sign up at the gates. :-) But nobody wants to blast you, I promise. It's an old, old, subject, and we've tried various things at verious times over the last 5 years or so -- some tries even pre-date that. I don't think there's one we don't regret. So it's not like we're dismissing your idea out of hand; it's one of those things that we've all thought of, and we'd all like to do, and we never quite forget it, and it pops up now and again even among old hands, but it's a net negative. And there's a lot of people here who have a lot of experience of the subject. There was probably a time when even I thought that posting individual blind format conversions was a good idea, but it must have been long ago. >Let me state my point of view or why I made the offer and why I think >perhaps trees and forests may be getting a little confused. Now I'm new >to this stuff and many of you good folks have labored for years so if I >lay a few mines of my own - so be it. >1. The primary reason for my offer was simply that since I found the >simple text version unpleasant to read I thought there may be others and >that having a choice of formats available may make the output - the >books - more approachable hence reach a wider audience and all the good >things that must flow from that. Seems to me this is that GP is about - >outreach. >2. I fully understand the issue of editable text. and rampant variations >- a maintenance nightmare. Untenable. >So let me address the issue of maintenance and incidentally why I do not >think that my offer need cause the end of the world as we know it. >There are two parts to this argument: >1. The basic format that I have converted to is OpenOffice 's XML format >from which multiple conversions - PDF and MS doc if you want - are >derived. . All essentially driven from a set of DTD's. My brief reading >of TEI is that it too uses an XML base. So we have a trivial level of >commonality as a starting point. By looking at the conversion processes >we could have a WSYIWYG editor off-the-shelf at $0 cost with output >convertible to TEI output by driving it through appropriate XLST's and >all that good stuff. OpenOffice has a pilot development with DocBook to >do something similar. It is not making much progress but with the right >effort it could. >2. The second point relates to the difficulty, of success possibility, >of conversion. I used 4 styles in the book. Header 1, paragraph, page >header and page footer (the last two could be easily removed but are >tactically useful because of page numbering). For a simple text book I >see no reason to use any more and the cost of replacement of header >/footer with an alternate implementation is trivial in the extreme. Hard >pagination is perhaps a bit more difficult to handle and I'm not sure I >should have done it but in the absence of any instructions/suggestions >to the contrary I did. So a set of simple rules in the period before an >idealized solution is available would significantly reduce difficulties. >Now whether TEI is better than DocBook or a converged OASIS standard is >not for me to say. >But it does seem to me there is a way forward in the short term by >making the right intercepts - a combination of technology and rules - >without building up a redundant and unmanageable nightmare. Or am I wrong? >Finally does anyone want my pathetic conversions of Northanger Abbey and >Persuasion !! -:) Or is it thanks but no thanks! Your conversions may well be lovely; their quality isn't at all an issue here. It's just not something that we do, except under some compelling special circumstances. jim From Bowerbird at aol.com Fri Feb 25 20:01:24 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Fri Feb 25 20:02:28 2005 Subject: [gutvol-d] Enlightened Self Interest Message-ID: jon niehof said: > (pie-in-the-sky: > being able to on-the-fly convert TEI > to format of user's choice on download > would be nearly Grail-like.) well, then _totally_ "grail-like" would be for users to have a tool that enables them to convert the sole maintained download format to any other format they might have a need for. i created the format -- zen markup language -- and am rapidly finishing programming the tool... that is all for now. talk amongst yourselves... -bowerbird From donovan at abs.net Sat Feb 26 04:29:38 2005 From: donovan at abs.net (D Garcia) Date: Sat Feb 26 04:31:54 2005 Subject: [gutvol-d] Enlightened Self Interest In-Reply-To: <16927.48790.533173.228950@celery.zuhause.org> References: <421FA5B1.2080806@zytrax.com> <16927.48790.533173.228950@celery.zuhause.org> Message-ID: <200502260729.38650.donovan@abs.net> On Friday 25 February 2005 07:11 pm, Bruce Albrecht wrote: > I don't know if DP will try to replace all previous versions of texts > with TEI-Lite documents, but my guess is that once a system is in > place, there will be volunteers that will go back and rework the > texts, just as we have volunteers today providing revised editions of > earlier texts with HTML and text versions that follow the current > formatting guidelines. > > As always, volunteer in the ways you see fit, but I suspect many here > (at least us DPers) would argue that working on new texts hitherto > unavailable to PG is probably a better use of your time than providing > multiple reformatted versions of existing works. I'm one of the volunteers who is going back and providing reworked versions of existing older PG texts, and my approximate criteria for selection are: Older than (roughly) number 7000, is only in text version at PG, text version has many "hard" errors (tbe, arc, arid, etc. as opposed to "soft" problems such as formatting), illustrations not present, and most importantly, ones that I have a physical copy of the book from which to make the corrections from. This clearly falls under the "value-added" category of thinking. While I share your position that simple reformatting is mostly a waste of time, going back and rehabilitating existing works is not, and I hope that people interested in working on that aspect are not discouraged. I think of it much like carpentry; there are some people who are more of a framing temperament, those who are interested in finish work, and those who like to do restoration or renovation work. All of those skills/mindsets are necessary to complete a strong and attractive project. From bruce at zuhause.org Sat Feb 26 07:43:23 2005 From: bruce at zuhause.org (Bruce Albrecht) Date: Sat Feb 26 07:44:30 2005 Subject: [gutvol-d] Enlightened Self Interest In-Reply-To: <200502260729.38650.donovan@abs.net> References: <421FA5B1.2080806@zytrax.com> <16927.48790.533173.228950@celery.zuhause.org> <200502260729.38650.donovan@abs.net> Message-ID: <16928.39195.282423.849192@celery.zuhause.org> D Garcia writes: > I'm one of the volunteers who is going back and providing reworked versions of > existing older PG texts, and my approximate criteria for selection are: Older > than (roughly) number 7000, is only in text version at PG, text version has > many "hard" errors (tbe, arc, arid, etc. as opposed to "soft" problems such > as formatting), illustrations not present, and most importantly, ones that I > have a physical copy of the book from which to make the corrections from. > > This clearly falls under the "value-added" category of thinking. While I share > your position that simple reformatting is mostly a waste of time, going back > and rehabilitating existing works is not, and I hope that people interested > in working on that aspect are not discouraged. I agree that your type of updates is needed for the older PG titles, and don't consider it a waste of time. However, it was my impression that Ron was offering to provide uncorrected reformatted editions of the titles in question. From joshua at hutchinson.net Sat Feb 26 08:25:13 2005 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Sat Feb 26 08:26:01 2005 Subject: [gutvol-d] Enlightened Self Interest In-Reply-To: <20050226032924.GA29574@panix.com> References: <16927.48790.533173.228950@celery.zuhause.org> <20050226012008.95779.qmail@web41601.mail.yahoo.com> <20050226020452.GA24272@panix.com> <421FE721.2000004@zytrax.com> <20050226032924.GA29574@panix.com> Message-ID: <4220A2E9.7010300@hutchinson.net> Ok, I leave the computer for one night and you all go nuts with the posts! :) hehe Anyway, as one of the people working on PGTEI, I figure this discussion could use an update where things stand. Currently, my efforts have concentrate on two fronts. 1 - Converting those texts that come through me from DP into PGTEI master format. I then use the online PGTEI -> HTML conversion routine to convert them to HTML for posting to PG. Most of them are not converted to TEXT simply because someone else at DP did the text version before I got to them. In other words, I've been mostly concentrating on the PGTEI format itself and the HTML output that results from it. Here is a recent link to a posted book... from off the top of my head. There are many more I just don't have the list here on this computer. (Last count there were 20+ documents that I've put in PGTEI format sitting on my computer... most of which have been posted to the PG archives in HTML and/or TEXT format.) http://www.gutenberg.org/dirs/1/4/9/8/14986/14986-h/14986-h.htm Experimental Researches in Electricity, Volume 1 This is a pretty straightforward text, but it has an automatically produced Table of Contents and the generated footnotes, so it gives some idea of where we are at. One of the things I plan on fixing in the future is the lack of links from the footnote text BACK to the footnote anchor in the main text. 2 - Updating/expanding the PGTEI documentation. I've got more notes than I know what to do with and many many pages of additional documentation written in a rough draft. *** The eventual end I am hoping for is a standard encoding that makes conversion to other formats easy and quick. For instance, one of my next projects will be to take on of the VERY nasty math texts that DP has produced in TeX format and convert it to PGTEI. TEI uses TeX encoding for the math equations themselves, but the rest of the formatting is a little more intuitive AND because of the validation routines we have available, much easier to develop and fix. But, since I haven't tried the TeX on a massive scale yet within a PGTEI document, I don't know what bugs and gotchas I'm going to find. If there are any questions (or if anyone wants to see some of the PGTEI documents I've created, rough drafts of the documentation I've working on, etc), please let me know. Josh From Bowerbird at aol.com Sat Feb 26 10:27:07 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Sat Feb 26 10:28:26 2005 Subject: [gutvol-d] one more thing, for jon noring Message-ID: oh yeah, one more thing before i return to my laboratory. for jon noring. or 2 things, actually. no, make that 3. first, jon, since you've been makin' some big noises about "my antonia", could you please make available a .zip file containing all of your image-scans and the o.c.r. output? i plan on using them in a nice little project of mine, and downloading the scans one at a time is a pain in the neck. second, since you regularly assert your insistence that markup must be "semantic" rather than "presentational", can you elucidate the structural aspects that typically should be marked up in books? that list would include things like chapter-headings, footnotes, block-quotes; and what else? would also be nice if you could say _how_ these things should be marked up, with actual examples, but since even the .tei experts can't seem to agree on it... third, over on the bookpeople list, john mark ockerbloom moderated out my replies to your late-december posts where you issued some "friendly challenges" to me; but let it be known that my replies accepted your challenges. i'll be creating a space soon where we can discuss them... -bowerbird From marcello at perathoner.de Sat Feb 26 10:57:35 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat Feb 26 13:08:56 2005 Subject: [gutvol-d] Enlightened Self Interest In-Reply-To: <4220A2E9.7010300@hutchinson.net> References: <16927.48790.533173.228950@celery.zuhause.org> <20050226012008.95779.qmail@web41601.mail.yahoo.com> <20050226020452.GA24272@panix.com> <421FE721.2000004@zytrax.com> <20050226032924.GA29574@panix.com> <4220A2E9.7010300@hutchinson.net> Message-ID: <4220C69F.1060301@perathoner.de> Joshua Hutchinson wrote: > If there are any questions (or if anyone wants to see some of the PGTEI > documents I've created, rough drafts of the documentation I've working > on, etc), please let me know. I'd like to see the draft documentation. -- Marcello Perathoner webmaster@gutenberg.org From marcello at perathoner.de Sat Feb 26 11:05:13 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat Feb 26 13:08:59 2005 Subject: [gutvol-d] Filesystem changes to the web site In-Reply-To: <6.1.2.0.0.20050225171445.01be73b0@mail.fireantproductions.com> References: <421CDA26.7060507@perathoner.de> <6.1.2.0.0.20050225171445.01be73b0@mail.fireantproductions.com> Message-ID: <4220C869.9010406@perathoner.de> Aaron Cannon wrote: > Any chance we could get a specific timeline on when these changes would > be taking place? I just want to be sure we don't miss any CD/DVD requests. The timeline is: as soon as I get it done. I assume your form processor writes a log file of the requests somewhere. At present there are two copies of your program. You should go to the /public/vhost/g/gutenberg/html/ directory on login.ibiblio.org and edit that copy of the form processor so it writes the log into the new file hierarchy. The old copy under /public/html/gutenberg/ will still write the log to the old location. When we switch over, the new form will start writing the new log and you'll just have to pick up the old log once manually before we delete the old directory. Test your new form under www-dev.gutenberg.org -- Marcello Perathoner webmaster@gutenberg.org From jon at noring.name Sat Feb 26 13:06:31 2005 From: jon at noring.name (Jon Noring) Date: Sat Feb 26 13:09:07 2005 Subject: [gutvol-d] one more thing, for jon noring In-Reply-To: References: Message-ID: <1817862109.20050226140631@noring.name> Bowerbird wrote: > first, jon, since you've been makin' some big noises about > "my antonia", could you please make available a .zip file > containing all of your image-scans and the o.c.r. output? > i plan on using them in a nice little project of mine, and > downloading the scans one at a time is a pain in the neck. Good idea. Unfortunately I do not have OCR output, but I have the page scans. I'll zip up the 600 dpi 2-color (B&W) scans which have already gone through a clean-up stage (they will be PNG files, and occupy if memory serves me right, about 50 megs of space.) These should import nicely into an OCR program. If you don't have an OCR program, someone here may offer to do that for you. (Note that the page scans which are individually linked from the My Antonia online document were resampled from the 600 dpi 2-color scans to 120 dpi with greyscale antialiasing to improve legibility at lower resolutions -- the 120 dpi versions probably are not as good to use for OCRing.) Anyone? > second, since you regularly assert your insistence that > markup must be "semantic" rather than "presentational", > can you elucidate the structural aspects that typically > should be marked up in books? that list would include > things like chapter-headings, footnotes, block-quotes; > and what else? would also be nice if you could say _how_ > these things should be marked up, with actual examples, > but since even the .tei experts can't seem to agree on it... Also a very good suggestion. Remind me if I don't answer anytime soon. Got a lot of projects on my plate (and just got done with a several day project to upgrade the hardware, OS and software on my main computer.) Yes, the TEI people also disagree, but that's because the full vocabulary of TEI is quite extensive. When I talked with Charles last year on this topic, his vision at the time seemed to be that DP will settle upon a required base subset, maybe an extended subset that those who are interested can use but that's not required for basic support (e.g., including semantic information as to who speaks a particular quote, which can be marked up but is probably overkill for basic markup support.) I should probably make the inquiry over at the DP forums, but those working with DP who are familiar with DP's consideration of blessing a TEI subset for its master documents, let me know. > third, over on the bookpeople list, john mark ockerbloom > moderated out my replies to your late-december posts > where you issued some "friendly challenges" to me; but > let it be known that my replies accepted your challenges. > i'll be creating a space soon where we can discuss them... Thanks. I look forward to it! (Really, I do.) Jon From jon at noring.name Sat Feb 26 14:12:44 2005 From: jon at noring.name (Jon Noring) Date: Sat Feb 26 14:14:01 2005 Subject: [gutvol-d] one more thing, for jon noring In-Reply-To: References: Message-ID: <9221835046.20050226151244@noring.name> Bowerbird asked: > first, jon, since you've been makin' some big noises about > "my antonia", could you please make available a .zip file > containing all of your image-scans and the o.c.r. output? The 600 dpi bitonal page scans of My Antonia (as PNG, archived in ZIP) now available via: http://www.openreader.org/myantonia I encourage others to download the ZIP to preserve the page scans. But be forewarned the ZIP file is 49 megs in size. Using one of the CCITT bitonal compression algorithms it would be possible to do better with lossless compression, maybe 50% better than the currently used PNG. But virtually everyone can view PNG files, while those CCITT algorithms (usually encapsulated in TIFF) are oftentimes obscure. Jon From jon at noring.name Sat Feb 26 17:03:31 2005 From: jon at noring.name (Jon Noring) Date: Sat Feb 26 17:04:53 2005 Subject: [gutvol-d] ZML added (was one more thing, for jon noring) In-Reply-To: <9221835046.20050226151244@noring.name> References: <9221835046.20050226151244@noring.name> Message-ID: <3032082515.20050226180331@noring.name> Btw, to the "My Antonia" beta page I've added an entry for "regularized" plain text, with one format in this category being Bowerbird's ZML. I have heard of a couple other systems being touted for regularized plain text, but none of them are being discussed in Project Gutenberg. Jon From joshua at hutchinson.net Sat Feb 26 18:39:27 2005 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Sat Feb 26 18:40:21 2005 Subject: [gutvol-d] one more thing, for jon noring In-Reply-To: References: Message-ID: <422132DF.4040508@hutchinson.net> Bowerbird@aol.com wrote: >second, since you regularly assert your insistence that >markup must be "semantic" rather than "presentational", >can you elucidate the structural aspects that typically >should be marked up in books? that list would include >things like chapter-headings, footnotes, block-quotes; >and what else? would also be nice if you could say _how_ >these things should be marked up, with actual examples, >but since even the .tei experts can't seem to agree on it... > Hmm, I've yet to find a TEI "expert" that doesn't agree on the fundamental markups.

is a paragraph container. for a divisional (chapter, section, part, etc) heading. for a footnote. * replace "foot" with "margin" or "endnote" as appropriate for other note markers. for a block quote.
for an inline illustration. *** The problems with TEI don't tend to lie in the markup, but rather in the conversion of said markup to a final presentation format. And usually then it is in markup that requires a bit of intelligence on the part of the rendering engine ... like complex tables, for instance. Josh From Bowerbird at aol.com Sun Feb 27 01:31:43 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Sun Feb 27 01:33:10 2005 Subject: [gutvol-d] one more thing, for jon noring Message-ID: <157.4b53b442.2f52ed7f@aol.com> jon noring said: > Unfortunately I do not have OCR output did you do o.c.r. on it? if you can retrieve the output, that would be good. it would allow people to do research on assessing/improving o.c.r. quality, and assist programmers in developing post-o.c.r. text-cleanup programs. (but, from later posts, it looks like you grabbed the text from elsewhere. so what you've done is "blessed" somebody else's work as "trustworthy", presumably after checking it, and maybe correcting it. you could also have done that same thing using project gutenberg's version of the text, since my comparison of the two files shows them to be very similar, so much so that i expect they were indeed based on the same version.) > I'll zip up the 600 dpi 2-color (B&W) scans > which have already gone through a clean-up stage > (they will be PNG files, and occupy if memory serves me right, > about 50 megs of space those are too big for my purposes, and for me to download. but if i could reimburse you for sending them to me on a cd? or the 120-dpi versions would work just fine for my project, the same ones that are on the website, just zipped together. > Remind me if I don't answer anytime soon. sure thing. > Thanks. I look forward to it! (Really, I do.) great. -bowerbird From cannona at fireantproductions.com Sun Feb 27 05:30:22 2005 From: cannona at fireantproductions.com (Aaron Cannon) Date: Sun Feb 27 05:32:45 2005 Subject: [gutvol-d] Filesystem changes to the web site In-Reply-To: <4220C869.9010406@perathoner.de> References: <421CDA26.7060507@perathoner.de> <6.1.2.0.0.20050225171445.01be73b0@mail.fireantproductions.com> <4220C869.9010406@perathoner.de> Message-ID: <6.1.2.0.0.20050227072235.01e2dec0@mail.fireantproductions.com> At 01:05 PM 2/26/2005, you wrote: >The timeline is: as soon as I get it done. > >I assume your form processor writes a log file of the requests somewhere. >At present there are two copies of your program. > >You should go to the /public/vhost/g/gutenberg/html/ directory on >login.ibiblio.org and edit that copy of the form processor so it writes >the log into the new file hierarchy. The old copy under >/public/html/gutenberg/ will still write the log to the old location. When >we switch over, the new form will start writing the new log and you'll >just have to pick up the old log once manually before we delete the old >directory. > >Test your new form under > > www-dev.gutenberg.org I'm actually thinking it might be easier to take the system down for a couple days during the switch over. That way, I can copy the old database into the new directory without having to wonder which requests went where. I assume that you will be giving the go-ahead to Ibiblio once you've tested everything. Would it be at all possible to drop me an e-mail a day or so before you think you'll be contacting them so I can take things offline? Thanks! Sincerely Aaron Cannon >-- >Marcello Perathoner >webmaster@gutenberg.org > > >_______________________________________________ >gutvol-d mailing list >gutvol-d@lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d -- E-mail: cannona@fireantproductions.com Skype: cannona MSN Messenger: cannona@hotmail.com (Do not send E-mail to the hotmail address.) From hacker at gnu-designs.com Sun Feb 27 06:21:24 2005 From: hacker at gnu-designs.com (David A. Desrosiers) Date: Sun Feb 27 06:23:30 2005 Subject: [gutvol-d] Filesystem changes to the web site In-Reply-To: <6.1.2.0.0.20050227072235.01e2dec0@mail.fireantproductions.com> References: <421CDA26.7060507@perathoner.de> <6.1.2.0.0.20050225171445.01be73b0@mail.fireantproductions.com> <4220C869.9010406@perathoner.de> <6.1.2.0.0.20050227072235.01e2dec0@mail.fireantproductions.com> Message-ID: > I'm actually thinking it might be easier to take the system down for > a couple days during the switch over. That way, I can copy the old > database into the new directory without having to wonder which > requests went where. How do these changes affect those of us who maintain mirrors? I've noticed that rsync'ing the main filesystem to keep up to date, has duplicated two copies of the tree now. Is this intentional? Its also taking twice the amount of space. David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com From nwolcott at dsdial.net Sun Feb 27 03:59:00 2005 From: nwolcott at dsdial.net (N Wolcott) Date: Sun Feb 27 08:43:09 2005 Subject: [gutvol-d] Pepys' birthday References: <1e2.36266c1c.2f4e4543@aol.com> <20050223213539.GB14264@reactor-core.org> Message-ID: <006801c51ceb$18df3220$399495ce@gw98> It occurred to me that if there are not too many xxx portions of the diary, then under the "fair use" doctrine one could write a "scholarly article" on the topic of "editorial squeamishness" and include the referenced passages as footnotes and publish it in a scholarly place like the PG arachive?? ----- Original Message ----- From: "Jonathan Walther" To: "Project Gutenberg Volunteer Discussion" Sent: Wednesday, February 23, 2005 4:35 PM Subject: Re: [gutvol-d] Pepys' birthday > On Wed, Feb 23, 2005 at 03:44:51PM -0500, Gutenberg9443@aol.com wrote: > > So who is going to complain? There is a new edition as of about 24 > > years ago, which includes all Pepys's XXX comments that are omitted > > from the earlier edition. > > I seem to recall that Pepys diaries were written in a special shorthand. > The current editions may claim copyright on their "transcriptions" of > the shorthand. > > Anyone game for scanning in the original shorthand, and transcribing it? > > Jonathan > > -- > It's not true unless it makes you laugh, > but you don't understand it until it makes you weep. > > Eukleia: Jonathan Walther > Address: 12706 99 Ave, Surrey, BC V3V2P8 (Canada) > Contact: 604-684-1319 (daytime) > Contact: 604-582-9308 (morning and evening) > Puritan: Purity of faith, Purity of doctrine. Sola Scriptura! > > Patriarchy, Polygamy, Slavery === Fatherhood, Husbandry, Mastery > Matriarchy, Monogamy, Prisons === Wickedness, Stupidity, Buggery > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From hyphen at hyphenologist.co.uk Sun Feb 27 10:26:45 2005 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Sun Feb 27 10:28:40 2005 Subject: [gutvol-d] Pepys' birthday In-Reply-To: <006801c51ceb$18df3220$399495ce@gw98> References: <1e2.36266c1c.2f4e4543@aol.com> <20050223213539.GB14264@reactor-core.org> <006801c51ceb$18df3220$399495ce@gw98> Message-ID: On Sun, 27 Feb 2005 06:59:00 -0500, "N Wolcott" wrote: | It occurred to me that if there are not too many xxx portions of the diary, | then under the "fair use" doctrine one could write a "scholarly article" on | the topic of "editorial squeamishness" and include the referenced passages | as footnotes and publish it in a scholarly place like the PG arachive?? I don't understand why there should be any problems with xxx portions. Or indeed why you thought it necessary to use xxx. Nowadays absolutely anything goes, a bit of sex does not cause any problems whatsoever, in the UK at least. -- Dave F From jon at noring.name Sun Feb 27 12:16:17 2005 From: jon at noring.name (Jon Noring) Date: Sun Feb 27 12:17:55 2005 Subject: [gutvol-d] one more thing, for jon noring In-Reply-To: <157.4b53b442.2f52ed7f@aol.com> References: <157.4b53b442.2f52ed7f@aol.com> Message-ID: <15216999687.20050227131617@noring.name> Bowerbird wrote: > jon noring said: > did you do o.c.r. on it? if you can retrieve the output, that would > be good. it would allow people to do research on assessing/improving > o.c.r. quality, and assist programmers in developing post-o.c.r. > text-cleanup programs. No, I did not OCR the scans for producing My Antonia (I did experiment with scanning though). But since the scans exist, any OCR package will import them and scan them. So nothing is "lost". There's no law that says one must OCR them at the same time they are scanned -- they are separate processes and can be decoupled with no loss of anything to anyone at any time. If you need OCRing done, you can probably post a "plea for help" and find someone who has the OCR software packages you'd like to try (I don't have the robust, up-to-date ones -- like Abbyy which I asked a friend to help out with for my experiments -- I just have the cheapo freebies.) The scans are available online for download, as you know. > (but, from later posts, it looks like you grabbed the text from > elsewhere. so what you've done is "blessed" somebody else's work as > "trustworthy", presumably after checking it, and maybe correcting > it. you could also have done that same thing using project > gutenberg's version of the text, since my comparison of the two > files shows them to be very similar, so much so that i expect they > were indeed based on the same version.) I won't go into the gory details, but yes, I took two versions and then combined/diffed them. I then did a very thorough comparison page-by-page to the original source page scans, a la DP. We are now in the process of having several people do the same (a page-by-page comparison of the XHTML version to the page scans -- I want at least two people to go over each page) -- anyone reading this, you are welcome to volunteer and help us -- do a few pages just like for DP! The error rate in the XHTML version (still beta) is now very low, and can be considered for all practical purposes a very accurate and textually faithful reproduction of the 1918 1st edition. (But then, maybe I'll be surprised and find a serious error in the text.) In retrospect, this process should have been done via DP instead. But there was a deadline to finish the first beta of the cleaned-up text, so there was no time to have this done at DP. However, I do plan to post a request to the relevant DP forums for final proofing help, as well as to seek help from the DP folk on other matters (such as TEI markup). If DP wishes to go over it in some fashion and incorporate it into their "archive" as well as submit it to PG, that's fine by me (I will not directly contribute the text to PG as I've noted on TeBC.) Regarding the PG version of My Antonia compared to the 1918 1st edition, there are a *lot* of differences. I regularized both texts and ran 'diff' between them, and found over 200 differences, mostly spelling (the PG version uses mostly British spelling but even here it is strangely inconsistent!), but also oddities in punctuation, wrong paragraph breaks, some missing accented characters, a couple places with changed wording, a few misspellings, etc. Of course, whenever I encountered a difference, I went to the original page scans of the 1918 1st edition to verify what was done there. All 200+ differences with respect to the original text were with the PG version, which I surmise was derived from the British edition of My Antonia, which is noted to have been mangled in editing (Willa Cather was supposedly furious over the quality of editing in the British edition which went beyond just using British spelling for words, such as 'colour' instead of 'color'.) Anyway, when the final proofing is done, I believe the textual error rate will be very low, near zero (but one cannot say it is perfect.) So I think it will be useful for OCR accuracy experiments (which I assume is what you want to perform?) Of course, there's always the issue of hyphenated compound words, figuring out if a hyphenated compound word will have a dash in it or not, but that's another matter. I believe we did pretty good on this, with help from the UNL information as well as textual analysis. >> I'll zip up the 600 dpi 2-color (B&W) scans >> which have already gone through a clean-up stage >> (they will be PNG files, and occupy if memory serves me right, >> about 50 megs of space > those are too big for my purposes, and for me to download. Oops, sorry. They are pretty large for downloading by modem (but with DSL/Cable they can be downloaded pretty quick.) > but if i could reimburse you for sending them to me on a cd? It's on me. In private email send me your address and I'll burn and mail you a disk of the 600 dpi and 120 dpi scans. I do have the original 600 dpi 24-bit color scans (which is overkill -- next time I'll do the raw scans for B&W pages at greyscale), but in PNG they occupy over 5 gigs of disk space! (Don't have a DVD burner yet otherwise I'd send those, too.) > or the 120-dpi versions would work just fine for my project, > the same ones that are on the website, just zipped together. Unfortunately, since the 120-dpi scans are antialiased greyscale (while the 600 dpi are bitonal), the size difference is surprisingly not that different. I updated the My Antonio index page to include downloading all the 120-dpi scans in a ZIP file, which is still over 30 megs in size: http://www.openreader.org/myantonia/ Bowerbird, I'll be happy to put up a ZML regularized text version of My Antonia. If I put up plain text, I want the plain text to follow some regularization rules, and ZML is the only game in town actively working with etexts (as far as I know at least -- I do recall two other text regularization schemas, but don't know if the authors are doing anything with them.) Jon From marcello at perathoner.de Sun Feb 27 11:25:14 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Sun Feb 27 12:19:41 2005 Subject: [gutvol-d] Filesystem changes to the web site In-Reply-To: References: <421CDA26.7060507@perathoner.de> <6.1.2.0.0.20050225171445.01be73b0@mail.fireantproductions.com> <4220C869.9010406@perathoner.de> <6.1.2.0.0.20050227072235.01e2dec0@mail.fireantproductions.com> Message-ID: <42221E9A.9050202@perathoner.de> David A. Desrosiers wrote: > How do these changes affect those of us who maintain mirrors? I've > noticed that rsync'ing the main filesystem to keep up to date, has > duplicated two copies of the tree now. Is this intentional? Its also > taking twice the amount of space. Nothing is going to change for the file archive. Just the web site files will be moved to a different file server. You are not supposed to keep mirrors of the web site. We will implement a net of squids to take load off the main site. -- Marcello Perathoner webmaster@gutenberg.org From hacker at gnu-designs.com Sun Feb 27 12:23:20 2005 From: hacker at gnu-designs.com (David A. Desrosiers) Date: Sun Feb 27 12:26:34 2005 Subject: [gutvol-d] Filesystem changes to the web site In-Reply-To: <42221E9A.9050202@perathoner.de> References: <421CDA26.7060507@perathoner.de> <6.1.2.0.0.20050225171445.01be73b0@mail.fireantproductions.com> <4220C869.9010406@perathoner.de> <6.1.2.0.0.20050227072235.01e2dec0@mail.fireantproductions.com> <42221E9A.9050202@perathoner.de> Message-ID: > You are not supposed to keep mirrors of the web site. We will > implement a net of squids to take load off the main site. I'm mirroring the archive, not the website. Something changed recently, and all of the directories have been moved to a completely new layout, duplicating the tree in a secondary location inside the same parent root. Its doubled the amount of space the archive consumes, which is why I was concerned. David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com From marcello at perathoner.de Sun Feb 27 12:57:35 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Sun Feb 27 12:58:47 2005 Subject: [gutvol-d] Filesystem changes to the web site In-Reply-To: References: <421CDA26.7060507@perathoner.de> <6.1.2.0.0.20050225171445.01be73b0@mail.fireantproductions.com> <4220C869.9010406@perathoner.de> <6.1.2.0.0.20050227072235.01e2dec0@mail.fireantproductions.com> <42221E9A.9050202@perathoner.de> Message-ID: <4222343F.4050404@perathoner.de> David A. Desrosiers wrote: > I'm mirroring the archive, not the website. Something changed > recently, and all of the directories have been moved to a completely > new layout, duplicating the tree in a secondary location inside the > same parent root. Its doubled the amount of space the archive > consumes, which is why I was concerned. I cannot understand that. The file archive was moved a while ago to the new fileserver but mounted on the same directory. What commandline are you using to rsync the archive? -- Marcello Perathoner webmaster@gutenberg.org From nwolcott at dsdial.net Sun Feb 27 13:03:09 2005 From: nwolcott at dsdial.net (N Wolcott) Date: Sun Feb 27 13:27:32 2005 Subject: [gutvol-d] Good site Message-ID: <003401c51d12$d1807240$bc9495ce@gw98> http://copac.ac.uk/ is a marvellous site I just discovered, at the University of Manchester, propbably already known to you experts. On it you can search all the combined british, scottish, and irish library catalogues at one time with results which can be sorted etc, downloaded in various formats, and saved. A real boon if you are looking for pre 1900 books. The expanded catalogue entries often have information as to pseudonyms publisher dates etc etc. N Wolcott nwolcott2@post.harvard.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050227/fd7f0183/attachment.html From Bowerbird at aol.com Sun Feb 27 13:35:39 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Sun Feb 27 13:37:16 2005 Subject: [gutvol-d] one more thing, for jon noring Message-ID: <1ec.35c2cb43.2f53972b@aol.com> jon said: > But there was a deadline to finish the first beta of the > cleaned-up text, so there was no time to have this done at DP. this text would fly through d.p. in a matter of hours... > and found over 200 differences but my comparison shows that most of those are minor, to the point of total insignificance to the average reader. when the focus is narrowed to meaningful differences, the number is less than 20. it is good to correct them -- and the less-significant ones too -- very good, but this is hardly a good example of an error-ridden e-text. > Unfortunately, since the 120-dpi scans are > antialiased greyscale (while the 600 dpi are bitonal), > the size difference is surprisingly not that different. > I updated the My Antonio index page to include > downloading all the 120-dpi scans in a ZIP file, > which is still over 30 megs in size: the .pngs on the website would seem to be much smaller. roughly 400 of those, at about 20k each, would be 8 megs. is my arithmetic wrong? or am i missing something? -bowerbird From jon at noring.name Sun Feb 27 15:57:13 2005 From: jon at noring.name (Jon Noring) Date: Sun Feb 27 15:58:54 2005 Subject: [gutvol-d] one more thing, for jon noring In-Reply-To: <1ec.35c2cb43.2f53972b@aol.com> References: <1ec.35c2cb43.2f53972b@aol.com> Message-ID: <19430256015.20050227165713@noring.name> Bowerbird wrote: > jon said: >> But there was a deadline to finish the first beta of the cleaned-up >> text, so there was no time to have this done at DP. > this text would fly through d.p. in a matter of hours... Well, yes, once it's been put in the queue. Anyone here from DP care to comment on typical times for a book to be proofed in the DP system? (I would have been happy to contribute to the post-processing markup stage.) (Btw, I finished the XHTML before I even finished the scanning, so it would have been delayed in the DP system anyway. But yet, I would have preferred the job be done in the DP system. At this stage it probably won't fit into their work flow.) >> Unfortunately, since the 120-dpi scans are >> antialiased greyscale (while the 600 dpi are bitonal), >> the size difference is surprisingly not that different. >> I updated the My Antonio index page to include >> downloading all the 120-dpi scans in a ZIP file, >> which is still over 30 megs in size: > the .pngs on the website would seem to be much smaller. > roughly 400 of those, at about 20k each, would be 8 megs. > is my arithmetic wrong? or am i missing something? Most of the PNGs are in the 70-80k range (I just rechecked at the online site to make sure something weird didn't happen.) So the 30+ MBytes for the 400+ scans at 120 dpi greyscale/antialiased is about right. Let me know if you want me to snail-mail the scans on CD-ROM. Jon From nwolcott at dsdial.net Sun Feb 27 13:57:33 2005 From: nwolcott at dsdial.net (N Wolcott) Date: Sun Feb 27 16:27:09 2005 Subject: [gutvol-d] Pepys' birthday References: <1e2.36266c1c.2f4e4543@aol.com><20050223213539.GB14264@reactor-core.org><006801c51ceb$18df3220$399495ce@gw98> Message-ID: <00cf01c51d2b$e7516e80$bc9495ce@gw98> The problem with the xxx portions is that they are only available in the copyrighted version of the diaries circa 1970. Since the Mynors-Bright versions indicate where the cuts were made, one could easily marry the additions with the original. Hence the need to use the U.S. "fair use" doctrine if it still exists. ----- Original Message ----- From: "Dave Fawthrop" To: "Project Gutenberg Volunteer Discussion" Sent: Sunday, February 27, 2005 1:26 PM Subject: Re: [gutvol-d] Pepys' birthday > On Sun, 27 Feb 2005 06:59:00 -0500, "N Wolcott" > wrote: > > | It occurred to me that if there are not too many xxx portions of the diary, > | then under the "fair use" doctrine one could write a "scholarly article" on > | the topic of "editorial squeamishness" and include the referenced passages > | as footnotes and publish it in a scholarly place like the PG arachive?? > > I don't understand why there should be any problems with xxx portions. Or > indeed why you thought it necessary to use xxx. > Nowadays absolutely anything goes, a bit of sex does not cause any problems > whatsoever, in the UK at least. > > -- > Dave F > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From Bowerbird at aol.com Sun Feb 27 16:35:49 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Sun Feb 27 16:37:33 2005 Subject: [gutvol-d] one more thing, for jon noring Message-ID: <85.22504fbb.2f53c165@aol.com> jon said: > Most of the PNGs are in the 70-80k range yes, that was my mistake, sorry. i was misled by the .djvu version, where most of the pages are <10k. (i've _got_ to see how to use that!) of course, at 30 megs per book -- gosh, i remember when 30 megs was a good-sized _hard-drive_! :+) -- we see why project gutenberg has never put all its scans online. distributed proofreaders keeps _saying_ that they are going to, but as of yet, they haven't done it. d.p. still seems to be constrained by disk-space, even with generous help from ibiblio and internet archive. -bowerbird From jon at noring.name Sun Feb 27 17:00:29 2005 From: jon at noring.name (Jon Noring) Date: Sun Feb 27 17:02:12 2005 Subject: [gutvol-d] one more thing, for jon noring In-Reply-To: <85.22504fbb.2f53c165@aol.com> References: <85.22504fbb.2f53c165@aol.com> Message-ID: <9734051828.20050227180029@noring.name> Bowerbird wrote: > jon said: >> Most of the PNGs are in the 70-80k range > yes, that was my mistake, sorry. > i was misled by the .djvu version, > where most of the pages are <10k. > (i've _got_ to see how to use that!) DjVu is cool, but the "openness" and long-term viability of the format is still open to question. The Internet Archive uses it, so they must feel it is open enough to use. There's a dearth of free tools using DjVu, but from what I understand there's no impediment to open source DjVu compile tools. > distributed proofreaders keeps > _saying_ that they are going to, > but as of yet, they haven't done it. Yes, if this is the case, it is mysterious since IA will gladly host them once the etext version is out the door. I do know that some scan sets are encumbered (they are "loaned" to DP under some sort of arrangement, but cannot be made public -- this is somewhat troubling, but hopefully the scans will be made available elsewhere at a future time, such as through IA's scanning activities. One thing I do know is that DP does keep full source metadata for each text they produce, even if that data is not turned over to PG.) > d.p. still seems to be constrained > by disk-space, even with generous > help from ibiblio and internet archive. I know that for production purposes they want to use their own servers -- IA is not reliable enough. IA's focus is on archiving and storing, so 24-7 with full-throttle availability is a lower priority to IA, while DP *must* have 24-7 availability and sufficient speed to not keep volunteers waiting. Thus, disk space is an issue for the DP production process, especially in that DP is still a shoestring operation. Anyway, this is my interpretation of what Juliet told me a few months ago. Maybe someone from DP will reply to this... Jon From servalan at ar.com.au Sun Feb 27 18:47:50 2005 From: servalan at ar.com.au (Pauline) Date: Sun Feb 27 18:50:44 2005 Subject: [gutvol-d] one more thing, for jon noring In-Reply-To: <9734051828.20050227180029@noring.name> References: <85.22504fbb.2f53c165@aol.com> <9734051828.20050227180029@noring.name> Message-ID: <42228656.5000105@ar.com.au> Jon Noring wrote: > I know that for production purposes they want to use their own servers > -- IA is not reliable enough. IA's focus is on archiving and storing, > so 24-7 with full-throttle availability is a lower priority to IA, > while DP *must* have 24-7 availability and sufficient speed to not > keep volunteers waiting. Thus, disk space is an issue for the DP > production process, especially in that DP is still a shoestring > operation. Anyway, this is my interpretation of what Juliet told me a > few months ago. Maybe someone from DP will reply to this... The short version - very busy tying shoestrings :) : DP does fairly well for 24/7 uptime now. Since migrating to our own server last year, we've had minor network glitches due to routing hassles at the ISP & a few scheduled outages due to upgrades to the DP code. All our projects in various stages of production are kept on the DP production server. DP has had a production inbalance, proofing more books than post-processing & subsequent posting to PG. Projects are archived off the production server only after they have been posted to PG. Hence over time, we have wound up with ever-decreasing amounts of free disk space. The lack of disk space is not really the issue, the inbalance is. We are doing our best to address the inbalance by further distributing workload & post-processing more of our in progress projects. In the interim, disk space is tight, but we are managing for the moment. I have posted a few times about this issue to the DP Forums. Want to help? - sign up to smooth-read texts before they get posted & help our volunteer post-processors (PPers) & post-processing verifiers (PPVers) post more books to PG. If you can read an ebook, you can help. More info here: http://www.pgdp.net/phpBB2/viewtopic.php?t=13677 An accessible archive of posted projects & images is in the works. Thanks, P - one of the DP Site Admins - (pourlean @ DP) From prosfilaes at gmail.com Sun Feb 27 20:06:35 2005 From: prosfilaes at gmail.com (David Starner) Date: Sun Feb 27 20:08:15 2005 Subject: [gutvol-d] one more thing, for jon noring In-Reply-To: <9734051828.20050227180029@noring.name> References: <85.22504fbb.2f53c165@aol.com> <9734051828.20050227180029@noring.name> Message-ID: <6d99d1fd05022720067eebfde7@mail.gmail.com> On Sun, 27 Feb 2005 18:00:29 -0700, Jon Noring wrote: > There's a dearth of free tools using > DjVu, but from what I understand there's no impediment to open source > DjVu compile tools. What do you mean a dearth of free tools? The djvulibre set seems to be a pretty complete set of tools. From jon at noring.name Sun Feb 27 20:28:51 2005 From: jon at noring.name (Jon Noring) Date: Sun Feb 27 20:30:53 2005 Subject: [gutvol-d] one more thing, for jon noring In-Reply-To: <6d99d1fd05022720067eebfde7@mail.gmail.com> References: <85.22504fbb.2f53c165@aol.com> <9734051828.20050227180029@noring.name> <6d99d1fd05022720067eebfde7@mail.gmail.com> Message-ID: <10946554359.20050227212851@noring.name> David Starner wrote: > Jon Noring wrote: >> There's a dearth of free tools using >> DjVu, but from what I understand there's no impediment to open source >> DjVu compile tools. > What do you mean a dearth of free tools? The djvulibre set seems to be > a pretty complete set of tools. I stand corrected. I'll try the viewer plugin for Opera/Firefox. It looks interesting. Hopefully a Windows-based encoder with GUI front end will eventually be developed. From this perspective, there does appear to be a dearth of free tools for DjVu encoding. Jon From jon at noring.name Sun Feb 27 20:37:18 2005 From: jon at noring.name (Jon Noring) Date: Sun Feb 27 20:39:16 2005 Subject: [gutvol-d] one more thing, for jon noring In-Reply-To: <10946554359.20050227212851@noring.name> References: <85.22504fbb.2f53c165@aol.com> <9734051828.20050227180029@noring.name> <6d99d1fd05022720067eebfde7@mail.gmail.com> <10946554359.20050227212851@noring.name> Message-ID: <19547060671.20050227213718@noring.name> >Jon Noring wrote: > David Starner wrote: >> Jon Noring wrote: >>> There's a dearth of free tools using >>> DjVu, but from what I understand there's no impediment to open source >>> DjVu compile tools. >> What do you mean a dearth of free tools? The djvulibre set seems to be >> a pretty complete set of tools. > I'll try the viewer plugin for Opera/Firefox. It looks interesting. Oops, there's not yet a djvulibre browser viewer plugin for Windows (I use both Opera 7 and FireFox, so I got excited that I could view DjVu files using these browsers in Windows. But nada -- stuck with IE6 and LizardTech's plugin.) So for those who do most of their text and graphics processing on Windows, we're still stuck with the payware encoders from LizardTech. This may be one reason why DjVu has not taken off -- the djvulibre developers seem to have little interest at this time in encoders and viewers for Windows-based systems. Not exactly a great marketing decision. Jon From scott_bulkmail at productarchitect.com Sun Feb 27 20:37:56 2005 From: scott_bulkmail at productarchitect.com (Scott Lawton) Date: Sun Feb 27 20:42:52 2005 Subject: [gutvol-d] plain text formats [was: one more thing, for jon noring] In-Reply-To: <15216999687.20050227131617@noring.name> References: <157.4b53b442.2f52ed7f@aol.com> <15216999687.20050227131617@noring.name> Message-ID: >If I put up plain text, I want the plain text to follow >some regularization rules, and ZML is the only game in town actively >working with etexts (as far as I know at least -- I do recall two >other text regularization schemas, but don't know if the authors are >doing anything with them.) There are several schemes in active use, including: wiki markup: http://en.wikipedia.org/wiki/Wiki_markup STX: http://www.zope.org/Members/jim/StructuredTextWiki/FrontPage (Structured Text) by Jim Fulton, e.g. for Zope and ZWiki reStructuredText: http://docutils.sourceforge.net/rst.html for Python's DocUtils Markdown: http://daringfireball.net/projects/markdown/ by John Gruber of Daring Fireball (I'm not sure if any are being used for the same type of etexts as PG, but it seems likely that the overall level and diversity of activity and tools are more important. e.g. I think all of the above include source code, typically using a friendly "attribution" license.) My own approach (from 1995) plus links to several others is here: No-Tags Markup: http://prefab.com/ssl/notagsmarkup.html -- Cheers, Scott S. Lawton http://Classicosm.com/ - classic books From jon at noring.name Sun Feb 27 21:09:00 2005 From: jon at noring.name (Jon Noring) Date: Sun Feb 27 21:10:45 2005 Subject: [gutvol-d] plain text formats [was: one more thing, for jon noring] In-Reply-To: References: <157.4b53b442.2f52ed7f@aol.com> <15216999687.20050227131617@noring.name> Message-ID: <12848962687.20050227220900@noring.name> Scott Lawton wrote: >> If I put up plain text, I want the plain text to follow >> some regularization rules, and ZML is the only game in town actively >> working with etexts (as far as I know at least -- I do recall two >> other text regularization schemas, but don't know if the authors are >> doing anything with them.) > There are several schemes in active use, including: > > wiki markup: http://en.wikipedia.org/wiki/Wiki_markup > > STX: http://www.zope.org/Members/jim/StructuredTextWiki/FrontPage > (Structured Text) by Jim Fulton, e.g. for Zope and ZWiki > > reStructuredText: http://docutils.sourceforge.net/rst.html for > Python's DocUtils > > Markdown: http://daringfireball.net/projects/markdown/ by John > Gruber of Daring Fireball Thanks! Markdown is especially interesting since it produces regularized plain text which looks and reads the most like PG plain text, other than Bowerbird's ZML. It's also interesting that there's another ZML in use, so Bowerbird may need to change the acronym he is using for his regularized plain text schema, such as to ZenML: http://rx4rdf.liminalzone.org/RhizML > (I'm not sure if any are being used for the same type of etexts as > PG, but it seems likely that the overall level and diversity of > activity and tools are more important. e.g. I think all of the > above include source code, typically using a friendly "attribution" > license.) > > My own approach (from 1995) plus links to several others is here: > No-Tags Markup: http://prefab.com/ssl/notagsmarkup.html Very useful information. Thanks. Jon From Bowerbird at aol.com Sun Feb 27 23:19:33 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Sun Feb 27 23:21:23 2005 Subject: [gutvol-d] plain text formats [was: one more thing, for jon noring] Message-ID: jon said: > It's also interesting that there's another ZML in use, > so Bowerbird may need to change the acronym he is using > for his regularized plain text schema, such as to ZenML: fat chance! somebody better warn the new interloper... ;+) interesting page, scott. (except i got a flock of 404s.) one of these days, no-markup markup is finally gonna hit its critical mass... my efforts are aimed at offline e-books, for which i am creating the viewer-app, so my "zen markup language" files aren't intended to be transformed into (x)html (who needs the hassle?), but read directly. the real noise will be about my _viewer_. the sizzle on the steak is that you feed it plain old text, and still get immense power. eventually i'll write a z.m.l. browser plug-in, but for now the more-important priority is to wean people off the browser for reading e-books. so i do not see myself as competing in any way with any of the other no-markup systems today. (nor do i see any of them as competition to me.) i also intend to be much simpler than any of them! -bowerbird From gbnewby at pglaf.org Mon Feb 28 00:13:16 2005 From: gbnewby at pglaf.org (Greg Newby) Date: Mon Feb 28 00:13:17 2005 Subject: [gutvol-d] Filesystem changes to the web site In-Reply-To: <4222343F.4050404@perathoner.de> References: <421CDA26.7060507@perathoner.de> <6.1.2.0.0.20050225171445.01be73b0@mail.fireantproductions.com> <4220C869.9010406@perathoner.de> <6.1.2.0.0.20050227072235.01e2dec0@mail.fireantproductions.com> <42221E9A.9050202@perathoner.de> <4222343F.4050404@perathoner.de> Message-ID: <20050228081316.GB27826@pglaf.org> On Sun, Feb 27, 2005 at 09:57:35PM +0100, Marcello Perathoner wrote: > David A. Desrosiers wrote: > > > I'm mirroring the archive, not the website. Something changed > >recently, and all of the directories have been moved to a completely > >new layout, duplicating the tree in a secondary location inside the > >same parent root. Its doubled the amount of space the archive > >consumes, which is why I was concerned. > > I cannot understand that. The file archive was moved a while ago to the > new fileserver but mounted on the same directory. > > What commandline are you using to rsync the archive? I'm just confirming that my mirrors don't seem to show any duplication (total size is ~143.6GB). For sample command lines and mirroring methods, see: http://gutenberg.org/howto/mirror-howto -- Greg From widger at cecomet.net Wed Feb 23 08:24:33 2005 From: widger at cecomet.net (David Widger) Date: Mon Feb 28 00:19:54 2005 Subject: [gutvol-d] Pepys' birthday In-Reply-To: <005901c519bf$420ca4e0$ac9495ce@gw98> References: <005901c519bf$420ca4e0$ac9495ce@gw98> Message-ID: <6.0.1.1.2.20050223112214.027c8c48@mail.adelphia.net> At 10:49 AM 2/23/2005, N Wolcott wrote: >Being his birthday maybe this is appropriate. Pepys gave his memoirs to >Cambridge University, but the full text was not published until 1970. That >being the case would not the text (minus editorial comment and added >footnotes) be now public domain as it is now more than 75 years since the >author's death? > > >N Wolcott nwolcott2@post.harvard.edu >_______________________________________________ >gutvol-d mailing list >gutvol-d@lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d Here is the PG Pepys. David Samuel Pepys Unabridged Diary Entire Gutenberg Edition of The Diary of Samuel Pepys(6.6 mb) Quotes & Images 182389e.jpg 1660 Intro Jan Feb Mar/Apr May Jun/Jul Aug/Sep Oct/Nov/Dec 1661 Jan/Feb /Mar Apr/May/Jun Jul/Aug Sep/Oct Nov/Dec 1662 Jan/Feb Mar/Apr May/Jun Jul/Aug Sep/Oct Nov/Dec 1663 Jan/Feb Mar/Apr May/Jun Jul/Aug Sep/Oct Nov/Dec 1664 Jan/Feb Mar Apr/May Jun/Jul Aug/Sep Oct/Nov Dec 1665 Jan/Feb Mar/Apr May/Jun Jul Aug Sep Oct Nov/Dec 1666 Jan/Feb Mar/Apr May/Jun Jul Aug/Sep Oct Nov Dec 1667 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1668 Jan Feb Mar Apr May Jun/Jul Aug Sep/Oct Nov Dec 1669 Jan Feb/Mar Apr/May -------------- next part -------------- A non-text attachment was scrubbed... Name: 182389e.jpg Type: image/jpeg Size: 61296 bytes Desc: not available Url : http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050223/cc0042b9/182389e-0001.jpg From webmaster at gutenberg.org Sun Feb 27 12:32:06 2005 From: webmaster at gutenberg.org (Marcello Perathoner) Date: Mon Feb 28 00:19:56 2005 Subject: [gutvol-d] [Fwd: Carl Ludwig Schleich/Inka Weide] Message-ID: <42222E46.6040802@gutenberg.org> Any DPers here that can get this msg to Inka? -------- Original Message -------- Subject: Carl Ludwig Schleich/Inka Weide Date: Sun, 27 Feb 2005 17:28:23 +0100 From: jp-com To: webmaster@gutenberg.org Hallo, please send this mail to inka weide. Hallo Inka, mit Freude habe ich auf Gutenberg die Aufs?tze von Caarl Ludwig Schleich gelesen. Mehr ?ber meinen Urgro?onkel findet man unter www.carl-ludwig-schleich.de Gru? J?rgen Pohl Eichenhain 13 D--31311 Uetze -- Marcello Perathoner webmaster@gutenberg.org From inka at 21torr.com Mon Feb 28 03:11:50 2005 From: inka at 21torr.com (inka@21torr.com) Date: Mon Feb 28 03:10:41 2005 Subject: [gutvol-d] [Fwd: Carl Ludwig Schleich/Inka Weide] In-Reply-To: <42222E46.6040802@gutenberg.org> References: <42222E46.6040802@gutenberg.org> Message-ID: On Sun, 27 Feb 2005, Marcello Perathoner wrote: > Any DPers here that can get this msg to Inka? > Hm, yes, I think I may be able to reach me :) Thanks - the first 'reader feedback' for a book I worked on. Inka From shimmin at uiuc.edu Mon Feb 28 06:08:36 2005 From: shimmin at uiuc.edu (Robert Shimmin) Date: Mon Feb 28 06:08:40 2005 Subject: [gutvol-d] one more thing, for jon noring In-Reply-To: <19430256015.20050227165713@noring.name> References: <1ec.35c2cb43.2f53972b@aol.com> <19430256015.20050227165713@noring.name> Message-ID: <422325E4.1060907@uiuc.edu> Jon Noring wrote: > Well, yes, once it's been put in the queue. Anyone here from DP care to > comment on typical times for a book to be proofed in the DP system? (I > would have been happy to contribute to the post-processing markup > stage.) This depends greatly on which queue it gets put in. English-language novels tend to go quickly. If they qualify for the "easy" queue, they spend little time queuing, and often complete proofreading within a few days. Complex texts on "dry" subjects might spend a few weeks queueing, and then require several weeks to go through the proofreading rounds. -- RS From Bowerbird at aol.com Mon Feb 28 07:38:41 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Mon Feb 28 07:38:51 2005 Subject: [gutvol-d] one more thing, for jon noring Message-ID: <2b.6ddabc74.2f549501@aol.com> robert said: > English-language novels tend to go quickly. > If they qualify for the "easy" queue, > they spend little time queuing, and > often complete proofreading within a few days. when i said "fly through d.p. in a few hours" i meant literally, not figuratively. (well, i guess the "fly" part was figurative, :+) but the "few hours" part was quite literal, especially since first-time producers get to go to the head of the queue. go ahead, time it.) -bowerbird p.s. a text that is already this clean would most definitely be put into the "easy" queue. moreover, re-doing existing e-texts might be a very good test of the formatting rounds that are now being contemplated for the d.p. future. (of course, those are a complete waste of time, in my humble opinion, but they _are_ the plan...) p.p.s. robert, didn't you make a d.p. forum post that also listed a bunch of plain-text formats recently? or was that dazb? or someone else? From jon at noring.name Mon Feb 28 12:33:30 2005 From: jon at noring.name (Jon Noring) Date: Mon Feb 28 12:34:00 2005 Subject: [gutvol-d] Interesting message on TeBC from NetWorker (about fixing errors, Frankenstein, etc.) Message-ID: <15820479578.20050228133330@noring.name> [I'm forwarding the following message by "NetWorker" posted to The eBook Community. Any followup to the specific points NetWorker raises are maybe better posted over there, especially if you'd like NetWorker to see your comments ("ebook-community" at YahooGroups). NetWorker is very thorough... Jon] NetWorker wrote [a few days prior]: > Project Gutenberg e-texts may yet have a role to play in the > production of high-quality e-books. Rising to the bait, I have placed > a hold on my local library's one(!) copy of Frankenstein, which I > will scan when it arrives. I will try to create a highly structured > e-text (certainly not as fine as what Jon did with "My Antonia"). I > will then try to find a way to preprocess both the OCR'ed text and > the Project Gutenberg e-text in such a way that the two files can be > meaningfully "diffed." Hopefully, I can come up with a method that > will allow existing PG e-texts to be an automated "proofread" of next > generation Public Domain e-books. Boy, _that_ was an interesting experience! The project goals: As an e-book consumer, I want an e-book that contains _lots_ of metadata; the more the better. I want the metadata to be patterned, so that I can use automated tools to manage a collection; sorting by author, genre, publication date, publisher, contributors such as editors and illustrators, etc. I also want the actual text to be marked-up in such a way that 1) I can view the text with all the presentational richness traditionally associated with a paper book, if I choose to do so, 2) I can convert unambiguously from one markup language to another, and 3) so that I can do a structural analysis of the book using automated tools. I also want a mechanism to know if apparent errors in the text are due to transcription errors or the author's intent -- this can be accomplished by including source information in the metadata, or providing access to page scans; both would be preferable. Project Gutenberg e-texts satisfy none of these wishes; to create an e-book which _does_ satisfy them pretty much requires starting from scratch. Scanning technologies are quite advanced these days, but OCR is still not 100% accurate, and automated spell checking can only go so far. Clearly the most time-consuming -- and most error-prone -- part of producing a reasonably accurate e-book is proof-reading by a human being. My goal was to discover if Project Gutenberg e-texts, which are presumably fairly accurate as to the _words_, if nothing else, could be used as yet another automated preprocessing step to reduce typographical errors to a minimum before the actual proof-reading begins. The process: To test my theory, I decided to use the novel _Frankenstein_, by Mary Wollstonecraft Shelly. _Frankenstein_ is clearly in the public domain, is known to have at least two versions, and has been the subject of a fair amount of discussion on this list in the recent past. I obtained a copy of Frankenstein from the public library; it was published in the "Barnes & Noble Classics" series in 2000. I was fairly pleased with the edition, as it was printed in a rather old-seeming type-face which gave the appearance that it was in fact a photo reproduction of a much older text; it seemed likely that it had not gone through much in the way of re-editing to modern conventions. I scanned and OCR'ed the book using ABBYY FineReader. I then did a spell-check of the book from within FineReader so I could compared "misspelled" words to the actual scanned image. I then saved the text as an HTML file. In the past I have written a couple of programs to help in the creation of e-books. TidyeBook is based on the HTML Tidy code base. It fixes some of the inaccurate HTML produced by ABBYY, strips headers and footers but leaves page numbers intact, if invisible, and merges broken paragraphs when it can do so without question. html2txt, based on an earlier C++ version of HTML Tidy, takes an HTML document and reduces it to simple text similar to that used by Project Gutenberg. Next I ran "frankenstein.html" through TidyeBook to clean up the HTML. I then hand-edited the HTML to fix paragraph breaks not fixed in the automated process, or which should not have been broken. I also fixed those instances where hyphenated words spanned a page break (very easy to do given the output of TidyeBook). I then generated an Impoverished Text Format version of the HTML text using html2txt. My strategy was to use the Gnu "diff" program to detect differences between the simplified version of my work product, and the Project Gutenberg version. Because "diff" is line-oriented I needed to normalize the two texts so there was a greater likelihood that lines would be correctly matched. I did this by writing yet another program (this could probably have been done more efficiently by a Perl or AWK script, but I am not very familiar with scripting languages, but am a highly proficient C/C++ programmer; it was easiest for me to use the tools at my disposal). The new program would reduce each file to lines of no more than 60 characters (the shorter the line the easier for a human to find the difference detected). Additionally, the program would start a new line whenever it encountered what is conventionally accepted as sentence-ending punctuation (!.?) or two newline characters in a row, which would signal the beginning of a new paragraph. All whitespace was reduced to a space character, including multiple whitespace characters. I used the new program to normalize the text produced by html2txt and that of frank14.txt from Project Gutenberg. I then compared the two resultant files using gnu diff and Microsoft's WinDiff. The results: I was quite surprised to find literally thousands of differences between the two texts. Most of the differences were changes in punctuation and capitalization. Many em-dashes were converted to semicolons or omitted altogether, and many semicolons were converted to commas. Some words capitalized in my scan (eg. Paradise) were converted to lower case (paradise). Some phrases were "fixed" ("our uncle Thomas's book" became "our Uncle Thomas' book"; "an European" became "a European"). Some words were Americanized ("tranquillise" became "tranquillize") yet other words are not ("favourite" remained "favourite"). In an attempt to discover the source of these differences, I visited a number of not-so-local libraries, and checked out a number of different printings of _Frankenstein_. Two of the most interesting are Leonard Wolf's _The Annotated Frankenstein_, Clarkson N. Potter, 1977, which claims that "In order to ensure the authenticity of the text, we arranged with the Library of Congress in Washington, D.C., to microfilm a copy of the first edition. That text has been reproduced in this volume by the photo-offset process," and the Penguin Classics edition which includes an appendix identifying the differences between the 1818 and 1831 editions (while significant, they are neither as pervasive nor as substantive as has been earlier suggested). Neither of these editions contained the punctuation or spelling changes of the Project Gutenberg edition. One of the books I checked out, rather serendipitously as it turns out, was the Bantam Classic edition, which was first published in 1981. Of all the editions I consulted, only the Bantam edition contains virtually all of the changes I noted in the Project Gutenberg e-text. The PG edition is apparently based on Mary Shelly's revised 1831 edition (although it has lost both the "Author's Introduction to the Standard Novels Edition (1831)" by Mary Shell, and the "Preface" to the 1818 edition by Percy Shelly). I thus believe that the PG edition is based on the Bantam Classic edition of 1981. Interestingly, copyright law provides protection to changed versions of public domain texts if those changes are of a nature that they are more than mechanical and provide some modicum of creativity. Clearly, the punctuation changes are not merely mechanical, and in some cases actually change subtly the nuances of the text. Ironically, of all the textual bases that Project Gutenberg could have used for its e-text of _Frankenstein_, it choose the one which is apparently still protected by copyright! I modified my text normalization program to discard all punctuation except hyphens and underscores (and of course excluding the sentence-ending punctuation mentioned earlier). This reduced the noise to signal ratio enough that the differences started to become meaningful, although it still resulted in at least 500 differences. It allowed me to discover a handful of OCR errors that had been missed by the earlier automated methods, and I have so far also found a handful of errors in the PG text ("But must finish." should be "But I must finish.", "every sight ... seem still to" should be "every sight ... seems still to", "destroy radiant innocence" should be "destroy such radiant innocence", etc.) Conclusions: Of course, the goal of this exercise was not to establish the provenance of the Project Gutenberg e-text of _Frankenstein_, nor to discover if there are any errors in the PG e-text, but to determine if there was an automated method of reducing errors in newly scanned e-books for which a Project Gutenberg e-text already exists. I'm afraid the jury is still out on this question. If the texts are a different as the PG edition of _Frankenstein_ and virtually all other editions, the process of sorting through the chaff to find the grain of wheat may not be worthwhile; I believe that the OCR errors discovered so far were blatant enough that they would have been easily discovered in the first proof-reading, and I believe that human proof-reading will always be required no matter how good our automated tools become. Some time ago I produced an HTML e-book version of Mark Twain's _Pudd'n'head Wilson_. I believe I will put that e-book through that same process. I will then attempt a new scan of some other, perhaps more obscure, PD work that already has a PG version. Having at least three data points, I will report again later. p.s. -- If someone from Project Gutenberg wants my diff file to update the PG e-text, I will be happy to e-mail it to you; it is approximately 95k in size. From Bowerbird at aol.com Mon Feb 28 13:30:00 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Mon Feb 28 13:30:20 2005 Subject: [gutvol-d] Interesting message on TeBC from NetWorker (about fixing errors, Frankenstein, etc.) Message-ID: <1d7.37a74a84.2f54e758@aol.com> jon said networker said: > Project Gutenberg e-texts satisfy none of these wishes well, i guess networker will have to start his own project, eh? give him my best wishes! :+) > Conclusions > Of course, the goal of this exercise was not to establish the > provenance of the Project Gutenberg e-text of _Frankenstein_, maybe not. but having done so, it is _refreshing_ to know that -- when that's factored in -- only "a handful" of errors surface. so once again, in spite of some very big noises, it ends up that this fails to stand as a good example of an error-ridden e-text. > nor to discover if there are any errors in the PG e-text, but > to determine if there was an automated method of reducing errors > in newly scanned e-books for which a Project Gutenberg e-text > already exists. I'm afraid the jury is still out on this question. as for this "conclusion", the jury may still be out in _his_ mind, but in mine, the answer is very clear, and i've said it before here: if you do the scanning properly, manipulate those scans correctly, use abbyy in the best way, and subject its results to the right tools, you will reduce the errors in your text to a relatively small number. (the number we've been kickin' around is 1 error for every 10 pages, and at that point, proofreading by the public becomes very viable.) if you then have the rare luxury of evaluating your output against an existing version of the book -- like a project gutenberg e-text -- with the right tool (which networker obviously does not yet have), the comparison between the two, alongside the page-images, should make the process of coming to an error-free version simply a breeze. since this is _exactly_ what will need to be done _increasingly_, as the page-images from the internet archive and (we hope) google -- plus the work done by individual people scanning everywhere -- emerge into cyberspace, that's where my tool-development efforts are now being focused. i suggest networker start reading my blog; it should start being updated on a daily basis starting next week... -bowerbird From Bowerbird at aol.com Mon Feb 28 13:46:49 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Mon Feb 28 13:47:01 2005 Subject: [gutvol-d] ok, this is my last post for the time being, really Message-ID: <129.57a41ac4.2f54eb49@aol.com> jon said: > Bowerbird, I'll be happy to put up > a ZML regularized text version of My Antonia. if you prepare your browser-based versions carefully, copying text from the browser-window will give z.m.l., so there's no real reason to create a separate version... (of course, if you don't do the correct preparation...) for me, "round-tripping" is one of the biggest priorities. and by that, i mean that when a z.m.l. file is presented, the end-user should be able to copy text out _as_ z.m.l., so that -- with just a few global search-and-replaces -- when reloaded into a z.m.l.-viewer, it will look the same. (even if its text-styling is stripped away, as can happen when it's saved as a plain .txt file, it should be restored. automatically.) i have attained this in the z.m.l. viewer-program already. (this was pretty simple, as i control all the operations.) i've also attained it in my .pdf version, where i'm able to work around the limitations of acrobat's copy operation -- if you've ever copied text from a .pdf, you realize how awful it mangles the formatting -- by controlling what my viewer-program writes to the .pdf in the first place. (to answer the first question of a knowledgeable person, i write a dummy-line as a separator between paragraphs, so a global replace restores the blank line between them.) when i get around to making my zml-to-html converter, i will try to make sure that the .html that's created will copy out of the browser-window correctly too. however, browsers do some funky crap in their copy operations, so it might not be possible to preserve _everything_, at least until the browser-programmers tighten up their act there. when i do that work, i'll share any tips people need to know in order to prepare an .html version to copy out good .zml... but in many cases, even now, a copy out of a browser-window can produce text that is .zml, or can be easily converted to it. for instance, jon, your website that gives your listserve rules creates a nice .zml file. consistent formatting yields good .zml. *** jon said: > Yes, if this is the case, it is mysterious > since IA will gladly host them > once the etext version is out the door. sometimes i wonder if the internet archive is quite as accommodating as you always seem to make them out to be. i don't know otherwise, but it seemed if they were, then d.p. would've put their scans up long ago. (unlike their site, page-scans wouldn't need quick response-time.) pourlean said: > An accessible archive of posted > projects & images is in the works. i look forward to the day it comes online! (if there is any particular stumbling block, do please let me know, as maybe i can help.) *** jon said: > It's on me. In private email send me your address and > I'll burn and mail you a disk of the 600 dpi and 120 dpi scans hey, thanks for the gift, jon, i appreciate it! but i can't use 30 megs worth of scans in my little project; it's just a demo. so i wrote a quick program to grab a few dozen from the site. and now that i've done that, i can grab 'em all, if i ever need; since i was just looking for a way to do it in one fell swoop, i shoulda just done that straightaway, instead of bugging you. but maybe someone else will now be able to make good use of the zipped package of the scans that you added to your site... -bowerbird p.s. now if you'll all excuse me, i really need to go back to work... :+) From jon at noring.name Mon Feb 28 14:55:13 2005 From: jon at noring.name (Jon Noring) Date: Mon Feb 28 14:55:25 2005 Subject: [gutvol-d] Interesting message on TeBC from NetWorker (about fixing errors, Frankenstein, etc.) In-Reply-To: <1d7.37a74a84.2f54e758@aol.com> References: <1d7.37a74a84.2f54e758@aol.com> Message-ID: <11628983187.20050228155513@noring.name> Bowerbird wrote: > jon said networker said: > maybe not. but having done so, it is _refreshing_ to know that > -- when that's factored in -- only "a handful" of errors surface. But the bigger issue is not constrained to errors (differences) with respect to the source text used, as you continue to focus on. The issues deal with the larger areas of trust, verifiability, proper digital preservation of the Public Domain, and using acceptable sources (with proper documentation), not just blindly grabbing anything off the shelf as what appears happened to the PG version of Frankenstein, which now exposes PG to legal liability. The lack of proper processes, procedures and guidelines to build the non-DP portion of the PG library (which comprises about half the collection and is heavily skewed towards the more classic works), is leading to serious questions about the integrity and trustworthiness of the whole PG library (I've discussed this at length the last couple weeks on The eBook Community.) It can certainly be fixed, but the fix will require: 1) redoing most of the non-DP works using DP, 2) Proper selection of sources so they are acceptable, both legally and from those knowledgeable as to the better sources to use, and 3) Proper documentation as to source, including making available all the original page scans (and not just the title page, which proves nothing.) (Btw, NetWorker presented evidence in his message to indicate PG's version of Frankenstein was taken from a copyrighted edition, that itself had a significant number of emendments from the original, which in essence act like a "fingerprint" as to the pedigree. This is NOT good. It casts PG's archive in a negative light, and may even lead to a legal demand by Bantam for PG to remove the current version of Frankenstein. It also calls into question the provenance of a large number of other pre-DP texts where there's no source metadata given and no page scans to prove proper provenance. NetWorker himself is a former attorney, and he has thoroughly researched copyright law the last couple years as it relates to ebooks, so Michael and Greg should seriously sit up and take notice of the problem with PG's version of Frankenstein, and many of its other texts where an acceptable source cannot be demonstrated. Even if PG is "right" in a legal sense, that it could use the 1981 Bantam Classics edition as it *might* have done, does it want to even fight this in court, or to try to explain it away to the trusting public?) > so once again, in spite of some very big noises, it ends up that > this fails to stand as a good example of an error-ridden e-text. Well, at least you seem to indicate from your interest in very low error rate OCR that every etext PG includes in its archive should be a textually faithful reproduction of some known source. That is, if any post-emendments are done, that they should be properly documented. Otherwise, leave the text as it is in the print source. Is this your thinking, or do you believe that textual faithfulness and proper source identification and verification are not necessary at all? That is, just let people take any text in the PG library and then "edit it" as they see fit? > if you do the scanning properly, manipulate those scans correctly, > use abbyy in the best way, and subject its results to the right tools, > you will reduce the errors in your text to a relatively small number. I don't believe anyone disagrees with you here in general. But NetWorker was not only interested in OCR errors, but the bigger issues as mentioned above -- they are all interlinked. > (the number we've been kickin' around is 1 error for every 10 pages, > iand at that point, proofreading by the public becomes very viable.) I doubt this error rate (let's say for even half of the public domain printings out there) is accomplishable without sentient-level AI. But if proofreading is to be done anyway by the public, as is *now done* by DP, what difference is there between an OCR error of one every 10 pages, and one every page? The key is that for the aspect of building *trust* in the final product, it is a very good idea to involve the volunteer proofreaders to go over the texts, even if *you don't have to*. Having (and proving to anyone who asks) at least two independent people who proofed every page, adds to its trustworthiness. Include source metadata, and access to the original page scans used as the source, and the highest level of trust is built (as well as greater immunity to legal challenge.) That's what makes DP's system so powerful. But look at PG's edition of Frankenstein: 1) Which original edition it represents is not documented (Mary Shelley issued two substantially different editions). I think the reader should know which one it is in the PG cataloging information. This lack of care about different editions is troubling. 2) The source document is not given at all. I'm not sure if the person who did the first etext version is even recorded anywhere (or even known.) (Btw, this person, should Bantam press the issue, which I hope they don't, would probably become a co-defendent. This shows that the lack of proper guidelines, processes and verification methods in the building of the non-DP portion of PG's collection exposes the volunteer donors of texts to potential legal liability! This is another demonstration that if a project is to do something, it needs to *do it right* from the start, and not just do the "ready-fire-aim" approach to everything.) 3) It is unknown what subsequent "edits" were done along the way -- they are not documented, as far as I know. (How do we know that whole paragraphs were removed or inserted?) 4) It now appears, but is not proven, that the source document was the 1981 Bantam Classics edition. This certainly does not give one warm fuzzies as to the trustworthiness of the non-DP portion of the PG collection. As a user of PG texts, it is important, for both moral, legal and aesthetic reasons, that the texts are: 1) textually faithful reproductions of *known* sources, 2) provable as such (include access to the full page scans, and not just the title page), and 3) the sources of which are themselves acceptable to use, both legally and from those knowledgeable (both professional and amateur) with the Work in question. (For Works which were only published once and never republished by anyone, this last point does not apply provided the source is itself Public Domain.) > if you then have the rare luxury of evaluating your output against > an existing version of the book -- like a project gutenberg e-text -- > with the right tool (which networker obviously does not yet have), > the comparison between the two, alongside the page-images, should > make the process of coming to an error-free version simply a breeze. There will always be hand work necessary to compare two different etexts of the same Work (note that oftentimes there are multiple editions of multiple versions: The Work/Expression/Manifestation (WEM) principle.) Even the issue of hyphenation of compound words requires a human being to ascertain what the author intended. Of course, if this is not important to you, then what can I say? >since this is _exactly_ what will need to be done _increasingly_, >as the page-images from the internet archive and (we hope) google >-- plus the work done by individual people scanning everywhere -- >emerge into cyberspace, that's where my tool-development efforts >are now being focused. i suggest networker start reading my blog; >it should start being updated on a daily basis starting next week... Tools such as yours will likely work for some types of texts, and not work for others, where there'll be a need for human beings to not only proof for errors, but to properly structure the document. I'm now assessing the digitizing of records of historical and genealogical significance, and these documents usually have quite complex table layouts, very poor quality printing (and oftentimes handwriting). Scans of these records are insufficient for use, so having human beings read them and transcribe the information into properly structured etext form is necessary. I'll post an announcement to TeBC of your blog if you'd like me to (although I don't know the address of your blog -- had it and then lost it.) Jon From Bowerbird at aol.com Mon Feb 28 16:53:21 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Mon Feb 28 16:53:42 2005 Subject: [gutvol-d] Interesting message on TeBC from NetWorker (about fixing errors, Frankenstein, etc.) Message-ID: <80.22755cdc.2f551701@aol.com> jon said: > But the bigger issue is not constrained to errors (differences) > with respect to the source text used, as you continue to focus on. i think it was you who made "errors" the issue, revolving around the concept of "trustworthiness". if, once that house of cards falls down, you want to turn the issue to one of "which source-text to use", well then i think that michael's "i'm open to all of 'em" stance covers _that_ quite nicely, thank you very much. if you don't like the version of my antonia that's in the library now, add your own! the same goes for all the versions of "frankenstein". casting aspersions on the edition that _is_ there isn't constructive. provide all the meta-data you want on the version that you furnish; heck, you can even put a pointer in to your project at librarycity.org; these days i see a lot of e-texts referencing an .rtf version in france. > the PG version of Frankenstein, > which now exposes PG to legal liability. i don't agree. but if the lawyers to whom "bantam classics" is paying good money decide to send a cease-and-desist, let 'em. going by results obtained by the "gone with the wind" lawyers, the project gutenberg people will probably fold very quickly; without any money, you can't play poker against deep pockets. but hey, i would like to hear the laughter that would resound when bantam's lawyers argued that the way they can _prove_ that this e-text copied their book is because of the _errors_ (map-makers can pull that trick. but book-publishers? ha!) who knows, jon, maybe the project gutenberg lawyers will call _you_ to the stand, to throw your arms in the air and rant about how those terrible mistakes are ruining the fragile public domain, and therefore bantam doesn't _deserve_ the protection of the law. wouldn't that be ironic? :+) > The lack of proper processes, procedures and guidelines well, i don't agree with that either, jon. you might not agree with the procedures, but that doesn't mean there is a "lack" of them. maybe you don't agree with their choice of source-text for frankenstein. but it _was_ good enough for bantam. > is leading to serious questions about the integrity > and trustworthiness of the whole PG library not in my mind. and not in the minds of most people, i don't think. not any more so than with any paper-book i might find in a store. like the "frankenstein" version that was being _sold_ by bantam. > 1) redoing most of the non-DP works using DP, let's find out how many d.p. people want me to go over _their_ work with a fine-tooth comb. go ahead, speak up, i'd _love_ the challenge. > Well, at least you seem to indicate from > your interest in very low error rate OCR > that every etext PG includes in its archive > should be a textually faithful reproduction > of some known source. not necessarily. if someone wants to play editor and combine editions, i don't have any problem with that. in some sense, that's what the public domain is about. i don't see it in black/white terms as something frozen. if you _are_ going to represent something as faithful, i think it should _be_ faithful. but even then, that is _to_the_best_of_your_ability_. as long as you do that, and give your end-users a means of "checking your work", including a solid mechanism for improving it to perfection, then i think you've done your job. so yes, i agree with you, that scans should absolutely be furnished to the end-users, for works that purport to replicate that edition, certainly... however, i understand why they haven't been, up to this point, and so do you -- disk-space just hasn't been affordable enough, even now, if it were not for the largess of ibiblio and brewster, we couldn't even be entertaining the thought of posting the scans. > I doubt this error rate (let's say for even half of the public domain > printings out there) is accomplishable without sentient-level AI. i'm trying to get back off this listserve. i don't like contributing to the discourse in a place where my voice has been muffled before. so let me set up a place where you and i can fight... i mean, discuss... but this doubt of yours is rather easy to dispel, and quickly. you did a pretty good job of scanning that copy of "my antonia". and it looks like you processed (e.g., straightened) the scans well. so now we need to put them through o.c.r., using abbyy finereader; please have that done as follows: save results out to an .rtf file, one for each page; retaining line-breaks and paragraph indentation. do this for 20-50 pages, and zip the output up and e-mail it to me. i will reply to you with feedback on if the o.c.r. was done correctly. then i'll run it through programs that will soon be made available, at no cost, and we'll see what kind of an error-rate we end up with. or, if you prefer, follow this same procedure with some other book. then, if you still want to discuss this matter, we'll do it elsewhere. > But if proofreading is to be done anyway by the public, > as is *now done* by DP, what difference is there between > an OCR error of one every 10 pages, and one every page? when i talk about "the public", i mean _end-users_ who are reading the book for the purpose of reading the book, and _not_ specifically to be "proofreading" it per se. for that type of reader, one error on every page is too many, but one error on every tenth page is not. especially since -- if we give them an easy means of checking for errors and reporting them, and then reward readers for finding them -- errors won't persist for very long, and the e-text will instead progress very quickly on its merry way to a state of perfection. in a practical sense, this means that before you turn an e-text loose for download in an all-in-one file, you make it available _page-by-page_ on the web. anyone who might want to read it has to do so in that form. right alongside the text for each page is the image, so the person can easily check any possible errors. you let 'em know you are asking for their help to find mistakes. if they find one, they fill out a form right on the page, and their input is recorded -- wiki-style -- immediately. later readers can either confirm the error, or question it, or make comments. first person to find each error gets a credit in the final e-text. you also give people a viewer-program that allows them to download the appropriate page-image if they suspect an error -- displaying it right there in the viewer-app next to the text -- and which simplifies the process of reporting it if they find one. (by, for instance, filling out an e-mail they can send with a click.) > The key is that for the aspect of building *trust* in > the final product, it is a very good idea to involve > the volunteer proofreaders to go over the texts, > even if *you don't have to*. what i just described does a good job of doing that. this is the system of "continuous proofreading" i outlined on this listserve a very long time ago. you recently mistakenly credited it to james linden. my offer to develop this system was largely snubbed. for _that_, the project gutenberg "people in charge" rightly deserve to be criticized. for the tiny stuff that you have been complaining about, they do not... > Having (and proving to anyone who asks) at least > two independent people who proofed every page, > adds to its trustworthiness. not nearly as well as putting text and image side-by-side, and allowing any number of "volunteer proofreaders" to examine 'em. you might be surprised by the number of errors that "slip by" the proofreaders through two rounds of eyeballing over at d.p. (indeed, many even slip by the "third round" of post-processing and whitewashing, and sit there big and ugly in the final e-text.) even if a dozen people look at a page, an error might _still_ be there. but with eternal transparency, there is always hope it will be fixed. anyway, jon, i hope you take up the friendly challenge i issued here. and if any d.p. people want to call me on the challenge i made to them, you just let me know. in the meantime, i'll let you get in the last word on this thread, jon, because i _really_ need to be going. use it wisely... ;+) -bowerbird