From maitriv at yahoo.com Tue Mar 1 11:28:15 2005 From: maitriv at yahoo.com (maitri venkat-ramani) Date: Tue Mar 1 11:28:25 2005 Subject: [gutvol-d] Kenyan school turns to handhelds In-Reply-To: Message-ID: <20050301192815.88274.qmail@web52310.mail.yahoo.com> Technological progress reaching end users in developing countries makes me so happy! They bear a lot of the brunt for our wellbeing. Is there any way we can get PG books to this school and others like it? Do we have any African contacts? Thanks, Maitri ============================================================ Kenyan school turns to handhelds By Julian Siddle BBC Go Digital At the Mbita Point primary school in western Kenya students click away at a handheld computer with a stylus. They are doing exercises in their school textbooks which have been digitised. It is a pilot project run by EduVision, which is looking at ways to use low cost computer systems to get up-to-date information to students who are currently stuck with ancient textbooks. Matthew Herren from EduVision told the BBC programme Go Digital how the non-governmental organisation uses a combination of satellite radio and handheld computers called E-slates. "The E-slates connect via a wireless connection to a base station in the school. This in turn is connected to a satellite radio receiver. The data is transmitted alongside audio signals." The base station processes the information from the satellite transmission and turns it into a form that can be read by the handheld E-slates. "It downloads from the satellite and every day processes the stream, sorts through content for the material destined for the users connected to it. It also stores this on its hard disc." Linux link The system is cheaper than installing and maintaining an internet connection and conventional computer network. But Mr Herren says there are both pros and cons to the project. "It's very simple to set up, just a satellite antenna on the roof of the school, but it's also a one-way connection, so getting feedback or specific requests from end users is difficult." The project is still at the pilot stage and EduVision staff are on the ground to attend to teething problems with the Linux-based system. "The content is divided into visual information, textual information and questions. Users can scroll through these sections independently of each other." EduVision is planning to include audio and video files as the system develops and add more content. Mr Herren says this would vastly increase the opportunities available to the students. He is currently in negotiations to take advantage of a project being organised by search site Google to digitise some of the world's largest university libraries. "All books in the public domain, something like 15 million, could be put on the base stations as we manufacture them. Then every rural school in Africa would have access to the same libraries as the students in Oxford and Harvard" Currently the project is operating in an area where there is mains electricity. But Mr Herren says EduVision already has plans to extend it to more remote regions. "We plan to put a solar panel at the school with the base station, have the E-slates charge during the day when the children are in school, then they can take them home at night and continue working." Maciej Sundra, who designed the user interface for the E-slates, says the project's ultimate goal is levelling access to knowledge around the world. "Why in this age when most people do most research using the internet are students still using textbooks? The fact that we are doing this in a rural developing country is very exciting - as they need it most." Story from BBC NEWS: http://news.bbc.co.uk/go/pr/fr/-/2/hi/technology/4304375.stm Published: 2005/02/28 11:47:23 GMT __________________________________ Do you Yahoo!? Yahoo! Sports - Sign up for Fantasy Baseball. http://baseball.fantasysports.yahoo.com/ From brandon at corruptedtruth.com Tue Mar 1 14:21:55 2005 From: brandon at corruptedtruth.com (Brandon Galbraith) Date: Tue Mar 1 14:22:06 2005 Subject: [gutvol-d] [Fwd: [Public Knowledge] Trouble Locating Copyright Owners? Tell the Copyright Office Your Story] Message-ID: <4224EB03.5080903@corruptedtruth.com> I thought this would be of interest to list members =) -brandon -------------- next part -------------- An embedded message was scrubbed... From: publicknowledge-admin@publicknowledge.org Subject: [Public Knowledge] Trouble Locating Copyright Owners? Tell the Copyright Office Your Story Date: Tue, 01 Mar 2005 17:14:49 -0500 Size: 4497 Url: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050301/2d65269d/PublicKnowledgeTroubleLocatingCopyrightOwnersTelltheCopyrightOfficeYourStory.mht From brandon at corruptedtruth.com Tue Mar 1 21:35:28 2005 From: brandon at corruptedtruth.com (Brandon Galbraith) Date: Tue Mar 1 21:35:47 2005 Subject: [gutvol-d] Repost: Public Knowledge Orphaned Works Project Message-ID: <422550A0.5050807@corruptedtruth.com> Sorry for the repost, just noticed the mailing list scrubbed my forward; Are you an artist, author, musician, or filmmaker? Maybe you're a scholar or librarian? If so, have you ever wanted to use a copyrighted work but been unable to locate the owner to clear the rights? It's a problem that happens all too often, and not only does it affect your work, but it also "orphans" the original owner's work. It's an unfortunate side effect of current copyright law that diminishes everyone's ability to create, innovate, and educate. Fortunately, we have good news: The U.S. Copyright Office wants to make it easier to locate copyright holders, and it's asking for the public's help. Before the Copyright Office can *address* the problem, it needs to gather evidence that there *is* a problem. This is where you come in: tell your story to the Copyright Office. Public Knowledge along with a number of other like-minded organizations have created Ophanworks.org: an easy way for you to submit your story to the Copyright Office. Now is your chance to tell the Office what personal difficulties you've had when trying to clear rights. To get started, go to: http://www.orphanworks.org Never tried to clear rights? Maybe you know someone who has. Forward them this message or visit: http://www.orphanworks.org to send them an email. You can always learn more about the problem of "orphan works" and the U.S. Copyright Office's notice, by visiting Public Knowledge's website: http:/www.publicknowledge.org/issues/ow ====================== Public Knowledge collaborated with the EFF to set up orphanworks.org as a resource for everyone to facilitate public participation in copyright policy. If you'd like to support this and future efforts, please make a contribution: http://publicknowledge.org/donate ====================== Thanks for participating! Your friends at Public Knowledge February 28, 2005 ____________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050301/c50c1801/attachment.html From squawker at myrealbox.com Tue Mar 1 21:36:55 2005 From: squawker at myrealbox.com (Doug Adams) Date: Tue Mar 1 21:37:11 2005 Subject: [gutvol-d] quick question re: lack of date and copyright clearance Message-ID: <1109741815.8c22f19csquawker@myrealbox.com> How do i get a work cleared for copyright if it doesn't have a date on the cover page? I am working on a book that I know for a fact is from the nineteenth century. The publisher, however, neglected to include the date. From prosfilaes at gmail.com Tue Mar 1 21:52:48 2005 From: prosfilaes at gmail.com (David Starner) Date: Tue Mar 1 21:53:07 2005 Subject: [gutvol-d] quick question re: lack of date and copyright clearance In-Reply-To: <1109741815.8c22f19csquawker@myrealbox.com> References: <1109741815.8c22f19csquawker@myrealbox.com> Message-ID: <6d99d1fd0503012152595273d6@mail.gmail.com> Doug Adams write: > How do i get a work cleared for copyright if it doesn't have a date on > the cover page? I am working on a book that I know for a fact is from > the nineteenth century. The publisher, however, neglected to include the date. If you can match it up with an edition in a library catalog (Library of Congress or the British Library) are good for this, it'll probably be clearable. Bonus points if it says it was printed in the US, since then you only have to establish printing pre-1989, not pre-1923 (in most cases.) From kouhia at nic.funet.fi Wed Mar 2 08:42:42 2005 From: kouhia at nic.funet.fi (Juhana Sadeharju) Date: Wed Mar 2 08:42:55 2005 Subject: [gutvol-d] Re: Enlightened Self Interest Message-ID: Hello. The master format should be the digitized images of the original book pages. No font, nor footnote, nor math, nor any problems in readability, nor in representing the original text. I find the digitized images more pleasant than any ascii, html, word or TeX text. I don't know the reason but perhaps the art of typesetting and printing was better then than it is now! Any other format can be generated from the digitized images. If some conversion between html and TeX (say) does not go well, one can always check against the original typesetting from the images. So, keep archiving the digitized images!! 200 dpi with 32 grey levels starts looking ok but 300 dpi with 256 levels should be enough even for math texts. Forget 1-bit digitizations completely!!! Best regards, Juhana -- http://music.columbia.edu/mailman/listinfo/linux-graphics-dev for developers of open source graphics software From jon at noring.name Wed Mar 2 10:00:16 2005 From: jon at noring.name (Jon Noring) Date: Wed Mar 2 10:01:32 2005 Subject: [gutvol-d] Re: Enlightened Self Interest In-Reply-To: References: Message-ID: <2912024375.20050302110016@noring.name> Juhana wrote: > Hello. The master format should be the digitized images of the > original book pages. No font, nor footnote, nor math, nor any > problems in readability, nor in representing the original text. > > I find the digitized images more pleasant than any ascii, html, > word or TeX text. I don't know the reason but perhaps the art of > typesetting and printing was better then than it is now!... > > So, keep archiving the digitized images!! 200 dpi with 32 grey levels > starts looking ok but 300 dpi with 256 levels should be enough > even for math texts. Forget 1-bit digitizations completely!!! If the only purpose of scanning books is for OCRing whereupon the scans are either dumped or saved simply for "proving" provenance, then 300 dpi is *usually* sufficient: 8-bit greyscale for black and white, and 24-bit color for color pages. (If some type is very small, such as 5 point and less, then 600 dpi is usually required.) However, in my consultations with experts in the field, and personal experimentation (My Antonia at http://www.openreader.org/myantonia/ ), if the scans are to be used for multiple purposes besides OCR, such as for direct reading and other uses where sharpness is aesthetically important, then it is recommended to scan them at 600 dpi (optical) -- and 1200 dpi (optical) if the print is *very* small. Unfortunately, the resulting scan images become quite large (unless one uses lossy compression, such as DjVu, which is not recommended for the master archiving but alright for end-user delivery.) But if a job is worth doing, it is worth doing right. If there is one area which DP seems to fall short (let me know if I'm wrong here) is with respect to page scan resolution and archiving (or lack thereof). It is understandable considering the required disk space and bandwidth requirements (to move the scans around), but IA is a place to donate page scans once proofing is done (maybe this is already being done), and I'm sure others can be found who will gladly setup a terabyte storage box to store DP's 600 dpi page scans -- just post a plea to SlashDot and there'll probably be several volunteers who will step forward with spare terabytes available. Btw, if anyone here has made, and plans to make, 600 dpi (optical) greyscale or color scans of any public domain books including the book covers (and this includes books printed between 1923 and 1963 which may be public domain), I'll gladly accept donations of them on CD-ROM and DVD-ROM. I will also gladly accept the source books themselves, including if they've been chopped. I eventually will build a multi-terabyte hard disk storage system to support various activities including Distributed Scanners. Of course, the scans should be donated to IA as well so they can immediately be made available to the world. Jon Noring From brandon at corruptedtruth.com Wed Mar 2 10:20:21 2005 From: brandon at corruptedtruth.com (Brandon Galbraith) Date: Wed Mar 2 10:20:47 2005 Subject: [gutvol-d] Re: Enlightened Self Interest In-Reply-To: <2912024375.20050302110016@noring.name> References: <2912024375.20050302110016@noring.name> Message-ID: <422603E5.6010105@corruptedtruth.com> I for one am both a lurker on here AND a slashdot reader =) I how many terabytes do you think we'd need? Putting together a relatively cheap 3/4/5 TB NAS is fairly easy considering the price of 300/400 GB SATA drives has been dropping steadily. This may be something we'd want to talk to iBiblio about though, as they already have the infrastructure in place. No point in re-inventing the wheel. -brandon Jon Noring wrote: >Juhana wrote: > > > >>Hello. The master format should be the digitized images of the >>original book pages. No font, nor footnote, nor math, nor any >>problems in readability, nor in representing the original text. >> >>I find the digitized images more pleasant than any ascii, html, >>word or TeX text. I don't know the reason but perhaps the art of >>typesetting and printing was better then than it is now!... >> >>So, keep archiving the digitized images!! 200 dpi with 32 grey levels >>starts looking ok but 300 dpi with 256 levels should be enough >>even for math texts. Forget 1-bit digitizations completely!!! >> >> > >If the only purpose of scanning books is for OCRing whereupon the >scans are either dumped or saved simply for "proving" provenance, then >300 dpi is *usually* sufficient: 8-bit greyscale for black and white, >and 24-bit color for color pages. (If some type is very small, such as >5 point and less, then 600 dpi is usually required.) > >However, in my consultations with experts in the field, and personal >experimentation (My Antonia at http://www.openreader.org/myantonia/ ), >if the scans are to be used for multiple purposes besides OCR, such as >for direct reading and other uses where sharpness is aesthetically >important, then it is recommended to scan them at 600 dpi (optical) -- >and 1200 dpi (optical) if the print is *very* small. Unfortunately, >the resulting scan images become quite large (unless one uses lossy >compression, such as DjVu, which is not recommended for the master >archiving but alright for end-user delivery.) But if a job is worth >doing, it is worth doing right. > >If there is one area which DP seems to fall short (let me know if I'm >wrong here) is with respect to page scan resolution and archiving (or >lack thereof). It is understandable considering the required disk >space and bandwidth requirements (to move the scans around), but IA >is a place to donate page scans once proofing is done (maybe this is >already being done), and I'm sure others can be found who will gladly >setup a terabyte storage box to store DP's 600 dpi page scans -- just >post a plea to SlashDot and there'll probably be several volunteers >who will step forward with spare terabytes available. > >Btw, if anyone here has made, and plans to make, 600 dpi (optical) >greyscale or color scans of any public domain books including the book >covers (and this includes books printed between 1923 and 1963 which >may be public domain), I'll gladly accept donations of them on CD-ROM >and DVD-ROM. I will also gladly accept the source books themselves, >including if they've been chopped. I eventually will build a >multi-terabyte hard disk storage system to support various activities >including Distributed Scanners. Of course, the scans should be >donated to IA as well so they can immediately be made available to the >world. > >Jon Noring > >_______________________________________________ >gutvol-d mailing list >gutvol-d@lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050302/c63161e1/attachment-0001.html From jon at noring.name Wed Mar 2 11:46:00 2005 From: jon at noring.name (Jon Noring) Date: Wed Mar 2 11:46:41 2005 Subject: [gutvol-d] Re: Enlightened Self Interest In-Reply-To: <422603E5.6010105@corruptedtruth.com> References: <2912024375.20050302110016@noring.name> <422603E5.6010105@corruptedtruth.com> Message-ID: <118367750.20050302124600@noring.name> Brandon wrote: > I for one am both a lurker on here AND a slashdot reader =) I how > many terabytes do you think we'd need? Putting together a relatively > cheap 3/4/5 TB NAS is fairly easy considering the price of 300/400 > GB SATA drives has been dropping steadily. This may be something > we'd want to talk to iBiblio about though, as they already have the > infrastructure in place. No point in re-inventing the wheel. Since we are talking primarily about pre-1923 public domain books, most of them are black and white, so I'll restrict the analysis to those books. Color substantially adds to disk space requirements. (Also, many of the books published in the 1923-63 time frame, 90% of which are in the public domain, are black and white.) Ideally, we would like to scan the books at 600 dpi (optical), 8-bit greyscale, and store the images in some lossless compressed format (such as PNG). The images should not have gone through any lossy stage to get to this point, such as JPEG, since this adds annoying artifacts to the images. Unfortunately, this results in some pretty large scans. Using the data I have for the "My Antonia" project, a typical 600 dpi (optical) greyscale page saved as PNG occupies about 4.5 megs. So for a typical 300 page book, this works out to about 1.5 gigs per book (rounding up some to cover incidentals.) A terabyte hard disk storage system (optimized for data warehousing, since optimizing for server use increases the hardware cost) would thus hold about 700 books. This is not that many when there are potentially several million public domain books out there (especially if we include the many public domain books in the 1923-1963 range.) What could be done in the next few years, until multi-terabyte hard disk data warehousing systems become dirt cheap, is to backup the lossless greyscale scans onto DVD-ROM (which, granted, is risky), or even press DVDs (requires equipment to do this -- maybe someone will donate access to their DVD presser?) Of course, we should donate copies of the DVDs to IA and to other groups (?iBiblio) and hope they will preserve them, even moving them to hard disk. In the meanwhile, for public access and massive mirroring, we can convert the 600 dpi greyscale to 600 dpi bitonal (2-color black and white -- it is important to manually select the cutoff greyscale value for best quality.) This will save a *lot* of space and will be *minimally* acceptable as archival copies should the original greyscale scans get lost or become unreadable. Using 2-color PNG, a typical page now scrunches down to about 125 Kbytes, or about 40 Mbytes per book (using CCITT lossless compression, which is optimized for bitonal scans of text, it is possible to get the size down to about 60 Kbytes -- but this is an obscure format -- all web browsers will display PNG, but it requires a plugin or a special graphics program to display CCITT TIFFs. There may also be some proprietary problems with CCITT.) This way we can now store about 25,000 books on a terabyte server, which is very doable and will be sufficient for Distributed Scanners (or similar project) for a few years (in the meanwhile, disk space should continue to get cheaper and cheaper to the point we might even begin migrating the biggie-size greyscale scans stored on DVD or other storage medium back to mirrored hard disk servers.) Some of my thinking -- no doubt there's other approaches to consider. Should I start a "Distributed Scanners" discussion group at Yahoo? It seems like there may be enough people interested in this project. Jon From hart at pglaf.org Wed Mar 2 12:22:07 2005 From: hart at pglaf.org (Michael Hart) Date: Wed Mar 2 12:22:08 2005 Subject: [gutvol-d] quick question re: lack of date and copyright clearance In-Reply-To: <6d99d1fd0503012152595273d6@mail.gmail.com> References: <1109741815.8c22f19csquawker@myrealbox.com> <6d99d1fd0503012152595273d6@mail.gmail.com> Message-ID: If you include the number of pages, physical dimensions, binding type and color, and provide this data to a reference librarian, the odds go way up in identifying the particular edition. mh On Tue, 1 Mar 2005, David Starner wrote: > Doug Adams write: >> How do i get a work cleared for copyright if it doesn't have a date on >> the cover page? I am working on a book that I know for a fact is from >> the nineteenth century. The publisher, however, neglected to include the date. > > If you can match it up with an edition in a library catalog (Library > of Congress or the British Library) are good for this, it'll probably > be clearable. Bonus points if it says it was printed in the US, since > then you only have to establish printing pre-1989, not pre-1923 (in > most cases.) > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From squawker at myrealbox.com Wed Mar 2 12:45:02 2005 From: squawker at myrealbox.com (Doug Adams) Date: Wed Mar 2 12:45:11 2005 Subject: [gutvol-d] Re: quick question re: lack of date and copyright Message-ID: <1109796302.333c8dbcsquawker@myrealbox.com> >From: David Starner >If you can match it up with an edition in a library catalog >(Library of Congress or the British Library) are good for >this, it'll probably be clearable. Bonus points if it says it >was printed in the US, since then you only have to >establish printing pre-1989, not pre-1923 (in most cases.) Thanks David! I've found my version in the Library of Congress. The listing says it was published in: Chicago, Belford, Clarke [187-?] So even the LOC doesn't have a date for the book. Now a technical question. How do I submit this to get clearance without the date. Do I need to do it by email to someone. (I've previously used the internet form.) From vze3rknp at verizon.net Wed Mar 2 12:52:17 2005 From: vze3rknp at verizon.net (Juliet Sutherland) Date: Wed Mar 2 12:52:25 2005 Subject: [gutvol-d] Re: quick question re: lack of date and copyright In-Reply-To: <1109796302.333c8dbcsquawker@myrealbox.com> References: <1109796302.333c8dbcsquawker@myrealbox.com> Message-ID: <42262781.7010207@verizon.net> You can submit on the internet form. Just write the word none in where the date would usually go. You can add a link to the LoC listing in the comments section. Be sure it include scans of both the title page and the verso. JulietS Doug Adams wrote: >>From: David Starner >>If you can match it up with an edition in a library catalog >>(Library of Congress or the British Library) are good for >>this, it'll probably be clearable. Bonus points if it says it >>was printed in the US, since then you only have to >>establish printing pre-1989, not pre-1923 (in most cases.) >> >> > >Thanks David! I've found my version in the Library of Congress. The listing says it was published in: > >Chicago, Belford, Clarke [187-?] > >So even the LOC doesn't have a date for the book. Now a technical question. How do I submit this to get clearance without the date. Do I need to do it by email to someone. (I've previously used the internet form.) > >_______________________________________________ >gutvol-d mailing list >gutvol-d@lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > > From marcello at perathoner.de Wed Mar 2 12:17:08 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed Mar 2 12:55:19 2005 Subject: [gutvol-d] Please test www-dev.gutenberg.org Message-ID: <42261F44.1000005@perathoner.de> We are ready to migrate the web site to the new fast file server. Also some slight changes were made to the online catalog to make it better cacheable: The dynamic authrec pages have been dropped in favour of the static browse-by-author pages. Browse-by-author now includes all information from the authrec pages. Redirects are in place. The search has been optimized to redirect simple searches (searches for author only, title only) to the appropriate browse-by-author and browse-by-title pages. A preview is online at: www-dev.gutenberg.org Please test and report any oddities. -- Marcello Perathoner webmaster@gutenberg.org From jeroen.mailinglist at bohol.ph Wed Mar 2 13:28:17 2005 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Wed Mar 2 13:28:02 2005 Subject: [gutvol-d] Enlightened Self Interest In-Reply-To: <7815066468.20050225202613@noring.name> References: <16927.48790.533173.228950@celery.zuhause.org> <20050226012008.95779.qmail@web41601.mail.yahoo.com> <20050226020452.GA24272@panix.com> <421FE721.2000004@zytrax.com> <7815066468.20050225202613@noring.name> Message-ID: <42262FF1.2050504@bohol.ph> I didn't notice this discussion was heading to my favourite subject... TEI. I guess enlightened is on my mental spam filter... Jon Noring wrote: > >For maximum archivability, repurposeability and accessibility, it is >important for the XML markup vocabulary used in the master document to >be wholly structural and semantic. Except where absolutely necessary >(and maybe best solved using SVG and MathML), presentational markup >should be avoided. > > > Since we are reproducing printed works, it is often not possible to reconstruct the intended semantics of the user. This is especially true of books before the mid 19th century, when typographic conventions where not as well established. For many older books the best we can do is capture the typography in some "reduced" way. The good thing about TEI is that it actually supports that. >TEI is primarily structural/semantic, but there are some presentational >components. The base DP-TEI (I envision three levels of DP-TEI), when >it comes into being, should not specify any presentational markup >components. > >I am not familiar with OpenOffice's XML vocabulary, but I would guess >that it, too, is a mix of structural/semantic tags with presentation >tags (I also guess that it is much more presentationally-oriented than >TEI, and doesn't have the structural/semantic richness of TEI.) If >OpenOffice's XML vocabulary is to be used, it should be subsetted (at >least at the base level) to not allow presentational markup. > > > OpenOffice XML has a lot of features geared towards an office application and the nasty details of presentation. It is quite presentational, and I wouldn't recommend it as a long term archive format. However, it is much better structured than Microsoft .DOC format, and considerable more compact (using zip as it does). >I do not recommend DocBook as the primary markup vocabulary for >general books, but certainly it is intriguing to consider it as a >second "blessed" vocabulary for particular types of documents it >is designed for (primarily technical documents.) > > > Reminds me of that old saying about standards, good to have so many to choose from... DocBook is fine for technical manuals written from scratch, not for capturing a nineteenth century novel, or sixteenth century history. Jeroen. > > > From jeroen.mailinglist at bohol.ph Wed Mar 2 13:30:36 2005 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Wed Mar 2 13:30:20 2005 Subject: [gutvol-d] Enlightened Self Interest In-Reply-To: <4220A2E9.7010300@hutchinson.net> References: <16927.48790.533173.228950@celery.zuhause.org> <20050226012008.95779.qmail@web41601.mail.yahoo.com> <20050226020452.GA24272@panix.com> <421FE721.2000004@zytrax.com> <20050226032924.GA29574@panix.com> <4220A2E9.7010300@hutchinson.net> Message-ID: <4226307C.1050109@bohol.ph> Joshua Hutchinson wrote: > > 1 - Converting those texts that come through me from DP into PGTEI > master format. I then use the online PGTEI -> HTML conversion routine > to convert them to HTML for posting to PG. Most of them are not > converted to TEXT simply because someone else at DP did the text > version before I got to them. In other words, I've been mostly > concentrating on the PGTEI format itself and the HTML output that > results from it. > I've been producing all my ebooks as TEI (since 1997), but since Gutenberg can't deal with it, I've hardly ever been able to post them. Please don't convert any text I've submitted before asking me. All my HTML comes from a single stylesheet. Jeroen. From joshua at hutchinson.net Wed Mar 2 13:52:21 2005 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Wed Mar 2 13:52:33 2005 Subject: [gutvol-d] Enlightened Self Interest Message-ID: <20050302215221.9DF079E8FF@ws6-2.us4.outblaze.com> ----- Original Message ----- From: "Jeroen Hellingman (Mailing List Account)" > > Joshua Hutchinson wrote: > > > > > 1 - Converting those texts that come through me from DP into PGTEI master > > format. I then use the online PGTEI -> HTML conversion routine to convert > > them to HTML for posting to PG. Most of them are not converted to TEXT > > simply because someone else at DP did the text version before I got to them. > > In other words, I've been mostly concentrating on the PGTEI format itself > > and the HTML output that results from it. > > > I've been producing all my ebooks as TEI (since 1997), but since Gutenberg > can't deal with it, I've hardly ever been able to post them. Please don't > convert any text I've submitted before asking me. All my HTML comes from a > single stylesheet. > Oh, I don't grab works at random! ;) I've been helping people that don't want to learn HTML, mostly. They send me a finished text version (with the page breaks still intact) and I convert that to TEI. The longest part of the conversion is fixing up image links. Pretty much everything else is handled through RegEx. Right now, the vast majority of the TEI documents come through the conversion process to HTML with near perfection. I just put an inline style section in place of the linked css file and convert the "TeX" style single quotes back into straight single quotes ('). Josh From Bowerbird at aol.com Wed Mar 2 16:01:31 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Wed Mar 2 16:01:48 2005 Subject: [gutvol-d] so jon Message-ID: <7EA3EC8D.678C432A.023039A8@aol.com> so jon, are you going to take up my challenge? if not, tell me, and i'll do that o.c.r. myself. we're gonna see what accuracy-level we can get on your nice hi-res scans of "my antonia"... -bowerbird From jon at noring.name Wed Mar 2 16:28:10 2005 From: jon at noring.name (Jon Noring) Date: Wed Mar 2 16:28:25 2005 Subject: [gutvol-d] so jon In-Reply-To: <7EA3EC8D.678C432A.023039A8@aol.com> References: <7EA3EC8D.678C432A.023039A8@aol.com> Message-ID: <8935298093.20050302172810@noring.name> Bowerbird wrote: > so jon, are you going to take up my challenge? I didn't know you issued a challenge. > if not, tell me, and i'll do that o.c.r. myself. > we're gonna see what accuracy-level we can get > on your nice hi-res scans of "my antonia"... I can only ask my friend so much for Abbyy scanning (he effectively pays a per page fee for using Abbyy), and your request does not qualify as anything important enough for me to spend "capital" on. So feel free to go ahead and do what you will with the "My Antonia" scans. That's why they're online (I will need to make some sort of usage statement for them, maybe a Creative Commons license -- but the intent is for the whole world to have ready access to them.) I'm curious to know how well various OCR packages will perform on "My Antonia" since the XHTML version is very accurate to the original -- so it can form sort of a test base. Of course, if you or anyone else finds an error in the XHTML version as a result of the OCR test, I'll appreciate being informed so I can make the correction. Others here who use their own OCR package, feel free to test it out on the My Antonia scans. Go to: http://www.openreader.org/myantonia/ Jon (btw, I plan to soon scan my original edition of Burton's "Kama Sutra", and that will be a much greater challenge to any OCR package, even if it were new, because of very small print, overall poor typesetting, and poor print quality.) From jmdyck at ibiblio.org Wed Mar 2 16:43:26 2005 From: jmdyck at ibiblio.org (Michael Dyck) Date: Wed Mar 2 16:44:22 2005 Subject: [gutvol-d] DP anniversary? Message-ID: <42265DAE.C60930D1@ibiblio.org> In today's PG Weekly newsletter, and in a posting to the Book People mailing list, Michael Hart says: "This is the 4th Anniversary of The Distributed Proofreaders!!!" However, if you go to the DP site , you'll see that it says that DP was founded in 2000. Moreover, Charles Franks posted to the gutvol-d list on April 20, 2000, saying (in part) "I have completed the working beta of a distributed proofreaders website." and giving a link. I'm not sure if that was the first public announcement of DP, but in any case, DP is about 5 years old. Michael Hart appears to be referring to the 4th anniversary of March 13th, 2001, which is when the PG Weekly newsletter says DP completed its first book (PG #3320). However, that book ("Mohammed Ali and His House" by Louise Muhlbach) was actually posted April 2nd, 2001, so it's unclear where the March 13 date comes from. Moreover, a month or so *before* then, the PG newsletter for February 2001[a] says "The Online Distributed Proofreading Team has completed 8 books since mid October 2000!". This suggests that DP completed its first book in mid-Oct 2000, which then might have appeared in the list at the bottom of the mid-October "PG needs you" email[b], but I don't see any DP books there. Instead, I believe the first DP book to be posted by PG was #3059 (Homer's "The Iliad", trans. Andrew Lang) which was posted by December 6, 2000[c]. [a] http://www.gutenberg.org/newsletter/archive/PGMonthly_2001_02_07.txt [b] http://www.gutenberg.org/newsletter/archive/Other_2000_10_18_Project_Gutenberg_needs_you.txt [c] http://www.gutenberg.org/newsletter/archive/PGMonthly_2000_12_06.txt -Michael Dyck From Bowerbird at aol.com Wed Mar 2 17:11:54 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Wed Mar 2 17:12:14 2005 Subject: [gutvol-d] so jon Message-ID: <319A7C97.25B22035.023039A8@aol.com> jon said: > I didn't know you issued a challenge. i sure did. :+) it came through at 4:53 pacific, on 2005/2/28. i have appended a copy for your convenience... basically, you said "i doubt it" in direct response to my claim that correct processing of the whole process -- from scanning through a few hours of post-o.c.r. work -- could result in an accuracy-rate of 1 error per 10 pages, so i challenged you to a test with your "my antonia" scans. since you don't seem to want to have the o.c.r. done, for understandable reasons, i will do it myself. by the way, i have done some extensive comparisons of the project gutenberg version of "my antonia" and yours. the more deeply i go into it, the more i become convinced most differences are due to intentional edits, and _not_ due to sloppiness in the original preparation of the work. so this appears to be exactly like the "frankenstein" case -- a simple use of a different edition as the source-text. in view of the insinuations you cast against the "accuracy" of the project gutenberg e-text, perhaps you should apologize? -bowerbird Subj: Re: [gutvol-d] Interesting message on TeBC from NetWorker (about fixing errors, Frankenstein, etc.) Date: 2/28/2005 7:53:21 PM Eastern Standard Time From: Bowerbird To: gutvol-d@lists.pglaf.org, Bowerbird jon said: > But the bigger issue is not constrained to errors (differences) > with respect to the source text used, as you continue to focus on. i think it was you who made "errors" the issue, revolving around the concept of "trustworthiness". if, once that house of cards falls down, you want to turn the issue to one of "which source-text to use", well then i think that michael's "i'm open to all of 'em" stance covers _that_ quite nicely, thank you very much. if you don't like the version of my antonia that's in the library now, add your own! the same goes for all the versions of "frankenstein". casting aspersions on the edition that _is_ there isn't constructive. provide all the meta-data you want on the version that you furnish; heck, you can even put a pointer in to your project at librarycity.org; these days i see a lot of e-texts referencing an .rtf version in france. > the PG version of Frankenstein, > which now exposes PG to legal liability. i don't agree. but if the lawyers to whom "bantam classics" is paying good money decide to send a cease-and-desist, let 'em. going by results obtained by the "gone with the wind" lawyers, the project gutenberg people will probably fold very quickly; without any money, you can't play poker against deep pockets. but hey, i would like to hear the laughter that would resound when bantam's lawyers argued that the way they can _prove_ that this e-text copied their book is because of the _errors_ (map-makers can pull that trick. but book-publishers? ha!) who knows, jon, maybe the project gutenberg lawyers will call _you_ to the stand, to throw your arms in the air and rant about how those terrible mistakes are ruining the fragile public domain, and therefore bantam doesn't _deserve_ the protection of the law. wouldn't that be ironic? :+) > The lack of proper processes, procedures and guidelines well, i don't agree with that either, jon. you might not agree with the procedures, but that doesn't mean there is a "lack" of them. maybe you don't agree with their choice of source-text for frankenstein. but it _was_ good enough for bantam. > is leading to serious questions about the integrity > and trustworthiness of the whole PG library not in my mind. and not in the minds of most people, i don't think. not any more so than with any paper-book i might find in a store. like the "frankenstein" version that was being _sold_ by bantam. > 1) redoing most of the non-DP works using DP, let's find out how many d.p. people want me to go over _their_ work with a fine-tooth comb. go ahead, speak up, i'd _love_ the challenge. > Well, at least you seem to indicate from > your interest in very low error rate OCR > that every etext PG includes in its archive > should be a textually faithful reproduction > of some known source. not necessarily. if someone wants to play editor and combine editions, i don't have any problem with that. in some sense, that's what the public domain is about. i don't see it in black/white terms as something frozen. if you _are_ going to represent something as faithful, i think it should _be_ faithful. but even then, that is _to_the_best_of_your_ability_. as long as you do that, and give your end-users a means of "checking your work", including a solid mechanism for improving it to perfection, then i think you've done your job. so yes, i agree with you, that scans should absolutely be furnished to the end-users, for works that purport to replicate that edition, certainly... however, i understand why they haven't been, up to this point, and so do you -- disk-space just hasn't been affordable enough, even now, if it were not for the largess of ibiblio and brewster, we couldn't even be entertaining the thought of posting the scans. > I doubt this error rate (let's say for even half of the public domain > printings out there) is accomplishable without sentient-level AI. i'm trying to get back off this listserve. i don't like contributing to the discourse in a place where my voice has been muffled before. so let me set up a place where you and i can fight... i mean, discuss... but this doubt of yours is rather easy to dispel, and quickly. you did a pretty good job of scanning that copy of "my antonia". and it looks like you processed (e.g., straightened) the scans well. so now we need to put them through o.c.r., using abbyy finereader; please have that done as follows: save results out to an .rtf file, one for each page; retaining line-breaks and paragraph indentation. do this for 20-50 pages, and zip the output up and e-mail it to me. i will reply to you with feedback on if the o.c.r. was done correctly. then i'll run it through programs that will soon be made available, at no cost, and we'll see what kind of an error-rate we end up with. or, if you prefer, follow this same procedure with some other book. then, if you still want to discuss this matter, we'll do it elsewhere. > But if proofreading is to be done anyway by the public, > as is *now done* by DP, what difference is there between > an OCR error of one every 10 pages, and one every page? when i talk about "the public", i mean _end-users_ who are reading the book for the purpose of reading the book, and _not_ specifically to be "proofreading" it per se. for that type of reader, one error on every page is too many, but one error on every tenth page is not. especially since -- if we give them an easy means of checking for errors and reporting them, and then reward readers for finding them -- errors won't persist for very long, and the e-text will instead progress very quickly on its merry way to a state of perfection. in a practical sense, this means that before you turn an e-text loose for download in an all-in-one file, you make it available _page-by-page_ on the web. anyone who might want to read it has to do so in that form. right alongside the text for each page is the image, so the person can easily check any possible errors. you let 'em know you are asking for their help to find mistakes. if they find one, they fill out a form right on the page, and their input is recorded -- wiki-style -- immediately. later readers can either confirm the error, or question it, or make comments. first person to find each error gets a credit in the final e-text. you also give people a viewer-program that allows them to download the appropriate page-image if they suspect an error -- displaying it right there in the viewer-app next to the text -- and which simplifies the process of reporting it if they find one. (by, for instance, filling out an e-mail they can send with a click.) > The key is that for the aspect of building *trust* in > the final product, it is a very good idea to involve > the volunteer proofreaders to go over the texts, > even if *you don't have to*. what i just described does a good job of doing that. this is the system of "continuous proofreading" i outlined on this listserve a very long time ago. you recently mistakenly credited it to james linden. my offer to develop this system was largely snubbed. for _that_, the project gutenberg "people in charge" rightly deserve to be criticized. for the tiny stuff that you have been complaining about, they do not... > Having (and proving to anyone who asks) at least > two independent people who proofed every page, > adds to its trustworthiness. not nearly as well as putting text and image side-by-side, and allowing any number of "volunteer proofreaders" to examine 'em. you might be surprised by the number of errors that "slip by" the proofreaders through two rounds of eyeballing over at d.p. (indeed, many even slip by the "third round" of post-processing and whitewashing, and sit there big and ugly in the final e-text.) even if a dozen people look at a page, an error might _still_ be there. but with eternal transparency, there is always hope it will be fixed. anyway, jon, i hope you take up the friendly challenge i issued here. and if any d.p. people want to call me on the challenge i made to them, you just let me know. in the meantime, i'll let you get in the last word on this thread, jon, because i _really_ need to be going. use it wisely... ;+) -bowerbird From jon at noring.name Wed Mar 2 20:48:03 2005 From: jon at noring.name (Jon Noring) Date: Wed Mar 2 20:48:22 2005 Subject: [gutvol-d] so jon In-Reply-To: <319A7C97.25B22035.023039A8@aol.com> References: <319A7C97.25B22035.023039A8@aol.com> Message-ID: <6550890890.20050302214803@noring.name> Bowerbird wrote: > since you don't seem to want to have the o.c.r. done, > for understandable reasons, i will do it myself. Great! Hopefully others here will run it through their favorite OCR program and share the results with you and with gutvol-d. Please!, others, OCR the scans, which are available at: http://www.openreader.org/myantonia/ > by the way, i have done some extensive comparisons of > the project gutenberg version of "my antonia" and yours. > the more deeply i go into it, the more i become convinced > most differences are due to intentional edits, and _not_ > due to sloppiness in the original preparation of the work. How do we know? We don't know what source edition was used for PG's version of "My Antonia", but I now believe (but cannot prove until someone does the actual comparison) that the source was the "mangled" British edition, as noted below. So, the way to know for sure is to secure a copy of that "mangled" British edition and do the comparison. (Which I won't do because it is futile because the British edition is itself unacceptable.) > so this appears to be exactly like the "frankenstein" case > -- a simple use of a different edition as the source-text. Yes, and this is why I called the PG version of "My Antonia" "mangled", because it is -- it is based on a mangled British edition which Willa Cather herself was very unhappy about regarding the sloppy editing and printing. She was very "painstaking" with regards to her books -- more than the average author (and she had the status to dictate the editing and typography of her books to her publisher -- most lesser authors didn't have this luxury.) Again, my focus on the problems with the PG collection go beyond the error rates from some source -- it goes to the general aspects of trust and using the proper (acceptable) editions as source, to properly identify the source, and to provide means for easier verification the etext faithfully conforms to the source (primarily making the scans available, which is now possible -- I agree with you things were tougher a few years ago vis-a-vis providing page scans online.) For example, if NetWorker's analysis is correct (posted to The eBook Community), it now appears that the edition used for PG's version of "Frankenstein" is based on a 1981 Bantam Classics Edition, which did significant editing of the text (in essence, creating a convenient "fingerprint"), and which NetWorker (who was an attorney at one time, I believe) surmises may border on a copyright infringement (and not just a "sweat of the brow" sort of thing.) Hopefully Bantam will not catch wind of this -- but if they do, they probably won't do anything anyway. Nevertheless, one wonders how many other earlier PG texts, where there's no source information given, were derived from post-1923 emended editions? Could those ebook publishers who today use PG texts be potentially liable because of the lack of source information and a means to verify provenance? Even if the title page of a Work was photocopied and sent to PG for copyright clearance, how do we know that the person did not then use an easy-to-obtain and available modern edition for the actual scanning -- and simply photocopied the title page from a non-circulating, non-scannable copy of the rarer original edition? I believe most of those individuals who submitted etexts to PG's collection did it faithfully and followed common sense rules and expectations with regards to sources ---> But *how do we know*, and *how can we know*? We can't -- there's no mechanism to verify these things. This is where having the full source information, and having all the page scans of the source and making them available, builds trust in (and protects from copyright infringement claims) the particular etext and the associated collection it belongs to. It is also the morally right thing to do. > in view of the insinuations you cast against the "accuracy" > of the project gutenberg e-text, perhaps you should apologize? Why? The differences in the PG edition of "My Antonia" likely came from a mangled British edition which Willa Cather apparently was upset about. These changes are, in essence, errors. In addition, we have no idea as to what emendments may have been made to the first and subsequent PG etext editions since (until possibly now) we didn't know what edition was used as the original source! You certainly don't have access to the edition used to generate the PG edition of "My Antonia", do you? If not, then *how do you know* it is accurate to some original source edition? We can't talk about what is an error and what is not an error when we don't have the source information, and better yet page scans to immediately verify. That's why Michael Hart's interest in "correcting" the errors in the non-DP portion of the PG corpus is beyond futile and will not build trust in the collection -- how can one reliably correct an etext when the original source is not known/available to consult with? It's ludicrous, and a complete waste of time. It's better to redo the etexts via DP where the source info is recorded and page scans are (hopefully) available, as well as having the proofing done by a number of independent proofers, rather than just one person. Multiple, independent proofers adds trust to the process, in addition to having the source info and scans available. After all, intentional misspellings are common in many books (e.g., "My Antonia", Mark Twain's books, etc. -- and many pre-19th century books use variant spellings since rigorous spelling was not then an established norm) so how does one know if an "error" is really an error? And there are errors which cannot be caught by simple reading or even programs, such as missing (or added) accented characters, wrong punctuation (such as replacing an em-dash with a colon), and wrong paragraph breaks. (Most of which we see in "My Antonia".) Many of these "not discernable" errors can sometimes tweak the meaning of the etexts. We owe readers, even the casual readers, an excellent product with full disclosure. For example, the poll I'm conducting on this topic at The eBook Community indicates (but not proves -- consider this a preliminary assessment) that a significant percentage of those who read public domain digital texts *prefer* (note carefully this word) the texts they use to come from acceptable, known editions, and be faithful renditions of those editions. This only makes common sense. To dismiss this is essentially saying that the vast majority people don't give a damn about whether the public domain texts they spend hours and hours of their valuable time reading are reasonably faithful to the original. Does anyone want to make that claim that the vast majority of people (99% as it seems like PG's online info says) don't care one whit? And trying to prove that claim by pointing to the large number of people using PG texts, is not proof since I believe most people have innocent blind faith that PG did things correctly. Furthermore, anyone doing a major effort in delivering the public domain to the public has a moral responsibility to do it correctly and to state in sufficient detail the provenance and any edits of the texts. If it is a heavily emended text, then it should be specified to the public with sufficient detail *in that etext, not elsewhere* so the reader *knows* a text they are reading has been emended (one doesn't have to list the edits item by item, but it should be made clear the text has been substantially edited and to give a general overview of the types of edits done.) I've explained this on TeBC in more detail. This is a *responsibility*, which places restrictions on how PG and similar groups should conduct themselves. This is a serious endeavor: digitally transfering and preserving the public domain. This is not child's play. It is true that the Public Domain exists for anyone to do anything with it as they see fit, but like any freedom, there are associated responsibilities. Full disclosure is one of them, and is a common sense responsibility. Trying to be faithful in transcribing texts is another one when no disclaimers are given in the texts themselves since people assume the texts they are reading a reasonably faithful to the original. Jon From sly at victoria.tc.ca Wed Mar 2 21:08:10 2005 From: sly at victoria.tc.ca (Andrew Sly) Date: Wed Mar 2 21:08:25 2005 Subject: [gutvol-d] Please test www-dev.gutenberg.org In-Reply-To: <42261F44.1000005@perathoner.de> References: <42261F44.1000005@perathoner.de> Message-ID: On Wed, 2 Mar 2005, Marcello Perathoner wrote: > We are ready to migrate the web site to the new fast file server. > > Also some slight changes were made to the online catalog to make it > better cacheable: > I have an issue with the way that following an author name from a bibrec page leads to an anchor in a "author by first letter of last name" page. To me, this does not look like a long-term solution. It could work for a while, but as the collections continues to grow, these files will inevitably get too large to be easily useful for general browsing. Take a look at the New General Catalog of Old Books and Authors where Phillip has begun to break some of the files of author records into smaller sub-groupings. We could certainly do something like that here as well, but that would create extra work to identify which files are largest, and what the best way to split them up would be. Andrew From gbdavis at harborside.com Wed Mar 2 22:48:19 2005 From: gbdavis at harborside.com (George Davis) Date: Wed Mar 2 22:48:41 2005 Subject: [gutvol-d] re: DP Anniversary In-Reply-To: <20050303011217.817C08C8EC@pglaf.org> References: <20050303011217.817C08C8EC@pglaf.org> Message-ID: <4226B333.3070908@harborside.com> Michael Dyck wrote: > Subject: > [gutvol-d] DP anniversary? > From: > Michael Dyck > Date: > Wed, 02 Mar 2005 16:43:26 -0800 > To: > gutvol-d > > To: > gutvol-d > > > In today's PG Weekly newsletter, and in a posting to the Book People > mailing list, Michael Hart says: > "This is the 4th Anniversary of The Distributed Proofreaders!!!" > > However, if you go to the DP site , you'll see > that it says that DP was founded in 2000. Moreover, Charles Franks > posted to the gutvol-d list on April 20, 2000, saying (in part) "I have > completed the working beta of a distributed proofreaders website." and > giving a link. I'm not sure if that was the first public announcement > of DP, but in any case, DP is about 5 years old. > > Michael Hart appears to be referring to the 4th anniversary of > March 13th, 2001, which is when the PG Weekly newsletter says DP > completed its first book (PG #3320). However, that book ("Mohammed > Ali and His House" by Louise Muhlbach) was actually posted April 2nd, > 2001, so it's unclear where the March 13 date comes from. > > Moreover, a month or so *before* then, the PG newsletter for February > 2001[a] says "The Online Distributed Proofreading Team has completed 8 > books since mid October 2000!". This suggests that DP completed its first > book in mid-Oct 2000, which then might have appeared in the list at the > bottom of the mid-October "PG needs you" email[b], but I don't see any DP > books there. Checking the "Completed Gold E-texts" page, in ascending order by submission date (http://www.pgdp.net/c/list_etexts.php?x=g&sort=4), it shows: 1) "Mohammed Ali and His House", L. Muhlbach () Uploaded: Tuesday, March 13th, 2001 The link for that etext is to #3320. The above was relied upon in arriving at the March 13th date. Which has run in the newsletter since June 16, 2004, with no notice of correction. > Instead, I believe the first DP book to be posted by PG was #3059 (Homer's > "The Iliad", trans. Andrew Lang) which was posted by December 6, 2000[c]. Are in you for a surprise: DP didn't do that one. From the etext: "This etext was prepared by Sandra Stewart and Jim Tinsley " #3320 has a credit to C.F. and D.P. > [a] http://www.gutenberg.org/newsletter/archive/PGMonthly_2001_02_07.txt > [b] http://www.gutenberg.org/newsletter/archive/Other_2000_10_18_Project_Gutenberg_needs_you.txt > [c] http://www.gutenberg.org/newsletter/archive/PGMonthly_2000_12_06.txt > > -Michael Dyck Hopefully, someone from D.P. will step up and provide more meaningful activity updates for inclusion in the newsletter. People outside of DP would be interested in seeing what's going on over there, and why not include such in the weekly PG newsletter? And, boy, I'd like to be a fly on the wall in Jim's office when he reads this! [eorge] -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 266.5.5 - Release Date: 3/1/2005 From sly at victoria.tc.ca Wed Mar 2 23:05:22 2005 From: sly at victoria.tc.ca (Andrew Sly) Date: Wed Mar 2 23:05:39 2005 Subject: [gutvol-d] re: DP Anniversary In-Reply-To: <4226B333.3070908@harborside.com> References: <20050303011217.817C08C8EC@pglaf.org> <4226B333.3070908@harborside.com> Message-ID: It may not be relevant, but to see a bit of history, the old PG volunteer web board is still in place: http://promo.net/pg/vol/wwwboard/index.html Here are two particular messages that mention DP: http://promo.net/pg/vol/wwwboard/messages/1063.html http://promo.net/pg/vol/wwwboard/messages/1557.html Andrew On Wed, 2 Mar 2005, George Davis wrote: > > Checking the "Completed Gold E-texts" page, in ascending order by submission > date (http://www.pgdp.net/c/list_etexts.php?x=g&sort=4), it shows: > > 1) "Mohammed Ali and His House", L. Muhlbach () > Uploaded: Tuesday, March 13th, 2001 > > The link for that etext is to #3320. > > The above was relied upon in arriving at the March 13th date. Which has run in > the newsletter since June 16, 2004, with no notice of correction. > > > Instead, I believe the first DP book to be posted by PG was #3059 (Homer's > > "The Iliad", trans. Andrew Lang) which was posted by December 6, 2000[c]. > > Are in you for a surprise: DP didn't do that one. From the etext: > > "This etext was prepared by Sandra Stewart > and Jim Tinsley " > > #3320 has a credit to C.F. and D.P. > > > [a] http://www.gutenberg.org/newsletter/archive/PGMonthly_2001_02_07.txt > > [b] http://www.gutenberg.org/newsletter/archive/Other_2000_10_18_Project_Gutenberg_needs_you.txt > > [c] http://www.gutenberg.org/newsletter/archive/PGMonthly_2000_12_06.txt > > > > -Michael Dyck > > Hopefully, someone from D.P. will step up and provide more meaningful activity > updates for inclusion in the newsletter. People outside of DP would be > interested in seeing what's going on over there, and why not include such in the > weekly PG newsletter? > From jmdyck at ibiblio.org Thu Mar 3 02:11:35 2005 From: jmdyck at ibiblio.org (Michael Dyck) Date: Thu Mar 3 02:17:52 2005 Subject: [gutvol-d] re: DP Anniversary References: <20050303011217.817C08C8EC@pglaf.org> <4226B333.3070908@harborside.com> Message-ID: <4226E2D7.B6C16F40@ibiblio.org> George Davis wrote: > > Checking the "Completed Gold E-texts" page, in ascending order by submission > date (http://www.pgdp.net/c/list_etexts.php?x=g&sort=4), it shows: > > 1) "Mohammed Ali and His House", L. Muhlbach () > Uploaded: Tuesday, March 13th, 2001 Ah, so it does. Mind you, it also says that books 2 through 160 were uploaded on January 1st, 2002, which is pretty implausible. The bottom line is, don't trust the dates on that page. (A little thing affects them. A slight disorder of the projects table makes them cheats.) > The above was relied upon in arriving at the March 13th date. Which has > run in the newsletter since June 16, 2004, with no notice of correction. Until now. > > Instead, I believe the first DP book to be posted by PG was #3059 (Homer's > > "The Iliad", trans. Andrew Lang) which was posted by December 6, 2000[c]. > > Are in you for a surprise: DP didn't do that one. From the etext: > > "This etext was prepared by Sandra Stewart > and Jim Tinsley " Sorry, no surprise -- I read that attribution before I posted my earlier message. The lack of mention of DP didn't convince me that the text hadn't gone through DP. We'll see what Jim says. > Hopefully, someone from D.P. will step up and provide more meaningful > activity updates for inclusion in the newsletter. Perhaps someone will. I'm not sure I share your hope though. > People outside of DP would be interested in seeing what's going on > over there, and why not include such in the weekly PG newsletter? If people want to know what's going on, they're welcome to visit the DP website and see for themselves. (They may need to register -- it depends what they want to see.) -Michael Dyck From jtinsley at pobox.com Thu Mar 3 04:39:53 2005 From: jtinsley at pobox.com (Jim Tinsley) Date: Thu Mar 3 04:40:19 2005 Subject: [gutvol-d] re: DP Anniversary In-Reply-To: <4226E2D7.B6C16F40@ibiblio.org> References: <20050303011217.817C08C8EC@pglaf.org> <4226B333.3070908@harborside.com> <4226E2D7.B6C16F40@ibiblio.org> Message-ID: <20050303123953.GA17119@panix.com> On Thu, Mar 03, 2005 at 02:11:35AM -0800, Michael Dyck wrote: >George Davis wrote: >> >> Checking the "Completed Gold E-texts" page, in ascending order by submission >> date (http://www.pgdp.net/c/list_etexts.php?x=g&sort=4), it shows: >> >> 1) "Mohammed Ali and His House", L. Muhlbach () >> Uploaded: Tuesday, March 13th, 2001 > >Ah, so it does. Mind you, it also says that books 2 through 160 were >uploaded on January 1st, 2002, which is pretty implausible. The bottom >line is, don't trust the dates on that page. (A little thing affects >them. A slight disorder of the projects table makes them cheats.) > >> The above was relied upon in arriving at the March 13th date. Which has >> run in the newsletter since June 16, 2004, with no notice of correction. > >Until now. > >> > Instead, I believe the first DP book to be posted by PG was #3059 (Homer's >> > "The Iliad", trans. Andrew Lang) which was posted by December 6, 2000[c]. >> >> Are in you for a surprise: DP didn't do that one. From the etext: >> >> "This etext was prepared by Sandra Stewart >> and Jim Tinsley " > >Sorry, no surprise -- I read that attribution before I posted my earlier >message. The lack of mention of DP didn't convince me that the text >hadn't gone through DP. We'll see what Jim says. > Jim has already said more than plenty on the DP Forums when the question came up there. I ransacked my old e-mails, and you can see the whole thread at http://www.pgdp.net/phpBB2/viewtopic.php?t=5726 The Lang Iliad was an unusual case. Sandra and I had the same translation, but in different printings -- I had very small pages, a kind of pocket book; she had normal sized ones. She was typing from the _end_ of the book backwards by chapter; I was scanning and OCRing from the start forward. We were going to meet in the middle. Charlz' site came up, and, IIRC, I fed it the middle bit that neither of us had covered yet. It was the first text submitted, and there was no concept at the time of a credit for the site itself or the page-proofers thereat. Which, I suspect, bothered me, because I added one in the Pope Odyssey, which was next on my list. I can't find an e-mail from those days discussing credit lines for the site, but the first three posted books were: Lang Iliad: No mention of DP Pope Odyssey: This etext was prepared by Jim Tinsley with much help from the proofers at http://charlz.dynip.com/gutenberg Irish Race: This etext was produced by Charles Franks and the Distributed Proofreaders Team. and Charlz' formula is the one, more or less, that has been used since. >> Hopefully, someone from D.P. will step up and provide more meaningful >> activity updates for inclusion in the newsletter. > >Perhaps someone will. I'm not sure I share your hope though. > >> People outside of DP would be interested in seeing what's going on >> over there, and why not include such in the weekly PG newsletter? > >If people want to know what's going on, they're welcome to visit the DP >website and see for themselves. (They may need to >register -- it depends what they want to see.) > >-Michael Dyck > >_______________________________________________ >gutvol-d mailing list >gutvol-d@lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d From hart at pglaf.org Thu Mar 3 09:31:08 2005 From: hart at pglaf.org (Michael Hart) Date: Thu Mar 3 09:31:10 2005 Subject: [gutvol-d] re: DP Anniversary In-Reply-To: <20050303123953.GA17119@panix.com> References: <20050303011217.817C08C8EC@pglaf.org> <4226B333.3070908@harborside.com> <4226E2D7.B6C16F40@ibiblio.org> <20050303123953.GA17119@panix.com> Message-ID: I'll be glad to put revised dates in the Newsletter if/when and "official" date is picked, along with any other items that should be included. Thanks! Michael On Thu, 3 Mar 2005, Jim Tinsley wrote: > On Thu, Mar 03, 2005 at 02:11:35AM -0800, Michael Dyck wrote: >> George Davis wrote: >>> >>> Checking the "Completed Gold E-texts" page, in ascending order by submission >>> date (http://www.pgdp.net/c/list_etexts.php?x=g&sort=4), it shows: >>> >>> 1) "Mohammed Ali and His House", L. Muhlbach () >>> Uploaded: Tuesday, March 13th, 2001 >> >> Ah, so it does. Mind you, it also says that books 2 through 160 were >> uploaded on January 1st, 2002, which is pretty implausible. The bottom >> line is, don't trust the dates on that page. (A little thing affects >> them. A slight disorder of the projects table makes them cheats.) >> >>> The above was relied upon in arriving at the March 13th date. Which has >>> run in the newsletter since June 16, 2004, with no notice of correction. >> >> Until now. >> >>>> Instead, I believe the first DP book to be posted by PG was #3059 (Homer's >>>> "The Iliad", trans. Andrew Lang) which was posted by December 6, 2000[c]. >>> >>> Are in you for a surprise: DP didn't do that one. From the etext: >>> >>> "This etext was prepared by Sandra Stewart >>> and Jim Tinsley " >> >> Sorry, no surprise -- I read that attribution before I posted my earlier >> message. The lack of mention of DP didn't convince me that the text >> hadn't gone through DP. We'll see what Jim says. >> > > Jim has already said more than plenty on the DP Forums when the question > came up there. I ransacked my old e-mails, and you can see the whole > thread at > > http://www.pgdp.net/phpBB2/viewtopic.php?t=5726 > > The Lang Iliad was an unusual case. Sandra and I had the same translation, > but in different printings -- I had very small pages, a kind of pocket > book; she had normal sized ones. She was typing from the _end_ of the > book backwards by chapter; I was scanning and OCRing from the start > forward. We were going to meet in the middle. Charlz' site came up, and, > IIRC, I fed it the middle bit that neither of us had covered yet. > > It was the first text submitted, and there was no concept at the > time of a credit for the site itself or the page-proofers thereat. > Which, I suspect, bothered me, because I added one in the Pope > Odyssey, which was next on my list. > > I can't find an e-mail from those days discussing credit lines for > the site, but the first three posted books were: > > Lang Iliad: No mention of DP > > Pope Odyssey: This etext was prepared by Jim Tinsley > with much help from the proofers at http://charlz.dynip.com/gutenberg > > Irish Race: This etext was produced by Charles Franks and the > Distributed Proofreaders Team. > > and Charlz' formula is the one, more or less, that has been used since. > > >>> Hopefully, someone from D.P. will step up and provide more meaningful >>> activity updates for inclusion in the newsletter. >> >> Perhaps someone will. I'm not sure I share your hope though. >> >>> People outside of DP would be interested in seeing what's going on >>> over there, and why not include such in the weekly PG newsletter? >> >> If people want to know what's going on, they're welcome to visit the DP >> website and see for themselves. (They may need to >> register -- it depends what they want to see.) >> >> -Michael Dyck >> >> _______________________________________________ >> gutvol-d mailing list >> gutvol-d@lists.pglaf.org >> http://lists.pglaf.org/listinfo.cgi/gutvol-d > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From hart at pglaf.org Thu Mar 3 09:46:29 2005 From: hart at pglaf.org (Michael Hart) Date: Thu Mar 3 09:46:30 2005 Subject: [gutvol-d] Please test www-dev.gutenberg.org In-Reply-To: References: <42261F44.1000005@perathoner.de> Message-ID: > On Wed, 2 Mar 2005, Marcello Perathoner wrote: > >> We are ready to migrate the web site to the new fast file server. >> >> Also some slight changes were made to the online catalog to make it >> better cacheable: >> I got an email from one person who suggested that how to volunteer should be listed up with the donation finromation in addition to where it is in the "In Depth" section [marked <<< below]. Apparently some people don't read "In Depth" until they are already involved, and this person just wanted to know how volunteer. + Donate. How to make a donation to Project Gutenberg. + News and Events. The news. + Contacts. How to get in touch. + Partners, Affiliates and Resources. A collection of links. + Credits. Thanks to our most prominent volunteers. * In Depth Information. All you ever wanted to know about Project <<< Gutenberg. + Volunteering. How you can help Project Gutenberg. <<< From hart at pglaf.org Thu Mar 3 09:47:41 2005 From: hart at pglaf.org (Michael Hart) Date: Thu Mar 3 09:47:42 2005 Subject: [gutvol-d] Please test www-dev.gutenberg.org In-Reply-To: References: <42261F44.1000005@perathoner.de> Message-ID: I suppose while these updates are going on, we should also update 13,000 to 15,000 in the opening: Project Gutenberg is the oldest producer of free electronic books (eBooks or etexts) on the Internet. Our collection of more than 13.000 <<< eBooks was produced by hundreds of volunteers. From brandon at corruptedtruth.com Thu Mar 3 09:53:50 2005 From: brandon at corruptedtruth.com (Brandon Galbraith) Date: Thu Mar 3 09:53:59 2005 Subject: [gutvol-d] Please test www-dev.gutenberg.org In-Reply-To: References: <42261F44.1000005@perathoner.de> Message-ID: <42274F2E.8010000@corruptedtruth.com> It's too bad we can't make that dynamic, feeding off of a database =) -brandon Michael Hart wrote: > > I suppose while these updates are going on, we should also update > 13,000 to 15,000 in the opening: > > Project Gutenberg is the oldest producer of free electronic books > (eBooks or etexts) on the Internet. Our collection of more than > 13.000 <<< > eBooks was produced by hundreds of volunteers. > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > From kouhia at nic.funet.fi Thu Mar 3 10:39:39 2005 From: kouhia at nic.funet.fi (Juhana Sadeharju) Date: Thu Mar 3 10:39:53 2005 Subject: [gutvol-d] Re: Enlightened Self Interest Message-ID: >From: Jon Noring > >Btw, if anyone here has made, and plans to make, 600 dpi (optical) >greyscale or color scans of any public domain books including the book >covers (and this includes books printed between 1923 and 1963 which >may be public domain), I'll gladly accept donations of them on CD-ROM >and DVD-ROM. I have scanned about 3400 pages of math text. 300 dpi only as it looked ok enough. The images are in CDs and my CDROM device has been broken for three months already. Four of the books are journal books (600 pages per book) of Mathematische Annalen. Random pages of them are also scanned with 600 dpi because I wanted to extract all the fonts. Unfortunately, I decided to wait until a digital camera would appear, because it would be needed for the good quality fonts (600 and 1200 dpi on scanner looks blurry compared to camera). Yes, unfortunately, because our library sold two of the four books as a standard book cleaning procedure. In the four books, there were rare letters which appeared only in one page of one book. If my CDs loose their data, I cannot rescan them. The images are in zip files which are very fragile itself. E.g., removing two bytes from the end or removing the TOC of zip makes the zips unusable. And libraries want more tax payer's money! For what? Why libraries which destroys books should be supported at all? Better give the money to institutions who preserve the history. Juhana -- http://music.columbia.edu/mailman/listinfo/linux-graphics-dev for developers of open source graphics software From Bowerbird at aol.com Thu Mar 3 11:39:49 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Thu Mar 3 11:40:13 2005 Subject: [gutvol-d] new thread for noring Message-ID: well, jon, i'd have thought you could have used "the last word" in that thread a bit more wisely. because i believe that you ain't gonna have a leg to stand on once my results come in... but that won't be until next week, so please enjoy your brief reprieve... :+) before i get too deep into the o.c.r./correction process for "my antonia", though, i'd like to know how much time you spent, jon, on (a) scanning and (b) image-manipulation. because my general working rule-of-thumb will be that people should spend _less_ time on the post-o.c.r. steps than they did on the scanning and image-manipulation steps. now, until i get all my procedures hitting on all cylinders, that might be a pipe-dream, but that's my rule-of-thumb... i'd estimate that you spent at least 4 hours on the project, jon. (probably more, since you were still learning the curve, but if you had to repeat the whole thing, you could do it in 4.) that's for the scanning _as_well_as_ the image-manipulation. if i'm badly wrong, in either direction, do please let me know. otherwise, i will give myself a time-limit of 4 hours on this, and we'll see what i can come up with... and jon, please allow me to say a few nice things to you... ;+) first of all, you did a bang-up job on the "my antonia" scans. even though the world doesn't really have a place yet for high-resolution scans like these, it's very good to do them. you can always downsample to lower-resolution, if need be. i understand why many places aren't yet doing high-resolution -- like internet archive, distributed proofreaders, and google -- and i absolutely do _not_ fault them for the practical decision. at the same time, though, i applaud people doing high-resolution. it's not as if what you've done is unprecedented. bennett kobb, for instance, has high-res scans of _nearly_one_hundred_books_, (http://fax.libs.uga.edu) making your single one pale in comparison. (his kick-ass scanner: http://fax.libs.uga.edu/abovevu/abovevu.html) but nonetheless, your quality output is rare enough to merit applause. second, the image-manipulation you did on the scans is first-rate, as far as i can tell from cursory examination. the scans look great! they are straight! and their positioning is standardized very well! (these last two factors are _very_ important in getting good o.c.r.) there is no question in my mind that we'll get good o.c.r. out of 'em. third, you used a reasonable naming-scheme for your image-files! the scan for page 3, for instance, is named 003.png! fantastic! and when you had a blank page, your image-file says "blank page"! please pardon me for making a big deal out of something so trivial -- and i'm sure some lurkers wrongly think i'm being sarcastic -- but most people have no idea how uncommon this common sense is! when you're working with hundreds of files, it _really_ helps you if you _know_ that 183.png is the image of page 183. immensely. even the people over at distributed proofreaders, in spite of their immense experience, haven't learned this first-grade lesson yet. (well, a few of 'em have, and won't go back to that stupidity, but an amazing number of others will even _argue_ with you about it!) what this means, for those of you reading along at home, is that when you scan, start scanning at page 1. (and if the text starts on page 3, like "my antonia" did, then start 2 pages before that.) scan the blank pages. if there are picture "plates" in the book or other unnumbered pages, _skip_'em_, so numbers stay in sync; then do them later, at the _end_ of the regular numbered pages. that's also when you'll do the cover, and all of the front-matter. (this includes a forward, preface, anything with roman numerals.) fourth, jon, you scanned the headers and footers! again, bravo! some people don't, when they scan, and that is a big mistake. let the post-o.c.r. processing software eliminate them later. for now, they are worthwhile to keep in your master images; also later, if you view the images as a book, they're a nice touch. they aren't really necessary, in most cases, but why delete 'em? fifth, your dedication in driving the text to perfection is exemplary. you put together a team of a half-dozen people dedicated to the task, and it shows. while i don't think this approach can scale very well -- your team might well burn itself out after doing a couple books, while page-a-day people at distributed proofreaders go on and on, and an even better approach is to turn readers into proofreaders -- i do think that, as a special effort, what you've done is admirable. drawing attention to the importance of error-free e-texts is great. and setting a positive example, as you've done with your own file, is far superior to the vacuous criticism you make against p.g. files. you've put your time and energy where your mouth is, and i approve. sixth, i understand that you are motivated by good intentions, and i respect your courage in standing up for them while some people (including myself) are kicking you in the teeth, because we disagree. (and _their_ intentions and motivations are just as good as yours.) in case you haven't noticed, i have the exact same type of fortitude, and whenever i see it in other people, i hold it in very high esteem. seventh, i can't think of anything else, but i like to have 7 points, rather than 6, and i'm sure i'll think of the other when i hit "send". anyway, i hope i haven't embarrassed you, saying nice things and all... *** oh yeah, one more thing, just so nobody else wastes any time: jon suggested that people with a range of o.c.r. packages could run it on his scans. i do not think that's necessary, not at all. there's a ton of o.c.r. expertise here, all pointing the same: abbyy finereader v7.x is superior to any other o.c.r. program. combined with proper post-o.c.r. processing, its recognition gives a level of accuracy that is as good as can be expected. until other o.c.r. programs can deliver to us near-perfection, or results equivalent to abbyy's for free, they waste our time. *** anyway, off i go. i'll let you know when i have some results... :+) -bowerbird From jmdyck at ibiblio.org Thu Mar 3 12:12:25 2005 From: jmdyck at ibiblio.org (Michael Dyck) Date: Thu Mar 3 12:13:39 2005 Subject: [gutvol-d] re: DP Anniversary References: <20050303011217.817C08C8EC@pglaf.org> <4226B333.3070908@harborside.com> <4226E2D7.B6C16F40@ibiblio.org> <20050303123953.GA17119@panix.com> Message-ID: <42276FA9.86E8BB7C@ibiblio.org> Jim Tinsley wrote: > > Jim has already said more than plenty on the DP Forums when the question > came up there. I ransacked my old e-mails, and you can see the whole > thread at > > http://www.pgdp.net/phpBB2/viewtopic.php?t=5726 Thanks for that link -- lots of good information there. I must have missed it when it happened originally (probably because I was deep into copyright renewals at the time). -Michael From marcello at perathoner.de Thu Mar 3 09:03:37 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Thu Mar 3 13:29:55 2005 Subject: [gutvol-d] Please test www-dev.gutenberg.org In-Reply-To: References: <42261F44.1000005@perathoner.de> Message-ID: <42274369.7020102@perathoner.de> Andrew Sly wrote: > I have an issue with the way that following an author name from a > bibrec page leads to an anchor in a "author by first letter of last name" > page. > > To me, this does not look like a long-term solution. It could work > for a while, but as the collections continues to grow, these files > will inevitably get too large to be easily useful for general browsing. The old author pages had the problem that they were too many to generate statically (5000+) and very database-intensive to generate on-the-fly. We have a fair share of obnoxious robots visiting us (kids on a dsl line that want to grab everything and don't respect robots.txt) and every such visit costs us 5000+ heavy database hits. (The bibrec pages are much lighter on database resources.) I'll try this way to see how it performs. The script uses a list of regexes to fill the pages with authors. If the "B" page (currently 219 KB) gets too big, we'll split it into "BA" and "BM". Also, modern browser will request compression, so the 219 KB page will boil down to a ~50 KB transmission. Many other web sites use images that big. Here are the actual sizes. -rw-r--r-- 1 marcello pgweb 120171 Mar 2 15:47 a.php -rw-r--r-- 1 marcello pgweb 219237 Mar 2 15:47 b.php -rw-r--r-- 1 marcello pgweb 168136 Mar 2 15:48 c.php -rw-r--r-- 1 marcello pgweb 124726 Mar 2 15:48 d.php -rw-r--r-- 1 marcello pgweb 54900 Mar 2 15:48 e.php -rw-r--r-- 1 marcello pgweb 68002 Mar 2 15:49 f.php -rw-r--r-- 1 marcello pgweb 93415 Mar 2 15:49 g.php -rw-r--r-- 1 marcello pgweb 182640 Mar 2 15:50 h.php -rw-r--r-- 1 marcello pgweb 17617 Mar 2 15:50 i.php -rw-r--r-- 1 marcello pgweb 61671 Mar 2 15:50 j.php -rw-r--r-- 1 marcello pgweb 52031 Mar 2 15:50 k.php -rw-r--r-- 1 marcello pgweb 132947 Mar 2 15:50 l.php -rw-r--r-- 1 marcello pgweb 184111 Mar 2 2005 m.php -rw-r--r-- 1 marcello pgweb 29596 Mar 2 2005 n.php -rw-r--r-- 1 marcello pgweb 38429 Mar 2 2005 o.php -rw-r--r-- 1 marcello pgweb 9530 Mar 2 03:20 other.php -rw-r--r-- 1 marcello pgweb 110174 Mar 2 03:19 p.php -rw-r--r-- 1 marcello pgweb 11253 Mar 2 03:19 q.php -rw-r--r-- 1 marcello pgweb 85506 Mar 2 03:19 r.php -rw-r--r-- 1 marcello pgweb 195736 Mar 2 03:19 s.php -rw-r--r-- 1 marcello pgweb 88693 Mar 2 03:19 t.php -rw-r--r-- 1 marcello pgweb 29340 Mar 2 03:20 u.php -rw-r--r-- 1 marcello pgweb 148515 Mar 2 03:20 v.php -rw-r--r-- 1 marcello pgweb 139151 Mar 2 03:20 w.php -rw-r--r-- 1 marcello pgweb 7759 Mar 2 03:20 x.php -rw-r--r-- 1 marcello pgweb 18127 Mar 2 03:20 y.php -rw-r--r-- 1 marcello pgweb 15734 Mar 2 03:20 z.php -- Marcello Perathoner webmaster@gutenberg.org From jon at noring.name Thu Mar 3 14:11:28 2005 From: jon at noring.name (Jon Noring) Date: Thu Mar 3 14:11:42 2005 Subject: [gutvol-d] new thread for noring In-Reply-To: References: Message-ID: <3027651234.20050303151128@noring.name> Bowerbird wrote: > well, jon, i'd have thought you could have used > "the last word" in that thread a bit more wisely. laugh. > because i believe that you ain't gonna have > a leg to stand on once my results come in... Well, I hope you get an error rate that is one per ten pages for the "My Antonia" scans. And even if you do, I still believe a DP-like process is necessary to catch errors that OCR can't handle, and for someone to properly assemble the pages, structure the document, etc., after the OCRing/proofing is complete. I don't quite put the same level of faith in OCR as you seem to. Btw, I believe as you do that an error reporting system is a good idea so readers may submit errors they find in the texts they use -- sort of an ongoing post-DP proofing process. Obviously, it is necessary to make available the page scans of the source document to aid in this process. How can an error be properly verified and corrected when the source work is not available? > i'd estimate that you spent at least 4 hours on the project, > jon. (probably more, since you were still learning the curve, > but if you had to repeat the whole thing, you could do it in 4.) > that's for the scanning _as_well_as_ the image-manipulation. > if i'm badly wrong, in either direction, do please let me know. > otherwise, i will give myself a time-limit of 4 hours on this, > and we'll see what i can come up with... Scanning took quite a while (much more than four hours) since all I have at the moment is a flat bed scanner (an el cheapo and slow Microtek ScanMaker X6EL to be exact), so I had to hand place each page on the flat bed. Of course, 600 dpi optical resolution increases the per page scanning time (4 times as many pixels to capture, which slows everything down.) It would have gone a *lot faster* had I used a high-quality sheet feed scanner since I took apart the book to free the pages so as to get high quality, flat scans. Someday... > first of all, you did a bang-up job on the "my antonia" scans. Thanks! > even though the world doesn't really have a place yet for > high-resolution scans like these, it's very good to do them. > you can always downsample to lower-resolution, if need be. Exactly. It is my vision for Distributed Scanners that it should achieve at least this quality. > i understand why many places aren't yet doing high-resolution > -- like internet archive, distributed proofreaders, and google -- > and i absolutely do _not_ fault them for the practical decision. > at the same time, though, i applaud people doing high-resolution. > it's not as if what you've done is unprecedented. bennett kobb, > for instance, has high-res scans of _nearly_one_hundred_books_, > (http://fax.libs.uga.edu) making your single one pale in comparison. > (his kick-ass scanner: http://fax.libs.uga.edu/abovevu/abovevu.html) > but nonetheless, your quality output is rare enough to merit applause. Funny that I forgot about the UGA work. Quite an interesting and eclectic list of mostly 19th century works. Will need to contact Bennett one of these days. > third, you used a reasonable naming-scheme for your image-files! > the scan for page 3, for instance, is named 003.png! fantastic! > and when you had a blank page, your image-file says "blank page"! > please pardon me for making a big deal out of something so trivial > -- and i'm sure some lurkers wrongly think i'm being sarcastic -- > but most people have no idea how uncommon this common sense is!... Yes, I deemed it important for processing purposes that the name of the image contain semantic information of what it represents, and that naming be consistent for file sorting purposes. As an aside, it is interesting that in my copy of "My Antonia", which is a first edition, the Introduction starts on page 3. There is no page 1 and 2 -- at all. I carefully took the book apart (cutting the sewing) before scanning and proved by this process (plus referring to other info) that pages 1 and 2 never existed. The publisher simply chose to start at page 3. Was this common? (Hmmm, I probably need to take a trip to Utah University's library to check their first edition copy of My Antonia to make sure that there wasn't an inserted page, maybe of an illustration -- but the UNL online Cather edition shows nothing. Maybe there was an intent to insert a page there, which after typesetting it was decided not to.) > fourth, jon, you scanned the headers and footers! again, bravo! > some people don't, when they scan, and that is a big mistake. > let the post-o.c.r. processing software eliminate them later. > for now, they are worthwhile to keep in your master images; > also later, if you view the images as a book, they're a nice touch. > they aren't really necessary, in most cases, but why delete 'em? It was my intent to reproduce each page for direct reading purposes -- that is, if somebody wanted to read the book as it was printed, then they could. I attempted *archival scanning*, not *scanning only for OCR*. That OCR benefits from archival quality scanning, though, is obvious. > fifth, your dedication in driving the text to perfection is exemplary. > you put together a team of a half-dozen people dedicated to the task, > and it shows. while i don't think this approach can scale very well > -- your team might well burn itself out after doing a couple books, > while page-a-day people at distributed proofreaders go on and on, > and an even better approach is to turn readers into proofreaders -- It is not my intent to proof the way we did -- I still believe in the DP approach for proofing. But we had to get something out the door for demo purposes and did not have the time to submit it to the DP process. Maybe we should have. Hindsight is 20-20. And thanks for the rest of your comments. Jon From miranda_vandeheijning at blueyonder.co.uk Thu Mar 3 14:39:43 2005 From: miranda_vandeheijning at blueyonder.co.uk (Miranda van de Heijning) Date: Thu Mar 3 14:39:58 2005 Subject: [gutvol-d] 500th French book--Sodome et Gomorrhe In-Reply-To: <421B0080.8060402@blueyonder.co.uk> References: <42171387.5020807@blueyonder.co.uk> <20050220054956.GB30309@pglaf.org> <42187065.4060107@blueyonder.co.uk> <421B0080.8060402@blueyonder.co.uk> Message-ID: <4227922F.1050201@blueyonder.co.uk> Hi guys, Just to keep you all updated on progress: We are at 496 French books at the moment. Marcel Proust's Sodome et Gomorrhe 1 will come out of DP shortly and should be in time to become the official number 500, if that's okay with the rest of the PG community. Kind regards, Miranda Miranda van de Heijning wrote: > My intention is to continue A la recherche du temps perdu on DP-EU and > hopefully, one of the other PG sites will be able to publish them. > > After that, we just need to wait for US copyright to move along a few > years and then PG-US will have the full lot as well. :-) > > Miranda > > > Michael Hart wrote: > >> >> Don't forget, all of Proust can be posted at Project Gutenberg sites >> with "life +50" and +70 copyrights, since he died so long ago. >> >> Michael >> >> >> On Sun, 20 Feb 2005, Miranda van de Heijning wrote: >> >>> Hi all, >>> >>> I have just looked through the download info which Marcello very >>> kindly compiled for me and I would like to suggest we post as the >>> 500th book part 1 of 'Sodome et Gomorrhe'. >>> >>> It is part of Proust's classic A la recherche du temps perdu and the >>> only remaining volume which we can actually post to PG. This is >>> because the other parts of the series were published after his >>> death, between 1923 and 1927. We already have Sodomo et Gomorrhe 2. >>> >>> Sodome et Gomorrhe 1 is close to finishing proofing at Distributed >>> Proofreaders (162 pages to go in round 2) so I expect it will be >>> available for post-processing/posting soon. >>> >>> Or are there any other suggestions? >>> >>> Miranda >>> >>> >>> >>> Greg Newby wrote: >>> >>>> On Sat, Feb 19, 2005 at 10:23:03AM +0000, Miranda van de Heijning >>>> wrote: >>>> >>>>> Hi guys, >>>>> >>>>> There are 485 French books in PG at the moment, so we will be >>>>> reaching 500 pretty soon. Has any thought been given yet about >>>>> what could be the 500th book? If no decision has been made, there >>>>> are quite a few George Sand's coming up from DP and they may be >>>>> suitable, considering that we are working on providing her >>>>> complete works. >>>>> >>>> >>>> I don't think anyone has suggested one yet. Sands sounds >>>> like a good choice. We also have a nice array of Jules Verne >>>> and Victor Hugo, and I've noticed some Shakespeare translations. >>>> >>>> >>>>> Secondly, are there any statistics on which are the most popular >>>>> French books? I know that Le Kama Soutra is quite a crowdpleaser, >>>>> but what about the rest? >>>>> >>>> >>>> There's a "top 100" list at http://gutenberg.org/catalog >>>> There is also a non-public analysis of the download >>>> statistics. Both of these are for ibiblio only, so while they're >>>> useful they don't represent other download sources (notably, >>>> our many mirrors). >>>> >>>> You'd need to look through the download list "by hand" to spot the >>>> French titles. Email if if you want the URL & username+password, >>>> and I'll dig it up. >>>> -- Greg >>>> >>>> >>>> >>>>> Michael Hart wrote: >>>>> >>>>> >>>>>> I sent the address, >>>>>> unless someone has a better one. >>>>>> >>>>>> Michael >>>>>> >>>>>> >>>>>> On Thu, 17 Feb 2005, Alex Wilson wrote: >>>>>> >>>>>> >>>>>>> About a month ago Greg Newby offered to get me in touch with David >>>>>>> Wyllie--who provided the English translation of Kafka's >>>>>>> Metamorphosis for >>>>>>> PG--and I haven't heard from him since. I'm thinking Greg's >>>>>>> emails or mine >>>>>>> are ending up in a junk mail folder, so I'm wondering if anyone >>>>>>> here knows >>>>>>> how I can get in touch with Mr. Wyllie. >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> Alex. >>>>>>> >>>>>>> http://www.telltaleweekly.org - Funding a Free Audiobook Library >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> gutvol-d mailing list >>>>>>> gutvol-d@lists.pglaf.org >>>>>>> http://lists.pglaf.org/listinfo.cgi/gutvol-d >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> gutvol-d mailing list >>>>>> gutvol-d@lists.pglaf.org >>>>>> http://lists.pglaf.org/listinfo.cgi/gutvol-d >>>>>> >>>>>> >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> gutvol-d mailing list >>>>> gutvol-d@lists.pglaf.org >>>>> http://lists.pglaf.org/listinfo.cgi/gutvol-d >>>>> >>>> >>>> >>>> >>>> >>> >>> _______________________________________________ >>> gutvol-d mailing list >>> gutvol-d@lists.pglaf.org >>> http://lists.pglaf.org/listinfo.cgi/gutvol-d >>> >> >> >> > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > From Bowerbird at aol.com Thu Mar 3 15:03:14 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Thu Mar 3 15:03:31 2005 Subject: [gutvol-d] new thread for noring Message-ID: jon said: > I hope you get an error rate that is one per ten pages i'll do my best. :+) > even if you do, I still believe a DP-like process > is necessary to catch errors that OCR can't handle human readers will _always_ be necessary. (and easy enough to find. if no one wants to read a book, there's little call to digitize it.) thus a system of "continuous proofreading" will be quite good enough if we can make the computer-guided processing accurate enough. > and for someone to properly assemble the pages, > structure the document, etc., after the OCRing/proofing > is complete. that's part of what i include in "post-o.c.r. processing". > I don't quite put the same level of > faith in OCR as you seem to. except that once you see the evidence i lay out, you will realize "faith" has nothing to do with it. as i've been saying all along, for professionally typeset books, the structure is _in_ the presentation. so o.c.r. gives you all the information you need, if you know how to look for it, and do so diligently. > Btw, I believe as you do that an error reporting system > is a good idea so readers may submit errors they find > in the texts they use -- sort of an ongoing > post-DP proofing process. post-d.p.? i see it _replacing_ d.p. for most books. and good thing, too. once the coming avalanche of scanned-books engulfs us, it'll be the only way most books have a chance to surface. that will take the pressure off distributed proofreaders, and they'll be able to focus on the books that _need_ them. > Obviously, it is necessary to make available > the page scans of the source document to aid in this process. > How can an error be properly verified and corrected > when the source work is not available? i've always said i think that page-scans should be publicly available. particularly if your mission is _transcribing_an_existing_edition_. (although, to remind people again, copyism is _not_ the mission that michael hart chose to embed within his project gutenberg.) but even in the case of project gutenberg's "amalgamated" e-texts, i believe that a page-image graphic-version should be made available. this would allow people to view it on a dvd-player, just as an example. > Scanning took quite a while (much more than four hours) that doesn't surprise me. nonetheless, i'll limit myself to 4 hours. that's quite enough time to devote to it. and to prove the point too. > I deemed it important for processing purposes that > the name of the image contain semantic information > of what it represents, and that > naming be consistent for file sorting purposes. as one improvement, i would suggest _not_ using "001.png", etc. instead, preface each one with a string that will make it _unique_, such as "ma2005feb001.png". it's easy to tell that to the o.c.r. app -- you just type it in one time -- and it's an unmistakable stamp. and of course, if you're going to do hundreds or thousands of books, you want to cook up a naming convention that conveys information. on big multimedia projects, it is not at all uncommon to have one _full-time_ employee dedicated _solely_ to maintaining filenames. because if things go wrong, it can waste a whole lot of man-hours. oh yeah, one more suggestion. your front-matter filenames were prefaced with an "r". my typical recommendation is that they be prefaced with an "f", and that the regular pages be named with a "p", so the front-matter files will sort _on_top_of_ the regular pages. i want to be able to depend on the operating-system filename sort to give me pages in the exact order they appear in the book itself. so i use a "q" on back-matter files, so they will drop to the bottom. for illustration plates, i use a name that sorts _them_ correctly; for instance, if an illustration page is between pages 168 and 169, name it "p168a.png". (and don't forget the blank verso side either!, which you will name "p168b.png".) > The publisher simply chose to start at page 3. Was this common? it's not uncommon. oftentimes there is a "title-page", consisting of nothing more than the name of the book, which is considered "page 1", with its blank verso being "page 2", so chapter 1 starts on "page 3". sometimes chapter 1 starts on page 7. or page 11. publishers are weird. > Maybe there was an intent to insert a page there, > which after typesetting it was decided not to.) sometimes that happens too, yep. an "unnecessary" page gets dropped when the typesetter realizes they didn't plan the signatures correctly. or when the preface runs two pages longer than was originally intended. or any number of other snafus spring up. shit happens. > It was my intent to reproduce each page for direct reading purposes -- > that is, if somebody wanted to read the book as it was printed, > then they could. yeah, and sometimes people want to do exactly that. which is why the page-images should be made available. for many illustrated books, the text alone is not enough. you want to be able to see the pages as they were printed. my viewer-program will work with either, text or images. it'll even work in "hybrid" mode, so you can display the text in one of the 2-up pages, and the page-image on the other side. (and of course that is the mode which is used for proofreading.) that's why things like _blank_pages_ are so important to include. because if you toss them out, you screw up the left/right sequence. a convention of paper-books is that odd pages always go on the right. screw that up and you make yourself look silly. anyway, that's all for now. -bowerbird From jon at noring.name Thu Mar 3 17:15:58 2005 From: jon at noring.name (Jon Noring) Date: Thu Mar 3 17:16:32 2005 Subject: [gutvol-d] new thread for noring In-Reply-To: References: Message-ID: <838721171.20050303181558@noring.name> Bowerbird wrote: > as one improvement, i would suggest _not_ using "001.png", etc. > instead, preface each one with a string that will make it _unique_, > such as "ma2005feb001.png". it's easy to tell that to the o.c.r. app > -- you just type it in one time -- and it's an unmistakable stamp. Yes, a very good suggestion, and one that is being planned. I held off because we are still thinking through the exact syntax of the book identifier, although it *might* be based somewhat on the WEMI (Work/ Expression/Manifestation/Item) principle. The LibraryCity ID used at the current "My Antonia" site is just a quick improvisation of the WEMI principle. For example: Work: "Frankenstein" by Mary Shelley Expression: Second edition (which differs a lot from the First) Manifestation: 1895 printing edited by John Doe (just a dummy example) Item: XHTML So in Trusted Editions, filed under the WorkID for "Frankenstein", we could have multiple Expressions each with its own ExprID, e.g. First Edition, Second Edition, a lost manuscript for a third edition, etc. (many books will have only Expression since they did not become popular and no author manuscript exists.) Under Manifestation we could have several (with ManfID's) based on later edited editions as well as a modern "Michael Hart" style amalgamated/edited edition. And then for each Manifestation we can have several formats (Items, ItemID -- yeah, this is a small twist on WEMI as it officially exists since 'item' in the pbook world usually refers to a particular printed copy of a Manifestation, with coffee stains and page rips and all -- but this works well for ebooks/etexts where each item is a duplicatable digital format derived from the paper Manifestation. This is not yet etched in concrete -- it is still in the idea stage.) So, as an example, we might have for Identifiers: WorkID: 00000000025 (enough for 100 billion general Works.) ExprID: 02 ManfID: 03 ItemID: 008 (referring to some standardized list which expands over time) So the overall ID for a particular format of a particular source paper book might be: 00000000025-02-03-008 (yeah, it's long) Page scans only need the WEM portion of the ID for prefixing on the filename: 00000000025-02-03-p295.png (If we only care about 100 million Works, then we may have: 00000025-02-03-p295.png ) Of course, the WEM-ID itself does not contain any metadata other than identifiers, but that would mesh with a database. It is very problematic to include any Dublin Core type of metadata within an identifier. It is understandable maybe using the two first letters associated with the first two words of the title (ignoring articles), such as MA for "My Antonia", but that's as far as I'd go. > and of course, if you're going to do hundreds or thousands of books, > you want to cook up a naming convention that conveys information. > on big multimedia projects, it is not at all uncommon to have one > _full-time_ employee dedicated _solely_ to maintaining filenames. > because if things go wrong, it can waste a whole lot of man-hours. Every scanned image is a unique digital object, so it needs to have a unique identifier in the object's file name, applied when it is created, along with a metadata record somewhere to describe and keep track of it. The catalogers will take care of the identifers and metadata, which go hand in hand. > oh yeah, one more suggestion. your front-matter filenames were > prefaced with an "r". my typical recommendation is that they be > prefaced with an "f", and that the regular pages be named with a "p", > so the front-matter files will sort _on_top_of_ the regular pages. > i want to be able to depend on the operating-system filename sort > to give me pages in the exact order they appear in the book itself. > so i use a "q" on back-matter files, so they will drop to the bottom. > for illustration plates, i use a name that sorts _them_ correctly; > for instance, if an illustration page is between pages 168 and 169, > name it "p168a.png". (and don't forget the blank verso side either!, > which you will name "p168b.png".) Also an excellent suggestion. The 'r' stands for "Roman", but I noticed in sorting that the pages are not ordered, so the front-/body-/end-matter approach makes sense. Too bad 'b' comes before 'f', as you noted. >> It was my intent to reproduce each page for direct reading purposes -- >> that is, if somebody wanted to read the book as it was printed, then >> they could. > that's why things like _blank_pages_ are so important to include. > because if you toss them out, you screw up the left/right sequence. > a convention of paper-books is that odd pages always go on the right. > screw that up and you make yourself look silly. Definitely! I will certainly need to relook at what I did to make sure it's all there. Handling inserted illustrations is a problem name-wise since in "My Antonia", the illustrations were inserts between numbered pages. So for naming/sorting purposes that will need to be worked out. Thanks for the ideas. Jon From Bowerbird at aol.com Thu Mar 3 23:10:47 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Thu Mar 3 23:11:08 2005 Subject: [gutvol-d] bob bob bobbing along Message-ID: <25.5a67ce0d.2f5963f7@aol.com> i said: > bennett kobb, for instance, has > high-res scans of _nearly_one_hundred_books_, "bennett kobb" is actually "bob kobres". :+) -bowerbird From blondeel at clipper.ens.fr Fri Mar 4 02:53:46 2005 From: blondeel at clipper.ens.fr (Sebastien Blondeel) Date: Fri Mar 4 02:54:10 2005 Subject: 500th French book: Rimbaud? (Re: [gutvol-d] 500th French book--Sodome et Gomorrhe) In-Reply-To: <4227922F.1050201@blueyonder.co.uk> References: <42171387.5020807@blueyonder.co.uk> <20050220054956.GB30309@pglaf.org> <42187065.4060107@blueyonder.co.uk> <421B0080.8060402@blueyonder.co.uk> <4227922F.1050201@blueyonder.co.uk> Message-ID: <20050304105346.GA28659@clipper.ens.fr> I have been reading this 500th book discussion without connecting it to a book I have in PP and that may be eligible. It is a very famous book, maybe the best or second best know book of French poetry: Rimbaud's Les Illuminations, Une Saison en Enfer (projectID3fbe0069d630e from PGDP US) I finish it over the week/end if that I needed. From Bowerbird at aol.com Fri Mar 4 12:53:12 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Fri Mar 4 12:53:24 2005 Subject: [gutvol-d] march forth Message-ID: <84.409159b4.2f5a24b8@aol.com> well, that was a lot easier than i thought it would be... :+) i did o.c.r. on half of jon noring's page-scans for "my antonia", using abbyy finereader v7.x. the results were quite excellent. after doing a small number of global corrections to the o.c.r., i checked it against noring's "trustworthy" version of the text. except for exceptions i'll discuss right after this paragraph, most results are given below; each pair of lines represents a difference found between abbyy and noring, with the last word in each line being the point of difference. the number listed at the start of each line is the word-number in the file, and the string of words are the ones preceding the point-of-difference in the file, so that you can easily pinpoint the correct location. most of the o.c.r. errors were on _punctuation_, not _letters_. in particular, there were many instances where a _period_ was misrecognized as a comma. i did not bother to list these cases, mostly to avoid clutter. i do not know what caused these errors. i don't know if it's a _typical_ misrecognition that abbyy makes, if jon's manipulation of the images somehow caused confusion, if i set one of the options incorrectly, or what. help, anyone? i have also not listed differences found in hyphenation, since i don't have the time to write a decent routine to check them. (i just accepted the dehyphenation abbyy did automatically.) another set of differences not listed here is the _n't_ words. words like "couldn't" and "shouldn't" were set with the _n't_ part distinct from the first part. jon's version retained this. abbyy did not. personally, i find it an unnecessary distraction; the first thing i'd do with such the file is to change it globally. jon probably considers that "tampering with trustworthiness". i think it's common sense recognition of a changed convention. if you prefer jon's way, use his text. if not, you can use mine. (the change is global throughout the book, so it is easy to do.) i note that jon _did_ close up some "there's" where the "'s" was set off distinctively, so he was a bit inconsistent in this arena. (i didn't check to see if there were other apostrophe-s words that were set apart, because i would've closed 'em up myself.) i also changed high-bit characters to low, to ease comparison, so those differences are not listed. yes, the book _did_ print "antonia" with a squiggle over the "a". to me, it's unnecessary. (but i'm quite sure _that_ little detail gave jon wet dreams.) ;+) whichever way you like it, it is just one more global change. that's one beauty of plain-text -- it's so easy to manipulate it. so, now back to the quality of the recognition... almost all of the words were correctly identified. the ones that were not would be flagged by a simple spell-check, with merely 2 stealth scanno exceptions: "cur" for "our" and "oven" for "over". i imagine that these pairs are on the lists of know scannos, and the variants appear just 5 times, total, so it's an easy test to do. most of the errors were of two types -- periods and quote-marks. both these error-types are easy to program routines to check them, even if they aren't flagged in spellcheck -- many of them would be. it's relatively easy to detect sentences, so as to check for periods. and quote-marks are usually nested in pairs, and thus easy to check. but my routines for checking these two items are still back in my prototyping test-app, awaiting migration to the current version; that's why i didn't bother doing o.c.r. on the second half of the scans; once i've incorporated the routines, i'll refine 'em on the first half, and then do a solid test of them using the text from the second half. it's not surprising to me that my tools would find all the errors here. this is a relatively straightforward text, with very few complications. total time to do the o.c.r. on this book, once i know what i'm doing? i'd estimate it at about an hour. and for all post-o.c.r. processing? i'd estimate that about the same. total time for the book -- 2 hours. that's much less time than it took to scan and manipulate the images. i'm guessing that those 2 hours of o.c.r. and post-o.c.r. work would make the accuracy level about 1 error for every 50-100 pages or so. and those errors would be in the less-serious arena of punctuation. i won't be able to say for sure until i've done the second-half test, of course, but given the highly accurate recognition of the words that i found on this half, i feel rather safe making that prediction. in this half, of 200+ pages, the only errors that i might have missed -- but found because i had noring's version to compare it to -- were "layout/lay out" and "fairy tale/fairytale". i _might_ have caught "fairytale", because it's not contained in my spellcheck dictionary in its joined variant, and the split variant _is_ in the book (twice). i probably would not have caught "layout", since it's in my dictionary. (but i should take it out of the dictionary for checking older books. old-time typographers _did_ layout, but they didn't _call_ it that.) either way, i'm sure you'll agree that those two errors are trivial. if all the errors in our books were that meaningless, it'd be great. wait, i might have even caught _those_ errors, as they are _right_ in the _project_gutenberg_ e-text, which has been out for years! well, that wraps up my report. for those who might be curious, i'll be releasing my post-o.c.r. tool in the late spring. look for it! anyway... i believe this makes it very clear that i am correct when i say that if you do the scanning carefully, manipulate those scans correctly, use abbyy finereader v7.x to do the o.c.r., and subject its results to a good post-o.c.r. program, it is relatively quick and easy to process an o.c.r. text to the state where it can become a high-powered e-book. the notion that these procedures are difficult or time-consuming is just plain wrong. wrong, wrong, wrong. in one word -- _untrue_. -bowerbird p.s. although jon's highly accurate version of the text gave us little opportunity to find errors in his work, we _did_ find two. (one is an error in the text, i'd say, but jon did not preserve it.) if michael would like to have another "my antonia" in the library, i'll submit the _entirely_ correct version to project gutenberg, and maybe jon can use it to find the error that eluded his team. :+) p.p.s. i _did_ just drop a hint. one i can use later to show that i did indeed find the one error that is non-equivocal. as for the other error, which might or might not be an air, i'll sack that one. ----------------------------------------------------------------- 524 a group of people stood huddied 524 a group of people stood huddled 2442 of Jacob whom He loved. SelahP 2442 of Jacob whom He loved. Selah." 4562 grandmother's hand. The oldest son, Ambro2, 4562 grandmother's hand. The oldest son, Ambroz@, 5564 up like a hare. "Tatinek, Tatinekl" 5564 up like a hare. "Tatinek, Tatinek!" 6344 grumbled, but realized it was Important 6344 grumbled, but realized it was important 10749 was fixed for me by chance; 10749 was fixed for me by chance, 12887 the familiar road. "They still come? "he 12887 the familiar road. "They still come?" he 13132 they were always unfortunate. When PavePs 13132 they were always unfortunate. When Pavel's 16303 "You not mind my poor mamenka> 16303 "You not mind my poor mamenka, 17531 probably, in some deep Bohemian forest..... 17531 probably, in some deep Bohemian forest... 17718 would be lost ten times oven 17718 would be lost ten times over. 18300 the talking tree of the fairytale; 18300 the talking tree of the fairy tale; 21478 Ambrosch found him." "Krajiek could V 21478 Ambrosch found him." "Krajiek could 'a' 23282 that, too, Jelinek. But we beiieve 23282 that, too, Jelinek. But we believe 25309 and I went into the Shimerdas9 25309 and I went into the Shimerdas' 25594 of his long, shapely hands layout 25594 of his long, shapely hands lay out 26036 which is also Thy mercy seat," 26036 which is also Thy mercy seat." 26157 While the tempest still is high." 26157 While the tempest still is high."... 27674 milk like what your grandpa s&y. 27674 milk like what your grandpa say. 29075 in a spiteful, crowing -- "Jake-y, 29075 in a spiteful, crowing voice: -- "Jake-y 29418 kept for hot applications when cur 29418 kept for hot applications when our 30016 Shimerda dropped the rope, ran aftet 30016 Shimerda dropped the rope, ran after 36061 misfortune, his wife, "Crazy Mary," iried 36061 misfortune, his wife, "Crazy Mary," tried 36513 fine, making eyes at the men!?.." 36513 fine, making eyes at the men!..." 37226 given him one of Tiny SoderbalPs 37226 given him one of Tiny Soderball's 38713 pump water for the cattle. '"Oh, 38713 pump water for the cattle. "'Oh, From marcello at perathoner.de Fri Mar 4 11:58:33 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Fri Mar 4 13:26:26 2005 Subject: [gutvol-d] Please test www-dev.gutenberg.org In-Reply-To: References: <42261F44.1000005@perathoner.de> Message-ID: <4228BDE9.1010007@perathoner.de> Michael Hart wrote: > I got an email from one person who suggested that how to volunteer > should be listed up with the donation finromation in addition to where > it is in the "In Depth" section [marked <<< below]. Apparently some > people don't read "In Depth" until they are already involved, and this > person just wanted to know how volunteer. > > > + Donate. How to make a donation to Project Gutenberg. > + News and Events. The news. > + Contacts. How to get in touch. > + Partners, Affiliates and Resources. A collection of links. > + Credits. Thanks to our most prominent volunteers. > * In Depth Information. All you ever wanted to know about Project <<< > Gutenberg. > + Volunteering. How you can help Project Gutenberg. <<< Duplicating menu entries just creates confusion. We could move "Volunteering" into the "About" section, but I think its better placed in the "In Depth" section. -- Marcello Perathoner webmaster@gutenberg.org From marcello at perathoner.de Fri Mar 4 12:09:59 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Fri Mar 4 13:26:30 2005 Subject: [gutvol-d] Please test www-dev.gutenberg.org In-Reply-To: <42274F2E.8010000@corruptedtruth.com> References: <42261F44.1000005@perathoner.de> <42274F2E.8010000@corruptedtruth.com> Message-ID: <4228C097.8030803@perathoner.de> Brandon Galbraith wrote: > > I suppose while these updates are going on, we should also update >> 13,000 to 15,000 in the opening: > > It's too bad we can't make that dynamic, feeding off of a database =) Not worth the trouble ... First, we had to agree on what counts as an ebook in its own right. Eg. we have a Bible in the collection, where every chapter got its own ebook number. Also, many books are posted in parts, and every part got its own number besides the complete book. To get a meaningful count of ebooks we first had to get rid of such shameless stuffings. -- Marcello Perathoner webmaster@gutenberg.org From miranda_vandeheijning at blueyonder.co.uk Fri Mar 4 13:38:57 2005 From: miranda_vandeheijning at blueyonder.co.uk (Miranda van de Heijning) Date: Fri Mar 4 13:39:08 2005 Subject: 500th French book: Rimbaud? (Re: [gutvol-d] 500th French book--Sodome et Gomorrhe) In-Reply-To: <20050304105346.GA28659@clipper.ens.fr> References: <42171387.5020807@blueyonder.co.uk> <20050220054956.GB30309@pglaf.org> <42187065.4060107@blueyonder.co.uk> <421B0080.8060402@blueyonder.co.uk> <4227922F.1050201@blueyonder.co.uk> <20050304105346.GA28659@clipper.ens.fr> Message-ID: <4228D571.5030106@blueyonder.co.uk> Hi Sebastien, Thanks, this sounds like a very appropriate suggestion as well. PG is currently holding back Proust awaiting book #499, so it looks like we now have #501 as well. I will leave it up to PG which one to post as which number. It is great to have such great classics to mark this wonderful milestone! Miranda Sebastien Blondeel wrote: >I have been reading this 500th book discussion without connecting it to >a book I have in PP and that may be eligible. > >It is a very famous book, maybe the best or second best know book of >French poetry: Rimbaud's > >Les Illuminations, Une Saison en Enfer > >(projectID3fbe0069d630e from PGDP US) > >I finish it over the week/end if that I needed. >_______________________________________________ >gutvol-d mailing list >gutvol-d@lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > > > From gbnewby at pglaf.org Fri Mar 4 14:29:09 2005 From: gbnewby at pglaf.org (Greg Newby) Date: Fri Mar 4 14:29:11 2005 Subject: [gutvol-d] Please test www-dev.gutenberg.org In-Reply-To: <4228C097.8030803@perathoner.de> References: <42261F44.1000005@perathoner.de> <42274F2E.8010000@corruptedtruth.com> <4228C097.8030803@perathoner.de> Message-ID: <20050304222909.GA32543@pglaf.org> On Fri, Mar 04, 2005 at 09:09:59PM +0100, Marcello Perathoner wrote: > Brandon Galbraith wrote: > > >> I suppose while these updates are going on, we should also update > >>13,000 to 15,000 in the opening: > > > >It's too bad we can't make that dynamic, feeding off of a database =) > > Not worth the trouble ... First, we had to agree on what counts as an > ebook in its own right. > > Eg. we have a Bible in the collection, where every chapter got its own > ebook number. Also, many books are posted in parts, and every part got > its own number besides the complete book. > > To get a meaningful count of ebooks we first had to get rid of such > shameless stuffings. That's an unwarranted poke, Marcello. We do have a count, and it's eBook #s as used as the primary access point to our files. Agreeing on what counts as an eBook is not necessary. We know how many eBook #s we have, even if there is disagreement on what counts as an eBook. There are plenty of words (in GUTINDEX.ALL and elsewhere) to augment this simplistic number. -- Greg From jon at noring.name Fri Mar 4 14:48:52 2005 From: jon at noring.name (Jon Noring) Date: Fri Mar 4 14:49:17 2005 Subject: [gutvol-d] march forth In-Reply-To: <84.409159b4.2f5a24b8@aol.com> References: <84.409159b4.2f5a24b8@aol.com> Message-ID: <726605875.20050304154852@noring.name> Bowerbird wrote: > i did o.c.r. on half of jon noring's page-scans for "my antonia", > using abbyy finereader v7.x. the results were quite excellent. Great! > after doing a small number of global corrections to the o.c.r., > i checked it against noring's "trustworthy" version of the text. What type of global corrections were these? One area is how to handle hyphenation, and whether there was a short dash in the compound word in the first place before the typesetter hyphenated the word. > except for exceptions i'll discuss right after this paragraph, > most results are given below; each pair of lines represents a > difference found between abbyy and noring, with the last word > in each line being the point of difference. the number listed at > the start of each line is the word-number in the file, and the > string of words are the ones preceding the point-of-difference > in the file, so that you can easily pinpoint the correct location. Great! > most of the o.c.r. errors were on _punctuation_, not _letters_. > in particular, there were many instances where a _period_ was > misrecognized as a comma. i did not bother to list these cases, > mostly to avoid clutter. i do not know what caused these errors. > i don't know if it's a _typical_ misrecognition that abbyy makes, > if jon's manipulation of the images somehow caused confusion, > if i set one of the options incorrectly, or what. help, anyone? I originally scanned the pages at 600 dpi (optical) 24-bit color (which in the future I won't do for b&w works since I determined it is unnecessary overkill.) Then for the online scans they were reduced as follows: original --> 600 dpi bitonal --> 120 dpi greyscale antialiased I'm not sure which set of scans you used (you don't have the original since they occupy 5 gigs of space.) Hopefully you used the 600 dpi bitonal which should OCR the best. Antialiasing actually causes problems (notwithstanding the much lower resolution.) One thing you could do is to look at the 600 dpi pages at 100% size for which the punctuation was not correctly discerned. You probably will see some errant pixels that fooled the OCR into thinking it was some other punctuation mark than it is. Regardless, punctuation is a toughie for OCR to exactly get right, from what I understand. 600 dpi *helps* resolve the fine detail of punctuation. 300 dpi is marginal for a lot of punctuation because the characters are so small and don't occupy enough pixels (while letters retain enough pixels to better identify them.) > i have also not listed differences found in hyphenation, since > i don't have the time to write a decent routine to check them. > (i just accepted the dehyphenation abbyy did automatically.) Ah, ok (answering my comments at the beginning.) Resolving this usually requires a human being to go over, especially for Works from the 18th and 19th century where compound words with dashes were much more common than today (e.g., "to-morrow".) Sometimes one has to see what the author did elsewhere in the text. In a few cases a guess is necessary based on understanding what the author did in similar cases in the text. Some of this can be automated. In other cases it requires a human being to make a final decision. I followed the UNL Cather Edition here. > another set of differences not listed here is the _n't_ words. > words like "couldn't" and "shouldn't" were set with the _n't_ > part distinct from the first part. jon's version retained this. > abbyy did not. personally, i find it an unnecessary distraction; > the first thing i'd do with such the file is to change it globally. > jon probably considers that "tampering with trustworthiness". > i think it's common sense recognition of a changed convention. > if you prefer jon's way, use his text. if not, you can use mine. > (the change is global throughout the book, so it is easy to do.) Whether or not it is an "unnecessary" distraction, it is better to preserve the original text in the master etext version. My thinking is that if someone wants to produce a derivative "modern reader" edition of "My Antonia", they are welcome to do so and add it to the collection because the original faithful rendition is *already* there. The only requirements I would place (and this applies in general for any Work) are 1) the original textually faithful etext version has already been done and is in the collection, and 2) the type of modernizations done for the modern parallel editions are noted in the texts themselves (such as within an Editor's Introduction.) > i note that jon _did_ close up some "there's" where the "'s" was > set off distinctively, so he was a bit inconsistent in this arena. > (i didn't check to see if there were other apostrophe-s words > that were set apart, because i would've closed 'em up myself.) I spent some time looking at the " 's " issue last week. In many cases in the original print edition the spacing between the preceding word and the apostrophe s is quite small -- and for the same combination elsewhere was larger -- indicating this was more of a typesetter's convention rather than something Cather specified. [note] In addition, the UNL Cather Edition closed off all the apostrophe s (no spaces), but kept the space for many of " n't" words. So here again I followed the UNL Cather Edition. (Btw, I found quite a few errors in the online UNL Cather Edition of "My Antonia" which have been forwarded to the team overseeing it -- sadly the professor overseeing the online project passed away a few months ago. We are in touch with other Cather scholars.) But I've put the "'s" issue on my "to look at again" list. [note] Cather wanted the line length to be fairly short, so this puts extra pressure on typesetters who will either have to extend character spacing for a particular line or scrunch it up more than usual, depending upon the situation with the rest of the typesetting on the page, and whether certain words can be hyphenated or not. > i also changed high-bit characters to low, to ease comparison, You mean accented characters? > so those differences are not listed. yes, the book _did_ print > "antonia" with a squiggle over the "a". to me, it's unnecessary. But that's what is in the original, the "A acute". The squigly is an 'acute', btw. :^) Accented characters are *always* important to preserve under all situations. There's no need anymore, in these days of Unicode and the like to stick with 7-bit ASCII. I sense that you don't want to properly deal with accented characters since this poses extra problems with OCRing and proofing, something you are trying to avoid in your zeal to get everything to automagically work. To me, that's going too far in simplifying. Preserving accented characters are important. > (but i'm quite sure _that_ little detail gave jon wet dreams.) ;+) > whichever way you like it, it is just one more global change. > that's one beauty of plain-text -- it's so easy to manipulate it. Unicode is plain text. Just more characters to play with. :^) Btw, for those who are interested, here's the "non-Basic Latin" (non-ASCII) alphabetic characters used in "My Antonia": A acute AE ligature ae ligature e acute e circumflex i umlaut n tilde small z with caron > almost all of the words were correctly identified. the ones that > were not would be flagged by a simple spell-check, with merely > 2 stealth scanno exceptions: "cur" for "our" and "oven" for "over". > i imagine that these pairs are on the lists of know scannos, and > the variants appear just 5 times, total, so it's an easy test to do. > > most of the errors were of two types -- periods and quote-marks. Which makes sense. But these are the toughest to correct sometimes, and punctuation changes can sometimes subtly affect the meaning. They are hopefully caught by human proofers/readers when grammar checkers don't (I do use Word to help find both spelling and punctuation errors -- when they find something, I then manually check it in the page scans and the master XML.) > both these error-types are easy to program routines to check them, > even if they aren't flagged in spellcheck -- many of them would be. They are "sometimes" easy to spot. Other times the automatic routines will not catch errors (e.g. ":" vs. ";") > it's relatively easy to detect sentences, so as to check for periods. Usually true, but there are some rare exceptions where an abbreviation can be mistaken for an end of a sentence. Then there's the ellipsis issue where sometimes an ellipsis is at the end of the sentence and sometimes it is not (and incorrectly used.) > and quote-marks are usually nested in pairs, and thus easy to check. This is also true, but as found in "My Antonia", there are exceptions to pure nesting, such as when a quotation spills over into several paragraphs where the intermediate paragraphs are not terminated by an end quotation mark (whether single or double.) Also, apostrophes are sometimes confused with single right quote marks. Here's a fictional example (imagine the straight quotes and apostrophe marks being represented in print with the appropriate "curly" marks): "And Harry told me, 'the voters' confidence in the candidate waned.' To which I replied to Harry, 'I don't believe so.'" With a smart enough grammar and parser, the above might be properly parsed and the apostrophe correctly differentiated from the single right quote mark. But still, real-world texts tend to throw a lot of curve balls that are sometimes hard to correctly machine process. > but my routines for checking these two items are still back in my > prototyping test-app, awaiting migration to the current version; > that's why i didn't bother doing o.c.r. on the second half of the scans; > once i've incorporated the routines, i'll refine 'em on the first half, > and then do a solid test of them using the text from the second half. great! > total time to do the o.c.r. on this book, once i know what i'm doing? OCR is quite fast. It's making and cleaning up the scans which is the human and CPU intensive part. > i'd estimate it at about an hour. and for all post-o.c.r. processing? > i'd estimate that about the same. total time for the book -- 2 hours. > that's much less time than it took to scan and manipulate the images. Yes. > p.s. although jon's highly accurate version of the text gave us > little opportunity to find errors in his work, we _did_ find two. > (one is an error in the text, i'd say, but jon did not preserve it.) > if michael would like to have another "my antonia" in the library, > i'll submit the _entirely_ correct version to project gutenberg, > and maybe jon can use it to find the error that eluded his team. :+) Well, not all of the pages have been doubly proofed. The team is not finished, and I plan to post a plea somewhere for more eyeballs to go over it. I would like to receive error reports as well for this text, since Brewster wants highly proofed texts for some experiments he plans to run similar to yours. But if I have to use the version you donate to PG, so be it. :^) > p.p.s. i _did_ just drop a hint. one i can use later to show that > i did indeed find the one error that is non-equivocal. as for the > other error, which might or might not be an air, i'll sack that one. Oh, a clue. :^) Anyway, great work! Jon (p.s., I did find one error in my text based on the list you gave. Thanks. There should be a comma after the first "Jake-y" in "Jake-y Jake-y". So that's been corrected already in the online and archive version. I rechecked the PG edition, and they get the comma right in the text, which I oddly missed doing my "diff" (probably because there were quite a few differences to pour over.) But then they enclose the surrounding sentence within a single quote mark (following the British convention), while the original first edition uses a double quote mark. The PG edition seems to be inconsistent with regards to quotation marks and to British/American spelling, which is why I surmise the PG edition is based on some non-Cather-approved British edition and might have subsequently been selectively and inconsistently edited in trying to "re-Americanize" it. I assume you discovered the several different paragraph breaks in the PG edition?) From Bowerbird at aol.com Fri Mar 4 15:23:39 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Fri Mar 4 15:23:54 2005 Subject: 500th French book: Rimbaud? (Re: [gutvol-d] 500th French book--Sodome et Gomorrhe) Message-ID: <29.6e499476.2f5a47fb@aol.com> miranda said: > PG is currently holding back Proust awaiting book #499, > so it looks like we now have #501 as well. i'd use "une saison" as #499. it'd be a shame for someone to come along later and say, "what?, you did 500 books _before_ that one?" ;+) -bowerbird From donovan at abs.net Fri Mar 4 17:08:42 2005 From: donovan at abs.net (D Garcia) Date: Fri Mar 4 17:10:08 2005 Subject: [gutvol-d] new thread for noring In-Reply-To: References: Message-ID: <200503042008.42482.donovan@abs.net> > > The publisher simply chose to start at page 3. Was this common? > > it's not uncommon. oftentimes there is a "title-page", consisting of > nothing more than the name of the book, which is considered "page 1", Those are called "half-titles," btw. From Bowerbird at aol.com Fri Mar 4 17:14:14 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Fri Mar 4 17:14:34 2005 Subject: [gutvol-d] re: march forth Message-ID: <8.638ae14b.2f5a61e6@aol.com> jon said: > What type of global corrections were these? the type that is made easy by my tool. that's all i'll say for now. > One area is how to handle hyphenation, > and whether there was a short dash in the compound word > in the first place before the typesetter hyphenated the word. as i said, i ignored the issue of hyphenation for the time being. my tool will give a number of ways to deal with hyphenation, but the routines haven't been brought into the current version. but i can give a general overview. end-line hyphenation is removed. the hyphen in compound words is retained. to tell the difference, when there is ambiguity, you look at the rest of the text, to see if the word was handled consistently there. if it was, you match that. if not, you have more work to do. that's where it gets interesting. to go any further is to give too much information for here and now. > Hopefully you used the 600 dpi bitonal which should OCR the best. i did. > Antialiasing actually causes problems > (notwithstanding the much lower resolution.) right. i first thought the periods misrecognized as commas were the effect of anti-aliasing, but i used the 600-dpi scans. so it must be something else causing that problem. > One thing you could do is to look at the 600 dpi pages at 100% size > for which the punctuation was not correctly discerned. You probably > will see some errant pixels that fooled the OCR into thinking > it was some other punctuation mark than it is. i didn't care that much, really. the post-o.c.r. software can solve the problem well enough. i mentioned it for the record, for the sake of full disclosure, and to see if anybody knew why. > punctuation is a toughie for OCR to exactly get right, even if the recognition is admittedly somewhat difficult, i expect abbyy to correct "mr," and "mrs,", for instance. but even if abbyy doesn't, that's easy for me to program. > Resolving this usually requires a human being to go over, > especially for Works from the 18th and 19th century > where compound words with dashes were much more common if you want to retain those arcane spellings, it's difficult. if you wanna update them, the computer does it very easily. "to-day" and "to-morrow" become "today" and "tomorrow". instantly. > Sometimes one has to see what the author did > elsewhere in the text. is there some reason you think the computer can't do that? > In a few cases a guess is necessary based on understanding > what the author did in similar cases in the text. oh, i see. it takes "understanding". one of those rare precious human-being things. well then, i guess there's no way to program it. > Some of this can be automated. In other cases > it requires a human being to make a final decision. > I followed the UNL Cather Edition here. it's always easier to let other people make the decision, isn't it? ;+) > Whether or not it is an "unnecessary" distraction, > it is better to preserve the original text in the master etext version. well see, jon, that's where i differ with you. and other people do too. but like i said, as long as it's just one global change away, no big deal. i see lots of other cases, as well, where you diverge from the paper. a good many of the quotation-marks are set apart from their words. you're making editorial decisions whether you acknowledge it or not. > My thinking is that if someone wants to produce > a derivative "modern reader" edition of "My Antonia", > they are welcome to do so and add it to the collection > because the original faithful rendition is *already* there. whose "collection" are we talking about here jon? yours? do you have any intention of adding more "my antonia" editions? specifically a "derivative modern reader"? if so, i will submit mine. but surely you don't mean michael hart's project gutenberg collection? because, according to you anyway, he doesn't have a "faithful" rendition in his library, not even one, not *already* anyway. just a mangled one. another difference between your collection and michael's is you have 1 book in your collection and he has 10-15 thousand in his collection, depending on who is in charge of defining how the official counting is tabulated these days, it appears. whether you like it or not, that's a comment on the philosophies. > indicating this was more of a typesetter's convention > rather than something Cather specified. well that's a convenient dodge, isn't it? and of course you have no real _evidence_ that this is the case, do you? so you _really_ should enter each case as it _appears_, shouldn't you? at least if you want to stick to your philosophy? > In addition, the UNL Cather Edition closed off all the apostrophe s > (no spaces), but kept the space for many of " n't" words. > So here again I followed the UNL Cather Edition. and that's the difficulty with following an authority, ain't it? there are often so many, it's hard to know which one to follow! i know i can't keep up even with the editions of this one book! so how would a person possibly keep up on tens of thousands! and before you know it, you're having arguments about _that_! and not reading the book, or digitizing it, or playing at the park. and i don't know about you, jon, i don't think you're being consistent. you said you were reproducing what is right there in black-and-white on the page itself, even made high-resolution scans to prove it to us, and now you're making judges that are easy to spot. and to justify it, you're quoting some other figure of "authority". that's inconsistent. but heck, i have to be honest here. even if you _were_ consistent, and kept all of those quirks from the paper-book that _i_ consider to be distracting, the first thing i'm gonna do is global-change 'em. so all that hard work you did was for no good purpose to me. > Cather wanted the line length to be fairly short, > so this puts extra pressure on typesetters > who will either have to extend character spacing > for a particular line or scrunch it up more than usual, > depending upon the situation with the rest of the typesetting > on the page, and whether certain words can be hyphenated or not. oh!, hold it!, wait!, did i just hear you say what you just said? i think i did! yes, i'm quite sure i did! "cather wanted the line length to be fairly short". wow. you mean author-intent can go to _the_length_of_lines_? do you realize how significant that is to your philosophy, jon? it means you will need to respect willa's wishes on the matter. none of the long lines you might get in a web-browser! no sir! willa wanted short lines! (is that why the book looks so narrow?) > You mean accented characters? if they aren't in the lower-128 of the ascii range ("true ascii"), yes. > Accented characters are *always* important > to preserve under all situations. according to you, maybe. according to me, it depends. in this case, i say no. that's my prerogative as an editor. (and i _do_ consider myself an editor, not just a copyist.) > There's no need anymore, in these days of > Unicode and the like to stick with 7-bit ASCII. until unicode works flawlessly on every machine used by all the people i know, for texts like this that have only the occasional character outside the lower-127, where the meaning isn't changed, i'll stick to plain ascii. > I sense that you don't want to > properly deal with accented characters first of all, jon, i define what "properly" means for me, you don't. you can define it for yourself. but i won't let you define it for me. > I sense that you don't want to > properly deal with accented characters > since this poses extra problems with OCRing and proofing, nope. it's just that i see them as _unnecessary_ to this book. if a reader thinks it _is_ necessary, make the global-change. > something you are trying to avoid in your zeal to get everything > to automagically work. To me, that's going too far in simplifying. i'm not "simplifying". i'm consciously making a choice to use something that will work on the broad range of machines out there, as opposed to something that -- in far too many cases -- fails badly. it's a pragmatic decision based on real-life knowledge of the actual infrastructure of machines that exist out here in our real world. it's the same pragmatic decision that michael made when he crafted the philosophy guiding the building of this library of 10,000+ e-texts, in sharp contrast to your philosophy, which has built a 1-book library. > Preserving accented characters are important. in some cases, i'd agree with you. in others, not. in this case, not. > punctuation changes can sometimes subtly affect the meaning. you know, as a writer, i'd really like to think that's possible. as a person who uses a lot of commas, i _want_ to believe it. but i'll be darned if i can think of that many good examples. if you can, i would _love_ to hear them. and if you can show me _any_ in "my antonia", any at all, i'd give you extra bonus points. as it is, though, i just have to resign myself to the position that o.c.r punctuation errors are a distraction, but make no difference. i'll still root them out, due to my sense of professionalism, but i sure wish it felt _fun_, instead of feeling like _doing_chores_. and to the extent that i can automate the chores, i'll be _happy_. > They are hopefully caught by human proofers/readers > when grammar checkers don't (I do use Word to > help find both spelling and punctuation errors -- > when they find something, I then manually > check it in the page scans and the master XML.) oh, so you _do_ use an assist from your tools at times. that's good. > They are "sometimes" easy to spot. > Other times the automatic routines will not catch errors maybe the automatic routines you are using are just inferior. use my tool. if it doesn't spot something it should, let me know. > Usually true, but there are some rare exceptions where > an abbreviation can be mistaken for an end of a sentence. not if your routines are as smart as mine are. > Then there's the ellipsis issue i'm three-dozen layers deep on some of these issues, and you want to talk about level 2. i'm not interested. use my tool. if it doesn't give you the results you want, let me know. > This is also true, but as found in "My Antonia", > there are exceptions to pure nesting, such as > when a quotation spills over into several paragraphs > where the intermediate paragraphs are not terminated > by an end quotation mark (whether single or double.) is it really your considered opinion that i don't know this? that i haven't factored it into my thinking _and_ my tools? maybe you're grandstanding to the lurkers, but my goodness, jon, do you really think that _they_ are that stupid too? > Also, apostrophes are sometimes confused with single right quote marks. ditto. > With a smart enough grammar and parser, > the above might be properly parsed and the blah blah blah. use my tool. if it doesn't figure out your stuff, let me know. > But still, real-world texts tend to throw a lot of curve balls > that are sometimes hard to correctly machine process. i know how to hit 87 different pitches, from both sides of the plate, and you're telling me to "watch out for the curve balls". i laugh at you. > OCR is quite fast. It's making and cleaning up the scans > which is the human and CPU intensive part. wait! i thought you said _proofreading_ and _mark-up_ were the steps that take up the most time. didn't you? or do i have you confused with someone else? > Well, not all of the pages have been doubly proofed. > The team is not finished, and I plan to post a plea > somewhere for more eyeballs to go over it. have you heard about distributed proofreaders? might be able to find some people there... (ok, now you see what it feels like.) > I would like to receive error reports as well for this text, i'll tell you the same thing i told michael about project gutenberg: set up a system for the checking, reporting, correction, and logging of errors, a system that is transparent to the general public, and i will be more than happy to report errors to you, and help you out. otherwise, you waste my time, as i figure someone else can do it. which, by the way, is what everyone else is thinking. which is why errors in the texts are not being reported at nearly the frequency that they should be being reported. but i've got another message sitting here waiting to be sent where i discuss that topic in more detail, so i'll stop here now. > since Brewster wants highly proofed texts > for some experiments he plans to run similar to yours. i'll have to ask him about his tests. > But if I have to use the version you donate to PG, so be it. :^) probably, yep. if michael wants it. they say he'll take just about anything... > I did find one error in my text based on the list you gave. Thanks. you're welcome. but that's not the one i was talking about. :+) > I assume you discovered > the several different paragraph breaks in the PG edition? nope. i didn't even evoke the routines to examine paragraph-breaks. i considered doing so, once you said that there were differences, but decided it was just too inconsequential to even bother with it. it's another one of those things i would very much like to see a case where it made a difference, because i'd love to believe it _could_, but in the absence of a case (or even an _imaginary_ possibility, which i confess i can't come up with, not off the top of my head), i am forced to relegate it to the "too trivial to think about" pile. as above, i'll make the corrections, but i ain't gonna sweat 'em... -bowerbird From Bowerbird at aol.com Fri Mar 4 17:16:08 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Fri Mar 4 17:16:22 2005 Subject: [gutvol-d] new thread for noring Message-ID: <13e.e482be3.2f5a6258@aol.com> david said: > Those are called "half-titles," btw. oh cool, thanks. i figured they had a name, just didn't know it. -bowerbird From sly at victoria.tc.ca Fri Mar 4 23:04:12 2005 From: sly at victoria.tc.ca (Andrew Sly) Date: Fri Mar 4 23:04:32 2005 Subject: [gutvol-d] new thread for noring In-Reply-To: <13e.e482be3.2f5a6258@aol.com> References: <13e.e482be3.2f5a6258@aol.com> Message-ID: On Fri, 4 Mar 2005 Bowerbird@aol.com wrote: > david said: > > Those are called "half-titles," btw. > > oh cool, thanks. i figured they had a name, just didn't know it. > One site that I bookmarked describing the "Anatomy of a Book" can be found here: http://www.bibliophilegroup.com/biblio/other/school/anatomy.html Andrew From Bowerbird at aol.com Sat Mar 5 05:12:02 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Sat Mar 5 05:12:08 2005 Subject: [gutvol-d] re: Anatomy of a Book Message-ID: <12b.581be762.2f5b0a22@aol.com> andrew said: > http://www.bibliophilegroup.com/biblio/other/school/anatomy.html what a marvelous page! funny and informative at the same time! a winner! -bowerbird From jon at noring.name Sat Mar 5 10:35:52 2005 From: jon at noring.name (Jon Noring) Date: Sat Mar 5 10:36:07 2005 Subject: [gutvol-d] re: march forth In-Reply-To: <8.638ae14b.2f5a61e6@aol.com> References: <8.638ae14b.2f5a61e6@aol.com> Message-ID: <15697825812.20050305113552@noring.name> Bowerbird wrote: > jon said: >> Some of this can be automated. In other cases it requires a human >> being to make a final decision. I followed the UNL Cather Edition >> here. > it's always easier to let other people make the decision, isn't it? ;+) It's always *smarter* to leverage the experience and knowledge of others. The idea behind the "Trusted Edition" concept is to mobilize the help of both professional scholars and amateur enthusiasts, using community- oriented tools and processes, to assist with understanding the specific and unique bibliographic details of any particular Work. (Interestingly, when it comes to the more obscure public domain Works, which is the vast majority of them, they were only published once in one printing, so with respect to figuring out which edition is "acceptable" or "authoritative", it is pretty cut-and-dried. It's the famous classics, especially the much older ones which are written in some archaic fashion or in another language, where it can get quite complicated as to what is/are the acceptable editions to use as source(s). Neverthess, for the classics most of this has already been hashed out, and where there is no agreement between any two, do *both* of them!) >> Whether or not it is an "unnecessary" distraction, it is better to >> preserve the original text in the master etext version. > well see, jon, that's where i differ with you. and other people do too. And there are people who also agree, at least in general, with my position. I'm not alone on this. You make it out to be like I'm alone on this, like a "John the Baptist" in the desert. > but like i said, as long as it's just one global change away, no big deal. The problem is that sometimes global changes are easy to do in one direction, and much harder to do the other. When information is removed, such as converting accented characters to 7-bit ASCII with no traceback information, it is harder to go in the other direction because information has been lost. > i see lots of other cases, as well, where you diverge from the paper. > a good many of the quotation-marks are set apart from their words. > you're making editorial decisions whether you acknowledge it or not. There's not lots, but a few. The focus is to produce a *textually* accurate rendition which is presentationally-agnostic wherever possible. We took to heart a lot of the information provided by the UNL Cather Edition online information because that is the smart thing to do. We *are* in contact with a couple scholars of Willa Cather's works besides the UNL folks. To ignore expert advice is, to put it bluntly, stupid. And we are putting together a preliminary list of the top 500/1000 classic public domain works, and should the project launch, we plan to get these rigorously converted along the lines of "My Antonia", and to mobilize the help of the professional *and amateur* enthusiasts to help guide the process. >> My thinking is that if someone wants to produce a derivative >> "modern reader" edition of "My Antonia", they are welcome to do so >> and add it to the collection because the original faithful >> rendition is *already* there. > whose "collection" are we talking about here jon? > > yours? The collection (of one so far, and it is essentially a working demo for learning purposes) does not state to be "Jon Noring's" collection. Go to http://www.openreader.org/myantonia/ and tell me what it says there, and if it prominently mentions my name. There's another name given to it. Just because I'm the most visible person with regards to it here, does not mean it is mine. It is not. It is part of a fairly visible project mobilizing a group of people (but not visible on the particular forums you frequent, and not by the specific name, which doesn't matter.) Should this project go into production mode, what is produced will belong to the world. It's not going to be elitist or exclusive as some other etext projects are (I'm not talking about PG obviously) -- all work product will be made publicly available, as it should be since it is from the Public Domain. Anyway, what is this strange obsession with ownership and competition? Why do you keep talking about PG being Michael Hart's (more on this below)? > do you have any intention of adding more "my antonia" editions? > specifically a "derivative modern reader"? if so, i will submit mine. Sure. So long as the changes from the original acceptable source are sufficiently noted in the text file, such as an "Editor's Introduction", some boilerplate, or whatever you want to add. I'm not sure if you have an interest in taking the time to provide such editorial information, but we'll be happy to take your edited version and mark it the "Bowerbird Modernized Edition" or whatever. I am thinking of providing my own modernized edition as well (which will have very few changes in the case of "My Antonia".) ("Sufficiently noted" does not mean to spell in gory detail each and every change, but enough info so the reader will have a good general idea of how it was "modernized". Readers will appreciate the thoroughness expended to modernize a text for them, and will have warm fuzzies that it is "accurate" when the editor *takes the time* to explain what they did. This builds *trust* with the reader.) > but surely you don't mean michael hart's project gutenberg collection? So? What do you care? Is there a law saying any digital text version of a public domain work *must* be submitted to PG? Does PG have a government monopoly on the Public Domain? Of course not. And about this strange fixation you have on "ownership", PG is no longer "Michael Hart's". You seem to fail to understand that PG now belongs to the hundreds/thousands who have materially contributed in building it. (DP has greatly increased the ownership of PG several fold by its cool way at mobilizing thousands of volunteers.) Michael Hart is the pioneer and founder of the PG idea, but PG has gone well beyond him. He can die tomorrow (hopefully not!), and what he has started will continue unabated. If it were still his, it may die with him. [An outside example is the World Wide Web -- does it still "belong" to Tim Berners-Lee because he invented the general idea and some of the early standards and tools for it? If Tim Berners-Lee dies tomorrow, will the plug be pulled on the Web?] When you produce this magical "toolset" of yours and give it away to others to use (or do you plan to sell it?), it will no longer be "yours". So, should you die tomorrow (hopefully not!), will there be a community of people who will take all your ideas and code and continue on where you are now? Or will it die with you? So much for the benefits of ownership and control. This is why just about everything we've done for "My Antonia" is *already* online and downloadable, even though it is still an early beta/demo shake out things. There is more to put up. The Bible mentions "casting one's bread upon the waters, and it will be returned to you." The complementary logic of this is that those who develop their tools in secret, who don't strive to build partnerships with other like-minded folk, who are not transparent, etc., etc., are not casting their bread upon the water, and thus may not find the kinds of rewards they seek. Interestingly, Michael Hart cast his bread upon the water, and it has returned more than a hundred-fold. Of all the great contributions Michael Hart has made, it is to inspire a volunteer movement. I do have problems with how the earlier PG collection has been assembled (which DP has mostly, but not completely, resolved), but I recognize that Michael Hart has accomplished a lot *because he cast his bread upon the waters*. He not did do his thing in secret, and he welcomed volunteers from the beginning. DP is a result of his vision, of his casting his bread upon the waters. Even his PGII concept (which I think is ill-conceived for various reasons not germane to this particular discussion) is an attempt to expand the PG collection by embracing other collections into one big happy tent. And he talks about giving away trillions and trillions of etexts for free. I like this attitude. He is giving away, not taking. He is open and transparent -- he does not keep everything secret. If he were developing software, he would immediately open source it and ask for others to help write it. It will be free for all from the start. He does not keep his light kept underneath a blanket. So if Michael Hart is your hero, then consider emulating his example. I think you catch the drift. That's why I keep asking when you plan to start a SourceForge or similar open soruce project to develop your system. > because, according to you anyway, he doesn't have a "faithful" > rendition in his library, not even one, not *already* anyway. just a > mangled one. With respect to PG's current "My Antonia". Yes, it is mangled. More importantly, it is not trustworthy, which goes beyond just errors or differences. I discussed this on TeBC, which I know you've read (either from an anonymous account or a friend who forward messages. I don't really care.) And of course, in my discussion of the whole PG corpus, I carefully differentiate between the DP and the non-DP portions of it -- I've done this from the beginning. How convenient you ignore this important fact. > another difference between your collection and michael's is > you have 1 book in your collection and he has 10-15 thousand > in his collection, depending on who is in charge of defining > how the official counting is tabulated these days, it appears. > whether you like it or not, that's a comment on the philosophies. Hmmm. Sounds a lot like a school yard taunt: "Let's compare yours and mine and we'll see whose is bigger -- drop your pants..." So what? How many etexts did Michael have in "his" collection in 1991? Every journey starts with the first step. And why do you say "my collection" (in reference to the LibraryCity "My Antonia" project? Why this obsession with possession and ownership: "My tool", "My idea", "My whatever"? And why do you view everything in a competitive color, rather than complementary and collaborative? In these days of open source development, collaborative efforts, etc., your approach to do everything in secret is really odd and out-of-synch. Why don't you cast your bread upon the waters and see what happens? Or are you afraid your bread won't return to you multiplied? >> indicating this was more of a typesetter's convention rather than >> something Cather specified. > well that's a convenient dodge, isn't it? No. > i know i can't keep up even with the editions of this one book! > so how would a person possibly keep up on tens of thousands! The idea of "Trusted Editions" as an archetype is that it won't rely on any one person. It is part of a bigger picture of building communities around noted etexts. To mobilize people. To not only bring digital texts to people (as PG has been doing), but to also bring people and community to digital texts (which PG is NOT doing now.) But so far I don't see much interest in your "calculus" to understand the important role people play in etexts, from creation to final use. And that the most viable contributions to Mankind come from when people are mobilized in a cooperative/community way (either in a non-profit open source approach, or in a private for-profit approach using employees and contractors.) Technology is to provide tools to make a community of people work better together for a common end-goal, not to replace community. And the word "trust" is an important core human concept -- society works only when there is sufficient trust between people, and trust in the various products of their labors. So any human endeavor which does not put "trust" as #1 is prone to eventually fail. >> Cather wanted the line length to be fairly short, so this puts >> extra pressure on typesetters who will either have to extend >> character spacing for a particular line or scrunch it up more than >> usual, depending upon the situation with the rest of the >> typesetting on the page, and whether certain words can be >> hyphenated or not. > oh!, hold it!, wait!, did i just hear you say what you just said? > i think i did! yes, i'm quite sure i did! > > "cather wanted the line length to be fairly short". > > wow. you mean author-intent can go to _the_length_of_lines_? > > do you realize how significant that is to your philosophy, jon? *rolls eyes* > it means you will need to respect willa's wishes on the matter. > > none of the long lines you might get in a web-browser! no sir! > > willa wanted short lines! (is that why the book looks so narrow?) You really need to read less selectively. I've used the phrase "textually faithful" many times the last couple weeks for a reason. The reason? Because it is important that texts transcend the visual as much as possible, to become agnostic with respect to presentation type, yet contain sufficient structure and semantics so quite authentic visual presentation is possible. This is necessary not only for accessibility, but repurposeability and usability. (And this helps Michael Hart's long-term vision in universal language translations of digital texts.) With the right style sheet, most of Cather's stated preferences are possible to duplicate. There's a reason why the texts are marked up in XML. With one tiny change in the CSS for our "My Antonia" demo, we can duplicate quite well Willa Cather's apparent preference in visual presentation of her book. Interestingly, the UNL Cather Edition (the print version published by UNL's publishing house) uses longer line lengths and smaller print than Cather specified. They did not deem the exact visual presentation of the content to be as important as much as the textual faithfulness, even though they discuss it on their web site. >> Accented characters are *always* important to preserve under all >> situations. > according to you, maybe. according to me, it depends. > in this case, i say no. that's my prerogative as an editor. > (and i _do_ consider myself an editor, not just a copyist.) Sure, you can call yourself an editor, and do what editors do. But to throw away the richness of the expanded Western character set which many, many public domain books use -- is simply bizarre. This richness is what adds to the aesthetics of the text, and builds a better reading experience. It also *adds* trust because people will see the care you took in doing this -- in sweating out the details. >> There's no need anymore, in these days of Unicode and the like to >> stick with 7-bit ASCII. > until unicode works flawlessly on every machine used > by all the people i know, for texts like this that have > only the occasional character outside the lower-127, > where the meaning isn't changed, i'll stick to plain ascii. I believe this is a copout. You can convert most of the western-based Unicode characters to ISO-8859 (the "8-bit ASCII") if you want, and to other encoding schemes, so you have even more encoding options to handle just about everything everyone uses. Today's web browsers handle Unicode very well. And since you are building your own ebook viewer, you can implement Unicode in it quite trivially (at least be able to handle, to start out with, the Latin-1, Latin-Extended and Greek character sets.) The problem with throwing away the higher-characters is that, contrary to what you say, it is not easy to reinsert them as they appeared in the original, unless you re-OCR the texts and the OCR accurately finds them. I can tell you that OCR, even Abbyy, still has some problems with accented characters, especially those which use very subtle accent marks that can easily be mistaken for serifs. As an example, I'm curious to know if Abbyy 7 will correctly recognize *all* the accented characters in the current "My Antonia" scans -- I listed them in my prior message. If you want, I will be happy to go through and list the actual page numbers they are found on. For example, the unlauted "i" in "na?ve" and "na?vety" -- this is a particularly difficult character to recognize (it is often incorrectly recognized as a capital 'Y'), and it is often (as are most accented characters) used in words which will not be found in some lookup dictionary. >> I sense that you don't want to properly deal with accented >> characters > first of all, jon, i define what "properly" means for me, you don't. > you can define it for yourself. but i won't let you define it for me. A lot of people consider accented characters important to preserve. Since, as you say, it is easy to translate from accented characters to non-accented characters (but not vice-versa), then you can meet more people's needs (including those odd few who prefer *not* to read accented characters) by recognizing and preserving these characters. I'd like feedback from the DP folk as to their policy regarding reproducing the non-ASCII characters (Latin 1, Latin Extended, Greek, etc.) It would not surprise me if DP, as a matter of policy, reproduces them. >> I sense that you don't want to properly deal with accented >> characters since this poses extra problems with OCRing and >> proofing, > nope. it's just that i see them as _unnecessary_ to this book. > if a reader thinks it _is_ necessary, make the global-change. How? Unless you somehow record that information on accented characters in some master document, you can't go in the other direction. You are assuming all the words using accented characters are found in some dictionary, which is not true. >> something you are trying to avoid in your zeal to get everything >> to automagically work. To me, that's going too far in simplifying. > i'm not "simplifying". i'm consciously making a choice to use > something that will work on the broad range of machines out there, > as opposed to something that -- in far too many cases -- fails badly. Yes, but this is the fundamental flaw. You appear to be taking short-cuts to try to prove that people don't matter in the process to produce high-quality etexts that are repurposeable and trustworthy. Certainly it is much preferred to have better and more accurate tools, and hopefully the tools you are producing will make life easier for many *people* involved in creating structured digital texts of public domain works. > it's the same pragmatic decision that michael made when he crafted > the philosophy guiding the building of this library of 10,000+ e-texts, > in sharp contrast to your philosophy, which has built a 1-book library. As previously discussed, did Michael immediately go from 1 text to 10,000 etexts in two weeks? And did this growth occur solely by his own sweat of the brow? And note that almost half of the PG collection is done mostly right because Distributed Proofreaders *does* follow "my philosophy" fairly closely (or maybe better put I follow their philosophy fairly closely.) There's another purpose behind the "Trusted Editions" project. It is not intended to be a competitor to PG or other text projects, but to further benefit the various users of public domain texts. More options are better than fewer options. >> Preserving accented characters are important. > in some cases, i'd agree with you. in others, not. in this case, not. Can you explain how you decide when accented characters are to be reproduced? Or is this impossible to explain using an unambiguous, objective rule? (And will your toolset handle the full Western portion of the Unicode set? If so, then why not process *all* texts using the full character set? Why the need to reduce some of them to irreversible 7-bit ASCII?) > as it is, though, i just have to resign myself to the position that > o.c.r punctuation errors are a distraction, but make no difference. > i'll still root them out, due to my sense of professionalism, but > i sure wish it felt _fun_, instead of feeling like _doing_chores_. > and to the extent that i can automate the chores, i'll be _happy_. What's interesting is that there are lots of people who *enjoy* doing this. That's what makes DP so successful, because it brings together people with different interests. Does DP do what it does the best possible way at this time? Of course not. Is DP as good as it could ever be? Of course not. Charles himself noted that to me last year. DP is still a "beta" in progress, or maybe a version 1.0. But DP recognizes that mobilizing people is a critical requirement of success. Juliet could talk for hours about how important the people side of producing etexts really is. And note that there are millions of texts that *cannot* be handled by your toolset, such as handwritten records, horribly tabulated data with poor and ambiguous structure, etc. These texts are held by historical and genealogical societies, local governments, etc., etc. DP, or a DP-like process, properly cloned, is the best way to convert these texts to useful structured digital texts. Not only that, these local groups have a lot of enthusiastic supporters who will volunteer to scan and proof these texts. It will be done by people power, enabled by technology, and not solely by machine power -- unless, of course, someone soon invents truly sentient AI machines with real human intelligence, personalities and even emotions. >> They are hopefully caught by human proofers/readers when grammar >> checkers don't (I do use Word to help find both spelling and >> punctuation errors -- when they find something, I then manually >> check it in the page scans and the master XML.) > oh, so you _do_ use an assist from your tools at times. that's good. Of course! I use tools when I can, but I don't blindly use them. Do you think I use 3x5 cards for everything I do? >> They are "sometimes" easy to spot. Other times the automatic >> routines will not catch errors > maybe the automatic routines you are using are just inferior. *Shrug* After all, I put together "My Antonia" for the project by kludging together sub-optimum tools, hardware and processes (e.g., not having a high-quality sheet feed scanner). "My Antonia" is simply a pre-beta to test out several (but not all) of the important concepts, to shake down various things for the next stage effort. It is showing us the kind of tools and applications we will need to go into production (this includes the high-quality scanning and image preparation processes.) The discussion, both here, and on TeBC, both critical and supportive, both public and private, has been extremely useful at helping us to better understand various things. This feedback has shown things we've done wrong, things that could be improved, and different ways of looking at the various issues. So your asumption that we've finalized the "formula" and the "process" is incorrect. We feel comfortable in "casting our bread upon the waters", so we can inspire many people, supporters and critics, to provide valuable feedback. We obviously inspired you to reply -- your feedback has been very valuable. >> This is also true, but as found in "My Antonia", there are >> exceptions to pure nesting, such as when a quotation spills over >> into several paragraphs where the intermediate paragraphs are not >> terminated by an end quotation mark (whether single or double.) > is it really your considered opinion that i don't know this? > that i haven't factored it into my thinking _and_ my tools? > > maybe you're grandstanding to the lurkers, but my goodness, > jon, do you really think that _they_ are that stupid too? You seem to have blind faith that you will be able to sufficiently cover most every important "exception" found in most texts, and I don't believe it is yet possible. If you do, that'll be wonderful. But your apparent dismissal of the importance of universal handling of extended character sets is alone a show-stopper, in my opinion. Now if you do plan to soon universally support the Unicode character set (or at least the European subset of it), then I believe it will greatly make your toolset much more valuable. >> Well, not all of the pages have been doubly proofed. The team is >> not finished, and I plan to post a plea somewhere for more eyeballs >> to go over it. > have you heard about distributed proofreaders? > might be able to find some people there... I should have written "to post a plea to a few places", because yes, I plan to post a message to the DP forums about "My Antonia". But I want to do some more preliminary assessments before approaching them. Anyway, I've already posted here for some help, and have done some back channel chatting, so a few DPers already know about "My Antonia". :^) >> I would like to receive error reports as well for this text, > i'll tell you the same thing i told michael about project gutenberg: > set up a system for the checking, reporting, correction, and logging > of errors, a system that is transparent to the general public, and > i will be more than happy to report errors to you, and help you out. Now, I agree with you on this. Part of the community aspect of the bigger vision is a system for follow-on proofing. But we also, for the short-term, want to improve the "My Antonia" text the old-fashioned way of manual error report submissions. Properly designing the error feedback and updating system has to be integrated with the other community aspects of the digital texts since these are inextricably linked -- in addition, the "manual" process helps in better understanding the community-based system. > which, by the way, is what everyone else is thinking. > which is why errors in the texts are not being reported > at nearly the frequency that they should be being reported. > but i've got another message sitting here waiting to be sent > where i discuss that topic in more detail, so i'll stop here now. I agree with you on this. And the error reporting system is an important aspect of building user trust in any etext collection. >> since Brewster wants highly proofed texts for some experiments he >> plans to run similar to yours. > i'll have to ask him about his tests. brewster@archive.org Not sure what his current status is on this. >> I did find one error in my text based on the list you gave. Thanks. > you're welcome. but that's not the one i was talking about. :+) *shrug*. It will be found, unless it's something that you believe is an error in how we transcribed the original first edition, and we do not consider it to be an error. You alluded to that in your prior message (such as mentioning the small space that precedes a few question marks -- inspection of a large number of pages where question marks appear strongly supports my contention that this is a typesetting issue and not anything specified by Willa. Anyway, the original communications by Cather on her many preferences for "My Antonia" *exist* and scholars have poured over them with a fine-toothed comb. The UNL Cather Edition does not place any spaces before any question marks, nor do they place a space anywhere before an apostrophe s used in contractions.) However, the " 's" contraction issue is one I'm going to look at again today. One of my proofers noted this to me the other day, so with hers and your feedback, it will be looked at again. See, the system, primitive as it is at present, *is* working (even if it is currently a manual, short-term hack.) Jon From hart at pglaf.org Sat Mar 5 11:17:58 2005 From: hart at pglaf.org (Michael Hart) Date: Sat Mar 5 11:17:59 2005 Subject: [gutvol-d] Please test www-dev.gutenberg.org In-Reply-To: <20050304222909.GA32543@pglaf.org> References: <42261F44.1000005@perathoner.de> <42274F2E.8010000@corruptedtruth.com> <4228C097.8030803@perathoner.de> <20050304222909.GA32543@pglaf.org> Message-ID: We resisted the temptation to divide the Bible and Shakespeare into various sections when others were claiming AEsop's Fables each as an individual eBook to pad their bibiographies. However, when people started requesting individual Shakespeare plays and books of the Bible for research purposes, we did as they asked, which we nearly always try to do for our readers. I'm sure some people would also try to prevent paper publishers and libraries from publishing individual Shakespeare plays or books of the Bible. BTW, I think we put all the shortest books in one file, at last that was my intention. However, when someone donates a Shakespeare or Bible in their own particular favorite format and breakdown, that's totally up to them, and I'm not about to fight with them about it. . . . If someone wants a verse by verse eBible, I think we should zip it all in one huge file, but still let it unzip in the manner they prefer. mh On Fri, 4 Mar 2005, Greg Newby wrote: > On Fri, Mar 04, 2005 at 09:09:59PM +0100, Marcello Perathoner wrote: >> Brandon Galbraith wrote: >> >>>> I suppose while these updates are going on, we should also update >>>> 13,000 to 15,000 in the opening: >>> >>> It's too bad we can't make that dynamic, feeding off of a database =) >> >> Not worth the trouble ... First, we had to agree on what counts as an >> ebook in its own right. >> >> Eg. we have a Bible in the collection, where every chapter got its own >> ebook number. Also, many books are posted in parts, and every part got >> its own number besides the complete book. >> >> To get a meaningful count of ebooks we first had to get rid of such >> shameless stuffings. > > That's an unwarranted poke, Marcello. > > We do have a count, and it's eBook #s as used as the primary > access point to our files. > > Agreeing on what counts as an eBook is not necessary. We know > how many eBook #s we have, even if there is disagreement on > what counts as an eBook. There are plenty of words (in > GUTINDEX.ALL and elsewhere) to augment this simplistic number. > -- Greg > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From servalan at ar.com.au Sat Mar 5 15:19:36 2005 From: servalan at ar.com.au (Pauline) Date: Sat Mar 5 15:20:07 2005 Subject: [gutvol-d] Database down? Message-ID: <422A3E88.5010501@ar.com.au> Hiya All, Did I miss an outage notice? The PG server appears to be having hassles: I keep seeing "Could not connect to database server." when I try to access etexts. Thanks, P -- Help digitise public domain books: Distributed Proofreaders: http://www.pgdp.net "Preserving history one page at a time." Set free dead-tree books: http://bookcrossing.com/referral/servalan From Bowerbird at aol.com Sat Mar 5 16:58:50 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Sat Mar 5 16:59:09 2005 Subject: [gutvol-d] have a nice day Message-ID: <195.3a482828.2f5bafca@aol.com> i believe there's no interest in these threads, other than from jon and myself, so i will reply to him backchannel and be done with it. if anyone else does want a copy, let me know, and i'll send it to you as well. thank you. the proof, as always, is in the pudding. other than that, have a nice day... :+) -bowerbird From prosfilaes at gmail.com Sat Mar 5 19:49:57 2005 From: prosfilaes at gmail.com (David Starner) Date: Sat Mar 5 19:50:13 2005 Subject: [gutvol-d] re: march forth In-Reply-To: <15697825812.20050305113552@noring.name> References: <8.638ae14b.2f5a61e6@aol.com> <15697825812.20050305113552@noring.name> Message-ID: <6d99d1fd05030519495f5f2e91@mail.gmail.com> On Sat, 5 Mar 2005 11:35:52 -0700, Jon Noring wrote: > The collection (of one so far, and it is essentially a working demo > for learning purposes) does not state to be "Jon Noring's" collection. > Go to http://www.openreader.org/myantonia/ and tell me what it says > there, and if it prominently mentions my name. The comment about DjVu and IE6 seems out of place; there's plugins for Netscape there too. It seems like an interesting project. I'm not sure I have the time or ability to help, but I willing to make the offer. > Readers will appreciate the > thoroughness expended to modernize a text for them, and will have warm > fuzzies that it is "accurate" when the editor *takes the time* to > explain what they did. This builds *trust* with the reader.) I got into a bit of a flame war on bookpeople by suggesting that a translation might stand a few words on why. > So? What do you care? Is there a law saying any digital text version > of a public domain work *must* be submitted to PG? Does PG have a > government monopoly on the Public Domain? Of course not. I've cared because a central library makes it easier to find a work, instead of having to search in several places. Also, Project Gutenberg has a long history, indicating it will be around tomorrow and the day after that, and it's decentralized, meaning that if it's not, everything won't just disappear. > And the word "trust" is an important core human concept -- society > works only when there is sufficient trust between people, and trust > in the various products of their labors. So any human endeavor which > does not put "trust" as #1 is prone to eventually fail. I don't agree. PG has not put "trust" as an explicit concept, but people being as they are, they trust that the PG works are done competently. When I gave my sister a copy of "A Doll's House", I didn't check editions and quality of translation; I just bought a random copy. You want works to be verifiable, but most people just don't worry about that; they "trust" others to do a good job. > I'd like feedback from the DP folk as to their policy regarding > reproducing the non-ASCII characters (Latin 1, Latin Extended, Greek, > etc.) It would not surprise me if DP, as a matter of policy, > reproduces them. We mangle the Greek via transliteration still, but we always get Latin-1 right, and we more or less get Latin Extended correct. (OE is usually broken, but accents are recorded, and I assume most PMs are aware enough to catch the weird characters.) Hebrew, Arabic and friends are usually, hopefully, handled by the PPer. > > nope. it's just that i see them as _unnecessary_ to this book. > > if a reader thinks it _is_ necessary, make the global-change. Why judge that on a book for book basis? In fact, you can't, since your programs don't tend to support "accented" characters in any texts. Certainly, the majority of pre-1850 works have at least one Greek quote that ASCII will horribly and irrevocably mangle. French quotes are exactly uncommon in our era of books, either. From marcello at perathoner.de Sun Mar 6 10:57:41 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Sun Mar 6 11:19:06 2005 Subject: [gutvol-d] Database down? In-Reply-To: <422A3E88.5010501@ar.com.au> References: <422A3E88.5010501@ar.com.au> Message-ID: <422B52A5.3020600@perathoner.de> Pauline wrote: > I keep seeing "Could not connect to database server." when I try to > access etexts. The PG site is just too popular. The database cannot serve more than ~30 requests at one time. Try accessing the site in the "off" hours. Rush hours are 9 AM - 18 PM EST. -- Marcello Perathoner webmaster@gutenberg.org From donovan at abs.net Sun Mar 6 12:28:42 2005 From: donovan at abs.net (D Garcia) Date: Sun Mar 6 12:30:08 2005 Subject: [gutvol-d] Database down? In-Reply-To: <422B52A5.3020600@perathoner.de> References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de> Message-ID: <200503061528.43221.donovan@abs.net> On Sunday 06 March 2005 01:57 pm, Marcello Perathoner wrote: > The PG site is just too popular. The database cannot serve more than ~30 > requests at one time. I don't think there's any such thing as PG being too popular. :) But it does sound as if the DB is too anemic for the current (and future) popularity. From servalan at ar.com.au Sun Mar 6 13:27:32 2005 From: servalan at ar.com.au (Pauline) Date: Sun Mar 6 13:28:12 2005 Subject: [gutvol-d] Database down? In-Reply-To: <422B52A5.3020600@perathoner.de> References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de> Message-ID: <422B75C4.5070705@ar.com.au> Marcello Perathoner wrote: > Pauline wrote: > >> I keep seeing "Could not connect to database server." when I try to >> access etexts. > > > The PG site is just too popular. The database cannot serve more than ~30 > requests at one time. > > Try accessing the site in the "off" hours. Rush hours are 9 AM - 18 PM EST. I've been recommending PG texts & now have a bunch of replies saying PG is unusable due to this problem. :( If the server cannot be configured to cope with increased load, please at least consider changing the error message to something more useful for the user. e.g. "Project Gutenberg is too busy at the moment to handle your request, please try again later. Current slack times are 18.00->8.00 EST." Thanks, P -- Help digitise public domain books: Distributed Proofreaders: http://www.pgdp.net "Preserving history one page at a time." Set free dead-tree books: http://bookcrossing.com/referral/servalan From tb at baechler.net Sun Mar 6 19:43:02 2005 From: tb at baechler.net (Tony Baechler) Date: Sun Mar 6 19:41:35 2005 Subject: [gutvol-d] Database down? In-Reply-To: <422B75C4.5070705@ar.com.au> References: <422B52A5.3020600@perathoner.de> <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de> Message-ID: <5.2.0.9.0.20050306194202.037e4130@baechler.net> Hi. A slight workaround for this is to refresh the page. In Internet Explorer, Control + F5 does the trick. I got that same error and it refreshed fine. From gbnewby at pglaf.org Sun Mar 6 20:17:44 2005 From: gbnewby at pglaf.org (Greg Newby) Date: Sun Mar 6 20:17:45 2005 Subject: [gutvol-d] Database down? In-Reply-To: <422B52A5.3020600@perathoner.de> References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de> Message-ID: <20050307041744.GA10764@pglaf.org> On Sun, Mar 06, 2005 at 07:57:41PM +0100, Marcello Perathoner wrote: > Pauline wrote: > > >I keep seeing "Could not connect to database server." when I try to > >access etexts. > > The PG site is just too popular. The database cannot serve more than ~30 > requests at one time. > > Try accessing the site in the "off" hours. Rush hours are 9 AM - 18 PM EST. Marcello, can you tell me what it would take to grow our capacity to handle hits? I know you're also looking at Web site mirrors (I can supply some sites for this, BTW). But if you could come up with some recommendations for what it would take for iBiblio to dramatically grow our capacity, I can try to put something together for them. 30 simultaneous requests to PostgreSQL does not seem like a whole lot, so I'm assuming that contention for resources with other hosted sites is the main problem. It would be nice to do better. I know that iBiblio claims network bandwidth is not an issue, but possibly we need to look at the whole system. Thanks for any ideas you (or others) can provide. -- Greg From brandon at corruptedtruth.com Sun Mar 6 20:20:07 2005 From: brandon at corruptedtruth.com (Brandon Galbraith) Date: Sun Mar 6 20:20:23 2005 Subject: [gutvol-d] Database down? In-Reply-To: <20050307041744.GA10764@pglaf.org> References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de> <20050307041744.GA10764@pglaf.org> Message-ID: <422BD677.203@corruptedtruth.com> Marcello, Could connection pooling fix this? Maybe combined with more concurrent connections to the database server? I'm not sure how big the database box is though. -brandon Greg Newby wrote: >On Sun, Mar 06, 2005 at 07:57:41PM +0100, Marcello Perathoner wrote: > > >>Pauline wrote: >> >> >> >>>I keep seeing "Could not connect to database server." when I try to >>>access etexts. >>> >>> >>The PG site is just too popular. The database cannot serve more than ~30 >>requests at one time. >> >>Try accessing the site in the "off" hours. Rush hours are 9 AM - 18 PM EST. >> >> > >Marcello, can you tell me what it would take to grow our capacity to >handle hits? I know you're also looking at Web site mirrors (I can >supply some sites for this, BTW). But if you could come up with some >recommendations for what it would take for iBiblio to dramatically grow >our capacity, I can try to put something together for them. > >30 simultaneous requests to PostgreSQL does not seem like a whole lot, >so I'm assuming that contention for resources with other hosted sites is >the main problem. It would be nice to do better. > >I know that iBiblio claims network bandwidth is not an issue, but >possibly we need to look at the whole system. > >Thanks for any ideas you (or others) can provide. > -- Greg >_______________________________________________ >gutvol-d mailing list >gutvol-d@lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050306/27426d04/attachment.html From jlinden at projectgutenberg.ca Sun Mar 6 21:41:51 2005 From: jlinden at projectgutenberg.ca (James Linden) Date: Sun Mar 6 21:40:51 2005 Subject: [gutvol-d] Database down? In-Reply-To: <20050307041744.GA10764@pglaf.org> References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de> <20050307041744.GA10764@pglaf.org> Message-ID: <422BE99F.7080302@projectgutenberg.ca> Migrating to MySQL might help -- and it's easier to replicate/mirror on the fly. -- James Greg Newby wrote: > On Sun, Mar 06, 2005 at 07:57:41PM +0100, Marcello Perathoner wrote: > >>Pauline wrote: >> >> >>>I keep seeing "Could not connect to database server." when I try to >>>access etexts. >> >>The PG site is just too popular. The database cannot serve more than ~30 >>requests at one time. >> >>Try accessing the site in the "off" hours. Rush hours are 9 AM - 18 PM EST. > > > Marcello, can you tell me what it would take to grow our capacity to > handle hits? I know you're also looking at Web site mirrors (I can > supply some sites for this, BTW). But if you could come up with some > recommendations for what it would take for iBiblio to dramatically grow > our capacity, I can try to put something together for them. > > 30 simultaneous requests to PostgreSQL does not seem like a whole lot, > so I'm assuming that contention for resources with other hosted sites is > the main problem. It would be nice to do better. > > I know that iBiblio claims network bandwidth is not an issue, but > possibly we need to look at the whole system. > > Thanks for any ideas you (or others) can provide. > -- Greg > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > From hyphen at hyphenologist.co.uk Mon Mar 7 00:50:11 2005 From: hyphen at hyphenologist.co.uk (Dave Fawthrop) Date: Mon Mar 7 00:50:48 2005 Subject: [gutvol-d] Database down? In-Reply-To: <200503061528.43221.donovan@abs.net> References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de> <200503061528.43221.donovan@abs.net> Message-ID: <8a5o21l78j4o9f1sjos6qav3tejkbegvpd@4ax.com> On Sun, 6 Mar 2005 15:28:42 -0500, D Garcia wrote: | On Sunday 06 March 2005 01:57 pm, Marcello Perathoner wrote: | > The PG site is just too popular. The database cannot serve more than ~30 | > requests at one time. | I don't think there's any such thing as PG being too popular. :) Agreed. Is there no way of transferring requests which cannot be handled to a mirror site? Cheaper than a bigger server? -- Dave F From brandon at corruptedtruth.com Mon Mar 7 00:54:33 2005 From: brandon at corruptedtruth.com (Brandon Galbraith) Date: Mon Mar 7 00:54:53 2005 Subject: [gutvol-d] Database down? In-Reply-To: <8a5o21l78j4o9f1sjos6qav3tejkbegvpd@4ax.com> References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de> <200503061528.43221.donovan@abs.net> <8a5o21l78j4o9f1sjos6qav3tejkbegvpd@4ax.com> Message-ID: <422C16C9.8050000@corruptedtruth.com> Usually, it's cheaper and easier to use a bigger database server then try to redirect the requests to another site. The only time you'd want to redirect to another site would be in the event the primary site was down. Disclaimer: I'm a sysadmin at a hosting company. -brandon Dave Fawthrop wrote: >On Sun, 6 Mar 2005 15:28:42 -0500, D Garcia wrote: > >| On Sunday 06 March 2005 01:57 pm, Marcello Perathoner wrote: >| > The PG site is just too popular. The database cannot serve more than ~30 >| > requests at one time. >| I don't think there's any such thing as PG being too popular. :) > >Agreed. >Is there no way of transferring requests which cannot be handled to a >mirror site? Cheaper than a bigger server? > > > From bruce at zuhause.org Mon Mar 7 07:21:01 2005 From: bruce at zuhause.org (Bruce Albrecht) Date: Mon Mar 7 07:21:06 2005 Subject: [gutvol-d] Database down? In-Reply-To: <422BE99F.7080302@projectgutenberg.ca> References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de> <20050307041744.GA10764@pglaf.org> <422BE99F.7080302@projectgutenberg.ca> Message-ID: <16940.29021.164180.819016@celery.zuhause.org> There's also clustering available for Postgresql, which might be easier than migrating to MySQL. Either way, it would probably take more human resource time than throwing hardware at it (for example, a dual Opteron system with 4 GB RAM 4 250 GB SATA drive in 10 RAID for about $3300, which might be overkill). James Linden writes: > Migrating to MySQL might help -- and it's easier to replicate/mirror > on the fly. > > -- James > > Greg Newby wrote: > > On Sun, Mar 06, 2005 at 07:57:41PM +0100, Marcello Perathoner wrote: > > > >>Pauline wrote: > >> > >> > >>>I keep seeing "Could not connect to database server." when I try to > >>>access etexts. > >> > >>The PG site is just too popular. The database cannot serve more than ~30 > >>requests at one time. > >> > >>Try accessing the site in the "off" hours. Rush hours are 9 AM - 18 PM EST. > > > > > > Marcello, can you tell me what it would take to grow our capacity to > > handle hits? I know you're also looking at Web site mirrors (I can > > supply some sites for this, BTW). But if you could come up with some > > recommendations for what it would take for iBiblio to dramatically grow > > our capacity, I can try to put something together for them. > > > > 30 simultaneous requests to PostgreSQL does not seem like a whole lot, > > so I'm assuming that contention for resources with other hosted sites is > > the main problem. It would be nice to do better. > > > > I know that iBiblio claims network bandwidth is not an issue, but > > possibly we need to look at the whole system. > > > > Thanks for any ideas you (or others) can provide. From marcello at perathoner.de Mon Mar 7 09:28:02 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon Mar 7 11:23:07 2005 Subject: [gutvol-d] Database down? In-Reply-To: <422BE99F.7080302@projectgutenberg.ca> References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de> <20050307041744.GA10764@pglaf.org> <422BE99F.7080302@projectgutenberg.ca> Message-ID: <422C8F22.7070500@perathoner.de> James Linden wrote: > Migrating to MySQL might help -- and it's easier to replicate/mirror > on the fly. Yuck! MySQL is just a glorified file system with an SQL interface. They barely got transactions working. They still can't do referential integrity, views and triggers. And if you happen to need transactions, they only work with the InnoDB backend, which is slower than Postgres. Postgres replicates very well. Just where should we replicate to? -- Marcello Perathoner webmaster@gutenberg.org From marcello at perathoner.de Mon Mar 7 10:04:31 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon Mar 7 11:23:09 2005 Subject: [gutvol-d] Database down? In-Reply-To: <422BD677.203@corruptedtruth.com> References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de> <20050307041744.GA10764@pglaf.org> <422BD677.203@corruptedtruth.com> Message-ID: <422C97AF.5050003@perathoner.de> Brandon Galbraith wrote: > Could connection pooling fix this? Maybe combined with more concurrent > connections to the database server? I'm not sure how big the database > box is though. I'm not sure why the limit is so low. Maybe the folks at ibiblio have a good reason for it. We have to see what increasing the number of concurrent connections does to query response time. The database box is a dual-processor IBM whatever. I can look up the specs if you want. But this box and his brother are serving all sites hosted at ibiblio. Many of those sites are build with CMS and thus very database intensive. -- Marcello Perathoner webmaster@gutenberg.org From marcello at perathoner.de Mon Mar 7 10:13:58 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon Mar 7 11:23:15 2005 Subject: [gutvol-d] Database down? In-Reply-To: <20050307041744.GA10764@pglaf.org> References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de> <20050307041744.GA10764@pglaf.org> Message-ID: <422C99E6.8090000@perathoner.de> Greg Newby wrote: > Marcello, can you tell me what it would take to grow our capacity to > handle hits? I know you're also looking at Web site mirrors (I can > supply some sites for this, BTW). But if you could come up with some > recommendations for what it would take for iBiblio to dramatically grow > our capacity, I can try to put something together for them. We have doubled our page hits over the last year. We are now serving nearly 200.000 pages a day. Just recently we became a top 5000 internet site. See Alexa stats starting at: http://www.alexa.com/data/details/traffic_details?range=3m&size=large&compare_sites=gutenberg.net,promo.net&y=t&url=gutenberg.org To handle the ever increasing load we could implement one of the following solutions: 1) An array of on-site squids at ibiblio. But ibiblio isn't adding squids for the vhosted sites. At least that's what I was told. 2) Make ibiblio throw more hardware at us (all hosted sites). This may not be possible with the limited budget. They recently got a faster file server. 3) One or more dedicated squids for PG co-located at ibiblio. (Make ibiblio pay for the bandwidth.) Somebody had to donate us a server. Needs fast disks, lots of ram, average cpu, linux, ssh. 4) Big time solution. A hierarchy of squids distributed around the world. We would have a squid hierarchy like this: www.gutenberg.org (apache) + us1.cache.gutenberg.org (squid) + us2.cache.gutenberg.org (squid) + au.cache.gutenberg.org (squid) + eu.cache.gutenberg.org (squid) + de.cache.gutenberg.org (squid) + en.cache.gutenberg.org (squid) + fr.cache.gutenberg.org (squid) To do that we need squids 2.5 with the rproxy patch. I'm still exploring that solution, but if anybody has any experience please chime in. We need service providers donate us (or co-locate our) servers and donate the bandwidth. Also we need to explore the legal implications of offering PG services outside the US. The PG web site w/o file downloads averages a 5 GB of traffic / day. (The file downloads are 100 GB / day, but we ain't going to thrash the squids with the files.) > 30 simultaneous requests to PostgreSQL does not seem like a whole lot, > so I'm assuming that contention for resources with other hosted sites is > the main problem. It would be nice to do better. I just asked ibiblio to double that. I'm not sure why the limit is so low. -- Marcello Perathoner webmaster@gutenberg.org From marcello at perathoner.de Mon Mar 7 11:22:10 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon Mar 7 11:23:25 2005 Subject: [gutvol-d] Shakespeare's Birthday April 23rd Message-ID: <422CA9E2.3090409@perathoner.de> I got a request from a proof-reader to celebrate Shakespeares birthday with a banner on the site. (The original request being to celebrate St. Georges Day, but I don't think that one qualifies.) Any ideas? -- Marcello Perathoner webmaster@gutenberg.org From brandon at corruptedtruth.com Mon Mar 7 11:41:27 2005 From: brandon at corruptedtruth.com (Brandon Galbraith) Date: Mon Mar 7 11:41:39 2005 Subject: [gutvol-d] Database down? In-Reply-To: <422C97AF.5050003@perathoner.de> References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de> <20050307041744.GA10764@pglaf.org> <422BD677.203@corruptedtruth.com> <422C97AF.5050003@perathoner.de> Message-ID: <422CAE67.30909@corruptedtruth.com> Marcello, Maybe it's time to talk about doing a master/slave replication configuration of postgres to handle the database load. If you want, contact me off list and I'd be willing to help any way I can. -brandon Marcello Perathoner wrote: > Brandon Galbraith wrote: > >> Could connection pooling fix this? Maybe combined with more >> concurrent connections to the database server? I'm not sure how big >> the database box is though. > > > I'm not sure why the limit is so low. Maybe the folks at ibiblio have > a good reason for it. We have to see what increasing the number of > concurrent connections does to query response time. > > The database box is a dual-processor IBM whatever. I can look up the > specs if you want. But this box and his brother are serving all sites > hosted at ibiblio. Many of those sites are build with CMS and thus > very database intensive. > > From marcello at perathoner.de Mon Mar 7 11:55:53 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon Mar 7 11:55:40 2005 Subject: [gutvol-d] Database down? In-Reply-To: <422CAE67.30909@corruptedtruth.com> References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de> <20050307041744.GA10764@pglaf.org> <422BD677.203@corruptedtruth.com> <422C97AF.5050003@perathoner.de> <422CAE67.30909@corruptedtruth.com> Message-ID: <422CB1C9.1020809@perathoner.de> Brandon Galbraith wrote: > Maybe it's time to talk about doing a master/slave replication > configuration of postgres to handle the database load. If you want, > contact me off list and I'd be willing to help any way I can. First ibiblio will have to host another database server and dedicate that server to PG. A dedicated database server would probably solve our problem. But we'll have to talk the ibiblio people into doing that for us. It means money, more maintenance hassles and maybe problems from other sites hosted at ibiblio, who want a faster server too. OTOH a dedicated squid for PG would help too and be much cheaper. Replication to an external server will not help much, as the latency will be too big. Our current database server is: IBM Netfinity 6000R Quad Xeon PIII 700 MHz 4.5 GB RAM 108 GB storage iblinux But this one is shared with other sites hosted at ibiblio. See also: http://www.ibiblio.org/systems/hardware-details.html -- Marcello Perathoner webmaster@gutenberg.org From Bowerbird at aol.com Mon Mar 7 12:03:10 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Mon Mar 7 12:03:25 2005 Subject: [gutvol-d] lest the message be missed Message-ID: <15b.4c4a3125.2f5e0d7e@aol.com> lest the main message be missed in all the minutiae... you can take an average p-book from scans to e-book in one evening. one evening. the people who want to convince you that it's difficult are _wrong_. the fastest and _easiest_ way to get a million p-books digitized is for one million people to convert one book in the next month or two. *** also, for the record, all of the global changes i made to "my antonia" are completely reversible, if you're smart enough to know what you're doing. -bowerbird From kth at srv.net Mon Mar 7 11:33:24 2005 From: kth at srv.net (Kevin Handy) Date: Mon Mar 7 12:07:55 2005 Subject: [gutvol-d] Database down? In-Reply-To: <16940.29021.164180.819016@celery.zuhause.org> References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de> <20050307041744.GA10764@pglaf.org> <422BE99F.7080302@projectgutenberg.ca> <16940.29021.164180.819016@celery.zuhause.org> Message-ID: <422CAC84.1050703@srv.net> Bruce Albrecht wrote: >There's also clustering available for Postgresql, which might be >easier than migrating to MySQL. Either way, it would probably take >more human resource time than throwing hardware at it (for example, a >dual Opteron system with 4 GB RAM 4 250 GB SATA drive in 10 RAID for >about $3300, which might be overkill). > > > what is the number of connections for postmaster (-N). You may just need to up this value. From joshua at hutchinson.net Mon Mar 7 12:13:24 2005 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Mon Mar 7 12:13:32 2005 Subject: [gutvol-d] lest the message be missed Message-ID: <20050307201324.B8A8C2F915@ws6-3.us4.outblaze.com> ----- Original Message ----- From: Bowerbird@aol.com > > you can take an average p-book from scans to e-book in one evening. > one evening. HAHAHAHAHAHAHA *gasp* *wheeze* HAHAHAHAHAHAHA That's the most laughable thing I've read in a long time. If laughter helps us live longer, you just added 5 years to my life. Josh From gbnewby at pglaf.org Mon Mar 7 13:31:15 2005 From: gbnewby at pglaf.org (Greg Newby) Date: Mon Mar 7 13:31:16 2005 Subject: [gutvol-d] Database down? In-Reply-To: <422CB1C9.1020809@perathoner.de> References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de> <20050307041744.GA10764@pglaf.org> <422BD677.203@corruptedtruth.com> <422C97AF.5050003@perathoner.de> <422CAE67.30909@corruptedtruth.com> <422CB1C9.1020809@perathoner.de> Message-ID: <20050307213115.GB4465@pglaf.org> On Mon, Mar 07, 2005 at 08:55:53PM +0100, Marcello Perathoner wrote: > Brandon Galbraith wrote: > > >Maybe it's time to talk about doing a master/slave replication > >configuration of postgres to handle the database load. If you want, > >contact me off list and I'd be willing to help any way I can. > > First ibiblio will have to host another database server and dedicate > that server to PG. A dedicated database server would probably solve our > problem. But we'll have to talk the ibiblio people into doing that for > us. It means money, more maintenance hassles and maybe problems from > other sites hosted at ibiblio, who want a faster server too. > > OTOH a dedicated squid for PG would help too and be much cheaper. > > Replication to an external server will not help much, as the latency > will be too big. > > Our current database server is: > > IBM Netfinity 6000R > Quad Xeon PIII 700 MHz > 4.5 GB RAM > 108 GB storage > iblinux > > But this one is shared with other sites hosted at ibiblio. Thanks, and also for the info about squids. I will pitch ibiblio on the idea of a new system - we'll see what the response is. A current top-end quad Xeon system from someplace like asaservers.com is ~$20,000, which is a little steep for PG to pay for. -- Greg > See also: > > http://www.ibiblio.org/systems/hardware-details.html > > > > -- > Marcello Perathoner > webmaster@gutenberg.org > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From prosfilaes at gmail.com Mon Mar 7 16:11:05 2005 From: prosfilaes at gmail.com (David Starner) Date: Mon Mar 7 16:11:17 2005 Subject: [gutvol-d] lest the message be missed In-Reply-To: <15b.4c4a3125.2f5e0d7e@aol.com> References: <15b.4c4a3125.2f5e0d7e@aol.com> Message-ID: <6d99d1fd05030716113ca7a9e@mail.gmail.com> On Mon, 7 Mar 2005 15:03:10 EST, Bowerbird@aol.com wrote: > also, for the record, all of the global changes i made to "my antonia" are > completely reversible, if you're smart enough to know what you're doing. If you're "smart enough", you could just retype the book from memory. Completely reversible, provided that you're already familiar with the work (which you have to be, else you wouldn't know that Antonia needs an accent), is a pretty lousy standard. From Bowerbird at aol.com Mon Mar 7 17:36:16 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Mon Mar 7 17:36:34 2005 Subject: [gutvol-d] lest the message be missed Message-ID: josh said: > That's the most laughable thing I've read in a long time. evidently josh missed the message. and why does that not surprise me? nonetheless, i invite skepticism. when i do the entire "my antonia" -- sometime later in the week -- i'll log my time and document all of the changes i make on the file, and wipe all that skepticism away. *** david said: > If you're "smart enough", you could just retype the book from memory. i'm not that smart. are you? > Completely reversible, > provided that you're already familiar with the work > (which you have to be, else you wouldn't know that > Antonia needs an accent), > is a pretty lousy standard. yeah, that _would_ be "a pretty lousy standard". which is obviously why it's not the one i'm using. for those who are smart enough to think about it a bit, "completely reversible" in this type of situation means that once you change it, you can change it back any time. it doesn't mean you have to magically know what to change. (if you know that, _every_ change is completely reversible.) -bowerbird p.s. note to the list subscribers: i usually don't respond to david, since his points are too often paper bags that cannot hold water -- he's always clever enough to find a fault, but seemingly never clever enough to realize why it doesn't apply, or to find its obvious solution -- just like this post, but since this _was_ in regard to something that i was putting "on the record", i am compelled to respond to it. having done it once, though, i probably won't bother doing it again... From prosfilaes at gmail.com Mon Mar 7 17:49:01 2005 From: prosfilaes at gmail.com (David Starner) Date: Mon Mar 7 17:49:19 2005 Subject: [gutvol-d] lest the message be missed In-Reply-To: References: Message-ID: <6d99d1fd05030717497adc722c@mail.gmail.com> On Mon, 7 Mar 2005 20:36:16 EST, Bowerbird@aol.com wrote: > it doesn't mean you have to magically know what to change. > (if you know that, _every_ change is completely reversible.) You've stripped the accents; how am I supposed to know which accents to put back? Take a look at this text, from Garnett's translation of Elene that's currently going through DP. I have removed the accent; replace it. "Lo! that we heard through holy books, That the Lord to you gave blameless glory, 365 The Maker, mights' Speed, to Moses said How the King of heaven ye should obey, His teaching perform. Of that ye soon wearied, And counter to right ye had contended; Ye shunned the bright Creator of all, 370 > find its obvious solution But it's always too much work to show us this obvious solution. Show me the pudding; put the accent back into the above text. From Bowerbird at aol.com Mon Mar 7 22:14:23 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Mon Mar 7 22:14:44 2005 Subject: [gutvol-d] lest the message be missed Message-ID: <1f8.5bc37bd.2f5e9cbf@aol.com> codepoints for macroman encoding: > 142 ? B?n?dictine > 142 ? na?vet? > 144 ? cr?che > 149 ? na?ve > 149 ? na?vet?. > 150 ? ca?on > 170 ? Edition? > 174 ? ?neid > 190 ? antenn? > 231 ? ?ntonia -bowerbird From traverso at dm.unipi.it Tue Mar 8 01:04:48 2005 From: traverso at dm.unipi.it (Carlo Traverso) Date: Tue Mar 8 01:02:54 2005 Subject: [gutvol-d] lest the message be missed In-Reply-To: <1f8.5bc37bd.2f5e9cbf@aol.com> (Bowerbird@aol.com) References: <1f8.5bc37bd.2f5e9cbf@aol.com> Message-ID: <200503080904.j2894mo11253@posso.dm.unipi.it> To bowerbird: How do you manage words that are written in the same way, except the accent? This is quite common in french and italian. And of course you find both in the same book, and sometimes in the same quotation that you can find in an english book. "Il a dit a toi" (he said to you). Which a has an accent? From shimmin at uiuc.edu Tue Mar 8 07:07:10 2005 From: shimmin at uiuc.edu (Robert Shimmin) Date: Tue Mar 8 07:07:22 2005 Subject: [gutvol-d] lest the message be missed In-Reply-To: <15b.4c4a3125.2f5e0d7e@aol.com> References: <15b.4c4a3125.2f5e0d7e@aol.com> Message-ID: <422DBF9E.20904@uiuc.edu> Bowerbird@aol.com wrote: > lest the main message be missed in all the minutiae... > > you can take an average p-book from scans to e-book in one evening. > > one evening. With some of the OCR I've seen latesly, this is probably about 90% right, for 90% of books, provided you can get good scans, and provided you are willing to let a few hard-to-detect classes of error go until post-production. -- RS From Bowerbird at aol.com Tue Mar 8 10:04:18 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Tue Mar 8 10:04:29 2005 Subject: [gutvol-d] lest the message be missed Message-ID: <1a4.33384a22.2f5f4322@aol.com> carlo said: > How do you manage words that are written in the same way, > except the accent? This is quite common in french and italian. i wouldn't strip away high-bit characters on a french or ltalian book; they are an essential part of those languages. i've said that repeatedly. jon noring seems to turn every listserve into an endless merry-go-round. *** robert said: > With some of the OCR I've seen latesly, > this is probably about 90% right, for 90% of books if you want me to estimate some hard numbers, i'd say 75% of the e-texts in the library now could be done to our standard in one evening. the "standard" we've been talking about thus far is 1 error every 10 pages. if an individual can take an e-text to that level of accuracy, then it can be turned over to a quasi-public process i call "continuous proofreading", discussed below. another 20% of the e-texts in the library now would take two or three evenings, and might require the help of a "specialist" of some kind, someone knowledgeable about a certain arena, like greek, or tables, or indexes, or graphics, etc. here's where the value of distributed proofreaders will most come into play in the future, in my opinion, being able to "fix up" the work of independent proofers. the remaining 5% of the e-texts in the library now (remember, i'm just guessing at the numbers here) might be too difficult for any individual to take on, for reasons of size, complexity, or what have you. again, distributed proofreaders will shine here... > provided you can get good scans, jon noring promises us good scans, at a high resolution. i guess he's got a lotta money, to be able to afford the maintenance costs of storage and bandwidth for them. if he delivers on that promise, we don't need to worry... if he can't -- and he has no track-record in this arena, so maybe we shouldn't count on him -- then we have the scans that internet archive is making for toronto. they've promised us "thousands" of scanned books by this spring, but the last thing i heard was that they were "pausing to do an evaluation of their quality", which probably means they've come to realize that they must do a much better job than they have been. i've heard that some books are done very well and that others leave a lot to be desired. consistency in this regard is often a difficult goal to achieve. but i believe brewster will eventually get it right. and of course, there's always google. we still don't know what kind of job they'll do, or if we can use their scans. but if they do a good job, and release their scans freely, that will provide us with a ton of scanned-image books. some people reading this undoubtedly work in an office that has one of those "multi-functionality" machines. some of those babies can scan over 60 pages a minute, straighten the scans, and upload them to the internet, all while you sit there and whistle and pick your nose. at that rate, the 450 pages of "my antonia" would take 7.5 minutes. i don't know how crafty _you_ might be, but my fellow poets are _highly_ skilled at hijacking office machinery for our own nefarious purposes... ;+) (heck, that's the only reason some of us get a job at all.) finally, there are millions of home computers out there that were sold with an all-in-one printer/scanner/fax. and, as before, the quickest and easiest way to get to a million scan-books is for a million people to scan _one_. so, it's fairly easy to predict that, from one or more of the above factors, there will soon be an _avalanche_ of books that have been scanned and need to be converted into text. and every scanned-book will have at least _one_ person who will want its text badly enough to do a little work. what we need to do is _give_that_person_a_good_tool_ that enables their little bit of work to get good results. that's what i intend to give them. all i ask of you is that you stop telling people that this job is difficult. it's not. > and provided you are willing to let a few > hard-to-detect classes of error you'll have to explain what you mean by "hard-to-detect". in my experience, if an error is serious (in any meaningful way), then it'll be detected by a person who's actually reading the book. some errors are unforgivable, such as an incorrect word that won't even pass spellcheck. those should _always_ be caught. trivial punctuation errors, like a missing comma, are... well, trivial. (although i haven't mentioned anything about it until now, a great way to catch some errors is to have the computer speak the text aloud to you as you follow along reading it; stealth scannos, for instance, are handily exposed by this.) > and provided you are willing to let a few > hard-to-detect classes of error > go until post-production. "post-production" has no meaning in my scenario. i repeat myself, again and again, by saying that once a person gets the error-level on an e-text down to 1 error in 10 pages, we can make it available via "continuous proofreading" and let readers-from-the-general-public zoom it to perfection. 1. scan. 2. "fix" the scans. 3. do the o.c.r. on them. 4. run the post-o.c.r. tool. 5. do quasi-public "continuous proofreading". 6. consolidate the corrections into a public release. 7. release the e-text out to the public as a single file. 8. continue doing a full-public "continuous proofreading". 9. take error-reports from the people reading the file offline. i will also state, for the record, that i think step #4 can do _far_ better than 1 error in 10 pages if we sharpen our tools. look at the "my antonia" example. jon had a team of _seven_ proofreading it. i don't know how many looked at each page, but, according to my analysis, they took the error-rate down to about 1 every 70 pages. when i used my bag of tricks on it, i removed 3 errors from the 210 pages i subjected to scrutiny. (which leads us to predict 3 more errors in the second half.) to the best of my knowledge, there are no errors in my file, i.e., the first half of the book. (full book by friday, hopefully.) i'm not saying it _is_ free of errors even now, since someone with a different set of tricks in _their_ bag might be able to locate 2 more remaining errors i couldn't find, but i will say that it is more than clean enough to turn loose on the public... (that is, i think we could skip steps #5 and #6 on this e-text.) -bowerbird From jon at noring.name Tue Mar 8 10:40:39 2005 From: jon at noring.name (Jon Noring) Date: Tue Mar 8 10:40:48 2005 Subject: [gutvol-d] lest the message be missed In-Reply-To: <1a4.33384a22.2f5f4322@aol.com> References: <1a4.33384a22.2f5f4322@aol.com> Message-ID: <19715944156.20050308114039@noring.name> Bowerbird wrote: > carlo said: >> How do you manage words that are written in the same way, >> except the accent? This is quite common in french and italian. > i wouldn't strip away high-bit characters on a french or ltalian book; > they are an essential part of those languages. i've said that repeatedly. Yes you have said that -- repeatedly. But I believe it is also essential to preserve all accented Latin and non-accented characters found in *all* books. This is where the differences of view arise. Throwing them out because they are "inconvenient" (which seems to be your motive, but I'm not sure) is not a valid excuse. Since your tool set (and viewing software) can handle any character set you want, then not supporting the non-ASCII characters is even more confusing. > jon noring seems to turn every listserve into an endless merry-go-round. Jon From Bowerbird at aol.com Tue Mar 8 11:33:12 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Tue Mar 8 11:33:29 2005 Subject: [gutvol-d] lest the message be missed Message-ID: <86.236d71ad.2f5f57f8@aol.com> jon said: > But I believe it is also essential to preserve > all accented Latin and non-accented characters > found in *all* books. once again, the minutiae is being brought to the surface. why doesn't anyone here respond to the main message? because you have no response, that's why. the main point doesn't correspond to your petty-politics of throwing mud at michael, so y'all continue to try to shift the emphasis. > But I believe it is also essential to preserve > all accented Latin and non-accented characters > found in *all* books. we know that's what you believe, jon. you've said it over and over and over. and i have said, over and over and over, that _i_ believe it is _not_ essential, not in *all* books. so there we have it. i will do things my way, and i expect that you will do things your way. fine! let's leave the other people here alone! as usual, you look only at the _benefits_, without factoring _costs_ into the equation. the _cost_ of including high-bit characters is the e-text then _breaks_ for some users, ones who are using viewer-programs that are not encoding-savvy, or who don't have all of the correct fonts on their computer. or other reasons i haven't come across yet. if the unicode people had done their job right, and made unicode follow the mac philosophy -- "it just works" -- i would be up there on the unicode bandwagon with you and your friends. but it doesn't "just work", not for everyone -- not yet -- and until it does, i don't want to talk about it. and _after_ it does, i don't want to talk about it _then_, either, i just wanna use it and have it work. for everyone. wanna do something useful? _make_it_work_! not just on the new machines, with certain browsers and not any other viewer-programs -- on _every_ machine, with _every_ program. but until then, just stop bugging all of us about it. we've heard it, too often, and we are unconvinced. and buddy, you are _not_ going to convince us by repeating the same old argument _again_, or by asserting your beliefs again and again... with all the time i've wasted discussing this stupid topic for the 829th time, i could have cleaned up the rest of that "my antonia" text. go away. oh never mind, i will... -bowerbird From fvandrog at scripps.edu Tue Mar 8 11:43:57 2005 From: fvandrog at scripps.edu (Frank van Drogen) Date: Tue Mar 8 11:43:46 2005 Subject: [gutvol-d] lest the message be missed In-Reply-To: <86.236d71ad.2f5f57f8@aol.com> References: <86.236d71ad.2f5f57f8@aol.com> Message-ID: <6.2.0.8.0.20050308114121.01e91f68@mail.scripps.edu> At 11:33 AM 3/8/2005, you wrote: >jon said: > > But I believe it is also essential to preserve > > all accented Latin and non-accented characters > > found in *all* books. > >once again, the minutiae is being brought to the surface. > >why doesn't anyone here respond to the main >message? Maybe because it got lost between all the other stuff you wrote? Ah, I see you mean: >you can take an average p-book from scans to e-book in one evening. Well that's great, so start going:) Frank From jon at noring.name Tue Mar 8 12:33:46 2005 From: jon at noring.name (Jon Noring) Date: Tue Mar 8 12:33:59 2005 Subject: [gutvol-d] Accented characters are important to reproduce in PG texts (was: lest the message be missed) In-Reply-To: <86.236d71ad.2f5f57f8@aol.com> References: <86.236d71ad.2f5f57f8@aol.com> Message-ID: <16622730312.20050308133346@noring.name> Bowerbird wrote: > jon said: >> But I believe it is also essential to preserve >> all accented Latin and non-accented characters >> found in *all* books. > once again, the minutiae is being brought to the surface. The devil is in the details. > as usual, you look only at the _benefits_, > without factoring _costs_ into the equation. On the other hand, there are certain minimum requirements for every project. As a corollary of an adage I've given earlier: "If a job is to be done, it is to be done right." > the _cost_ of including high-bit characters > is the e-text then _breaks_ for some users, > ones who are using viewer-programs that > are not encoding-savvy, or who don't have > all of the correct fonts on their computer. All web browsers today, and most more advanced formats, such as PDF, support the full Unicode set. That's the future. Embrace it, don't fight it. There's a saying: "I focus on the future since that's where I'm going to spend the rest of my life." > if the unicode people had done their job right, > and made unicode follow the mac philosophy > -- "it just works" -- i would be up there on the > unicode bandwagon with you and your friends. This is a specious argument. The Unicode working group is doing their job right because before Unicode things were a *real* mess and were NOT working. There is a clear need to unify the world's character sets and to create universal text encoding formats (e.g. UTF-8) There is still some controversy regarding some Han scripts, but by and large Unicode has been successful at its stated goals. > wanna do something useful? _make_it_work_! > not just on the new machines, with certain > browsers and not any other viewer-programs > -- on _every_ machine, with _every_ program. Throwing out important accented characters is unacceptable. Period. The author/publisher considered it important enough to spend the $$$ to include these characters (in the 19th century it took more effort to print books with accented and foreign characters.) It adds richness to the text, and it is hard to argue that the characters are not somehow an integral part of the text. Anyway, it is trivial, as *you said yourself*, to autoconvert text with accented characters to 7-bit ASCII text. So you *can* make your system work for the folk using legacy systems. It is far better to do the job right for the long-term future, than to compromise it for the short-term (legacy hardware and software that is rapidly becoming obsolete.) > but until then, just stop bugging all of us about it. > we've heard it, too often, and we are unconvinced. Who's "we"? It would not surprise me if the majority of PG and DP volunteers consider it important (or at least a very good idea) to reproduce the full character set in all Public Domain texts, especially now that it is easy to do (both by UTF-8/16 encoding, and using character entities in XML/XHTML/TEI.) Hopefully a few of the PGers and DPers will give their thoughts on this particular topic. > and buddy, you are _not_ going to convince us > by repeating the same old argument _again_, > or by asserting your beliefs again and again... Who's "us"? > with all the time i've wasted discussing this > stupid topic for the 829th time, i could have > cleaned up the rest of that "my antonia" text. If it weren't important *to you*, you would not have replied. I can only interpret your vociferous replies to mean that you consider permanently dumping accented characters to be an *important* requirement to implement your system. That's why I have used the word "inconvenient" since that's the only reason I can think of. But if you have another reason why you believe it o.k. to dump accented characters for most English language PG texts, let us know. You've not given a good reason why they should not be reproduced. (The argument of meeting legacy needs is not a compelling argument since, as you said and I'm repeating what I said above, one can autoconvert a Master document with accented characters to 7-bit ASCII for use by legacy-users. Thus, you can meet the needs of these people *and* the needs and preferences of future generations by preserving the non-ASCII characters. Instead, you inexplicably want to permanently remove accented characters from the digital *Master* versions of most public domain English-language digital texts.) There's a lot of aspects to Public Domain texts that are "inconvenient" which prevent easy digitizing. We figure out how to overcome these "inconveniences" and produce a high-quality product, not make short-term short-cuts so we can avoid dealing with them. Distributed Proofreaders is one example of not giving in to the "convenient", but rather to figure out how to do it right in a reasonably efficient way. Anyway, why the rush to digitize (make structured digital texts) out of page scans, to the point you are willing to sacrifice textual accuracy and quality? So long as the page scans are available for posterity, they can be transcribed any time, and done more carefully and thoughtfully. To me, the most critical thing is to make archival- quality scans of public domain texts and get them online via IA and similar organizations. In the meanwhile, the most popular of these texts can be carefully and methodically converted to Structured Digital Texts (SDT). There are about 1000 very classic Public Domain works (part of the pre-DP PG collection) that should be redone to at least the quality of the "My Antonia" demo project (for those who have not seen it, it is at: http://www.openreader.org/myantonia/ It is still an early "beta", but it's been a real learning experience for several of us working on it.) Jon From Bowerbird at aol.com Tue Mar 8 13:08:08 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Tue Mar 8 13:08:21 2005 Subject: [gutvol-d] stop changing the message-headers Message-ID: <129.584c0d53.2f5f6e38@aol.com> jon said: > I can only interpret your vociferous replies > to mean that you consider permanently dumping accented characters "permanently"? as i said, as soon as unicode works everywhere, i will embrace it. those of us out here in the real world know that time is not yet here. in the meantime, my viewer-tool will actually _display_ all those accented characters, even when they are not present in the e-text, if the user chooses that option. (it's all about user-choice for me.) if you want to help with that, create a list of such accented words. > to be an *important* requirement to implement your system. "my" system? michael's philosophy of having the e-texts work on all machines, specifically including trailing-edge machinery, is _the_ factor that has made his e-library the premier one in all of cyberspace, thank you very much. you got a high-tech solution? fine, use it. and watch it wither, just like every other one before it has... as to my tools in particular, they will support unicode fully, long before unicode works with all the other tools out there, so you're barking up the wrong tree, buster. i'm done with this stupid thread! done, done, done! aarrgghh! :+) -bowerbird From Bowerbird at aol.com Tue Mar 8 13:10:40 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Tue Mar 8 13:11:01 2005 Subject: [gutvol-d] lest the message be missed Message-ID: <1a3.2eed3a51.2f5f6ed0@aol.com> frank said: > Maybe because it got lost between all the other stuff you wrote? no, i think it "got lost" because it was buried under minutiae. and it is _still_ being subjected to attempts to bury it... > Ah, I see you mean: > > you can take an average p-book from scans to e-book in one evening. well, a better phrasing of that main point might be that "_any_average_person_ can do an average book in an evening..." and the follow-on point would be this: "...so let's start informing the people who might wanna do a book, once the avalanche of scanned-books arrives, so they'll realize it is within the realm of possibility; let's stop spreading the false meme that it's difficult." > Well that's great, so start going:) it's wiser for me to build the tool that enables people to do a book in an evening, rather than spend my time doing books... but yes, i will "start going" on that, right away! just as soon as i help jon find the rest of his errors. (as i'm sure you realize, those two go hand-in-hand.) but really, there won't be a need for that tool _until_ after the avalanche of scanned-books becomes available. most people just won't be motivated enough to do a book until there is a scanned-book they want to have as text. until then, i'm in no hurry. been working on this tool for well over a year now. no reason to rush things. just waiting for jon to get some more books scanned... ;+) -bowerbird From prosfilaes at gmail.com Tue Mar 8 16:52:14 2005 From: prosfilaes at gmail.com (David Starner) Date: Tue Mar 8 16:52:27 2005 Subject: [gutvol-d] lest the message be missed In-Reply-To: <1a4.33384a22.2f5f4322@aol.com> References: <1a4.33384a22.2f5f4322@aol.com> Message-ID: <6d99d1fd050308165275501188@mail.gmail.com> On Tue, 8 Mar 2005 13:04:18 EST, Bowerbird@aol.com wrote: > carlo said: > > How do you manage words that are written in the same way, > > except the accent? This is quite common in french and italian. > > i wouldn't strip away high-bit characters on a french or ltalian book; > they are an essential part of those languages. i've said that repeatedly. Then why take the time to remove the high-bit characters in an English book? There's lots of books that have important French quotations or use accents to denote unpredictable stress; why strip them from the books that you can, just because? It's not hard at all to deal with the handful of accents the average English book has. > if you want me to estimate some hard numbers, > i'd say 75% of the e-texts in the library now > could be done to our standard in one evening. Not by someone new to the job. To do a book in an evening requires that you be experianced with the job and the tools. And I get real tired of you using the average book in PG as a metric. The average book in PG was chosen because it was relatively easy to do. Out of the three floors of books in the library I'm sitting (excluding the governmental depository), the basement is full of science, math or technology books, and will require complex graphical work or mathematical work. Of the remaining 66% (probably more like 55% or 60%, since the third floor is small), many of them are art or music books or dictionaries and grammars, or archiac languages that OCR doesn't handle well, or archiac fonts that OCR doesn't handle well. > here's where the value of distributed proofreaders > will most come into play in the future, in my opinion, > being able to "fix up" the work of independent proofers. It's funny that if the value of DP is so limited, that the percentage of texts that have come in from DP is so high. Why don't we have more people doing books by hand alone? > that's what i intend to give them. all i ask of you is that > you stop telling people that this job is difficult. it's not. When you upload books to PG, what name do you put on them? For all your words, I can't recall ever seeing a book credited to you. I have no samples of what you've worked on alone and what your quality standards are to judge by. From prosfilaes at gmail.com Tue Mar 8 16:56:45 2005 From: prosfilaes at gmail.com (David Starner) Date: Tue Mar 8 16:57:00 2005 Subject: [gutvol-d] stop changing the message-headers In-Reply-To: <129.584c0d53.2f5f6e38@aol.com> References: <129.584c0d53.2f5f6e38@aol.com> Message-ID: <6d99d1fd05030816563202ce2e@mail.gmail.com> On Tue, 8 Mar 2005 16:08:08 EST, Bowerbird@aol.com wrote: > jon said: > > I can only interpret your vociferous replies > > to mean that you consider permanently dumping accented characters > > "permanently"? > > as i said, as soon as unicode works everywhere, i will embrace it. > those of us out here in the real world know that time is not yet here. The reason why Unicode doesn't work places is because idiots like you aren't bothering to support it. You're being part of the problem, and having the audicity to complain about _other_ people causing the problem. > in the meantime, my viewer-tool will actually _display_ all those > accented characters, even when they are not present in the e-text, How? You still haven't put the accent back in the sample from Elene. You're throwing the baby out with the bathwater and keep telling us how easy it is to refill the bathtub with water. From brad at chenla.org Tue Mar 8 19:14:35 2005 From: brad at chenla.org (Brad Collins) Date: Tue Mar 8 19:17:10 2005 Subject: [gutvol-d] stop changing the message-headers In-Reply-To: <129.584c0d53.2f5f6e38@aol.com> (Bowerbird@aol.com's message of "Tue, 8 Mar 2005 16:08:08 EST") References: <129.584c0d53.2f5f6e38@aol.com> Message-ID: Bowerbird@aol.com writes: > in the meantime, my viewer-tool will actually _display_ all those > accented characters, even when they are not present in the e-text, > if the user chooses that option. (it's all about user-choice for me.) > if you want to help with that, create a list of such accented words. Such a feature never would have occured to me. Toggle accents that aren't in a text? How many users have told you that they want to toggle accents? If you knowlingly strip out accents you give up any claim to have created a faithful and accurate edition of a text. Sorry but that's blown your credibility right there and drops your text down to the level of a bootleg Harry Potter translation[1]. Why is this so important? It's the old game of whispering a sentence into someone's ear and then they repeat it to someone else etc. After passing through a few people the sentence get's mangled. Unicode is very much ready for prime time. Hell, Unicode is even supported by Xterm. Man pages on Red Hat Linux use Unicode. If the command line in a unix terminal window uses Unicode, it's everywhere. b/ Footnotes: [1] BTW. Usually the Harry Potter translations come out a good few months after the English version so there is a real market for quicky translations for people who can't bear to wait and can't read the English. My wife can't read English very well, and she bought a bootleg translation of the second book in Thai. We compared the first few pages with the English edition and she said it was so horrible that she could wait for the official Thai translation which is quite good. I also saw a bootleg of Goblet of Fire in Chinese which came out a week after the English edition was published! From the look of it, it had been done in Shanghai. That's a 636 page book translated, printed and shipped to where I found it in the dingy dark dusty market stalls in Beijing in a week! Looking through the book you could see very distinct shifts in writing style and vocabulary every few pages. Even the translation of the names of some of the characters changed slighlty a couple of times in the book. They must have chopped up the book and split the translation between scores of translaters to do it in a day or two. -- Brad Collins , Bangkok, Thailand From Bowerbird at aol.com Tue Mar 8 20:18:59 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Tue Mar 8 20:19:26 2005 Subject: [gutvol-d] stop changing the message-headers Message-ID: <1ad.33150c38.2f5fd333@aol.com> brad said: > How many users have told you that they want to toggle accents? toggle them on? or toggle them off? :+) according to some people here, there is a great desire out there to see the accents. so i'll try to reintroduce them when possible so far i have these words in my lookup-table: > 142 ? B?n?dictine -- Benedictine > 142 ? na?vet? -- naivete > 144 ? cr?che -- creche (my spellchecker gives another accent?) > 149 ? na?ve -- naive > 149 ? na?vet? -- naivete > 150 ? ca?on -- canon > 170 ? ? -- (TM) > 174 ? ?neid -- Aeneid > 190 ? antenn? -- antenna > 231 ? ?ntonia -- Antonia please do feel free to send me more... :+) > If you knowlingly strip out accents you give up any claim to > have created a faithful and accurate edition of a text. contrary to what some people would like for you to believe, that doesn't have to be the only objective, or even the main one. for me, _mass_usability_ reigns supreme. call me a heathen... > and drops your text down to the level of > a bootleg Harry Potter translation[1]. my understanding is that some harry potter pirate digitizations have attained an extremely high level of fidelity to the source... i'd be leery of the crummy ones. as would most people. which is probably why one writer organizations has been advised to put out crappy "pirate" digitizations, so as to sour people on underground editions. so perhaps the bad version your wife got was planted by the publisher? if so, it seems to have had exactly the desired effect, eh? (although it sounds like it was a paper-book as well?) i heard one "potter" e-book was finished _within_24_hours_ of the release of the paper-book. and when my tool is released, i expect that the pirates will become even _more_ efficient! > Why is this so important? i don't know. but to hear some people talk, you'd think it's a matter of life-and-death! > It's the old game of whispering a sentence into someone's ear > and then they repeat it to someone else etc. > After passing through a few people the sentence get's mangled. hence the importance of complete reversibility, already mentioned. > After passing through a few people the sentence get's mangled. like the way the word "gets" got mangled in your sentence? :+) or the way "knowingly" turned into "knowlingly" up above? ;+) -bowerbird From traverso at dm.unipi.it Tue Mar 8 22:11:33 2005 From: traverso at dm.unipi.it (Carlo Traverso) Date: Tue Mar 8 22:09:33 2005 Subject: [gutvol-d] stop changing the message-headers In-Reply-To: <1ad.33150c38.2f5fd333@aol.com> (Bowerbird@aol.com) References: <1ad.33150c38.2f5fd333@aol.com> Message-ID: <200503090611.j296BXj22745@posso.dm.unipi.it> > > so far i have these words in my lookup-table: > > > 142 ?? B??n??dictine -- Benedictine > > 142 ?? na??vet?? -- naivete > > 144 ?? cr??che -- creche (my spellchecker gives another accent?) > > 149 ?? na??ve -- naive > > 149 ?? na??vet?? -- naivete > > 150 ?? ca??on -- canon > > 170 ??? ??? -- (TM) > > 174 ?? ??neid -- Aeneid > > 190 ?? antenn?? -- antenna > > 231 ?? ??ntonia -- Antonia > Would you replace a four-voices canon with a four-voices ca??on ? Carlo From Bowerbird at aol.com Tue Mar 8 23:41:28 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Tue Mar 8 23:41:50 2005 Subject: [gutvol-d] stop changing the message-headers Message-ID: <75.40ab08d7.2f6002a8@aol.com> carlo said: > Would you replace a four-voices canon > with a four-voices ca??on ? with a _what_? i see a capital "a" with a curvy squiggle above it, followed by a plus-or-minus sign. what you sent me back is _not_ what i sent you; it's been changed; just like that game of "telephone" that brad was telling us about. so it looks like our software here isn't handling the encoding correctly. which is precisely my point. and in the cases like this, it's great to give the user the option to go back to the 7-bit letters, so there is a semblance of normality. because we _know_ that "canon" is always going to be "canon". -bowerbird From prosfilaes at gmail.com Tue Mar 8 23:52:40 2005 From: prosfilaes at gmail.com (David Starner) Date: Tue Mar 8 23:52:59 2005 Subject: [gutvol-d] stop changing the message-headers In-Reply-To: <75.40ab08d7.2f6002a8@aol.com> References: <75.40ab08d7.2f6002a8@aol.com> Message-ID: <6d99d1fd050308235215046ff@mail.gmail.com> On Wed, 9 Mar 2005 02:41:28 EST, Bowerbird@aol.com wrote: > and in the cases like this, it's great to give the user the option to > go back to the 7-bit letters, so there is a semblance of normality. > because we _know_ that "canon" is always going to be "canon". Nice strawman. Everyone wants to give the users the option to go back to the 7-bit letters; it's whether we throw away the information at the start, so nobody has it, or at the point the users want it thrown away. BTW, for your list of accents, that -> th?t was the change in the section of Elene. Well, _one_ of the that's had an accent. From Bowerbird at aol.com Wed Mar 9 00:15:11 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Wed Mar 9 00:15:36 2005 Subject: [gutvol-d] stop changing the message-headers Message-ID: <19f.2f0941df.2f600a8f@aol.com> carlo said: > Would you replace a four-voices canon > with a four-voices ca??on ? but to answer your question, there is only one "canon" in "my antonia". and yes, i will do the conversion on a book-by-book basis in cases where that is necessary. (how many of these terms do you think you can find -- with a non-accented and an accented version -- where _both_ are listed in an english dictionary? look away, my friends, because every one you find is one that makes my lookup-table more extensive.) and even then, if a change is not completely reversible, you'll need to give me the entire sentence in each case where the change is to be made. (or, if it's easier to do it the other way -- each sentence where the change is _not_ to be made -- you can do it that way instead. the only requirement is an absolute non-ambiguity.) you can _count_ on the fact that i have thought things through _well_past_ the first exception to everything. you will have to burrow down to a _much_ deeper layer if you really want to trip me up. and i dare you to try... -bowerbird From jon at noring.name Wed Mar 9 07:53:49 2005 From: jon at noring.name (Jon Noring) Date: Wed Mar 9 07:54:02 2005 Subject: [gutvol-d] stop changing the message-headers In-Reply-To: <75.40ab08d7.2f6002a8@aol.com> References: <75.40ab08d7.2f6002a8@aol.com> Message-ID: <1783410265.20050309085349@noring.name> Bowerbird wrote: > and in the cases like this, it's great to give the user the option to > go back to the 7-bit letters, so there is a semblance of normality. > because we _know_ that "canon" is always going to be "canon". The closest American-English equivalent of ca?on is 'canyon', not canon. Interestingly in "My Antonia" Willa Cather used both variants: "canyons" on page xi, and "ca?on" on page 124. However, one can forgive Cather on this since page xi is part of the Introduction, spoken by the character Jake, and page 124 is part of the main story as told (in a "written" manuscript) by Jim. Jon From marcello at perathoner.de Wed Mar 9 09:24:06 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed Mar 9 09:23:56 2005 Subject: [gutvol-d] lest the message be missed In-Reply-To: <86.236d71ad.2f5f57f8@aol.com> References: <86.236d71ad.2f5f57f8@aol.com> Message-ID: <422F3136.8060206@perathoner.de> Bowerbird@aol.com wrote: > once again, the minutiae is being brought to the surface. ... the minutiae *are* brought to the surface. If we are going to show off in Latin better get our numeri right. -- Marcello Perathoner webmaster@gutenberg.org From Bowerbird at aol.com Wed Mar 9 09:26:45 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Wed Mar 9 09:27:04 2005 Subject: [gutvol-d] stop changing the message-headers Message-ID: jon said: > ca?on i came across one project gutenberg e-text that used "ca?on" throughout, including a word-capped reference to "the grand ca?on" -- you know, the one in arizona with all the pretty colors. anyway, guys, call me when unicode works on all apps on all machines. until then, i have put this issue to bed. by the way, it is _still_ so easy to digitize the average book that, once you have the scans, an average person can do it in one evening. -bowerbird From Bowerbird at aol.com Wed Mar 9 09:48:42 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Wed Mar 9 09:48:53 2005 Subject: [gutvol-d] lest the message be missed Message-ID: <1e6.36eb827a.2f6090fa@aol.com> marcello said: > If we are going to show off in Latin better get our numeri right. a constructive comment from marcello! wow! that's a first! thanks! yes, my latin has been a bit rusty, for quite a while now... -bowerbird From hart at pglaf.org Wed Mar 9 09:49:26 2005 From: hart at pglaf.org (Michael Hart) Date: Wed Mar 9 09:49:27 2005 Subject: [gutvol-d] lest the message be missed In-Reply-To: <20050307201324.B8A8C2F915@ws6-3.us4.outblaze.com> References: <20050307201324.B8A8C2F915@ws6-3.us4.outblaze.com> Message-ID: On Mon, 7 Mar 2005, Joshua Hutchinson wrote: > ----- Original Message ----- > From: Bowerbird@aol.com >> >> you can take an average p-book from scans to e-book in one evening. >> one evening. > > > HAHAHAHAHAHAHA > > *gasp* *wheeze* > > HAHAHAHAHAHAHA > > That's the most laughable thing I've read in a long time. > > If laughter helps us live longer, you just added 5 years to my life. > > Josh Of course we should not forget people such as David Widger, who has produced nearly 3,000 eBooks, about one per day, over a period of years, nor David Price who sent us one eBook per week for years, or several others who prefer to remain anonymous. mh From Bowerbird at aol.com Wed Mar 9 10:44:45 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Wed Mar 9 10:44:58 2005 Subject: [gutvol-d] lest the message be missed Message-ID: <1b9.ef334b5.2f609e1d@aol.com> michael said: > Of course we should not forget people such as David Widger, > who has produced nearly 3,000 eBooks, about one per day, > over a period of years, nor David Price who sent us > one eBook per week for years, or > several others who prefer to remain anonymous. that's right! :+) of course, david widger is super-human, not "an average person". ;+) but -- with the right tool -- now an _average_ person can do it too! -bowerbird From marcello at perathoner.de Wed Mar 9 09:42:56 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed Mar 9 11:37:00 2005 Subject: [gutvol-d] stop changing the message-headers In-Reply-To: <129.584c0d53.2f5f6e38@aol.com> References: <129.584c0d53.2f5f6e38@aol.com> Message-ID: <422F35A0.9060808@perathoner.de> Bowerbird@aol.com wrote: > in the meantime, my viewer-tool will actually _display_ all those > accented characters, even when they are not present in the e-text, > if the user chooses that option. Balderdash. You think you can sneak by using a word list? Then tell me how your forever-announced reader program is going to distinguish between the Italian words: e (meaning: and) ? (meaning: is) Now put the accents back: La grappa e buona e la carne e cattiva. Don't be irritated by the fact that you don't understand the text. Your program also has to put the accents back without understanding the text. > if you want to help with that, create a list of such accented words. Get ispell or aspell or any other open-source spellchecker. They all have multilingual wordlists included. > michael's philosophy of having the e-texts work on all machines, > specifically including trailing-edge machinery, is _the_ factor > that has made his e-library the premier one in all of cyberspace, Prove this assertion. > as to my tools in particular, they will support unicode fully, > long before unicode works with all the other tools out there, > so you're barking up the wrong tree, buster. Your tools so far supported only your endless blabbing about them, buster. They never even got so mature as to print the greeting screen without crashing. -- Marcello Perathoner webmaster@gutenberg.org From Bowerbird at aol.com Wed Mar 9 11:58:26 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Wed Mar 9 11:58:43 2005 Subject: [gutvol-d] stop changing the message-headers Message-ID: <87.2301818c.2f60af62@aol.com> marcello said: > Prove this assertion. history is the proof. study it. -bowerbird From marcello at perathoner.de Wed Mar 9 12:21:49 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Wed Mar 9 12:21:27 2005 Subject: [gutvol-d] stop changing the message-headers In-Reply-To: <87.2301818c.2f60af62@aol.com> References: <87.2301818c.2f60af62@aol.com> Message-ID: <422F5ADD.8010802@perathoner.de> Bowerbird@aol.com wrote: >>> michael's philosophy of having the e-texts work on all machines, >>> specifically including trailing-edge machinery, is _the_ factor >>> that has made his e-library the premier one in all of cyberspace, >> >> Prove this assertion. > > history is the proof. study it. What can I reply to this blinding proof of bowerbirds superior argumentative powers? I bow to the Great Master. -- Marcello Perathoner webmaster@gutenberg.org From Bowerbird at aol.com Wed Mar 9 12:56:26 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Wed Mar 9 12:56:45 2005 Subject: [gutvol-d] a wiki-like mechanism for continuous proofreading and error-reporting Message-ID: <66.5297a9a9.2f60bcfa@aol.com> here's one from last week that never got mailed out... i'll be leaving here again very shortly, since i have been reminded just why i had stayed away, because this place can be so negative and destructive and poisonous... ick! *** jon, you said the scanning took "much more than four hours". so how long _did_ it take? and if you were to do it again, with your present scanner, how long would it take you? also, how long did it take you to manipulate the images? and how did you do that? what specific steps did you take, in what order, and what program did you use to do all that? is there anything of all that which you'd do differently now? *** jon said: > OCR is quite fast. It's making and cleaning up the scans > which is the human and CPU intensive part. well, it all depends, jon, it all depends... with the right hardware -- like office-level machinery -- 60 pages a minute can get swallowed by the gaping maw. that's right. one page per second. that seems fast to me. that means your 450-page scan-job would take 7.5 minutes. probably took you more time than that to cut the cover off. and the machine will automatically straighten those pages, o.c.r., and upload to the net, while you stare dumbfounded... likewise with the kirtas 1200, geared to scanning books. http://www.kirtas-tech.com/ it does "only" 20 pages a minute, but hey, 1000 pages/hour ain't nothing to sneeze at. they estimate that in a full-scale production environment, the price-per-scan is 3 cents a page. sounds like brewster should buy a half-dozen of these babies. so it all depends. the bottom line, though, is that if a person has experience, good equipment, solid software, and a concentrated focus, they can open a paper-book to start scanning it and move it all the way through to finished, high-power, full-on e-book in one evening, maybe two. *** i said: > third, you used a reasonable naming-scheme for your image-files! > the scan for page 3, for instance, is named 003.png! fantastic! > and when you had a blank page, your image-file says "blank page"! > please pardon me for making a big deal out of something so trivial > -- and i'm sure some lurkers wrongly think i'm being sarcastic -- > but most people have no idea how uncommon this common sense is! > when you're working with hundreds of files, it _really_ helps you > if you _know_ that 183.png is the image of page 183. immensely. > even the people over at distributed proofreaders, in spite of their > immense experience, haven't learned this first-grade lesson yet. i forgot to mention earlier that my processing tool can automatically rename your image and text-files, based on the page-numbers that it finds right in the text-files (which it extends in sequence for those files without a page-number -- usually the section-heading pages). so even if you're dealing with someone else's scans, and _they_ didn't name their files wisely, you don't have to deal with the consequences. *** jon said: > I believe as you do that an error reporting system is a good idea > so readers may submit errors they find in the texts they use -- > sort of an ongoing post-DP proofing process. i didn't elaborate earlier that it goes much deeper than that. a very important point here is that an error-reporting system -- over and above the obvious effect of getting errors fixed -- will actively incorporate readers into the entire infrastructure, making them active participants cumulating a world of e-books. if you have ever edited a page on a wiki, you're likely aware that the experience gives a very strong feeling of _empowerment_ -- because you can "leave your mark" right on a page, quite literally. if we set up a wiki-page to collect the error-reports for an e-text, in a system allowing people to check the text against a page-image, they'll be much more motivated to report errors than they are now, with the "send an e-mail" system. the feedback is more immediate, and compelling, with a wiki. furthermore, by collecting the reports, in the change-log right on the wiki, you can avoid duplicate reports. you can also give rational for rejecting any submitted error-reports, and/or engage people in a discussion about whether to act on a report. all of this makes your readers feel _responsible_ for the e-texts. a lifetime of experience with printed matter has made people very _passive_ about typographic errors. there's no reason to "report" an error they find in a newspaper, for instance, because hey, it's already been printed. the same with a magazine or a printed book. water under the bridge. and they translate that same attitude over to e-books, even though it _does_ do good to report errors there. so we need to do something to shake them out of their passivity, something to make them feel _responsible_ for helping fix errors. (just for the record, although i use the term "wiki", i don't mean it literally. what i have in mind is more of a "guestbook" type method, where people can _add_ their text to the page, but not necessarily _delete_ what other people have added. it's thus more like a blog, where everyone can add their comments to the bottom of the page, but the top part stays constant, to list the "official" information. but i'll still use the term "wiki" to connote a free-flowing attitude.) in addition to the wiki, you can build an error-reporting capability into the viewer-program that you give people to display the e-texts. if they doubt something in the e-text, they click a button and boom!, that page-image is downloaded into the program so they can see it. if they have indeed found an error, they copy the line in its bad form, correct it to its good form, and then click another button and boom!, the error-report is e-mailed right off to the proper e-mail address. this symbolic (and real!) incorporation of readers into our processes is a rad thing to do. but it's not the _only_ benefit of such a system; it also facilitates the automation of the error-correction procedures. the error-report can be formatted such that your software can automatically summon the e-text _and_ the relevant page-scan. so you see a screen with the page-scan _and_ the error-report. you check its merit, and if it's good, click the "approve" button and the e-text is automatically edited. further, the change-log is updated right on the wiki-page for that e-text, and anyone who requested error-notification gets an e-mail describing the change. auxiliary versions of the e-text -- like the .html and .pdf files -- are automatically updated. and all you did was click one button... face it, if you're dealing with 15,000+ e-texts, doing it manually is a sure-fire way to burn yourself out. who needs that hassle? i mocked up a demo up this, using a simple a.o.l. guestbook script. i'm sure you versatile script-kiddies here could do something that was much more sophisticated, but my version will give you the idea: http://users.aol.com/bowerbird/proof_wiki.html -bowerbird From prosfilaes at gmail.com Wed Mar 9 14:15:30 2005 From: prosfilaes at gmail.com (David Starner) Date: Wed Mar 9 14:15:40 2005 Subject: [gutvol-d] lest the message be missed In-Reply-To: References: <20050307201324.B8A8C2F915@ws6-3.us4.outblaze.com> Message-ID: <6d99d1fd05030914151ee0afd5@mail.gmail.com> On Wed, 9 Mar 2005 09:49:26 -0800 (PST), Michael Hart wrote: > Of course we should not forget people such as David Widger, > who has produced nearly 3,000 eBooks, about one per day, > over a period of years, nor David Price who sent us one > eBook per week for years, or several others who prefer > to remain anonymous. I'm sure it gets a lot easier after your hundredth book. For all those people doing thousands of books, a large group of books can be done in an evening. But the vast majority of people helping PG, those who sign up and proof a few hundred pages at DP and quit, or produce one or two books and wander off, don't have the skills and expertise to do a book in an evening. In any case, this applies to novels and simple non-fiction. You aren't doing many of the books I currently have up for proofing in one night. From miranda_vandeheijning at blueyonder.co.uk Wed Mar 9 14:26:52 2005 From: miranda_vandeheijning at blueyonder.co.uk (Miranda van de Heijning) Date: Wed Mar 9 14:27:04 2005 Subject: [gutvol-d] lest the message be missed In-Reply-To: <1b9.ef334b5.2f609e1d@aol.com> References: <1b9.ef334b5.2f609e1d@aol.com> Message-ID: <422F782C.20600@blueyonder.co.uk> hi bowerbird, This sounds very exciting! I have a book which I want to put online, a grammar in three languages with loads of accents etc. It is very difficult and I expect it will take a long time to get through DP, which will be a shame as it is a very important text. I am encouraged to hear you can make this into an e-text in one evening! The scans are done and if you like I will mail you a copy. I'd like to have the proofed book back before the weekend, if that's not too much trouble. Thanks so much! Miranda van de Heijning Bowerbird@aol.com wrote: >michael said: > > >> Of course we should not forget people such as David Widger, >> who has produced nearly 3,000 eBooks, about one per day, >> over a period of years, nor David Price who sent us >> one eBook per week for years, or >> several others who prefer to remain anonymous. >> >> > >that's right! :+) > >of course, david widger is super-human, not "an average person". ;+) > >but -- with the right tool -- now an _average_ person can do it too! > >-bowerbird >_______________________________________________ >gutvol-d mailing list >gutvol-d@lists.pglaf.org >http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > > > From Bowerbird at aol.com Wed Mar 9 14:38:25 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Wed Mar 9 14:38:41 2005 Subject: [gutvol-d] hey marcello Message-ID: <1a3.2f05cf2e.2f60d4e1@aol.com> hey marcello, since you recently noted that some of the e-texts are subsets of other e-texts -- like the separate e-texts for books of the bible -- how about if you continue your constructive streak and give us a summary of these duplicated e-texts? best would be to delete the subsets and give us just the larger "collection" -- so we would have the smallest possible list of all the unique books in the whole library -- but if it would be easier to do it the other way -- delete the collections -- that would be fine too. whenever you get a chance... thanks... -bowerbird From Bowerbird at aol.com Wed Mar 9 15:00:00 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Wed Mar 9 15:00:17 2005 Subject: [gutvol-d] lest the message be missed Message-ID: miranda said: > The scans are done and if you like I will mail you a copy. that would be great, miranda! i'd love to help you out! and you don't even have to mail them! just put them online, in one zip file, and let me know where they are. i'll go get 'em. oh yeah, i only have the _english_ module for abbyy finereader, and that won't work well with accented text, so you'll have to do the o.c.r. on the images with the correct language modules, ok? so put the o.c.r. files in a zip file too, so i can grab those. and i should say that my tool facilitates the proofing process, but can't help much if you're proofing a language you don't know, and i only know english. so you might not get the best results from me. so far, i'm just concentrating on doing _english_ books. once i get those down, then i can do work on helping people who speak other languages extend the tool for their purposes as well. oh yeah, one more thing. please include spell-check dictionaries for the languages that are contained in the text, because i only have an english one. marcello can probably help you find those... > I'd like to have the proofed book back before the weekend, > if that's not too much trouble. my schedule is full for the next week or more. i can't even get to the second half of "my antonia" until friday at the earliest, and probably next week. so it'll be a couple weeks before i can get to yours. and this sounds like it's not really "an average book", so it might take me two or three evenings, not just one. but i'd still love to take on your project sometime! so put those scans somewhere where i can grab 'em, and i'll get to them at my very first opportunity, ok? oh yeah, i do hope they are 600-dpi scans, like jon's. those were really fine. they gave very clean o.c.r., and they're very pleasant to look at, as well. nice. and let me know if you decide to start doing the project, because there's no reason for us to duplicate our efforts. it won't hurt my feelings if you get impatient waiting! but since there's lots of books over at d.p. for you to do, if you want to just hold off on this one and let me take it, feel free! -bowerbird From jon at noring.name Wed Mar 9 15:00:36 2005 From: jon at noring.name (Jon Noring) Date: Wed Mar 9 15:01:08 2005 Subject: [gutvol-d] a wiki-like mechanism for continuous proofreading and error-reporting In-Reply-To: <66.5297a9a9.2f60bcfa@aol.com> References: <66.5297a9a9.2f60bcfa@aol.com> Message-ID: <13029016843.20050309160036@noring.name> Bowerbird asked: > jon, you said the scanning took "much more than four hours". > so how long _did_ it take? and if you were to do it again, > with your present scanner, how long would it take you? It took about a minute or so to carefully place each page on the flat bed scanner, close the top, initiate the scanning, open the top, and replace the page with a new one. While one page was being scanned, I could do some related work such as naming and saving the previous scanned images. It got old pretty fast. So with a manual flat bed scanner, with an already chopped book, it took me about ten hours, spread over a few days, to do the 450 or so pages in "My Antonia" (I did it in cracks of time). If I had chosen 300 dpi scanning (rather than 600 dpi), it would have gone faster, but not four times faster -- maybe 20-30% faster as a rough guess. Of course, one goal was archival-quality scans -- I could have cut corners to make it go faster. Obviously, a fairly new model, professional-grade sheet feed scanner would have made life a lot easier. But lots of people, the average Joe, generally only have the el cheapo flat bed scanners which are *slow*, plus they may not have the necessary knowledge on scanning and image processing fundamentals to do a good job. I have a strong background in image processing (plus being an engineer helps in general, as well as an amateur photographer), so I caught on quite fast after talking with a few of the pros on scanner newsgroups. As an aside, I'm used to processing giant images, on the order of 24000x18000 in pixel dimensions (fractal art printing using Kodak LVT -- now it's Durst Lambda and equivalent machines) -- and I did this a few years ago on lower-horsepower PCs. > also, how long did it take you to manipulate the images? One needs enough *horsepower* to manipulate 600 dpi images (300 dpi images are *four* times smaller), plus some knowledge. Fortunately, most of today's basic Win XP boxes and laptops, and latest Mac OS X hardware, have sufficient horsepower (lots of memory helps.) > and how did you do that? what specific steps did you take, > in what order, and what program did you use to do all that? > is there anything of all that which you'd do differently now? There are "all-in-one" professional-level application tools that straighten out misaligned images, and crops them accordingly. I did this processing mostly by hand using Paint Shop Pro plus another tool for semi-automated alignment whose name eludes me at the moment (it was a 15-day trial software, and it expired the day after I completed the job -- they want $400 for that sucker. :^( ) For all of the above, this is why I'm advocating a semi-centralized project to scan public domain texts, working in parallel with other scanning projects, such as IA's: 1) We will use volunteers who have access to higher-end scanners (if not ones we supply), plus the knowledge on how to use them properly for books. 2) We probably can get $$$ to buy sheet feed scanners (which are not that expensive, less than 1% the cost of the automated page turning scanners IA is using in Canada, as will be discussed below.) 3) We will be able to afford the professional-level "all-in-one" scan processing software to do the automated alignment, consistent cropping, and image clean-up. 4) We will establish sufficient guidelines, plus QC procedures, to maintain a minimum scanned image quality. >> OCR is quite fast. It's making and cleaning up the scans >> which is the human and CPU intensive part. > with the right hardware -- like office-level machinery -- > 60 pages a minute can get swallowed by the gaping maw. > that's right. one page per second. that seems fast to me. The fairly good quality sheet feed scanners, which are "office- quality", may be able to do 5-7 archival-quality scans per minute (this includes down time due to setting up, stuck pages, etc.) So for scanning alone, not including keeping track of pages, page numbering, and other administrative details associated with scanning, the average 300 page book could be raw-scanned, by someone experienced, in about 45 minutes. This assumes 600 dpi optical (archival quality). It may go a little faster with 300 dpi optical settings -- not sure... > that means your 450-page scan-job would take 7.5 minutes. > probably took you more time than that to cut the cover off. Not possible, unless one bought the *big buck* (above office-level) sheet feed or page turning scanners, or one simply used a photocopy machine, and captured the low-rez images it produces. If you want to increase speed for a given technology, the scan quality (dpi and maybe color depth) has to be reduced. (Well, except maybe for photographic-type scanners, which are coming down in price, where a high-rez snapshot is taken at one moment of each page rather than running a scan head over the page. I see this as the long-term savior to produce archival quality scans, and do it more quickly. It may also be possible to autorotate the book to assure alignment, rather than doing alignment by image processing after-the-fact.) > and the machine will automatically straighten those pages, > o.c.r., and upload to the net, while you stare dumbfounded... The software exists, but this is *expensive*. You are not going to find the average person able to afford to buy the software. However, for the proposed "Distributed Scanners", we'll get the needed hardware and software to speed up the process, plus the book chopper for those books which can be chopped (both Charles and Juliet at DP have these guillotine-type page choppers -- they are quite impressive. ) > likewise with the kirtas 1200, geared to scanning books. > http://www.kirtas-tech.com/ > it does "only" 20 pages a minute, but hey, 1000 pages/hour > ain't nothing to sneeze at. they estimate that in a full-scale > production environment, the price-per-scan is 3 cents a page. > sounds like brewster should buy a half-dozen of these babies. Brewster is already using something like the Kirtas for the Canada book scanning project. Not sure if it is a Kirtas or some other brand, though. I was told, or read somewhere, that the page turning scanner cost IA about $100,000. This is *major* bucks. Whether such machines will come down a lot in cost remains to be seen -- I doubt they will come down very much. These are fairly complex robotic machines, designed to handle all kinds of variations found in books, and to be very gentle on them -- yet produce a reasonably good image. I don't see a big enough market for these machines to substantially come down in cost by the power of competition. The Kirtas cost quote of 3 cents per page (which I assume includes labor, but unsure whether it includes capital equipment amortization) works out to about $10/book, which is IA's goal, btw. It requires a trained person to operate it. > the bottom line, though, is that if a person has experience, > good equipment, solid software, and a concentrated focus, > they can open a paper-book to start scanning it and move it > all the way through to finished, high-power, full-on e-book > in one evening, maybe two. Yes, but this is not for the average, ordinary Joe working in his basement. This requires a lot of $$$ in upfront investment to get this fancy equipment and software. For books which can be chopped (such as books where the cover is falling off, or very common old printings), then one can use $1000 (or less) sheet feed scanners, which maybe run at an average 5-7 pages per minute. Of course, with a "fleet" of sheet feed scanners, and the right image capture system, it is possible to run them in parallel -- above two machines, though, it probably requires two people to keep the machines properly fed (I don't think one person can operate any more than two sheet feed scanners and keep them occupied -- just a guess.) There's still need for the whiz-bang scan cleanup software, which I know is expensive. It can be done by hand, but it is laborious. (This cleanup could be centralized at one place, but there's the issue of moving the raw scans to the central location.) > i forgot to mention earlier that my processing tool can automatically > rename your image and text-files, based on the page-numbers that it > finds right in the text-files (which it extends in sequence for those > files without a page-number -- usually the section-heading pages). > > so even if you're dealing with someone else's scans, and _they_ didn't > name their files wisely, you don't have to deal with the consequences. Well, yes. However, in "My Antonia" a lot of pages were not numbered at all (such as the last page in each chapter). I had to be especially careful not to mess up and lose which page is which. Of course, with the Kirtas or a sheet feed scanner properly run, it is possible to keep all the scans in the proper order (which for a monoplex sheet feed scanner just run the ordered stack through once, and then once again.) > i didn't elaborate earlier that it goes much deeper than that. > > a very important point here is that an error-reporting system > -- over and above the obvious effect of getting errors fixed -- > will actively incorporate readers into the entire infrastructure, > making them active participants cumulating a world of e-books. This is *exactly* what we have in mind for LibraryCity's role in this, Bowerbird. We planned for this at least six months ago, but not implemented anything yet -- we have bigger fish to fry at the moment. But we envision enabling readers to build community around digital texts, and this includes mechanisms for error reporting/correction -- but not limited to just that. > if you have ever edited a page on a wiki, you're likely aware that > the experience gives a very strong feeling of _empowerment_ -- > because you can "leave your mark" right on a page, quite literally. Yes, LibraryCity plans to use wiki, or wiki-like, technology in various of its processes to build community, to enable people to become an integral part of the texts themselves, and to create new content -- to make the old texts come alive. > if we set up a wiki-page to collect the error-reports for an e-text, > in a system allowing people to check the text against a page-image, > they'll be much more motivated to report errors than they are now, > with the "send an e-mail" system. the feedback is more immediate, > and compelling, with a wiki. furthermore, by collecting the reports, > in the change-log right on the wiki, you can avoid duplicate reports. > you can also give rational for rejecting any submitted error-reports, > and/or engage people in a discussion about whether to act on a report. > > all of this makes your readers feel _responsible_ for the e-texts. Yes. This, btw, is also the power of Distributed Proofreaders -- it is an environment which not only increases trust in the work product, but it helps volunteers to feel like they are a part of something big. > in addition to the wiki, you can build an error-reporting capability > into the viewer-program that you give people to display the e-texts. > if they doubt something in the e-text, they click a button and boom!, > that page-image is downloaded into the program so they can see it. > if they have indeed found an error, they copy the line in its bad form, > correct it to its good form, and then click another button and boom!, > the error-report is e-mailed right off to the proper e-mail address. With our XML-based approach, we have the power of XPointer/etc. to enable not only error reporting, but full annotation, interpublication linking and so on. We're going to let the public annotate the books they read (the annotations will point to the XML internally, not alter the documents themselves.) This is just one of many things we are thinking of. (Btw, one has to be careful in how to reconcile error correction of texts with their usefulness in a full hypertext setting -- we don't want error corrections to break the already-established links for annotations, interpublication linking, RDF/topic maps for indexing, and so forth.) > the error-report can be formatted such that your software can > automatically summon the e-text _and_ the relevant page-scan. > so you see a screen with the page-scan _and_ the error-report. > you check its merit, and if it's good, click the "approve" button > and the e-text is automatically edited. further, the change-log > is updated right on the wiki-page for that e-text, and anyone who > requested error-notification gets an e-mail describing the change. > auxiliary versions of the e-text -- like the .html and .pdf files -- > are automatically updated. and all you did was click one button... > face it, if you're dealing with 15,000+ e-texts, doing it manually > is a sure-fire way to burn yourself out. who needs that hassle? Hmmm, this is a lot like what James Linden is developing, which may be incorporated into PG of Canada's operations. It is a good idea to maintain change tracking of all texts. And to answer your last point. Doing 15,000 texts, or a million texts, still needs some manual processing. It is also important to produce them correctly and uniformly in the first place, gather full metadata about them and put the metadata into a library acceptable form (e.g., MARC), and for various fields, such as author name, to maintain a single authority database as librarians do. PG's collection has been assembled so ad-hoc that trying to consistently autoprocess the collection is nigh impossible. That's why, to me, it is more important to redo the collection, put it on a common, surer footing (including building trust), before launching into doing a lot more texts. Imagine how difficult it would be to process one million texts if they were produced in the same ad-hoc fashion, without following some common standards. In the meanwhile, while most of the pre-DP portion of the collection is redone, a strong focus can be made on the archival scanning and *public access* of public domain books (including tackling the 1923-63 era in the U.S.) and getting them online as soon as possible (including properly done metadata and copyright clearance). Then, when the next-gen systems are in place to resume major text production, the scans will be there, available, and already online for associating with the SDT versions. And this is where we diverge -- I don't believe the full process can be done totally by machine, there's still need for people to go over every text to make sure the markup for document structure and inline text semantics are correctly done. This is *very* important for the more advanced usages of the digital texts: indexing, interpublication linking, multiple output formats and presentation types, cataloging, data mining, and Michael Hart's dream of eventual language translation. PG's ad hoc approach up to now (which DP has partly fixed), works against making the text collection capable of meeting these very advanced needs. XML (or some other text structuring technology with similar fine granularity) is necessary -- it can't be done using any plain text regularization scheme, unless the scheme is made very complex, whereupon going to XML simply makes sense because it follows the general trends of XML in the publishing workflow. Jon Noring From Bowerbird at aol.com Wed Mar 9 15:49:07 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Wed Mar 9 15:49:23 2005 Subject: [gutvol-d] a wiki-like mechanism for continuous proofreading and error-reporting Message-ID: <2b.6e87e0b0.2f60e573@aol.com> jon said: > Not possible, unless one bought the *big buck* (above office-level) > sheet feed or page turning scanners, or one simply used > a photocopy machine, and captured the low-rez images it produces. my girlfriend's office has a $10,000 lanier just down the hall. that's the kind of machine i was talking about. their website says that their high-end machines can scan 60+ pages an hour. but i grant you that a scanning time of a few hours (or more) is much more in line with what most normal people can attain, even those with lots of experience like yourself... > Yes, but this is not for the average, ordinary Joe > working in his basement. This requires a lot of $$$ > in upfront investment to get this fancy equipment > and software. i think you might be surprised in the coming months, jon. > There's still need for the whiz-bang scan cleanup software, > which I know is expensive. donovan was working on some open-source deskewing routines. might want to check that out. and i'm told that abbyy does a fairly good job setting brightness and contrast automatically. so the other thing that needs to be done is to standardize the placement of each scan relative to each other, which isn't hard. (removing curvature is a bear, but the best new scanner out -- the optik? -- lets you lay the book on the edge of the bed, which i understand effectively cures the curvature problems.) > in "My Antonia" a lot of pages were not numbered at all that's not uncommon. > (such as the last page in each chapter). yes, i noticed that. _that_ is a little uncommon. but like i said earlier, publishers can be weird. > I had to be especially careful not to mess up > and lose which page is which. it's _fairly_ easy to do each page in sequence -- just have to pay some attention turning the page -- and then using the auto-increment-name option will ensure that all of the files are named correctly. > Hmmm, this is a lot like what James Linden is developing, > which may be incorporated into PG of Canada's operations. if you check the archives you'll find i'm the one who posted it. i also offered to write all the software. all that was ignored. doesn't matter though, i'm proceeding to build my own system. if james took my post to heart, then he's smart. :+) > Doing 15,000 texts, or a million texts, > still needs some manual processing. if you're manually opening every file, and manually summoning every scan you need to check, you're going to burn yourself out. _plus_ expose yourself to the reality of inadvertent changes. you have to have a system that tracks every change that's made, so you can review the log to make sure it was the correct change, and that nothing else was changed. reviewing the log is "manual", and so is the decision as to _approval/rejection_ of the change, but the change itself should be totally automated. > That's why, to me, it is more important to redo the collection, > put it on a common, surer footing (including building trust), > before launching into doing a lot more texts. the library needs to be _corrected_, yes, but _not_ "redone". and i think you do more damage than good when you talk about e-texts being done "incorrectly", when what you _really_ mean is that an edition was used that you don't happen to approve of, or that metadata isn't included, just to use some most examples. there are _real_ errors in the e-texts. honest-to-goodness mistakes. we need to concentrate on _those_, not on some edition that uses the british spellings instead of american ones. (even if that _was_ silly.) but distributed proofreaders is more interested in doing new books than fixing old ones. they're volunteers who set their own priorities. > Imagine how difficult it would be to process one million texts > if they were produced in the same ad-hoc fashion, > without following some common standards. i don't have to "imagine" it. that's the way the library is now. and i made my fair share of efforts to try and convince the powers that that situation needed to be addressed with some standardization. but the difficulty of doing it with the type of heavy-markup that you like has held up that whole darn process. if we would have proceeded with the "zen markup language" that i like, the library would have been clean now. > PG's ad hoc approach up to now (which DP has partly fixed) the d.p. e-texts still exhibit a large degree of inconsistencies. and contrary to what you imply, they are not generally error-free. some are, but others are not. the same is true of earlier e-texts. the quality has improved, yes, surely. it is still not highest quality. but they are volunteers, and thus they set their own bar for quality. and they certainly deliver quality that is high enough that we could use "continuous proofreading" and have the public zoom us to perfect. > it can't be done using any plain text regularization scheme you're wrong. dead wrong. *** anyway, jon, thanks for the information on your scanning experience. i come away from hearing it with an even more firm conclusion that scanning and image-cleanup is indeed the biggest part of the process. -bowerbird From jon at noring.name Wed Mar 9 16:14:46 2005 From: jon at noring.name (Jon Noring) Date: Wed Mar 9 16:15:01 2005 Subject: [gutvol-d] a wiki-like mechanism for continuous proofreading and error-reporting In-Reply-To: <2b.6e87e0b0.2f60e573@aol.com> References: <2b.6e87e0b0.2f60e573@aol.com> Message-ID: <3633467218.20050309171446@noring.name> Bowerbird wrote: > jon said: >> Not possible, unless one bought the *big buck* (above office-level) >> sheet feed or page turning scanners, or one simply used >> a photocopy machine, and captured the low-rez images it produces. > my girlfriend's office has a $10,000 lanier just down the hall. > that's the kind of machine i was talking about. their website > says that their high-end machines can scan 60+ pages an hour. But what resolution? With scanners that move something with respect to the page, the higher the resolution, the slower it is. (On the other hand, today's 12 megapixel digital cameras, which for "My Antonia" would produce approximately 600 dpi quality, take a snapshot of the whole page, and can transfer the file in very short time, short than it takes to turn the page.) > but i grant you that a scanning time of a few hours (or more) > is much more in line with what most normal people can attain, > even those with lots of experience like yourself... Well, I'm not an experienced scanner (there's a difference between understanding the principles, and actual experience), but I think by the time I got finished with My Antonia, I gained a few stripes. >> There's still need for the whiz-bang scan cleanup software, >> which I know is expensive. > donovan was working on some open-source deskewing routines. > might want to check that out. O.k., thanks. Open source, high-quality deskewing routines are definitely needed! Now, it's a matter to also get a high-quality open source cropping and normalization application. > and i'm told that abbyy does a > fairly good job setting brightness and contrast automatically. > so the other thing that needs to be done is to standardize the > placement of each scan relative to each other, which isn't hard. > (removing curvature is a bear, but the best new scanner out > -- the optik? -- lets you lay the book on the edge of the bed, > which i understand effectively cures the curvature problems.) Yes, I've heard of these book-oriented scanners which are more gentle on bindings (but even here the binding is stressed.) There's a web site somewhere giving a review of the model you describe, but don't have the URL handy. > but distributed proofreaders is more interested in doing new books > than fixing old ones. they're volunteers who set their own priorities. Yes, that is true. There is a lot of interest in DP to redo a lot of the pre-DP classics in the PG corpus, from what I understand, so it may get done anyway even if PG does not encourage it. Jon From Bowerbird at aol.com Wed Mar 9 17:04:58 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Wed Mar 9 17:05:14 2005 Subject: [gutvol-d] a wiki-like mechanism for continuous proofreading and error-reporting Message-ID: <129.58668f74.2f60f73a@aol.com> jon said: > But what resolution? their website tells you. in a $10,000 machine, it better be good. > There's a web site somewhere giving a review > of the model you describe, but don't have the URL handy. we don't need a review on a website, as there's plenty of d.p. people here who'll vouch it's an amazing machine. > Yes, that is true. There is a lot of interest in DP > to redo a lot of the pre-DP classics in the PG corpus, > from what I understand, so it may get done anyway > even if PG does not encourage it. you didn't read what i wrote. it is _distributed_proofreaders_ that -- as a whole -- is more interested in doing new books than re-doing old ones. if they wanted to do it before now, they would have. but they haven't... (a few of 'em have redone old books. including some html versions that jim recently asked them to fix up. but as a course of action, not much.) michael doesn't tell d.p. what to do. he doesn't tell _anyone_ what to do. even if you _ask_ him for guidance, he's usually too stubborn to give it. -bowerbird From fielden3 at aol.com Thu Mar 10 00:23:56 2005 From: fielden3 at aol.com (Kent Fielden) Date: Thu Mar 10 00:24:20 2005 Subject: [gutvol-d] a wiki-like mechanism for continuous proofreading and error-reporting In-Reply-To: <129.58668f74.2f60f73a@aol.com> References: <129.58668f74.2f60f73a@aol.com> Message-ID: <4230041C.7060606@aol.com> At the risk of coming into the middle: My experience is that the time consuming part of going from book to E-book is the proofreading. The scanning, cropping, and OCR are probably less than about a quarter of the time. - I use a Canon S230 3 megapixel camera in a copy stand to get about 300 dpi scans. I can do 4-6 pages a minute, without destroying the book. I have been quite happy with using a camera as scanner, but 600 dpi would halve the processing speed. I tried 2 pictures per page, but I did not find any improvement in the OCR quality. - I use Abby FineReader 5.0, which was not that expensive, and it usually finds the right text, flipping pages and cropping automatically. A pass over the pages using FineReader to find basic OCR issues takes about 15 seconds a page. So up to this point, a 250 page book could be done in 2-3 hours of concentrated work. I would guess I have 2-4 errors per page at this point. - then comes a first pass proofreading, also fixing headers and footers. this is often 30 seconds per page. - then a full second pass of proofreading, again about 30 seconds a page. I probably find an error a page in this pass. Then I ship it. I could believe it could be done in a day's elapsed time, but I don't think I can focus that hard all in a single day. The real problem is my day job is using up most of my available concentration, so I don't feel up to spending too much time proofing. my 2 cents... Kent Fielden From traverso at dm.unipi.it Thu Mar 10 03:09:17 2005 From: traverso at dm.unipi.it (Carlo Traverso) Date: Thu Mar 10 03:07:03 2005 Subject: [gutvol-d] a wiki-like mechanism for continuous proofreading and error-reporting In-Reply-To: <3633467218.20050309171446@noring.name> (message from Jon Noring on Wed, 9 Mar 2005 17:14:46 -0700) References: <2b.6e87e0b0.2f60e573@aol.com> <3633467218.20050309171446@noring.name> Message-ID: <200503101109.j2AB9H232437@posso.dm.unipi.it> >>>>> "Jon" == Jon Noring writes: Jon> Yes, I've heard of these book-oriented scanners which are Jon> more gentle on bindings (but even here the binding is Jon> stressed.) There's a web site somewhere giving a review of Jon> the model you describe, but don't have the URL handy. I have one, Plustek OpticBook 3600, and I am very much satisfied of it, but scanning books in book mode trims away at least 1cm. in the middle, so can be used only if the margins are generous. To use it you have to open the book at 90 degrees, usually possible. I am satisfied nevertheless for the speed, the depth of the scan (there is almost no shadow in the gutter), the overall quality for the price. I see it quoted now $239, but it is difficult to find it in online shops: apparently there is much demand. However, in my experience, the limit is not scanning quality: it is print quality. OCR quality is pretty good on modern editions, but old books, often stained, and even more often with defective print, give raise to a lot of errors. Often you don't have the choice of a better print. Carlo From shimmin at uiuc.edu Thu Mar 10 08:45:56 2005 From: shimmin at uiuc.edu (Robert Shimmin) Date: Thu Mar 10 08:46:02 2005 Subject: [gutvol-d] a wiki-like mechanism for continuous proofreading and error-reporting In-Reply-To: <200503101109.j2AB9H232437@posso.dm.unipi.it> References: <2b.6e87e0b0.2f60e573@aol.com> <3633467218.20050309171446@noring.name> <200503101109.j2AB9H232437@posso.dm.unipi.it> Message-ID: <423079C4.5010903@uiuc.edu> Carlo Traverso wrote: > I have one, Plustek OpticBook 3600, and I am very much satisfied of > it, but scanning books in book mode trims away at least 1cm. in the > middle, so can be used only if the margins are generous. To use it you > have to open the book at 90 degrees, usually possible. I use the same model, and am very happy with its speed; for 300 dpi images of 8vo sized books, I have clocked myself at 300 pages per hour on a book with a good binding. I don't know what software you use it with, but if you have Abbyy, you might do what I do and run it through Abbyy's interface rather than its own "book mode" interface. The Abbyy driver should capture the entire platen rather than throwing away the outer cm. My experience is that having the book only 90 degrees open eliminates much of the gutter shadow on its own, and the additional processing that "book mode" does is largely unnecessary. > However, in my experience, the limit is not scanning quality: it is > print quality. OCR quality is pretty good on modern editions, but old > books, often stained, and even more often with defective print, give > raise to a lot of errors. Often you don't have the choice of a better > print. This can't be helped. However, the other issue that gives problematic raw OCR is that even when character recognition is good, layout detection can be poor, and sidenotes, multi-column text, and the like can be blended in with the main text, while corners might be chopped off, and in older printings where the inter-line spacing might not be exactly constant, whole lines can be elided. If I'm going to exert more effort in getting images and OCR, I've found that the place where it pays off the most is in previewing and correcting the recognition areas before letting the OCR do its work. -- RS From Bowerbird at aol.com Thu Mar 10 22:41:43 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Thu Mar 10 22:42:04 2005 Subject: [gutvol-d] ok, let's wrap this up, folks Message-ID: <199.3ab251fa.2f6297a7@aol.com> kent said: > At the risk of coming into the middle: ain't _that_ the truth! ;+) unless i am prodded further, however, this will be my last post on this thread. and this will also be my last thread before i take a long break from here, with the exception of my final report on "my antonia", and reports on the book that miranda asked me to do... and yes, people, it's a long post, because it's full of detailed thinking and analyses. if that ain't for you, not your cup'o'tea this afternoon, hit the 'delete' key, don't go running off complaining to michael and greg... > My experience is that the time consuming part > of going from book to E-book is the proofreading. ok, let's take a look at what you have to say. > I use a Canon S230 3 megapixel camera um, it is unlikely that's good enough. this very issue of using a digital camera rather than a scanner is being discussed right this second on another listserve, but the people there are talking about 5-megapixel and up, even a 10-megapixel. i seriously doubt a 3-megapixel works well. there are other concerns with a camera too. are you using external lighting on the book? if not, then your images will be substandard. do you use a tripod? do you focus manually? as always, photography can be a tricky thing. > I use a Canon S230 3 megapixel camera > in a copy stand to get about 300 dpi scans. there are also issues with the "copy stand". some stands can be good. others, not at all. are your scans showing curvature problems? if so, that can be a killer to o.c.r. recognition. unevenness in the brightness across the page? that can significantly impair the o.c.r. too. > I use a Canon S230 3 megapixel camera > in a copy stand to get about 300 dpi scans. 300dpi ain't giving your o.c.r. app the best you could. and isn't really creating what you'd want for archives. it's much more time-consuming to scan at 600dpi, and i think it's an open-question whether we want to ask individuals like you to take that extra time, or whether we wait to re-do scans until we have the equipment that will make that process fly by. but if we do take the 300dpi shortcut in scanning, or by using a digital camera rather than a scanner, then we need to do it with the full knowledge that that decision _might_ impact o.c.r. accuracy, which in turn _might_ result in more proofing work, which _might_ end up actually _costing_ us time overall... given the differences obtained from different scanners, and different source-texts, and different o.c.r. programs, and even from different _people_ doing the imaging -- if you've looked at a range of scanned books, you'll know that different people exhibit a wide range of variability in how carefully, e.g., straight, they position each page -- it's very difficult to do the research we'd need to do to find out exactly _how_much_ time we're wasting by creating images at less-than-ideal resolution. but we are _certainly_ wasting some time, in some situations -- and perhaps a _lot_ of time in more than we know... i'll put this as plainly as i can: if we use inferior tools, we _will_ get inferior results. if you take care to notice it, my statements about "one evening" are hedged carefully with qualifiers about "the right scanner", "the right manipulations", "the right tools", and of course, "an average book"... a lot of the people who scoff are people who are using inferior tools, and getting inferior results. people once thought heavier-than-air flight impossible. it is, if you do it wrong. if you do _anything_ wrong. and there are lots and lots of things you can do wrong. but do _everything_right_ and flying is certainly possible. now people fly every day in a plane, with no second thought. and, to be clear, i'm talking about the amount of time that it takes _after_ the page-scans are cleaned up. as people have confirmed, the scanning and clean-up will often take a very long time, all by themselves. compared to _that_, proofing should be much faster. before i leave the arena of the image-creation process, i should say there is only _one_ "right" scanner out there currently, in the range of personal affordability anyway. it's that optic3600 that other people have mentioned here. if you're using another scanner, you're wasting your time. maybe you're not wasting a _lot_ of your time, perhaps not enough to consider a $250 scanner as an "investment", but you need to know that you _are_ wasting some time. and if you use inferior tools, you will get inferior results. one more thing, since carlo mentioned that sometimes he gets inferior results because the p-book is shoddy. hey, no question that a bad original will make bad scans. the best answer to that problem, though, is very simple: go find a cleaner copy of the book to get your scans from. _somewhere_ out in the world, there _is_ a cleaner copy. (if not, let that rare book be scanned by a professional!) and if those bad scans are coming from somewhere else? the same answer: go find a cleaner copy and scan _that_. don't waste valuable time dealing with inferior images! jon noring keeps talking about how wonderful it is that distributed proofreaders keeps the scans for their books. and it is. but the truth of the matter is that previous few of those scans can be considered good enough for archival. so those books will have to be rescanned in the future too. let's hope that brewster and/or google are doing it right... > I use Abby FineReader 5.0 v5 won't give you the accuracy that v7 will. that's likely the _main_ reason that proofing is taking you longer than it should. version 7 does a much better job that 5. you will find the upgrade price _is_ an excellent investment, even if your time isn't really worth very much... if you use inferior tools, blah blah blah... > then comes a first pass proofreading, > also fixing headers and footers. > this is often 30 seconds per page. um, no. you're getting way ahead of yourself. after scanning, you _first_ need to clean up the page-scans -- which means deskewing them, standardizing placement, etc. almost every page is skewed to some degree. even though this might not be apparent to you without careful analysis, it _is_ a factor with big impact on the o.c.r. accuracy. and furthermore, when a person views page after page of the images, to read 'em, even a small skewing causes a subconscious weirdness to them. as for placement, i mean the left and top margins of each scan are identical. it's another factor effecting reader subconscious. while it's less important to o.c.r. accuracy, it does sometimes exert an impact there too, specifically in regard to the "zoning". (and yes, you _do_ have to zone the pages to get the best o.c.r.) there are a whole slew of other ways to manipulate the images. i don't have any experience with some of them, to discuss them, but there are some people over at distributed proofreaders who seem to know a lot, including one person whose name escapes me, who has formulated his "recipe" for enhancing page-scan images. interestingly, it includes "blurring" the image at one point, which certainly seems counterintuitive, but has the effect of converting the one-pixel dots into two-pixel dots (or some such), which means they don't get deleted in a later step where the image is downsized. (d.p. resizes many scans to a size that works well in their system; that also might be considered a shortcoming in their scan-archive.) now some of the skeptics out there are probably muttering now that adding time to the imaging process to save it on the proofing process isn't really "saving" us any time. and there is a little truth to that. however, many of these image-cleanup steps can be _automated_, so they are great candidates for inclusion in our ideal work-flow. even more importantly, it's vital that we start considering the scans as a product in and of themselves. i fully agree with michael hart that "a picture of a book is not an e-book". i too want raw, editable text. but that doesn't mean a high-quality "picture of a book" isn't useful. indeed, as pointed out here, it's the first step on the way to getting the raw, editable text. and even after that, it continues to be useful. people _will_ -- in the future -- desire to _replicate_ older books. they will want print-outs that "look exactly like" the original book. (_especially_ with books like those by william blake, for instance.) and the best way to fill that demand is to have high-quality scans. tomorrow's low-end printers will be 600dpi (if they aren't already). so that's the resolution that we need to be aiming at with our scans. yes, i fully realize that that is ridiculous in terms of the present, when that kind of resolution overwhelms our memory and bandwidth, as soon as we stop thinking about books at the individual-book level and start thinking about them as collections in the tens of thousands. which is precisely why i tell people now that 300dpi is acceptable, even for the "archive" versions we're building for the here-and-now, just as long as the 300dpi scans give us acceptable o.c.r. recognition. but i give louder applause to the foresight to go to 600dpi right now. (me, though, i'll go 300dpi unless/until i have a high-speed scanner, expecting that _every_ book i'll scan will eventually be rescanned.) > then comes a first pass proofreading, > also fixing headers and footers. > this is often 30 seconds per page. ok, after you've cleaned up the scans, you can start the "proofing". but there are lots and lots of different ways of "doing the proofing", so let's be perfectly clear about exactly what we we're talking about. my software tool guides you through the processes a certain way, so i'll be discussing that path. like i said, i plan to release my tool in late spring, about the same time that the internet archive begins to release scan-books from their toronto project, so if you prefer, save this post until then, when my tool is out. that's fine with me. on the other hand, if you want to consider my alternative processes, to see which ones you can incorporate into your work-flow, read on. i don't mean to frustrate anyone by saying "i've got a tool to do that" before the tool is released. but if this advance information helps... the first thing to do is a quick check that you got all the scans right. my tool allows you to "thumb through" all of them, from start to end; it displays them 2-up, so they look exactly like a p-book page-spread. on the first pass, you'll just look at each spread, ensure it looks good. on the second pass, you'll be looking at the text instead of the scans. here, the 2-up view shows the text on one side, the scan on the other. (my tool uses this 2-up view -- text next to its scan -- throughout.) in this pass, you'll be formatting the text, to make it match the scan. i'm still in the process of figuring the best way to save o.c.r. output, i hope my tool will do most of the formatting right automatically, but when it doesn't, you will have to do the formatting yourself, manually. "manually" doesn't mean "editing", like you'd do with a word-processor. while that may be necessary on some rare occasions here, in general there will be buttons that you can click to do most of the formatting. for instance, say there's a block-quote that didn't get auto-formatted. you would select the lines of the quote, and hit a "block-quote" button. same for a poem that didn't get indented, or to right-justify an epigraph. if your book is like most -- one boring page after another boring page -- there will be very little for you to do. for "my antonia", for instance, the only real excitement here was with the occasional chapter heading. for books that need heavy formatting, you should save that for later, and move to the next step, which is where the tool starts "proofing". my tool -- and the ones that are being developed by other people too -- takes the o.c.r. results and automatically makes some changes _before_ ever presenting them to you "for proofing". for the most part, these are changes due to known recurring errors in the o.c.r. recognition routines, so a person generally needs to build a list idiosyncratic to their setup. (one person doing this had a list of over 400 rules with his old scanner, but when he bought the optic3600, he was able to drop _half_ of them.) there are also some checks that are generic to all setups. an example would be replacing any "tbe" word with "the". undoubtedly a flyspeck caused that nonsense error, so we would just change it automatically. remember that all of these changes are taking place _before_ the text has even been viewed yet by a human being, so if -- for some reason -- it _really_was_ "tbe" instead of "the" (because, for instance, it was _this_ message that was being scanned), the human can change it back! (well, if it actually was _this_ message being scanned, then the change wouldn't be _automatic_, not with my tool anyway, because any "scanno" that is in quotes is _not_ changed automatically, for just that reason. but you get my point: it's safe to make automatic changes at this time, because we know that human beings are still going to review the text.) there are a number of other checks that happen at this time as well, based on analyses of the text. i won't say much about these, because that would give away too much about my program before its release, but some of the obvious ones would include the one to "close up" the spaces that o.c.r. often injects around punctuation. (or which, like in "my antonia", are _really_ right there in the paper-book. an example is on the very first page -- page 3 -- where "hands" is surrounded by such floating quotemarks; it's clearly printed as " hands ". even jon, with his focus on "fidelity", tightened up those floating quotemarks.) this is where the o.c.r. of "mr," and "mrs," -- followed by a comma, instead of a period (which i mentioned before) -- would get fixed. all of these automatic changes are logged to a file, so they can be reviewed by a human. except that review is often a waste of time, because these changes are (or at least should be) totally obvious. and if your review _does_ show an auto-change that was incorrect, and therefore shouldn't have been made, you would seriously consider _the_removal_of_ the rule that was responsible for that auto-change. also, kent, since you specifically mentioned headers and footers, a good tool will let you retain those right up until the last minute. they don't hurt anything -- and they help you keep your bearing -- so there's no need to delete 'em. the tool should de-emphasize them -- mine displays them in gray, which makes 'em unobtrusive _and_ has the benefit of letting you know it identified them correctly -- but they're something you shouldn't have to spend time on in any way. after the automatic changes comes the fun part. at this time, the app does the hard work. again, i don't wanna steal thunder from my tool, but the aim at this point in time is to present to you _each_line_ that will need your attention (accompanied by the page-scan containing it), and _only_ those lines that need your attention (i.e., no false-alarms). that is, the tool seeks to find every line that has an _error_ in it, and present it to you, alongside a page-scan, so you can correct the error; and it seeks to show you _only_ those lines that really have an error, so it doesn't waste your time showing you lines you don't need to fix. that is the "secret sauce" in the tool -- to show you _every_ line that you'll need to fix, and _only_ the lines that need fixing, and no others... of course, that's the _ideal_, and we can only hope to _approach_ that. after all, if the tool knew for certain where each and every error was, we could just tell it to correct the errors itself, while we ate lunch. so we scale our expectations back to something a bit more reasonable, and have the program bring up -- to the best of its ability to do so -- each line for which it has some good reason to think we need to check. to put this into a phrase, we have the tool look for _probable_ errors. some of them might not actually be errors, but we go on probability... we do want to find _all_ the errors, or as many as we reasonably can, so we'll accept _some_ "false alarms". they're preferable to _missing_ an actual error. but at the same time, too many of 'em wastes our time. after all, the tool could just show us _every_ line and say "check it"; but that wouldn't be buying us any improved efficiency now, would it? so the closer we get to the ideal -- show us every line we need to see, and not one line that we _don't_ need to see -- the better we like it. and if the tool tells us what is wrong with the line, and suggests the correct fix, with a "yes, fix it" button we can click, so much the better. to use an example from above, let's say that it offered to close up those floating quotemarks around "hands" with just the click of button. slick! if we get _close_enough_ to the ideal -- where we are shown only lines that have errors, and no others -- then we will have just sat there and button-clicked, while our text became easily and adequately "proofed". once we've corrected every line that needs to be corrected, we are done! but we don't really have to get all the way to the ideal to be successful. again, my "standard" is 1 error every 10 pages. and i expect to do better. but if i attain that rate, i will consider my tool to have been "successful". i should say specifically that _spell-check_ is an important part of this. i find it laughable and ridiculous that distributed proofreaders does _not_ do a spell-check on the o.c.r. results before shipping them off to proofers. your first reaction might be "why do a spell-check, since that is exactly the job proofers are gonna be doing anyway?", plus then go on to point out how much time a spell-check would take, and various other considerations, perhaps even launch into your spiel about "what a distributed process is". (spare me; as a social psychologist, i understand it far better than most.) heck, there is actually some debate over at distributed proofreaders about whether a spell-check must be done _after_ the text comes out of proofing. which explains why some e-texts are actually being posted now that have obvious spelling errors in them that will _not_ pass a spell-check! awful! except i'm talking about a very specific form of limited spell-check, namely an analysis of the text that creates a list of all the words used in the book. again, i won't explain how it works, but the purpose is to compile the words that are _unique_ to the book. the best example is _names_of_characters_, another good example is _words_and_phrases_from_a_foreign_language_. and there are other categories. here are some examples from "my antonia": > kolaches > mamenka > misterioso > patria > tatinek > amour propre > noblesse oblige > Optima dies? prima fugit > palatia Romana > Primus ego in patriam mecum? deducam Musas these words are used to create a _book-specific_spell-check_dictionary_: words not in a normal spell-check dictionary, but which _are_ in the book. i believe that every e-text should include such a word-list in an appendix. first, it's useful, from the standpoint of end-users running a spell-check; once this book-specific word-list is specified as an additional dictionary, the entire file should pass through spell-check without pausing even once. but moreover, it's just plain _fascinating_ to browse this list for a book. it is a quickie road-map to the freakish extremes of that particular book. back to the job at hand... the word-list _is_ very useful to spell-check text right out of o.c.r., and _before_ you commence the job of "proofing". as a good example, remember those character-names? when you browse an alphabetized version of the word-list, you'll see a name popping up in a variety of variant forms, such as the possessive, the plural, and so on. what you'll _also_ see, though, is an occasional place where the name was misrecognized. boom! my tool allows you to click on it, and then immediately jumps you to it in the text -- right alongside the image -- so you can verify that it's an error, and change to the correct spelling. (my plan is to have a button you can just click to make the correction.) and if the error is obvious enough, you might not even go to the bother of jumping to its location in the text, but rather just fix it immediately. (remember, you can review these changes if you want down the line.) one of the test-books i used to develop my tool, way back when i first started putting it together, was "the hawaiian romance of laieikawai". (some of you know this e-text was in the group issued for dp#5000.) i might've spelled that name wrong; face it, it's a pretty difficult one. and, as you can imagine, the o.c.r. yielded quite a few variations of it! there were literally _dozens_ of 'em, off by a letter or two (or more). and not surprisingly, there were many hawaiian names, long and short, in this text, and the o.c.r. came up with a number of variants on each! although it was a pleasant story, and the o.c.r. was relatively clean for the pages -- remarkably so, considering how bad the scans were -- those difficult names made the task of proofing a terrible nightmare, so this text took a fairly long time to make it through all the rounds. using my tool, however, all of the various scannos on those names were easy to locate, and to correct, and that task was done quickly. thinking about individual proofers, going to the trouble of correcting each of those name scannos, independently, manually, i am appalled! imagine how much of a hassle that was! what a tremendous waste! but the scenario is even worse, at least for proofers who were careful, and took their job seriously, because in order to check _whether_ the name is spelled correctly or not, you must examine _every_instance_. and that process is extremely error-prone. and fatiguing. and boring. if the name was _at_least_ in the spell-check-dictionary for the file, the spell-check on the d.p. page would show it was correctly spelled (when it was) by failing to highlight it. and flag incorrect spellings. but until it's in the dictionary, every occurrence must be scrutinized. think how much of the proofer's time and energy could've been saved if the instructions would have said, "hey, ignore the hawaiian names, we fixed them all in a global operation before you got these pages...". to subject proofers to those difficulties, when such a simpler method isn't being developed and utilized, is almost an abuse of the good-will those fine volunteers are giving you by donating their time and energy along about now, someone will say, "d.p. plans to install the capability for a proofer to add a word to the spell-check dictionary for a book." well, gee, after 6,000 books, i would _hope_ you finally got the idea! and if you did it _right_, you'd create the book-specific dictionary _automatically_, before the first page is sent to the first proofer. i don't mean to sound high-handed and morally indignant and all that, because i fully realize this is an ongoing learning process for everyone, but hey, i guess it's easy to waste volunteer time if you have lots of it. and it would address my concerns _greatly_ if the people-in-charge (and the loudmouths who _act_ like they are) would be _accepting_ when well-intentioned people try to advise them on their processes. but there is an active hostility over there to constructive criticism. and i find that tragic. but i digress... getting back to the matter of an _individual_ doing a book, though, my objective for that situation is to make that person _efficient_. so _this_ is the type of spell-checking that you need to do _first_, one whose essential operating philosophy is a _book-wide_basis_. and then, only after that, yes, if you are an individual doing a book, the next thing to do is a _regular_ old spell-check, the type that goes from one questionable word to the next. the difference here -- and yes, one that my tool facilitates, of course -- is that when you come to a questionable word, the _page-scan_ is shown right there. some people actually say, "you should never do a spell-check, because some words that will pop up are actually as they were in the original, and they need to be left that way. so a spell-check is a waste of time, because what you really need to do instead is a line-by-line comparison." that's poppycock. _of_course_ that situation _can_ happen. sometimes. and that's why you've got the scan there, to check the questionable word. i don't advocate a blind "correction" to each and every questionable word. and you must be able to easily add a word to the book-wide dictionary, if you find that my tool is continually popping up a word that it shouldn't. (but odds are that it would've been put in the dictionary in the prior step.) but _nonetheless_, if you want to find words the o.c.r. _misrecognized_ -- and remember, that's the objective, to isolate _probable_ errors -- the best bet is to look at words that aren't in the spell-check dictionary. all right, so that takes care of spell-check. a final set of checks is then done that looks for anomalous situations; some of these involve punctuation, infrequent juxtapositions, and so on. there are some words that pass spell-check that you still want to view -- they are called "stealth scannos" over at distributed proofreaders -- and they are one of the things that are checked in this final set. and at that point, you're done with the text-cleanup. congratulations. all in all, as well as i can tell from the testing that i've done so far, you can expect the tool will present between 1% and 5% of the lines in the text-file to you for one kind of close examination or another, and perhaps 75% of those will require a "fix" of some kind or another, assuming that you got relatively clean o.c.r. results in the first place. that's a lot better than looking at 100% of the lines to "proof" them. and that, my friends, is how you can do a whole book in a few hours. unless you put aside that heavy markup earlier. if so, it's time to do it. once again, you will page through the book, text and scan side by side, doing whatever editing needs to be done so the text is formatted right. without knowing what kind of formatting you'll need to do, it's hard to tell you how you'll go about doing it. so you'll have to wait until you can get some hands-on experience with the tool to see exactly how it'll work. but it definitely will not be anything like the pseudo-markup over at d.p. -- where, for example, /* and */ are used to bracket poetry and stuff -- and it will most certainly not be any form of x.m.l. or h.t.m.l. markup . it _will_ be z.m.l. -- invisible markup that mimics the p-book page. and as my tool gets more and more advanced, it will actually _display_ the text just exactly as it will be shown by the z.ml. viewer-program. and sooner or later, the two apps will morph into one. (bet on sooner.) how complex can formatting get using z.m.l.? we'll have to see... ;+) so now that you've gone through all the post-o.c.r. cleanup my tool does, and the pages are nicely formatted so they resemble the original p-book, what next? well, it's probably the case now that your text is _already_ clean enough to meet or exceed our standard of 1 error every 10 pages. but i assume that if you're doing this book as an individual, it's because _you_actually_have_an_honest_desire_to_read_or_re-read_this_book._ because _that_ is really the absolute _best_ reason to digitize a p-book. so read it! read it in my tool, which allows you to display the image of the page right alongside the o.c.r. text for that page. keep in mind that you are reading for the express purpose of catching any errors in the text, so read carefully. at the same time, though, read for your enjoyment too! it's only by being engrossed in the story that you'll catch some errors, such as a word or a line inadvertently dropped. so become engrossed! if you find an error, first _log_it_! keep records, to improve the tool. _then_ use your word-processor to search the text for _similar_ errors. if that search yields other instances, see what you can learn from them, and expand your search based on anything you can generalize about them. some errors are flukes -- a coffee-stain on the page, or what have you. but others can be recurrent, and if you can pin down a recurrent error, you will become much more efficient in your efforts to clean up a text. finally, i will mention again that _text-to-speech_ can be _amazing_ in helping you to locate errors in a text you might never have _seen_ my tool will do text-to-speech; it'll even pronounce the punctuation, if you select that option, so you can verify that in your text as well. so i highly recommend that -- rather than reading the text to check it for that final "proof" -- you _listen_ to it instead, via text-to-speech. this has the added benefit that you can do it away from your computer. a lot of people enjoy putting a book onto a walkman, or even an ipod, and listening to it in the car, or at the exercise club, or out jogging. that's fine. (just be conscientious about _remembering_ any errors!) once you have done this final check, your "proofing" job is all finished. say what? does this mean i don't advocate a line-by-line comparison? isn't that what most people, like d.p., consider to _be_ "the proofing". well, let me put it this way: if you _want_ to do that, by all means, do! do i think it's absolutely necessary? well, in most cases, absolutely not! doesn't a failure to do that mean that you might release a text that has some small errors in it? well, yes, it certainly does, but that is exactly why i build the "continuous proofreading" step into my overall processes. no matter how good a job you might do, certainty requires more eyeballs. so if you're really feeling insecure, have other people read your file too. better yet, have someone else process the book completely independently, and compare their final file to yours. that should catch _every_ error. but if an error hides through all of the tools, and withstands a reading by an engrossed human and/or wasn't noticeable during text-to-speech, then that error is insignificant enough that i'm not gonna worry about it. i think it _should_ be corrected, and (due to "continuous proofreading") that it eventually _will_be_ corrected. but i ain't gonna worry about it. and considering the care i put into listserve posts, it's obvious i'm anal. there are 6,272 words in 707 lines in this message. find the typo in it. i circle the mistakes in everything i read, for the sheer fun of doing it. so if i can live with that error, hey, you can probably live with it too... at the point of insignificant errors, our attention is much better spent with a focus on digitizing additional books. i'll repeat, so it sinks in, that if someone _wants_ to do line-by-line comparison, that's _great_. but if we can get texts that are far-and-away error-free without it, then _i_ have far better ways to spend my time, thank you very much. and don't try to make that out that i don't care about finding errors, or that i'm talking about "something different" than what you mean, and that's the only reason i say it can be done in just one evening. because my processes will give just as accurate results as yours. and i'll be happy to prove it by finding the errors in _your_ e-texts. anyway, now you're done _proofing_, but you're not _completely_ done. because there's just one more step before you can send your e-text out. up until now, you might have had the text from each page in its own file. (or maybe you had it all in one file, since my tool can work either way.) but if you had them in separate files, they'll now need to be combined. we also want to get rid of the headers and footers and make it all nice. these are things my tool does for you -- mostly automatically -- but there are a few that do require some input from you, and some others you have to monitor to make sure they are done correctly. one example would be footnotes, which are moved to be end-notes. another example is to make sure all headings are at the right level. and when the end-line hyphenation is removed, you might be asked to make decisions for the tool when it seeks your guidance on that job. but for the most part, the tool will step you through all these tasks. it assumes that you're not an expert at doing this, and it helps you. there isn't that much more for me to explain about this final step, other than to mention that you _might_ want to execute this step before you read through the book or listen to it via text-to-speech. once you've concluded these steps, your file is a bona fide e-book. congratulations! you've moved a book into the realm of cyberspace! you can load your e-text into my z.m.l. viewer-program, and boom!, you'll see that what you created is a high-powered electronic-book! the headings are big and bold! your table-of-contents is hot-linked! words that were italicized in the p-book, which my tool marked with underscores like _this_, are again shown in all their italicized glory! illustrations are displayed on the appropriate page, automatically, and all you did was make sure their file-name was nearby that text. after this step, future versions of my tool might perform conversions of the e-text to other formats, like .html and .pdf and .rtf, if you want. plans in that regard are still fairly tentative, and i might decide that i will leave that matter to the end-reader using my viewer-program. your time might be better allocated by proceeding on to the next book. after all, it was fun to do it, wasn't it? and it only took one evening! > The real problem is my day job is using up most of my available > concentration, so I don't feel up to spending too much time proofing. well, yeah, there's no question that this job does take concentration. there's really no way around that. i will say, however, that my tool helps to _conserve_ your concentration by helping you to _focus_ on the things that require your attention, and not the things that don't. and that's really the big secret in making people more efficient here. indeed, that's what enables you to do an average book in just one evening. anyway, i have exposed enough flaws and gored enough sacred cows in this post that i can feel the vilification efforts building already. like i said, unless i am prodded, this is my last post in this thread. and except for a few final reports on the other threads, i'm all done. if those vilification efforts break out, though, and i am challenged, i _will_ remain here to defend myself, as i stand behind this post... otherwise, i'll be out of here until one of these tools is released, either from me or from one of the other people working on them, or until someone comes on here trying to tell you this job is hard. it ain't, folks. it's easy. and people have been flying for decades... the choice is up to you, people... -bowerbird From tb at baechler.net Thu Mar 10 23:50:50 2005 From: tb at baechler.net (Tony Baechler) Date: Thu Mar 10 23:49:20 2005 Subject: [gutvol-d] No part 2 of newsletter Message-ID: <5.2.0.9.0.20050310234714.01f6ace0@baechler.net> Hi. I'm sure I'm not the only one to notice this, but neither George, Greg or Michael commented, so I'm asking. What happened to part 2 of the newsletter? I got part 1 as always. George said that he would no longer be editing so part 2 would now be automated, but I think something must have happened because I never got anything. I did not fully read part 1 but I think based on length it is too short to contain a full list of new books. Any thoughts? Any idea when it will be sent out? No big rush, but I'm curious to see the apparently new, automated format if there is one. From JBuck814366460 at aol.com Fri Mar 11 00:18:49 2005 From: JBuck814366460 at aol.com (Jared Buck) Date: Fri Mar 11 00:19:20 2005 Subject: [gutvol-d] No part 2 of newsletter In-Reply-To: <5.2.0.9.0.20050310234714.01f6ace0@baechler.net> References: <5.2.0.9.0.20050310234714.01f6ace0@baechler.net> Message-ID: <1110529129.22730.1.camel@lsanca1-ar51-4-42-023-178.lsanca1.dsl-verizon.net> On Thu, 2005-03-10 at 23:50 -0800, Tony Baechler wrote: > Hi. I'm sure I'm not the only one to notice this, but neither George, Greg > or Michael commented, so I'm asking. What happened to part 2 of the > newsletter? I got part 1 as always. George said that he would no longer > be editing so part 2 would now be automated, but I think something must > have happened because I never got anything. I did not fully read part 1 > but I think based on length it is too short to contain a full list of new > books. Any thoughts? Any idea when it will be sent out? No big rush, but > I'm curious to see the apparently new, automated format if there is one. > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d Don't worry, Tony, sometimes it takes a week or so to switch to automated emails from emails that are hand-edited. That's been my experience with newsletters I subscribe to that converted to automation from hand-done newsletters. If it doesn't come, let me know, and I'll talk to Michael. Jared From jeroen.mailinglist at bohol.ph Sat Mar 12 08:17:08 2005 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Sat Mar 12 08:16:07 2005 Subject: [gutvol-d] lest the message be missed In-Reply-To: <422F782C.20600@blueyonder.co.uk> References: <1b9.ef334b5.2f609e1d@aol.com> <422F782C.20600@blueyonder.co.uk> Message-ID: <42331604.6070002@bohol.ph> Hi Miranda, May I also claim this wonderful resource, to finish all my obscure Philippine grammars. The last one took 10 months to go through DP, so I am somewhat discouraged to put up more of these. They are absolutely very scarce and very important works, and helpful in reviving interest in those languages, which, although widely spoken, until today lack any official status. I also have some great dictinionaries, and while we are on it, I still wish to convert my great Hiligaynon dictionary (over 1000 pages in two columns, small type print) to a nice ebook. Scans are ready for shipping. Loads of accents, single letters in italics that are significant, and so on. I think that resource will also have the time to deal with my wonderful sanskrit dictionary, written by Monier-Williams, and with 1600 A3 pages in tiny print (three columns, Devanagari and Greek script as a bonus), it should just be a breeze for this powerbird. When they are through, I have some great census books as well... thousands of pages of 6 point letter tables, and we cannot tolerate a single mistake. O yes, he can simply download all stuff from http://www.hti.umich.edu/cgi/t/text/text-idx?c=philamer;cc=philamer;tpl=home.tpl Jeroen Hellingman Miranda van de Heijning wrote: > hi bowerbird, > > This sounds very exciting! I have a book which I want to put online, a > grammar in three languages with loads of accents etc. It is very > difficult and I expect it will take a long time to get through DP, > which will be a shame as it is a very important text. I am encouraged > to hear you can make this into an e-text in one evening! The scans are > done and if you like I will mail you a copy. I'd like to have the > proofed book back before the weekend, if that's not too much trouble. > > Thanks so much! > > Miranda van de Heijning > From traverso at dm.unipi.it Sat Mar 12 09:31:50 2005 From: traverso at dm.unipi.it (Carlo Traverso) Date: Sat Mar 12 09:31:47 2005 Subject: [gutvol-d] Scanning/OCR tips Message-ID: <200503121731.j2CHVoX03502@posso.dm.unipi.it> In margin of the BB-Jon discussion, I would like to issue a warning from my experience with FineReader OCR: it is not true that higher resolution scans always provide better OCR. Often they cause minute imperfections of the original print to be recognized as letters, punctuations or diacritics. Sometimes it pays to reduce the resolution of the scans to 300DPI and use the reduced images; FineReader seems to expect 300DPI scans. Higher resolution is only better with very small type. I think that this is a bug of FineeReader (should not recognize as letters etc. image details that are much smaller than the other characters, or incorrectly placed) but this is something on which we don't have control, except than pre-processing images. Often the higher resolution scans have different errors from reduced resolution scans; procedures to compare the OCR at different resolutions might lead to better global recognition. Globally, and not unexpectedly, the OCR seems well tuned to recent print in contemporary language. My impression is that an effort of developing free OCR software of good quality, in which the knowledge of the source can be used in the recognition process, could be well spent for the needs of PG. Another domain in which a considerable progress could be attained is the spell-checking software, that is much more tuned to typing than to OCR, especially of old texts. It is common experience that the most common OCR errors are down in the list of suggestions. This is however a domain in which free software exists, and the problem is one of a metric tuned for OCR in the corrections space. Carlo From gbnewby at pglaf.org Sun Mar 13 18:48:57 2005 From: gbnewby at pglaf.org (Greg Newby) Date: Sun Mar 13 18:48:59 2005 Subject: [gutvol-d] FWD: converting text into audio for reader format In-Reply-To: <42013639.5000100@sheridanc.on.ca> References: <420122D6.30201@sheridanc.on.ca> <20050202193514.GB29652@pglaf.org> <42013639.5000100@sheridanc.on.ca> Message-ID: <20050314024857.GB12812@pglaf.org> Please see the below - the question is, what eBook readers with text to speech capabilities can input a .txt file (versus .htm etc.) Please copy Donna Woodstock or respond to her directly with any suggestions. Thanks! On Wed, Feb 02, 2005 at 03:21:13PM -0500, Donna Woodstock wrote: > Hi Greg, > > If it's no trouble to forward to the list that would be appreciated. I > tried searching for a html format for Frankenstein...it shows it is > available but it still downloads as a .txt file. > > Cheers! > > Greg Newby wrote: > >On Wed, Feb 02, 2005 at 01:58:30PM -0500, Donna Woodstock wrote: > > > >>I am wondering if it is possible when an ebook is downloaded to be able > >>to open it up in a reader that has audio capabilities. I've tried > >>Microsoft reader but I cannot get it to read the text format. If you > >>can recommend a reader that can do this I would greatly appreciate it. > > > > > >Hi, Donna. We've had people using products like ViaVoice and > >other text-to-speech programs. I don't know anything about Microsoft > >Reader's audio capabilities - it might be it's not capable > >of processing .txt files. Perhaps you could try one of our > >titles in HTML (see http://gutenberg.org/find)? Or, it might > >be necessary to transform a .txt to the proprietary Reader > >format. I know people can do this, but we don't have any > >information about the tools. If you're still stuck, I can > >forward your note to the gutvol-d list (http://lists.pglaf.org) > >to see whether people can provide some more specific guidance. > > > >Sorry this isn't too helpful... > > -- Greg Newby > From shimmin at uiuc.edu Mon Mar 14 06:18:05 2005 From: shimmin at uiuc.edu (Robert Shimmin) Date: Mon Mar 14 06:18:11 2005 Subject: [gutvol-d] Scanning/OCR tips In-Reply-To: <200503121731.j2CHVoX03502@posso.dm.unipi.it> References: <200503121731.j2CHVoX03502@posso.dm.unipi.it> Message-ID: <42359D1D.6050205@uiuc.edu> Carlo Traverso wrote: > In margin of the BB-Jon discussion, I would like to issue a warning > from my experience with FineReader OCR: it is not true that higher > resolution scans always provide better OCR. Often they cause minute > imperfections of the original print to be recognized as letters, > punctuations or diacritics. Sometimes it pays to reduce the resolution > of the scans to 300DPI and use the reduced images; FineReader seems to > expect 300DPI scans. Higher resolution is only better with very small > type. Agreed. Higher resolution only improves recognition of small type. Once the resolution is high enough that the thinnest parts of the letters are reliably one pixel thick, if the software misrecognizes the character, it will misrecognize a larger character of the same shape. At 300 dpi, "normal" sized roman fonts seem to usually have thick stems of 3-4 pixels wide, thin stems of 1 pixel wide, and serifs also 1 pixel wide. Also, greyscale images do not appear to improve OCR with Abbyy either. Although I'm not privvy to their algorithms, certain aspects of the user interface suggest to me that the software only operates on black / white values, and even if you take greyscale scans, the software threshholds them for the purposes of recognition. You have the greyscales to save for whatever other purposes you wish to put them to, but the software itself seems to make use of a B/W version. My usual scanning practice for DP is to 300 dpi B/W scans for text, and 300 or 600 dpi greyscale scans for illustrations. > Globally, and not unexpectedly, the OCR seems well tuned to recent > print in contemporary language. My impression is that an effort of > developing free OCR software of good quality, in which the knowledge > of the source can be used in the recognition process, could be well > spent for the needs of PG. But it can be trained to recognize other fonts with some success. I have trained Abbyy 5 Pro on blackletter with not stellar, but not exactly embarrassing, results. There is an (unfortunately out of most people's price range) version of Abbyy 7 that is designed with oldstyle fonts in mind. If this software is the Abbyy 7 engine, specially trained on old text, it suggests that we might do well to set up a place to share our pre-trained user patterns for old printing styles. -- RS From vze3rknp at verizon.net Mon Mar 14 07:12:27 2005 From: vze3rknp at verizon.net (Juliet Sutherland) Date: Mon Mar 14 07:12:30 2005 Subject: [gutvol-d] Scanning/OCR tips In-Reply-To: <42359D1D.6050205@uiuc.edu> References: <200503121731.j2CHVoX03502@posso.dm.unipi.it> <42359D1D.6050205@uiuc.edu> Message-ID: <4235A9DB.5010001@verizon.net> Robert Shimmin wrote: > Also, greyscale images do not appear to improve OCR with Abbyy either. > Although I'm not privvy to their algorithms, certain aspects of the > user interface suggest to me that the software only operates on black > / white values, and even if you take greyscale scans, the software > threshholds them for the purposes of recognition. You have the > greyscales to save for whatever other purposes you wish to put them > to, but the software itself seems to make use of a B/W version. > I have found, using Finereader 6.0 Corporate, that for certain kinds of material I do get substantially better recognition results from greyscale. The best example are some old medical journals from the 1820's that are severely foxed. Finereader is able to recognize most of the text on these in greyscale, where B&W scanning produced images that even humans can't read. In sizing these down for proofing at DP, I found I could not go to B&W but had to go to 2 bit greyscale, and even then there were a few pages that need the full 8-bit greyscale to be legible. I always scan at 600 dpi B&W with the sheet-fed high-speed scanner because that slows it down enough for me to hand feed it (which is often necessary with the old paper). It doesn't seem to change the recognition quality much either way. JulietS From traverso at dm.unipi.it Mon Mar 14 09:58:14 2005 From: traverso at dm.unipi.it (Carlo Traverso) Date: Mon Mar 14 09:57:52 2005 Subject: [gutvol-d] Scanning/OCR tips In-Reply-To: <4235A9DB.5010001@verizon.net> (message from Juliet Sutherland on Mon, 14 Mar 2005 10:12:27 -0500) References: <200503121731.j2CHVoX03502@posso.dm.unipi.it> <42359D1D.6050205@uiuc.edu> <4235A9DB.5010001@verizon.net> Message-ID: <200503141758.j2EHwEN08451@posso.dm.unipi.it> I confirm that FineReader stores the images internally as monochrome. Probably grayscale works better because of an optimized thresholding algorithm; but in general the quality of the B/W scans very much depend on the quality of the scanning software: my B/W scans with the Plustek OpticBook are very much better that the scans of a (low-end) Epson. Carlo From marcello at perathoner.de Mon Mar 14 11:06:05 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon Mar 14 11:05:47 2005 Subject: [gutvol-d] Another PG 'clone' Message-ID: <4235E09D.70102@perathoner.de> How do we like this one? http://www.gutenberg.com -- Marcello Perathoner webmaster@gutenberg.org From miranda_vandeheijning at blueyonder.co.uk Mon Mar 14 11:25:53 2005 From: miranda_vandeheijning at blueyonder.co.uk (Miranda van de Heijning) Date: Mon Mar 14 11:26:04 2005 Subject: [gutvol-d] Another PG 'clone' In-Reply-To: <4235E09D.70102@perathoner.de> References: <4235E09D.70102@perathoner.de> Message-ID: <4235E541.6060406@blueyonder.co.uk> This is what it says: "Project Gutenberg is a wonderful project that has been going on for several decades, making public domain books available to people for free. We support the work of Project Gutenberg. We also believe, however, that as more and more value is added to books, even public domain books, that people will pay reasonable prices for these new information forms. So, whether a book is $1 or free is not a big issue to us at Gutenberg.com, but it is an issue if that additional $1 allows for newer and better services to be offered. Project Gutenberg and Gutenberg.com are not affiliated and if you look at the About page on Gutenberg.com, you will see that ebooks, and within that context free ebooks, will be a portion of this site. And we plan to have many places where free ebooks can be found, including Project Gutenberg. We hope you will join us here at Gutenberg.com as your home for the next phase of books, publishing, ebooks, and so on and so forth. This list below is from Project Gutenberg's site, and we are putting this up here to see how people like it." [followed by our PG Top 100 with links to the books] It's obviously making money on PGs reputation but very clear about the fact they are not directly affiliated with PG. It is free PR, but would this still be considered abuse of the PG trademark? Marcello Perathoner wrote: > How do we like this one? > > http://www.gutenberg.com > > From joshua at hutchinson.net Mon Mar 14 11:29:08 2005 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Mon Mar 14 11:29:16 2005 Subject: [gutvol-d] Another PG 'clone' Message-ID: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com> The site makes it very clear on the front page that they are not affliated with PG, so I've got no problem with it. I'd prefer a different domain name, but obviously we don't own the domain name, so it was free for anyone to take, I suppose. At least this one is actually something to do with eBooks and not a porn site or something... :) Josh ----- Original Message ----- From: "Marcello Perathoner" To: "Project Gutenberg volunteer discussion" Subject: [gutvol-d] Another PG 'clone' Date: Mon, 14 Mar 2005 20:06:05 +0100 > > How do we like this one? > > http://www.gutenberg.com > > > -- Marcello Perathoner > webmaster@gutenberg.org > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From marcello at perathoner.de Mon Mar 14 11:33:53 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Mon Mar 14 11:33:32 2005 Subject: [gutvol-d] Announce: web site directory switch Message-ID: <4235E721.8@perathoner.de> We are switching directories on the web site as announced on 02/23. If you have changed any content since then you should test it now. The old directories are still there if you forgot to copy things over. -- Marcello Perathoner webmaster@gutenberg.org From hacker at gnu-designs.com Mon Mar 14 11:34:50 2005 From: hacker at gnu-designs.com (David A. Desrosiers) Date: Mon Mar 14 11:36:20 2005 Subject: [gutvol-d] Another PG 'clone' In-Reply-To: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com> References: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com> Message-ID: > The site makes it very clear on the front page that they are not > affliated with PG, so I've got no problem with it. I'd prefer a > different domain name, but obviously we don't own the domain name, > so it was free for anyone to take, I suppose. At least this one is > actually something to do with eBooks and not a porn site or > something... :) But aren't they using the Gutenberg name to drive banner ad revenue to them, instead of to the "real" Gutenberg sites and pages? With a bit of creative SEO, they could end up with a PR8 or higher, knocking you off the SERPS for Google and Yahoo hits. They may say they're not affiliated with Project Gutenberg, but if you look at their meta keywords, they certainly are making that direct association, because they mention 'project gutenberg' a few times. David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com From ian at babcockbrown.com Mon Mar 14 11:40:34 2005 From: ian at babcockbrown.com (Ian Stoba) Date: Mon Mar 14 11:40:46 2005 Subject: [gutvol-d] Another PG 'clone' In-Reply-To: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com> References: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com> Message-ID: Here's the information from whois about the domain registrant. Does anyone know Chris Andrews? According to his web site (http://www.chrisandrews.com), he was an early participant in multimedia CD-ROMs. He has a page on his site about digital publishing, but there seems to be very little information there. Registrant: Chris Andrews (GUTENBERG-DOM) po box 1330 los altos, CA 84024 US Domain Name: GUTENBERG.COM Administrative Contact: Andrews, Chris (30036170I) chris@chrisandrews.com Chris Andrews PO Box 3550 Los Altos, CA 94024 US 650-599-3747 fax: 650-599-3747 Technical Contact: Network Solutions, LLC. (HOST-ORG) customerservice@networksolutions.com 13200 Woodland Park Drive Herndon, VA 20171-3025 US 1-888-642-9675 fax: 571-434-4620 Record expires on 02-Mar-2012. Record created on 01-Mar-1995. Database last updated on 14-Mar-2005 14:36:53 EST. Domain servers in listed order: NS41.WORLDNIC.COM 216.168.228.23 NS42.WORLDNIC.COM 216.168.225.172 On Mar 14, 2005, at 11:29 AM, Joshua Hutchinson wrote: > The site makes it very clear on the front page that they are not > affliated with PG, so I've got no problem with it. I'd prefer a > different domain name, but obviously we don't own the domain name, so > it was free for anyone to take, I suppose. At least this one is > actually something to do with eBooks and not a porn site or > something... :) > > Josh > > ----- Original Message ----- > From: "Marcello Perathoner" > To: "Project Gutenberg volunteer discussion" > Subject: [gutvol-d] Another PG 'clone' > Date: Mon, 14 Mar 2005 20:06:05 +0100 > >> >> How do we like this one? >> >> http://www.gutenberg.com >> >> >> -- Marcello Perathoner >> webmaster@gutenberg.org >> >> _______________________________________________ >> gutvol-d mailing list >> gutvol-d@lists.pglaf.org >> http://lists.pglaf.org/listinfo.cgi/gutvol-d > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d This email message may contain information that is confidential and proprietary to Babcock & Brown or a third party. If you are not the intended recipient, please contact the sender and destroy the original and any copies of the original message. Babcock & Brown takes measures to protect the content of its communications. However, Babcock & Brown cannot guarantee that email messages will not be intercepted by third parties or that email messages will be free of errors or viruses. If you do not wish to receive any further e-mail from Babcock & Brown, please send an email to opt-out@babcockbrown.com. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 2201 bytes Desc: not available Url : http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050314/684b867e/attachment-0001.bin From jhowse at nf.sympatico.ca Mon Mar 14 16:44:54 2005 From: jhowse at nf.sympatico.ca (JHowse) Date: Mon Mar 14 12:13:36 2005 Subject: [gutvol-d] Another PG 'clone' In-Reply-To: References: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com> <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com> Message-ID: <5.1.0.14.0.20050314164355.00a62dc0@pop1.nf.sympatico.ca> At 02:34 PM 14/03/05 -0500, you wrote: > But aren't they using the Gutenberg name to drive banner ad >revenue to them, instead of to the "real" Gutenberg sites and pages? >With a bit of creative SEO, they could end up with a PR8 or higher, >knocking you off the SERPS for Google and Yahoo hits. > > They may say they're not affiliated with Project Gutenberg, >but if you look at their meta keywords, they certainly are making that >direct association, because they mention 'project gutenberg' a few >times. and with their top ebooks list, they are actually linking to Project Gutenberg. JH ================================================================================ "I'm not likely to write a great novel or compose a song or save a baby from a burning building...but I can help make sure that there is an electronic library of free knowledge available for future people to access."--jhutch. Preserving History One Page at a Time!! Celebrating our 6000th book posted to Project Gutenberg Join Project Gutenberg's Distributed Proofreaders http://www.pgdp.net/c/ ================================================================================ From jeroen.mailinglist at bohol.ph Mon Mar 14 15:51:03 2005 From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account)) Date: Mon Mar 14 15:50:35 2005 Subject: [gutvol-d] Another PG 'clone' In-Reply-To: <4235E09D.70102@perathoner.de> References: <4235E09D.70102@perathoner.de> Message-ID: <42362367.10202@bohol.ph> Hi All, I've registered www.gutenberg.ph today, but that will be the home of a Philippines oriented Project Gutenberg, and I have asked Michael beforehand. Initially, it will contain many pointers back to the original PG, but we are planning to add more materials that cannot be cleared in the US (The Philippines is a life+50 country) Anyway, before people discover it, and start asking questions. Jeroen Hellingman. Marcello Perathoner wrote: > How do we like this one? > > http://www.gutenberg.com > > From tb at baechler.net Tue Mar 15 00:21:35 2005 From: tb at baechler.net (Tony Baechler) Date: Tue Mar 15 00:19:58 2005 Subject: [gutvol-d] FWD: converting text into audio for reader format In-Reply-To: <20050314024857.GB12812@pglaf.org> References: <42013639.5000100@sheridanc.on.ca> <420122D6.30201@sheridanc.on.ca> <20050202193514.GB29652@pglaf.org> <42013639.5000100@sheridanc.on.ca> Message-ID: <5.2.0.9.0.20050315001811.03978ec0@baechler.net> Hi. Here is a partial, although not necessarily good or recommended solution. You can get various older versions of the DEC-Talk software demo. They will work with text files or content pasted from the clipboard. Unfortunately the ones I know of require Windows. Also there is a size limit on how much text it will process at once but I don't know what it is. Another and probably better option is to get a free Linux text to speech system such as FreeTTS or Festival and use that. I know that FreeTTS can be downloaded at freetts.sf.net but I don't have links for anything else at the moment. Contact me if you need a link for the DEC-Talk demo and I'll find it. At 06:48 PM 3/13/2005 -0800, you wrote: >Please see the below - the question is, what >eBook readers with text to speech capabilities >can input a .txt file (versus .htm etc.) > >Please copy Donna Woodstock >or respond to her directly with any suggestions. Thanks! From schultzk at uni-trier.de Tue Mar 15 00:43:43 2005 From: schultzk at uni-trier.de (Keith J.Schultz) Date: Tue Mar 15 00:49:24 2005 Subject: [gutvol-d] Another PG 'clone' In-Reply-To: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com> References: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com> Message-ID: Hi Everybody, They are definately using PG for what they can get. Their disclaimer is a good thing, but they are VIOLATING Nettique: To download there so-called free books they link up to the PG site in a Frame, where their navbar is at the bottom!! This should not be done !! As they are not affilliated not I doubt have permission to the PG site in such a manner. Furthermore they are using the PG resources to make money, that is: linking directly to the PG site and using the PG Disk space for their service. They should either fork over some money or link to a new window!! That is the way to do it !!! Just my 0 Euro cents worth: 2 Euro cents added value tax deducted Keith. Am 14.03.2005 um 20:29 schrieb Joshua Hutchinson: > The site makes it very clear on the front page that they are not > affliated with PG, so I've got no problem with it. I'd prefer a > different domain name, but obviously we don't own the domain name, so > it was free for anyone to take, I suppose. At least this one is > actually something to do with eBooks and not a porn site or > something... :) > > Josh > > ----- Original Message ----- > From: "Marcello Perathoner" > To: "Project Gutenberg volunteer discussion" > Subject: [gutvol-d] Another PG 'clone' > Date: Mon, 14 Mar 2005 20:06:05 +0100 > >> >> How do we like this one? >> >> http://www.gutenberg.com >> >> >> -- Marcello Perathoner >> webmaster@gutenberg.org >> >> _______________________________________________ >> gutvol-d mailing list >> gutvol-d@lists.pglaf.org >> http://lists.pglaf.org/listinfo.cgi/gutvol-d > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From miranda_vandeheijning at blueyonder.co.uk Tue Mar 15 01:46:41 2005 From: miranda_vandeheijning at blueyonder.co.uk (Miranda van de Heijning) Date: Tue Mar 15 01:47:13 2005 Subject: [gutvol-d] Another PG 'clone' In-Reply-To: References: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com> Message-ID: <4236AF01.2030607@blueyonder.co.uk> I think we should be greatly suspicious of anyone who refers to George W. Bush as the leader of the free world, but that's an entirely different matter. All in all it looks like a bit of a crap website, from a creator who is too lazy to make his own eBooks or to do his own marketing. Keith J.Schultz wrote: > Hi Everybody, > > They are definately using PG for what they can get. Their > disclaimer is a good thing, > but they are VIOLATING Nettique: > To download there so-called free books they link up to the PG > site in a Frame, where > their navbar is at the bottom!! This should not be done !! As > they are not affilliated > not I doubt have permission to the PG site in such a manner. > Furthermore they are using > the PG resources to make money, that is: linking directly to > the PG site and using the PG Disk > space for their service. They should either fork over some > money or link to a new window!! > That is the way to do it !!! > > > Just my 0 Euro cents worth: 2 Euro cents added value tax deducted > > > Keith. > > Am 14.03.2005 um 20:29 schrieb Joshua Hutchinson: > >> The site makes it very clear on the front page that they are not >> affliated with PG, so I've got no problem with it. I'd prefer a >> different domain name, but obviously we don't own the domain name, so >> it was free for anyone to take, I suppose. At least this one is >> actually something to do with eBooks and not a porn site or >> something... :) >> >> Josh >> >> ----- Original Message ----- >> From: "Marcello Perathoner" >> To: "Project Gutenberg volunteer discussion" >> Subject: [gutvol-d] Another PG 'clone' >> Date: Mon, 14 Mar 2005 20:06:05 +0100 >> >>> >>> How do we like this one? >>> >>> http://www.gutenberg.com >>> >>> >>> -- Marcello Perathoner >>> webmaster@gutenberg.org >>> >>> _______________________________________________ >>> gutvol-d mailing list >>> gutvol-d@lists.pglaf.org >>> http://lists.pglaf.org/listinfo.cgi/gutvol-d >> >> >> _______________________________________________ >> gutvol-d mailing list >> gutvol-d@lists.pglaf.org >> http://lists.pglaf.org/listinfo.cgi/gutvol-d >> > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > From kouhia at nic.funet.fi Thu Mar 17 12:19:52 2005 From: kouhia at nic.funet.fi (Juhana Sadeharju) Date: Thu Mar 17 12:20:02 2005 Subject: [gutvol-d] Scanner vs. digital camera Message-ID: [ Continuing the thread under subject "wiki...". ] Scanners have pros but because they are dangerous to use, I would prefer digital camera. At least I got fed up to lifting up the book for changing pages. A 600 pages book is quite weighty. Another annoyance was that the scanner collected dust (both from book and from room). Scanner was also slow. A couple of days ago I borrowed a tourist range digital camera. I could digitize 8 pages per minute. It was as fast and easy as I had predicted. The digitization speed was limited only by image transfer technology, not by speed of my fingers. "Easy" is the keyword here. I have invented a couple of camera features which would help in book digitization. Anyone would know how to contact camera manufacturers? Juhana -- http://music.columbia.edu/mailman/listinfo/linux-graphics-dev for developers of open source graphics software From hacker at gnu-designs.com Thu Mar 17 12:55:36 2005 From: hacker at gnu-designs.com (David A. Desrosiers) Date: Thu Mar 17 12:57:25 2005 Subject: [gutvol-d] Scanner vs. digital camera In-Reply-To: References: Message-ID: > I have invented a couple of camera features which would help in book > digitization. Anyone would know how to contact camera manufacturers? You've "invented" camera features? What hardware did you use when building these features into your camera? What camera model did you use as a base unit? David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com From hart at pglaf.org Fri Mar 18 10:53:24 2005 From: hart at pglaf.org (Michael Hart) Date: Fri Mar 18 10:53:26 2005 Subject: [gutvol-d] Kenyan school turns to handhelds In-Reply-To: <20050301192815.88274.qmail@web52310.mail.yahoo.com> References: <20050301192815.88274.qmail@web52310.mail.yahoo.com> Message-ID: On Tue, 1 Mar 2005, maitri venkat-ramani wrote: > Technological progress reaching end users in developing countries makes > me so happy! They bear a lot of the brunt for our wellbeing. Is there > any way we can get PG books to this school and others like it? Do we > have any African contacts? I emailed my Africa contact from the UN, no reply. > > Thanks, > Maitri > > ============================================================ > > Kenyan school turns to handhelds > By Julian Siddle > BBC Go Digital > > At the Mbita Point primary school in western Kenya students click away > at a handheld computer with a stylus. > They are doing exercises in their school textbooks which have been > digitised. > > It is a pilot project run by EduVision, which is looking at ways to use > low cost computer systems to get up-to-date information to students who > are currently stuck with ancient textbooks. > > Matthew Herren from EduVision told the BBC programme Go Digital how the > non-governmental organisation uses a combination of satellite radio and > handheld computers called E-slates. > > "The E-slates connect via a wireless connection to a base station in > the school. This in turn is connected to a satellite radio receiver. > The data is transmitted alongside audio signals." > > The base station processes the information from the satellite > transmission and turns it into a form that can be read by the handheld > E-slates. > > "It downloads from the satellite and every day processes the stream, > sorts through content for the material destined for the users connected > to it. It also stores this on its hard disc." > > Linux link > > The system is cheaper than installing and maintaining an internet > connection and conventional computer network. But Mr Herren says there > are both pros and cons to the project. > > "It's very simple to set up, just a satellite antenna on the roof of > the school, but it's also a one-way connection, so getting feedback or > specific requests from end users is difficult." > > The project is still at the pilot stage and EduVision staff are on the > ground to attend to teething problems with the Linux-based system. > "The content is divided into visual information, textual information > and questions. Users can scroll through these sections independently of > each other." > > EduVision is planning to include audio and video files as the system > develops and add more content. > > Mr Herren says this would vastly increase the opportunities available > to the students. He is currently in negotiations to take advantage of a > project being organised by search site Google to digitise some of the > world's largest university libraries. > > "All books in the public domain, something like 15 million, could be > put on the base stations as we manufacture them. Then every rural > school in Africa would have access to the same libraries as the > students in Oxford and Harvard" > > Currently the project is operating in an area where there is mains > electricity. But Mr Herren says EduVision already has plans to extend > it to more remote regions. > > "We plan to put a solar panel at the school with the base station, have > the E-slates charge during the day when the children are in school, > then they can take them home at night and continue working." > > Maciej Sundra, who designed the user interface for the E-slates, says > the project's ultimate goal is levelling access to knowledge around the > world. > > "Why in this age when most people do most research using the internet > are students still using textbooks? The fact that we are doing this in > a rural developing country is very exciting - as they need it most." > > > Story from BBC NEWS: > http://news.bbc.co.uk/go/pr/fr/-/2/hi/technology/4304375.stm > > Published: 2005/02/28 11:47:23 GMT > > > > > __________________________________ > Do you Yahoo!? > Yahoo! Sports - Sign up for Fantasy Baseball. > http://baseball.fantasysports.yahoo.com/ > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From joshua at hutchinson.net Fri Mar 18 12:56:06 2005 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Fri Mar 18 12:56:15 2005 Subject: [gutvol-d] Online TEI page and the recent server move Message-ID: <20050318205606.EDA2A2FAB9@ws6-3.us4.outblaze.com> I know Marcello has been talking about some server moves that have been taking place recently. Starting this week, the online TEI conversion tools at: http://www.gutenberg.org/tei/services/tei-online have quit working. I thought at first it might be my local firewall or an overloaded server (it does cause problems once in a while). However, it has been down all week for me, so I'm starting to think it may be due to the server move. My question is: Is this the cause and is this something that is going to get fixed in time? Should I just be patient? Also, as a sidenote question, since the PG server can get overwhelmed sometimes, would this be better server over on the DP server? (It seems to fit better over with that work load anyway, at least in my opinion). Josh From marcello at perathoner.de Fri Mar 18 12:10:44 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Fri Mar 18 13:18:31 2005 Subject: [gutvol-d] Kenyan school turns to handhelds In-Reply-To: References: <20050301192815.88274.qmail@web52310.mail.yahoo.com> Message-ID: <423B35C4.90607@perathoner.de> Michael Hart wrote: >> At the Mbita Point primary school in western Kenya students click away >> at a handheld computer with a stylus. >> They are doing exercises in their school textbooks which have been >> digitised. >> >> It is a pilot project run by EduVision, which is looking at ways to use >> low cost computer systems to get up-to-date information to students who >> are currently stuck with ancient textbooks. >> >> Matthew Herren from EduVision told the BBC programme Go Digital how the >> non-governmental organisation uses a combination of satellite radio and >> handheld computers called E-slates. Do we want African nations to get into an educational dependency from satellite links and such high tech stuff? Maybe textbooks are just right for these students. A textbook will not need a new battery pack in a couple of years. It will not stop working if the school can't get new battery packs because the publicity value of the project has died away. Reminds me very much of the shipping of wheat into nations that are used to eat mais. Ship free wheat, thus ruin the local industry who produces cheap mais, then ship pricy wheat. >> "Why in this age when most people do most research using the internet >> are students still using textbooks? The fact that we are doing this in >> a rural developing country is very exciting - as they need it most." And -- as a side effect -- maximizes the publicity Return On Investment. -- Marcello Perathoner webmaster@gutenberg.org From hart at pglaf.org Sat Mar 19 12:51:45 2005 From: hart at pglaf.org (Michael Hart) Date: Sat Mar 19 12:51:47 2005 Subject: [gutvol-d] Kenyan school turns to handhelds In-Reply-To: <423B35C4.90607@perathoner.de> References: <20050301192815.88274.qmail@web52310.mail.yahoo.com> <423B35C4.90607@perathoner.de> Message-ID: On Fri, 18 Mar 2005, Marcello Perathoner wrote: > Michael Hart wrote: > >>> At the Mbita Point primary school in western Kenya students click away >>> at a handheld computer with a stylus. >>> They are doing exercises in their school textbooks which have been >>> digitised. >>> >>> It is a pilot project run by EduVision, which is looking at ways to use >>> low cost computer systems to get up-to-date information to students who >>> are currently stuck with ancient textbooks. >>> >>> Matthew Herren from EduVision told the BBC programme Go Digital how the >>> non-governmental organisation uses a combination of satellite radio and >>> handheld computers called E-slates. > > Do we want African nations to get into an educational dependency from > satellite links and such high tech stuff? Maybe textbooks are just right for > these students. A textbook will not need a new battery pack in a couple of > years. It will not stop working if the school can't get new battery packs > because the publicity value of the project has died away. Personally, I think cell phones have already made the satelites obsolete for distributing eBooks. Africa has the fastest growing cell phone base in the world. > Reminds me very much of the shipping of wheat into nations that are used to > eat mais. Ship free wheat, thus ruin the local industry who produces cheap > mais, then ship pricy wheat. Sounds like something the World Bank or International Monetary Fund would do. >>> "Why in this age when most people do most research using the internet >>> are students still using textbooks? The fact that we are doing this in >>> a rural developing country is very exciting - as they need it most." > > And -- as a side effect -- maximizes the publicity Return On Investment. As long as anyone can send their own eBooks, things should be ok, but that requires freedom of expression. . . . On the other hand, it's harder to get rid of an eBook, once published, than the paper editions. Michael From marcello at perathoner.de Sat Mar 19 17:52:08 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat Mar 19 17:51:53 2005 Subject: [gutvol-d] Online TEI page and the recent server move In-Reply-To: <20050318205606.EDA2A2FAB9@ws6-3.us4.outblaze.com> References: <20050318205606.EDA2A2FAB9@ws6-3.us4.outblaze.com> Message-ID: <423CD748.1080707@perathoner.de> Joshua Hutchinson wrote: > I know Marcello has been talking about some server moves that have > been taking place recently. > > Starting this week, the online TEI conversion tools at: > > http://www.gutenberg.org/tei/services/tei-online > > have quit working. I thought at first it might be my local firewall > or an overloaded server (it does cause problems once in a while). > However, it has been down all week for me, so I'm starting to think > it may be due to the server move. Many small things stopped working with the recent file server move. The online tei conversion being on of them. I will try to fix them when they come to my notice. The tei conversion should now be online again. -- Marcello Perathoner webmaster@gutenberg.org From marcello at perathoner.de Tue Mar 22 13:44:33 2005 From: marcello at perathoner.de (Marcello Perathoner) Date: Tue Mar 22 13:44:15 2005 Subject: [gutvol-d] Slashdot on Google Print Message-ID: <424091C1.4030003@perathoner.de> Discussion about Google Print mentions PG too. http://slashdot.org/articles/05/03/21/1237243.shtml -- Marcello Perathoner webmaster@gutenberg.org From JBuck814366460 at aol.com Tue Mar 22 13:59:56 2005 From: JBuck814366460 at aol.com (Jared Buck) Date: Tue Mar 22 14:00:07 2005 Subject: [gutvol-d] Spam on PG lists? In-Reply-To: <424091C1.4030003@perathoner.de> References: <424091C1.4030003@perathoner.de> Message-ID: <4240955B.3020103@aol.com> Hey, Is it me, or are we getting a lot of spam on a lot of the PG (and PGLAF) lists? We need to stop the spam coming, or the list is going to get overwhelmed before we know it. I've already banned receiving messages from known spammers on the list, it may help, or it may not. Jared From servalan at ar.com.au Tue Mar 22 14:07:02 2005 From: servalan at ar.com.au (Pauline) Date: Tue Mar 22 14:07:42 2005 Subject: [gutvol-d] Spam on PG lists? In-Reply-To: <4240955B.3020103@aol.com> References: <424091C1.4030003@perathoner.de> <4240955B.3020103@aol.com> Message-ID: <42409706.3080609@ar.com.au> Jared Buck wrote: > Hey, > > Is it me, or are we getting a lot of spam on a lot of the PG (and PGLAF) > lists? We need to stop the spam coming, or the list is going to get > overwhelmed before we know it. It's not just you. I'm a little annoyed as the DP posts email address which is supposed to be used only by our volunteers to notify the site admins of posted projects is available via a google search of the PG mailing list archives (& only from there) & is getting spam. I doubt it is fully fixable now - but it would be great if the PG mailman archives can be protected from future email address harvesters. I suspect other volunteers are also receiving spam via this path. Thanks, P -- Help digitise public domain books: Distributed Proofreaders: http://www.pgdp.net "Preserving history one page at a time." Set free dead-tree books: http://bookcrossing.com/referral/servalan From hacker at gnu-designs.com Tue Mar 22 14:08:19 2005 From: hacker at gnu-designs.com (David A. Desrosiers) Date: Tue Mar 22 14:09:46 2005 Subject: [gutvol-d] Spam on PG lists? In-Reply-To: <4240955B.3020103@aol.com> References: <424091C1.4030003@perathoner.de> <4240955B.3020103@aol.com> Message-ID: > Is it me, or are we getting a lot of spam on a lot of the PG (and > PGLAF) lists? We need to stop the spam coming, or the list is going > to get overwhelmed before we know it. Isn't the list open to subscribers-only? If not, I suggest moving it to that model. David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com From JBuck814366460 at aol.com Tue Mar 22 14:27:20 2005 From: JBuck814366460 at aol.com (JBuck814366460@aol.com) Date: Tue Mar 22 14:27:36 2005 Subject: [gutvol-d] Spam on PG lists? Message-ID: > Isn't the list open to subscribers-only? If not, I suggest > moving it to that model. I agree, if it isn't subscriber-only, it should be as soon as possible. The spam is very annoying and doesn't belong on the list. Jared -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050322/087de180/attachment.html From hacker at gnu-designs.com Tue Mar 22 14:55:18 2005 From: hacker at gnu-designs.com (David A. Desrosiers) Date: Tue Mar 22 14:56:47 2005 Subject: [gutvol-d] Spam on PG lists? In-Reply-To: References: Message-ID: > > Isn't the list open to subscribers-only? If not, I suggest moving > > it to that model. > I agree, if it isn't subscriber-only, it should be as soon as > possible. The spam is very annoying and doesn't belong on the list. Honestly, I haven't seen a single spam on either list since I've been a subscriber (a year?). Then again, I run dspam on my MTA, and its probably catching and quarantining them so I never even see them. David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com From servalan at ar.com.au Tue Mar 22 15:03:44 2005 From: servalan at ar.com.au (Pauline) Date: Tue Mar 22 15:04:24 2005 Subject: [gutvol-d] Spam on PG lists? In-Reply-To: References: Message-ID: <4240A450.6040207@ar.com.au> David A. Desrosiers wrote: > Honestly, I haven't seen a single spam on either list since > I've been a subscriber (a year?). Then again, I run dspam on my MTA, > and its probably catching and quarantining them so I never even see > them. From my quick peek - it's only the posted list (posted@pglaf.org) archive which is visible. So anyone submitting projects to PG will have a visible email address to email harvesters. The gutvol* lists are OK. I hope this helps, P -- Help digitise public domain books: Distributed Proofreaders: http://www.pgdp.net "Preserving history one page at a time." Set free dead-tree books: http://bookcrossing.com/referral/servalan From tb at baechler.net Tue Mar 22 23:42:11 2005 From: tb at baechler.net (Tony Baechler) Date: Tue Mar 22 23:40:25 2005 Subject: [gutvol-d] Spam on PG lists? In-Reply-To: References: Message-ID: <5.2.0.9.0.20050322234007.035a8ba0@baechler.net> Hello. What is dspam? How hard is it to set up? Is it similar to Spam Assassin? I'm running qmail under Linux and had an extremely hard time setting up spam filtering, so I eventually gave up. I have not heard of that antispam package before. More information would be appreciated. Thanks. To stay on topic, I have received no spam from the pglaf.org lists and I do not run a spam filter locally. From gbnewby at pglaf.org Wed Mar 23 11:22:52 2005 From: gbnewby at pglaf.org (Greg Newby) Date: Wed Mar 23 11:22:53 2005 Subject: [gutvol-d] Spam on PG lists? In-Reply-To: <5.2.0.9.0.20050322234007.035a8ba0@baechler.net> References: <5.2.0.9.0.20050322234007.035a8ba0@baechler.net> Message-ID: <20050323192252.GA564@pglaf.org> On Tue, Mar 22, 2005 at 11:42:11PM -0800, Tony Baechler wrote: > Hello. What is dspam? How hard is it to set up? Is it similar to Spam > Assassin? I'm running qmail under Linux and had an extremely hard time > setting up spam filtering, so I eventually gave up. I have not heard of > that antispam package before. More information would be appreciated. I did a very informal comparison of dspam to Spam Assassin, and found them to be about the same. They have some different features, but basically both "learn" based on your mail patterns. dspam takes a little longer to get trained, and is tuned to have a very low portion of false positives (that is, it very seldom flags non-spam as spam). With any spam filter, though, it's important to periodically check the logs or spam folders, to see what messages were misidentified as spam. > To stay on topic, I have received no spam from the pglaf.org lists and I do > not run a spam filter locally. If people could forward spam items to me that were distributed via the lists.pglaf.org server, I can look into how they got to the list. I'll also look into obfuscating email addresses in the logs (via transforming the @ or similar techniques). This is sometimes done automatically with Pipermail (which manages our Mailman archives, I believe), but doesn't seem to be happening. Sorry about that.... I'm still looking for a volunteer to manage the mailing lists, by the way. It takes just a few minutes per day (every day). -- Greg From traverso at dm.unipi.it Wed Mar 23 11:43:35 2005 From: traverso at dm.unipi.it (Carlo Traverso) Date: Wed Mar 23 11:43:19 2005 Subject: [gutvol-d] Spam on PG lists? In-Reply-To: <20050323192252.GA564@pglaf.org> (message from Greg Newby on Wed, 23 Mar 2005 11:22:52 -0800) References: <5.2.0.9.0.20050322234007.035a8ba0@baechler.net> <20050323192252.GA564@pglaf.org> Message-ID: <200503231943.j2NJhZa05915@pico.dm.unipi.it> I don't filter the lists, (I apply the filters after accepting pglaf lists) and I don't receive any spam on the lists (a lot outside). Consider the possibility of forged sender address. Carlo From mattsen at arvig.net Wed Mar 23 12:35:57 2005 From: mattsen at arvig.net (Chuck MATTSEN) Date: Wed Mar 23 12:36:08 2005 Subject: [gutvol-d] Spam on PG lists? In-Reply-To: <20050323192252.GA564@pglaf.org> References: <5.2.0.9.0.20050322234007.035a8ba0@baechler.net> <20050323192252.GA564@pglaf.org> Message-ID: <20050323143557.198b0fb5@localhost.localdomain> On Wed, 23 Mar 2005 11:22:52 -0800 Greg Newby typed: > On Tue, Mar 22, 2005 at 11:42:11PM -0800, Tony Baechler wrote: > > Hello. What is dspam? How hard is it to set up? Is it similar to > > Spam Assassin? I'm running qmail under Linux and had an extremely > > hard time setting up spam filtering, so I eventually gave up. I > > have not heard of that antispam package before. More information > > would be appreciated. > > I did a very informal comparison of dspam to Spam Assassin, and found > them to be about the same. They have some different features, but > basically both "learn" based on your mail patterns. dspam takes a > little longer to get trained, and is tuned to have a very low portion > of false positives (that is, it very seldom flags non-spam as spam). > With any spam filter, though, it's important to periodically check > the logs or spam folders, to see what messages were misidentified as > spam. Another alternative tool is POPFile (or any of the other Bayesian filters) ... http://popfile.sourceforge.net/ ... also free, open source, cross-platform. It has the advantage of being very fast in its processing of incoming mail (POP3 included), and it "learns" very quickly what the user considers spam and "not spam" ... actually, one could set up any number of different categories and, with time, it would learn to sort things however one wished. I get about 10,000 e- mails per months and POPFile has been running at about 99.81% accuracy for me with respect to false-positives, etc. > > To stay on topic, I have received no spam from the pglaf.org lists > > and I do not run a spam filter locally. Nor have I received any.... -- Chuck MATTSEN / mattsen at arvig dot net / Mahnomen, MN, USA Mandrakelinux release 10.2 (Cooker) for i586 kernel 2.6.10-3.mm.5mdk RLU #346519 / MT Lookup: http://eot.com/~mattsen/mtsearch.htm Random Thought/Quote for this Message: From listening comes wisdom, from speaking, repentance. From JBuck814366460 at aol.com Wed Mar 23 15:35:58 2005 From: JBuck814366460 at aol.com (Jared Buck) Date: Wed Mar 23 15:36:13 2005 Subject: [gutvol-d] Spam on PG lists? In-Reply-To: <20050323192252.GA564@pglaf.org> References: <5.2.0.9.0.20050322234007.035a8ba0@baechler.net> <20050323192252.GA564@pglaf.org> Message-ID: <4241FD5E.4070500@aol.com> Hi Greg, Sure, I wouldn't mind managing the lists for a couple minutes a day. I can't promise it will be as soon as I get up (I tend to sleep more than the average person) but it will be once a day. I'll forward you copies of the spam I'm getting on the list as I receive them, then you can figure out how to ban the senders' IPs to keep that mail from getting on the list and interfering with perfectly good discussions. Jared From hacker at gnu-designs.com Wed Mar 23 16:08:44 2005 From: hacker at gnu-designs.com (David A. Desrosiers) Date: Wed Mar 23 16:10:08 2005 Subject: [gutvol-d] Spam on PG lists? In-Reply-To: <20050323192252.GA564@pglaf.org> References: <5.2.0.9.0.20050322234007.035a8ba0@baechler.net> <20050323192252.GA564@pglaf.org> Message-ID: > I did a very informal comparison of dspam to Spam Assassin, and > found them to be about the same. They are so dramatically different, I can't believe you even would suggest they're "about the same". SpamAssassin is written in Perl, and is significantly slower than dspam. SpamAssassin also relies on static rulesets, not the "quality" of the mail received. You can't do per-user filtering with SA. With dspam, if one user prefers seeing lots of HTML advertisements, they can. Another user on the same system can reject those as spam. In my case, I was using SpamAssassin for about 2 years, trained down to a threshhold of 2, with 13 RBLs in place, and my users were still getting 20-30 spams per-week. SpamAssassin's accuracy under that configuration after 2 years was about 90%. In 1 month of using dspam, we were over 98% accuracy, AND I no longer had to manage mail. The users get their own quarantine and they can manage their own mail "quality" themselves, I don't _ever_ have to get involved. > They have some different features, but basically both "learn" based > on your mail patterns. dspam takes a little longer to get trained, > and is tuned to have a very low portion of false positives (that is, > it very seldom flags non-spam as spam). You probably didn't read the docs. Did you load it with the SA corpus first? Did you train it with that corpus? It took about an hour for me to train it to a level where it was accurately catching and quarantining mail. Getting dspam configured properly is no small task, and you have to be _very_ careful about using conflicting algorithms when you configure and build it. Also, were you using TOE? TEFT? TUM? Each of these has VERY different usages and specific conditions where they work well, or horrible. > With any spam filter, though, it's important to periodically check > the logs or spam folders, to see what messages were misidentified as > spam. And with dspam, this is all handled completely seamlessly, no need to "check logs" or "spam folders" at all. Users simply forward their false positives to spam-$USER@domain.com, and it gets marked as spam. When more emails come in that match similar tokens, those are marked as spam also. > I'm still looking for a volunteer to manage the mailing lists, by > the way. It takes just a few minutes per day (every day). I host quite a few mailing lists here for SourceFubar.Net, and I'd be happy to take over management of the lists for you, if you wish. We don't have any spam on the lists we host, and everything works as it should. David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com From sly at victoria.tc.ca Wed Mar 23 21:00:37 2005 From: sly at victoria.tc.ca (Andrew Sly) Date: Wed Mar 23 21:00:54 2005 Subject: [gutvol-d] Humanities Computing conference Message-ID: I'm looking for feedback from other PG volunteers. There will be a four-day "humanities computing Summer Institute" taking place in my city in June, as described here: http://web.uvic.ca/hrd/institute/ It looks as if its main focus will be on digitizing texts using the tei dtd. Any ideas on how worthwhile it would be participating in this? Andrew From felix.klee at inka.de Thu Mar 24 03:16:04 2005 From: felix.klee at inka.de (Felix E. Klee) Date: Thu Mar 24 03:17:13 2005 Subject: [gutvol-d] Scanner vs. digital camera In-Reply-To: References: Message-ID: <87sm2lz43v.wl%felix.klee@inka.de> At Thu, 17 Mar 2005 22:19:52 +0200, Juhana Sadeharju wrote: > A couple of days ago I borrowed a tourist range digital camera. I > could digitize 8 pages per minute. It was as fast and easy as I had > predicted. The digitization speed was limited only by image transfer > technology, not by speed of my fingers. "Easy" is the keyword here. How did OCR'ing go? I wonder because the resolution of cheap digital cameras is quite low for scanning. For example, to scan an A4 page (aspect ratio: sqrt(2)) with a usual digital camera (aspect ratio of images: 4:3) in 300DPI, you need a camera with more than nine mega-pixels. -- Felix E. Klee From bruce at zuhause.org Thu Mar 24 07:54:20 2005 From: bruce at zuhause.org (Bruce Albrecht) Date: Thu Mar 24 07:54:25 2005 Subject: [gutvol-d] Spam on PG lists? In-Reply-To: References: <5.2.0.9.0.20050322234007.035a8ba0@baechler.net> <20050323192252.GA564@pglaf.org> Message-ID: <16962.58028.410879.644627@celery.zuhause.org> David A. Desrosiers writes: > > > I did a very informal comparison of dspam to Spam Assassin, and > > found them to be about the same. > > They are so dramatically different, I can't believe you even > would suggest they're "about the same". > > SpamAssassin is written in Perl, and is significantly slower > than dspam. SpamAssassin also relies on static rulesets, not the > "quality" of the mail received. You can't do per-user filtering with > SA. With dspam, if one user prefers seeing lots of HTML > advertisements, they can. Another user on the same system can reject > those as spam. I don't want this to turn this mailing list into a dspam vs Spam Assassin war, but I think your information about SA is out of date. SA v3 supports multi-tiered (e.g., global, domain, user) configurations, and has bayesian filtering as one of several rules for determining spam. I'd also like to point out that being written in Perl does not imply that something is always much slower than C, especially when large amounts of regular expression pattern matching is involved. Perl developers have spent a lot of time optimizing its pattern matching. The SA Wiki suggests that if you find that SA is slow, you should examine the rule set you're using, and disable inappropriate rules (for example, ones requiring DNS lookups). Bruce From hacker at gnu-designs.com Thu Mar 24 08:15:13 2005 From: hacker at gnu-designs.com (David A. Desrosiers) Date: Thu Mar 24 08:17:02 2005 Subject: [gutvol-d] Spam on PG lists? In-Reply-To: <16962.58028.410879.644627@celery.zuhause.org> References: <5.2.0.9.0.20050322234007.035a8ba0@baechler.net> <20050323192252.GA564@pglaf.org> <16962.58028.410879.644627@celery.zuhause.org> Message-ID: > I don't want this to turn this mailing list into a dspam vs Spam > Assassin war, but I think your information about SA is out of date. You're right, my information is a bit out of date, dspam is quite a bit ahead of SA now, further than I originally surmised (see further down). But I agree, let's not turn this into a religious war. > SA v3 supports multi-tiered (e.g., global, domain, user) > configurations, and has bayesian filtering as one of several rules > for determining spam. Does SA support allowing the user to configure their own mail preferences via a simple web interface? Does it support adding and revoking tokens by simply sending the false-positives back through email, without involving a mail administrator? Sure, those things can be written, but do they come as part of the core package? Does that capability exist in the base engine? Incidentally, dspam supports the following, out of the box: - Bayesian filtering - Graham Bayes - Burton Bayes - Noise Reduction - Robinson Geometric Mean calculation - Fisher-Robinson Inverse Chi-Square calculation - Robinson Combined P-Values - Chained Tokens - Neural Networking - Message Innoculation ..and quite a bit more for filtering mail. Does SpamAssassin v3? I'm glad that SA is now beginning to incorporate some of these things now, and they've got a good base project to learn from. I've been very disappointed with SA, and dspam has already trounced it in our case, so we have no need to de-evolve to something that doesn't suit our needs. Less than 10 spam messages total in any user's mailbox in over a year now (that we've been told about), and only a small handful of innocent messages were caught as spam, but were really ham. With the web interface, the user just sends them on to their normal account, and dspam scores them lower, so future versions aren't caught. Works great, and I don't have to be involved in the mail management process _at all_ anymore. > I'd also like to point out that being written in Perl does not imply > that something is always much slower than C, especially when large > amounts of regular expression pattern matching is involved. True, poorly-written C can definately be worse than Perl, but well-written C is ALWAYS going to be faster than equivalently written Perl. I don't think I've ever seen SA process 100 messages/sec., but dspam has no problem doing the same thing, every day. > Perl developers have spent a lot of time optimizing its pattern > matching. The SA Wiki suggests that if you find that SA is slow, you > should examine the rule set you're using, and disable inappropriate > rules (for example, ones requiring DNS lookups). You're preaching to the choir here, I'm a very heavy user and supporter of Perl, and I use it for 99% of my tasks... but there are some cases where an interpreted language just can't compete with a natively-compiled object code. Anyway, good discussions all around. Use whatever tool fits your needs. In my case (heavy mail use from very disparate sources), dspam easily beat what SA could do, hands-down in terms of quality and speed and flexibility. The added benefit is that now I don't have to micro-manage mail, whitelists, or rulesets anymore. David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com From grythumn at gmail.com Thu Mar 24 09:19:18 2005 From: grythumn at gmail.com (Robert Cicconetti) Date: Thu Mar 24 09:19:27 2005 Subject: [gutvol-d] Scanner vs. digital camera In-Reply-To: <87sm2lz43v.wl%felix.klee@inka.de> References: <87sm2lz43v.wl%felix.klee@inka.de> Message-ID: <15cfa2a5050324091950b5eab3@mail.gmail.com> On Thu, 24 Mar 2005 12:16:04 +0100, Felix E. Klee wrote: > How did OCR'ing go? I wonder because the resolution of cheap digital > cameras is quite low for scanning. For example, to scan an A4 page > (aspect ratio: sqrt(2)) with a usual digital camera (aspect ratio of > images: 4:3) in 300DPI, you need a camera with more than nine > mega-pixels. Let's try something more realistic. Typical book size that I scan is under 8.5x11". Typical page is about 8.5x5.5"; typical text area is 6.5x4" to 7x4.5". So if focused solely on the text area, one would need about 2.2-2.8 megapixels / page, or for a full page impression, about 4.2. Most books do not lie flat enough to get two full page scans from straight up; you're better off doing each page at a time. So a 4 MP camera, with good optical zoom / focus, should be fine. This won't be cheap, but it's not in the same realm as a 9 MP camera. R C From jenzed at gmail.com Thu Mar 24 09:21:25 2005 From: jenzed at gmail.com (Jen Zed) Date: Thu Mar 24 09:21:33 2005 Subject: [gutvol-d] Humanities Computing conference In-Reply-To: References: Message-ID: <7d5745970503240921d601da@mail.gmail.com> The relevance of the workshops and conference depend mostly on what James has planned for the UniBook back-end. James, are you planning to implement TEI / XSL / FO? (Actually, any info about UniBook would be really useful to me, as I've started to think about the site front-end, but can't go very far unless I know what the back-end looks like.) At work, I'm doing a DocBook XSL implementation right now. The issues are similar enough that I might be able to swing a seminar and conference attendance on the company's tab. (DocBook is like TEI, only it's optimized for generating printed reference books.) Too bad we don't have a little pot of money we could use to send people to events like these. Can I hope (request) that getting our non-profit status established is on the agenda for the upcoming meeting in Toronto? jen. On Wed, 23 Mar 2005 21:00:37 -0800 (PST), Andrew Sly wrote: > I'm looking for feedback from other PG volunteers. > > There will be a four-day "humanities computing Summer Institute" > taking place in my city in June, as described here: > > http://web.uvic.ca/hrd/institute/ > > It looks as if its main focus will be on digitizing > texts using the tei dtd. > > Any ideas on how worthwhile it would be participating > in this? > > Andrew > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From sly at victoria.tc.ca Thu Mar 24 10:17:18 2005 From: sly at victoria.tc.ca (Andrew Sly) Date: Thu Mar 24 10:17:25 2005 Subject: [gutvol-d] Humanities Computing conference In-Reply-To: <7d5745970503240921d601da@mail.gmail.com> References: <7d5745970503240921d601da@mail.gmail.com> Message-ID: Just to avoid confusing other PG volunteers too much, I'll state that most of Jen's message was regarding issues for the slowly emerging PG Canada. Any general feedback on the value of a conference such as I mentioned would still be welcome... Andrew On Thu, 24 Mar 2005, Jen Zed wrote: > The relevance of the workshops and conference depend mostly on what [snip] From Bowerbird at aol.com Thu Mar 24 10:34:13 2005 From: Bowerbird at aol.com (Bowerbird@aol.com) Date: Thu Mar 24 10:34:28 2005 Subject: [gutvol-d] Humanities Computing conference Message-ID: <1d5.38d9b6e5.2f746225@aol.com> andrew said: > Just to avoid confusing other PG volunteers too much wouldn't want anyone to be confused... as for the conference, i say go for it. it would be nice if _someone_ here could answer questions about t.e.i. -bowerbird From felix.klee at inka.de Thu Mar 24 10:41:34 2005 From: felix.klee at inka.de (Felix E. Klee) Date: Thu Mar 24 10:42:27 2005 Subject: [gutvol-d] Scanner vs. digital camera In-Reply-To: <15cfa2a5050324091950b5eab3@mail.gmail.com> References: <87sm2lz43v.wl%felix.klee@inka.de> <15cfa2a5050324091950b5eab3@mail.gmail.com> Message-ID: <87mzsszy1t.wl%felix.klee@inka.de> At Thu, 24 Mar 2005 12:19:18 -0500, Robert Cicconetti wrote: > > How did OCR'ing go? I wonder because the resolution of cheap > > digital cameras is quite low for scanning. For example, to scan an > > A4 page (aspect ratio: sqrt(2)) with a usual digital camera (aspect > > ratio of images: 4:3) in 300DPI, you need a camera with more than > > nine mega-pixels. > > Let's try something more realistic. Admittedly, for most book scanning tasks the requirements are not as high as I illustrated. However, a simple camera wouldn't fit the need of people that frequently have to create quality scans of pages whose size is around A4 (I'm one of these people). IOW: An ordinary flatbed scanner is probably still the best and cheapest solution for most people. A dream for scanning books, of course, is the BookEye series of scanners that one can sometimes find in some public libraries. -- Felix E. Klee From jlinden at projectgutenberg.ca Thu Mar 24 11:18:40 2005 From: jlinden at projectgutenberg.ca (James Linden) Date: Thu Mar 24 11:22:55 2005 Subject: [gutvol-d] Humanities Computing conference In-Reply-To: <7d5745970503240921d601da@mail.gmail.com> References: <7d5745970503240921d601da@mail.gmail.com> Message-ID: <42431290.4090406@projectgutenberg.ca> Jen Zed wrote: > The relevance of the workshops and conference depend mostly on what > James has planned for the UniBook back-end. James, are you planning to > implement TEI / XSL / FO? TEI will be implemented as an input/output format, yes. It will have nothing to do with the internal workings of the system. XSL isn't needed - the application doesn't rely on transformations of any kind. > (Actually, any info about UniBook would be > really useful to me, as I've started to think about the site > front-end, but can't go very far unless I know what the back-end looks > like.) I'm still working on the tech docs -- the 6 pages of docs that we put on the wiki took me almost two weeks -- tech docs are going to be about 6 pages - per section! > At work, I'm doing a DocBook XSL implementation right now. The issues > are similar enough that I might be able to swing a seminar and > conference attendance on the company's tab. (DocBook is like TEI, only > it's optimized for generating printed reference books.) My demo app (on ibiblio) has an experimental docbook output... when it comes time, I know who I'm going to ask for help to implement that module. :-) > Too bad we don't have a little pot of money we could use to send > people to events like these. Can I hope (request) that getting our > non-profit status established is on the agenda for the upcoming > meeting in Toronto? Meeting in Toronto? What meeting? -- James From jenzed at gmail.com Thu Mar 24 13:06:40 2005 From: jenzed at gmail.com (Jen Zed) Date: Thu Mar 24 13:06:49 2005 Subject: [gutvol-d] Humanities Computing conference In-Reply-To: References: <7d5745970503240921d601da@mail.gmail.com> Message-ID: <7d57459705032413062adba9f9@mail.gmail.com> My apologies, I didn't notice that Andrew's original post was on the PG (as opposed to the PG Canada) list. jen. On Thu, 24 Mar 2005 10:17:18 -0800 (PST), Andrew Sly wrote: > > > Just to avoid confusing other PG volunteers too much, I'll state > that most of Jen's message was regarding issues for the slowly > emerging PG Canada. > > Any general feedback on the value of a conference such as I > mentioned would still be welcome... > > Andrew > > On Thu, 24 Mar 2005, Jen Zed wrote: > > > The relevance of the workshops and conference depend mostly on what > > [snip] > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > From webmaster at gutenberg.org Sun Mar 27 09:04:12 2005 From: webmaster at gutenberg.org (Marcello Perathoner) Date: Sun Mar 27 09:03:51 2005 Subject: [gutvol-d] [Fwd: Thought on http://www.gutenberg.org/faq/C-18.php] Message-ID: <4246E78C.7010503@gutenberg.org> -------- Original Message -------- Subject: Thought on http://www.gutenberg.org/faq/C-18.php Date: Sun, 27 Mar 2005 17:24:47 +0100 (BST) From: Nick Burch To: webmaster@gutenberg.org Hi I'm not sure if you're the right person on the guttenberg team to send this to, but hopefully if not you're close. I happened across http://www.gutenberg.org/faq/C-18.php from a discussion on slashdot, and I had a thought that there is something you can try. It should be possible to use a copyright repository to prove a book is out of copyright, without having to use the old (and hard to find) edition. I live in Oxford, and we have one of the UK's three copyright repositories, in the form of the Bodleian library. Most people can get temporary access to it for research, and I believe the same is true of the other two libraries. These libraries hold most books published in the UK. So, the steps for a book which you think is out of copyright would be: 1) Get a copy of the new version of the book 2) Find your nearest copyright library 3) Check to see if they have a copy of an older version - the search of the collection should be available online, eg http://library.ox.ac.uk/ 4) Arrange temporary membership of the library 5) Turn up, request the book, and go away for a few hours while someone retrieves it from the stack (most books aren't on open shelves) 6) Compare lots of pages to ensure the text is the same 7) Photocopy a few pages (including the copyright info to be sure) 8) Head home, and set to work on the new version I hope the above makes some sense, and might be of use Nick -- Marcello Perathoner webmaster@gutenberg.org From gbnewby at pglaf.org Sun Mar 27 10:07:40 2005 From: gbnewby at pglaf.org (Greg Newby) Date: Sun Mar 27 10:07:41 2005 Subject: [gutvol-d] Re: [Fwd: Thought on http://www.gutenberg.org/faq/C-18.php] In-Reply-To: <4246E78C.7010503@gutenberg.org> References: <4246E78C.7010503@gutenberg.org> Message-ID: <20050327180740.GA25403@pglaf.org> On Sun, Mar 27, 2005 at 07:04:12PM +0200, Marcello Perathoner wrote: > > > -------- Original Message -------- > Subject: Thought on http://www.gutenberg.org/faq/C-18.php > Date: Sun, 27 Mar 2005 17:24:47 +0100 (BST) > From: Nick Burch > To: webmaster@gutenberg.org > > Hi > > I'm not sure if you're the right person on the guttenberg team to send > this to, but hopefully if not you're close. > > I happened across http://www.gutenberg.org/faq/C-18.php from a discussion > on slashdot, and I had a thought that there is something you can try. It > should be possible to use a copyright repository to prove a book is out of > copyright, without having to use the old (and hard to find) edition. Hi, Nick. Thanks for your suggestion. In fact, this is our procedure. We should probably mention it more prominently in our FAQ & Copyright HOWTO. -- Greg > > I live in Oxford, and we have one of the UK's three copyright > repositories, in the form of the Bodleian library. Most people can get > temporary access to it for research, and I believe the same is true of the > other two libraries. These libraries hold most books published in the UK. > > So, the steps for a book which you think is out of copyright would be: > 1) Get a copy of the new version of the book > 2) Find your nearest copyright library > 3) Check to see if they have a copy of an older version - the search of > the collection should be available online, eg http://library.ox.ac.uk/ > 4) Arrange temporary membership of the library > 5) Turn up, request the book, and go away for a few hours while someone > retrieves it from the stack (most books aren't on open shelves) > 6) Compare lots of pages to ensure the text is the same > 7) Photocopy a few pages (including the copyright info to be sure) > 8) Head home, and set to work on the new version > > > I hope the above makes some sense, and might be of use > > Nick > > > > > -- > Marcello Perathoner > webmaster@gutenberg.org From kouhia at nic.funet.fi Tue Mar 29 08:38:17 2005 From: kouhia at nic.funet.fi (Juhana Sadeharju) Date: Tue Mar 29 08:38:28 2005 Subject: [gutvol-d] Re: Scanner vs. digital camera Message-ID: >From: "David A. Desrosiers" > > You've "invented" camera features? What hardware did you use >when building these features into your camera? What camera model did >you use as a base unit? Inventions nor patentions require physical hardware because the inventions can be readily described in the text. Patent office does not require inventors to send the hardware to them, anymore. I invented, but I have not patented. It basically does not matter which camera gets the features first, but I favor Canon EOS 300D, Nikon D70, and equivalent competitors. I'm curious why you were not interested in the features itself. They are basically public domain, but manufacturers could be interested in them more, if such features appears first in their camera. The competition is now on the camera features. Juhana -- http://music.columbia.edu/mailman/listinfo/linux-graphics-dev for developers of open source graphics software From felix.klee at inka.de Tue Mar 29 14:04:01 2005 From: felix.klee at inka.de (Felix E. Klee) Date: Tue Mar 29 14:04:14 2005 Subject: [gutvol-d] Re: Scanner vs. digital camera In-Reply-To: References: Message-ID: <87is3a5ctq.wl%felix.klee@inka.de> At Tue, 29 Mar 2005 19:38:17 +0300, Juhana Sadeharju wrote: > I'm curious why you were not interested in the features itself. Now I'm curious: Could you tell us about the features? ... especially since I think that hardware features are probably not needed that much: Software can automatically detect page borders and correct distortions. As an example have a look at the Bookeye software: It has a crappy user interface but mostly it does a good job. To improve automatic detection of distortions it may be interesting to experiment with generation and interpretation of stereo photos of book pages, but that's probably overkill. -- Felix E. Klee From joshua at hutchinson.net Tue Mar 29 14:17:48 2005 From: joshua at hutchinson.net (Joshua Hutchinson) Date: Tue Mar 29 14:17:57 2005 Subject: [gutvol-d] Re: Scanner vs. digital camera Message-ID: <20050329221748.6C8751099C0@ws6-4.us4.outblaze.com> I think the original poster was sarcastically making fun of the notion that "invention" is simply a matter of coming up with an original idea. While current patent practice seems to support that view, it is ridiculous to most people. Hence the saying, "Invention is 1% inspiration, 99% perspiration." In other words, just coming up with an idea is the easy part. Josh ----- Original Message ----- From: "Juhana Sadeharju" To: gutvol-d@lists.pglaf.org Subject: [gutvol-d] Re: Scanner vs. digital camera Date: Tue, 29 Mar 2005 19:38:17 +0300 > > > > From: "David A. Desrosiers" > > > > You've "invented" camera features? What hardware did you use when building > > these features into your camera? What camera model did you use as a base > > unit? > > Inventions nor patentions require physical hardware because the > inventions can be readily described in the text. Patent office > does not require inventors to send the hardware to them, anymore. > I invented, but I have not patented. It basically does not matter > which camera gets the features first, but I favor Canon EOS 300D, > Nikon D70, and equivalent competitors. > > I'm curious why you were not interested in the features itself. > They are basically public domain, but manufacturers could be > interested in them more, if such features appears first in > their camera. The competition is now on the camera features. > > Juhana > -- > http://music.columbia.edu/mailman/listinfo/linux-graphics-dev > for developers of open source graphics software > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d From felix.klee at inka.de Wed Mar 30 02:13:43 2005 From: felix.klee at inka.de (Felix E. Klee) Date: Wed Mar 30 02:14:07 2005 Subject: [gutvol-d] Re: Scanner vs. digital camera In-Reply-To: <20050329221748.6C8751099C0@ws6-4.us4.outblaze.com> References: <20050329221748.6C8751099C0@ws6-4.us4.outblaze.com> Message-ID: <87oed14f1k.wl%felix.klee@inka.de> At Tue, 29 Mar 2005 17:17:48 -0500, Joshua Hutchinson wrote: > In other words, just coming up with an idea is the easy part. Certainly. AFAIK, Switzerland was one of the last countries that required you to hand in working prototypes of devices to be patented. That requirement was overturned by Germany threatening to increase customs duties [1]. Then there's the upcoming threat of software patents - I'm active in that area for quite some time already. Nevertheless Joshua might have some good ideas concerning camera design that he wants to share with us. Seems like I escaped his subtle sarcasm. [1] http://www.sffo.de/machlup1.htm -- Felix E. Klee From nwolcott at dsdial.net Wed Mar 30 05:29:50 2005 From: nwolcott at dsdial.net (N Wolcott) Date: Wed Mar 30 05:30:14 2005 Subject: [gutvol-d] More PG spam being spread around Message-ID: <000a01c5352c$9790a780$0c9495ce@gw98> Resellers of PG books have taken on a new target, Lulu.com. Lulu offers POD publishing at zero up front cost, thus luring those who find free advertising for their spam. The postings I have seen so far both imply PG and Lulu are supporting thier spam. They advertise the quality of their texts as being from PG. One ofthem admits there may be errors. There is probably nothing for PG to do except to get Lulu to take the PG off their customer's postings. If they want to host 15000 books on their computers for free that is their business. I quote my post to the LuLu foruml I have posted 2 books to Lulu at 15 cent royalty with added content to the PG text and I do not mention PG in the blurb. My "quality" book may soon be submerged in a flood of lulu spam. Posting follows: ------------------------- Lulu offers a good service for self publishers who provide "content added" material. This offers the publisher to continually upgrade the product until it is in final form then market it through Lulu's various mechanisms. However recently public domain texts lifted from project gutenberg have been appearing on Lulu. The accomopanying blurb states that www.lulu.com and Project gutenberg have joined forces to offer you these long out of print books. The implication is that somehow Lulu and PG are supporting this effort. PG is trademarked and there is no right to use the name in advertising; enforcing the trademark is another thin however for an all volunteer organization. Software exists to move PG texts to a number of formats, ipod, ebook, etc including Lulu. So there is a real possibillity that most of the 15000 pg books could end up being hosted on Lulu. No review copy would ever be required, so the posting for the converter would be free. Lulu could end up hosting the entire pg corpus for free in a kind of publishing spam. The books are listed with a royalty of $1 to $2. One is published with a $1.59 royalty, and claims that $1 will be contriputed to PG of every book sold. This leaves only 27 cents for the seller. Iin one case the publisher had re-copyrighted the book and in the other had listed it as Public Domain. Nothing wrong with this, but the copyright only applies to "new material" and certainly not the entire book. In one case a ISBN number was listed, so Lulu might have gotten some revenue from that if the ISBN is real. One of the books was listed as 5000 in sales, so I imagine that is how many Lulu has in its archive. It may soon get 14,999 more! Another feature with Lulu is you never know who is selling the book. Lulu distributes it, but the real seller is someone else, unknown. This may raise legal issues about ultimate responsibilitiy. People like myself who provide added content at no or minimal royalty will be unhappy to see our listing efforts buried in an avalanche of Lulu spam. At the very least Lulu should require permission before violating trademark laws. To see the books in this post, search for "Verne" on Lulu. The additional cost of hosting all these books could end up in forcing up front charges on Lulu providers or radically restructuring the way Lulu operates, neither of which is desirable in my humble opinion. I mention this as a discussion topic, as I feel it is an emerging problem. --------------------- N Wolcott nwolcott2@post.harvard.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050330/00d64f22/attachment.html From kouhia at nic.funet.fi Thu Mar 31 06:26:22 2005 From: kouhia at nic.funet.fi (Juhana Sadeharju) Date: Thu Mar 31 06:26:31 2005 Subject: [gutvol-d] Re: Scanner vs. digital camera Message-ID: >From: "Felix E. Klee" > >How did OCR'ing go? I wonder because the resolution of cheap digital >cameras is quite low for scanning. Well, I did not test OCR'ing at all. :-) I store digitizations only as images which also are used for reading. Please test it yourself and tell the results in the list. ftp://ftp.funet.fi/pub/sci/audio/devel/books/ A few first images are various testings. The digitization sequence test starts at the image 1438. Remember, it is a tourist camera with lens distortions and with poor focus control. I used a plain ceiling light, not better movable lights. The book is on a chair and the photographed page points directly to up -- which is wrong. Yes, one page per image is better because the page bends when the book is laid wide open. The book and camera stand could be designed so that the book rests in V shape holder and that the camera is facing perpendicular to the book page. That is, camera would not be above the book and would not face down. (The scanner, which allows the book rest on the edge of the scanning glass, solves the same bending-pages problem. So does the scanning glass-wedge.) Juhana -- http://music.columbia.edu/mailman/listinfo/linux-graphics-dev for developers of open source graphics software From traverso at dm.unipi.it Thu Mar 31 06:52:39 2005 From: traverso at dm.unipi.it (Carlo Traverso) Date: Thu Mar 31 06:50:59 2005 Subject: [gutvol-d] Re: Scanner vs. digital camera In-Reply-To: (message from Juhana Sadeharju on Thu, 31 Mar 2005 17:26:22 +0300) References: Message-ID: <200503311452.j2VEqdn29068@posso.dm.unipi.it> >>>>> "Juhana" == Juhana Sadeharju writes: >> From: "Felix E. Klee" >> >> How did OCR'ing go? I wonder because the resolution of cheap >> digital cameras is quite low for scanning. Juhana> Well, I did not test OCR'ing at all. :-) I store Juhana> digitizations only as images which also are used for Juhana> reading. Juhana> Please test it yourself and tell the results in the list. Juhana> ftp://ftp.funet.fi/pub/sci/audio/devel/books/ A few first Juhana> images are various testings. The digitization sequence Juhana> test starts at the image 1438. Please, instead of putting there a big tar.gz file of 72MB, can you put some individual images? Probably downloading a couple is enough to say that they are unsuitable for OCR. Indeed, my attempts with a good digital camera (5Mpixels, manual focus, uncompressed tiff output, a special mode for text, a professional tripod, etc) have been poor. Carlo Traverso From kth at srv.net Thu Mar 31 08:18:42 2005 From: kth at srv.net (Kevin Handy) Date: Thu Mar 31 08:51:58 2005 Subject: [gutvol-d] DP Down? Message-ID: <424C22E2.6040603@srv.net> Is it just me, or is DP down today? All I get is a forbidden message. Any news on when it will be available again? From miranda_vandeheijning at blueyonder.co.uk Thu Mar 31 08:57:19 2005 From: miranda_vandeheijning at blueyonder.co.uk (Miranda van de Heijning) Date: Thu Mar 31 08:57:28 2005 Subject: [gutvol-d] DP Down? In-Reply-To: <424C22E2.6040603@srv.net> References: <424C22E2.6040603@srv.net> Message-ID: <424C2BEF.9000206@blueyonder.co.uk> DP's ISP has been down today.... The latest news from the DP local bar, aka the chatroom at jabber.org, is that the ISP is back up, but we are still waiting for DPs server to come back. Keep checking! In the meantime, you can visit the European site http://dp.rastko.net/ for all your proofing needs. Best regards, Miranda Kevin Handy wrote: > Is it just me, or is DP down today? All I get is a forbidden message. > Any news on when it will be available again? > > _______________________________________________ > gutvol-d mailing list > gutvol-d@lists.pglaf.org > http://lists.pglaf.org/listinfo.cgi/gutvol-d > > > From servalan at ar.com.au Thu Mar 31 17:54:33 2005 From: servalan at ar.com.au (Pauline) Date: Thu Mar 31 17:55:29 2005 Subject: [gutvol-d] DP Down? In-Reply-To: <424C2BEF.9000206@blueyonder.co.uk> References: <424C22E2.6040603@srv.net> <424C2BEF.9000206@blueyonder.co.uk> Message-ID: <424CA9D9.6030508@ar.com.au> Miranda van de Heijning wrote: > DP's ISP has been down today.... The latest news from the DP local bar, > aka the chatroom at jabber.org, is that the ISP is back up, but we are > still waiting for DPs server to come back. Keep checking! & the DP server is now back up & available. Thanks for your patience. Cheers, P -- Help digitise public domain books: Distributed Proofreaders: http://www.pgdp.net "Preserving history one page at a time." Set free dead-tree books: http://bookcrossing.com/referral/servalan