From maitriv at yahoo.com  Tue Mar  1 11:28:15 2005
From: maitriv at yahoo.com (maitri venkat-ramani)
Date: Tue Mar  1 11:28:25 2005
Subject: [gutvol-d] Kenyan school turns to handhelds
In-Reply-To: <S18357AbVBGOVQ/20050207142116Z+7941@nic.funet.fi>
Message-ID: <20050301192815.88274.qmail@web52310.mail.yahoo.com>

Technological progress reaching end users in developing countries makes
me so happy!  They bear a lot of the brunt for our wellbeing.  Is there
any way we can get PG books to this school and others like it?  Do we
have any African contacts?

Thanks,
Maitri

============================================================

Kenyan school turns to handhelds 
By Julian Siddle 
BBC Go Digital  

At the Mbita Point primary school in western Kenya students click away
at a handheld computer with a stylus. 
They are doing exercises in their school textbooks which have been
digitised. 

It is a pilot project run by EduVision, which is looking at ways to use
low cost computer systems to get up-to-date information to students who
are currently stuck with ancient textbooks. 

Matthew Herren from EduVision told the BBC programme Go Digital how the
non-governmental organisation uses a combination of satellite radio and
handheld computers called E-slates. 

"The E-slates connect via a wireless connection to a base station in
the school. This in turn is connected to a satellite radio receiver.
The data is transmitted alongside audio signals." 

The base station processes the information from the satellite
transmission and turns it into a form that can be read by the handheld
E-slates. 

"It downloads from the satellite and every day processes the stream,
sorts through content for the material destined for the users connected
to it. It also stores this on its hard disc." 

Linux link 

The system is cheaper than installing and maintaining an internet
connection and conventional computer network. But Mr Herren says there
are both pros and cons to the project. 

"It's very simple to set up, just a satellite antenna on the roof of
the school, but it's also a one-way connection, so getting feedback or
specific requests from end users is difficult." 

The project is still at the pilot stage and EduVision staff are on the
ground to attend to teething problems with the Linux-based system. 
"The content is divided into visual information, textual information
and questions. Users can scroll through these sections independently of
each other." 

EduVision is planning to include audio and video files as the system
develops and add more content. 

Mr Herren says this would vastly increase the opportunities available
to the students. He is currently in negotiations to take advantage of a
project being organised by search site Google to digitise some of the
world's largest university libraries. 

"All books in the public domain, something like 15 million, could be
put on the base stations as we manufacture them. Then every rural
school in Africa would have access to the same libraries as the
students in Oxford and Harvard" 

Currently the project is operating in an area where there is mains
electricity. But Mr Herren says EduVision already has plans to extend
it to more remote regions. 

"We plan to put a solar panel at the school with the base station, have
the E-slates charge during the day when the children are in school,
then they can take them home at night and continue working." 

Maciej Sundra, who designed the user interface for the E-slates, says
the project's ultimate goal is levelling access to knowledge around the
world. 

"Why in this age when most people do most research using the internet
are students still using textbooks? The fact that we are doing this in
a rural developing country is very exciting - as they need it most." 


Story from BBC NEWS:
http://news.bbc.co.uk/go/pr/fr/-/2/hi/technology/4304375.stm

Published: 2005/02/28 11:47:23 GMT


__________________________________ 
Do you Yahoo!? 
Yahoo! Sports - Sign up for Fantasy Baseball. 
http://baseball.fantasysports.yahoo.com/
From brandon at corruptedtruth.com  Tue Mar  1 14:21:55 2005
From: brandon at corruptedtruth.com (Brandon Galbraith)
Date: Tue Mar  1 14:22:06 2005
Subject: [gutvol-d] [Fwd: [Public Knowledge] Trouble Locating Copyright
 Owners? Tell the Copyright Office Your Story]
Message-ID: <4224EB03.5080903@corruptedtruth.com>

I thought this would be of interest to list members =)

-brandon
-------------- next part --------------
An embedded message was scrubbed...
From: publicknowledge-admin@publicknowledge.org
Subject: [Public Knowledge] Trouble Locating Copyright Owners? Tell the
	Copyright Office Your Story
Date: Tue, 01 Mar 2005 17:14:49 -0500
Size: 4497
Url: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050301/2d65269d/PublicKnowledgeTroubleLocatingCopyrightOwnersTelltheCopyrightOfficeYourStory.mht
From brandon at corruptedtruth.com  Tue Mar  1 21:35:28 2005
From: brandon at corruptedtruth.com (Brandon Galbraith)
Date: Tue Mar  1 21:35:47 2005
Subject: [gutvol-d] Repost: Public Knowledge Orphaned Works Project
Message-ID: <422550A0.5050807@corruptedtruth.com>

Sorry for the repost, just noticed the mailing list scrubbed my forward;


Are you an artist, author, musician, or filmmaker?  Maybe you're a 
scholar or librarian?  If so, have you ever wanted to use a copyrighted 
work but been unable to locate the owner to clear the rights?  It's a 
problem that happens all too often, and not only does it affect your 
work, but it also &quot;orphans&quot; the original owner's work.  It's 
an unfortunate side effect of current copyright law that diminishes 
everyone's ability to create, innovate, and educate.

Fortunately, we have good news: The U.S. Copyright Office wants to make 
it easier to locate copyright holders, and it's asking for the public's 
help.  Before the Copyright Office can *address* the problem, it needs 
to gather  evidence that there *is* a problem.  This is where you come 
in: tell your story to the Copyright Office.

Public Knowledge along with a number of other like-minded organizations 
have created Ophanworks.org: an easy way for you to submit your story to 
the Copyright Office.  Now is your chance to tell the Office what 
personal difficulties you've had when trying to clear rights.

To get started, go to:
http://www.orphanworks.org

Never tried to clear rights?  Maybe you know someone who has.  Forward 
them this message or visit: http://www.orphanworks.org to send them an 
email.

You can always learn more about the problem of &quot;orphan works&quot; 
and the U.S. Copyright Office's notice, by visiting Public Knowledge's 
website:
http:/www.publicknowledge.org/issues/ow

======================
Public Knowledge collaborated with the EFF to set up orphanworks.org
as a resource for everyone to facilitate public participation in
copyright policy.  If you'd like to support this and future efforts,
please make a contribution:
http://publicknowledge.org/donate
======================

Thanks for participating!

Your friends at Public Knowledge
February 28, 2005
____________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050301/c50c1801/attachment.html
From squawker at myrealbox.com  Tue Mar  1 21:36:55 2005
From: squawker at myrealbox.com (Doug Adams)
Date: Tue Mar  1 21:37:11 2005
Subject: [gutvol-d] quick question re: lack of date and copyright clearance
Message-ID: <1109741815.8c22f19csquawker@myrealbox.com>

How do i get a work cleared for copyright if it doesn't have a date on the cover page?  I am working on a book that I know for a fact is from the nineteenth century.  The publisher, however, neglected to include the date.

From prosfilaes at gmail.com  Tue Mar  1 21:52:48 2005
From: prosfilaes at gmail.com (David Starner)
Date: Tue Mar  1 21:53:07 2005
Subject: [gutvol-d] quick question re: lack of date and copyright clearance
In-Reply-To: <1109741815.8c22f19csquawker@myrealbox.com>
References: <1109741815.8c22f19csquawker@myrealbox.com>
Message-ID: <6d99d1fd0503012152595273d6@mail.gmail.com>

Doug Adams <squawker@myrealbox.com> write:
> How do i get a work cleared for copyright if it doesn't have a date on 
> the cover page?  I am working on a book that I know for a fact is from 
> the nineteenth century.  The publisher, however, neglected to include the date.

If you can match it up with an edition in a library catalog (Library
of Congress or the British Library) are good for this, it'll probably
be clearable. Bonus points if it says it was printed in the US, since
then you only have to establish printing pre-1989, not pre-1923 (in
most cases.)
From kouhia at nic.funet.fi  Wed Mar  2 08:42:42 2005
From: kouhia at nic.funet.fi (Juhana Sadeharju)
Date: Wed Mar  2 08:42:55 2005
Subject: [gutvol-d] Re: Enlightened Self Interest
Message-ID: <S12700AbVCBQmm/20050302164242Z+682@nic.funet.fi>


Hello. The master format should be the digitized images of the
original book pages. No font, nor footnote, nor math, nor any
problems in readability, nor in representing the original text.

I find the digitized images more pleasant than any ascii, html,
word or TeX text. I don't know the reason but perhaps the art of
typesetting and printing was better then than it is now!

Any other format can be generated from the digitized images.
If some conversion between html and TeX (say) does not go well, one
can always check against the original typesetting from the images.

So, keep archiving the digitized images!! 200 dpi with 32 grey levels
starts looking ok but 300 dpi with 256 levels should be enough
even for math texts. Forget 1-bit digitizations completely!!!

Best regards,
Juhana
-- 
  http://music.columbia.edu/mailman/listinfo/linux-graphics-dev
  for developers of open source graphics software
From jon at noring.name  Wed Mar  2 10:00:16 2005
From: jon at noring.name (Jon Noring)
Date: Wed Mar  2 10:01:32 2005
Subject: [gutvol-d] Re: Enlightened Self Interest
In-Reply-To: <S12700AbVCBQmm/20050302164242Z+682@nic.funet.fi>
References: <S12700AbVCBQmm/20050302164242Z+682@nic.funet.fi>
Message-ID: <2912024375.20050302110016@noring.name>

Juhana wrote:

> Hello. The master format should be the digitized images of the
> original book pages. No font, nor footnote, nor math, nor any
> problems in readability, nor in representing the original text.
>
> I find the digitized images more pleasant than any ascii, html,
> word or TeX text. I don't know the reason but perhaps the art of
> typesetting and printing was better then than it is now!...
>
> So, keep archiving the digitized images!! 200 dpi with 32 grey levels
> starts looking ok but 300 dpi with 256 levels should be enough
> even for math texts. Forget 1-bit digitizations completely!!!

If the only purpose of scanning books is for OCRing whereupon the
scans are either dumped or saved simply for "proving" provenance, then
300 dpi is *usually* sufficient: 8-bit greyscale for black and white,
and 24-bit color for color pages. (If some type is very small, such as
5 point and less, then 600 dpi is usually required.)

However, in my consultations with experts in the field, and personal
experimentation (My Antonia at http://www.openreader.org/myantonia/ ),
if the scans are to be used for multiple purposes besides OCR, such as
for direct reading and other uses where sharpness is aesthetically
important, then it is recommended to scan them at 600 dpi (optical) --
and 1200 dpi (optical) if the print is *very* small. Unfortunately,
the resulting scan images become quite large (unless one uses lossy
compression, such as DjVu, which is not recommended for the master
archiving but alright for end-user delivery.) But if a job is worth
doing, it is worth doing right.

If there is one area which DP seems to fall short (let me know if I'm
wrong here) is with respect to page scan resolution and archiving (or
lack thereof). It is understandable considering the required disk
space and bandwidth requirements (to move the scans around), but IA
is a place to donate page scans once proofing is done (maybe this is
already being done), and I'm sure others can be found who will gladly
setup a terabyte storage box to store DP's 600 dpi page scans -- just
post a plea to SlashDot and there'll probably be several volunteers
who will step forward with spare terabytes available.

Btw, if anyone here has made, and plans to make, 600 dpi (optical)
greyscale or color scans of any public domain books including the book
covers (and this includes books printed between 1923 and 1963 which
may be public domain), I'll gladly accept donations of them on CD-ROM
and DVD-ROM. I will also gladly accept the source books themselves,
including if they've been chopped. I eventually will build a
multi-terabyte hard disk storage system to support various activities
including Distributed Scanners.  Of course, the scans should be
donated to IA as well so they can immediately be made available to the
world.

Jon Noring

From brandon at corruptedtruth.com  Wed Mar  2 10:20:21 2005
From: brandon at corruptedtruth.com (Brandon Galbraith)
Date: Wed Mar  2 10:20:47 2005
Subject: [gutvol-d] Re: Enlightened Self Interest
In-Reply-To: <2912024375.20050302110016@noring.name>
References: <S12700AbVCBQmm/20050302164242Z+682@nic.funet.fi>
	<2912024375.20050302110016@noring.name>
Message-ID: <422603E5.6010105@corruptedtruth.com>

I for one am both a lurker on here AND a slashdot reader =) I how many 
terabytes do you think we'd need? Putting together a relatively cheap 
3/4/5 TB NAS is fairly easy considering the price of 300/400 GB SATA 
drives has been dropping steadily. This may be something we'd want to 
talk to iBiblio about though, as they already have the infrastructure in 
place. No point in re-inventing the wheel.

-brandon

Jon Noring wrote:

>Juhana wrote:
>
>  
>
>>Hello. The master format should be the digitized images of the
>>original book pages. No font, nor footnote, nor math, nor any
>>problems in readability, nor in representing the original text.
>>
>>I find the digitized images more pleasant than any ascii, html,
>>word or TeX text. I don't know the reason but perhaps the art of
>>typesetting and printing was better then than it is now!...
>>
>>So, keep archiving the digitized images!! 200 dpi with 32 grey levels
>>starts looking ok but 300 dpi with 256 levels should be enough
>>even for math texts. Forget 1-bit digitizations completely!!!
>>    
>>
>
>If the only purpose of scanning books is for OCRing whereupon the
>scans are either dumped or saved simply for "proving" provenance, then
>300 dpi is *usually* sufficient: 8-bit greyscale for black and white,
>and 24-bit color for color pages. (If some type is very small, such as
>5 point and less, then 600 dpi is usually required.)
>
>However, in my consultations with experts in the field, and personal
>experimentation (My Antonia at http://www.openreader.org/myantonia/ ),
>if the scans are to be used for multiple purposes besides OCR, such as
>for direct reading and other uses where sharpness is aesthetically
>important, then it is recommended to scan them at 600 dpi (optical) --
>and 1200 dpi (optical) if the print is *very* small. Unfortunately,
>the resulting scan images become quite large (unless one uses lossy
>compression, such as DjVu, which is not recommended for the master
>archiving but alright for end-user delivery.) But if a job is worth
>doing, it is worth doing right.
>
>If there is one area which DP seems to fall short (let me know if I'm
>wrong here) is with respect to page scan resolution and archiving (or
>lack thereof). It is understandable considering the required disk
>space and bandwidth requirements (to move the scans around), but IA
>is a place to donate page scans once proofing is done (maybe this is
>already being done), and I'm sure others can be found who will gladly
>setup a terabyte storage box to store DP's 600 dpi page scans -- just
>post a plea to SlashDot and there'll probably be several volunteers
>who will step forward with spare terabytes available.
>
>Btw, if anyone here has made, and plans to make, 600 dpi (optical)
>greyscale or color scans of any public domain books including the book
>covers (and this includes books printed between 1923 and 1963 which
>may be public domain), I'll gladly accept donations of them on CD-ROM
>and DVD-ROM. I will also gladly accept the source books themselves,
>including if they've been chopped. I eventually will build a
>multi-terabyte hard disk storage system to support various activities
>including Distributed Scanners.  Of course, the scans should be
>donated to IA as well so they can immediately be made available to the
>world.
>
>Jon Noring
>
>_______________________________________________
>gutvol-d mailing list
>gutvol-d@lists.pglaf.org
>http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
>
>  
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050302/c63161e1/attachment-0001.html
From jon at noring.name  Wed Mar  2 11:46:00 2005
From: jon at noring.name (Jon Noring)
Date: Wed Mar  2 11:46:41 2005
Subject: [gutvol-d] Re: Enlightened Self Interest
In-Reply-To: <422603E5.6010105@corruptedtruth.com>
References: <S12700AbVCBQmm/20050302164242Z+682@nic.funet.fi>
	<2912024375.20050302110016@noring.name>
	<422603E5.6010105@corruptedtruth.com>
Message-ID: <118367750.20050302124600@noring.name>

Brandon wrote:

> I for one am both a lurker on here AND a slashdot reader =) I how
> many terabytes do you think we'd need? Putting together a relatively
> cheap 3/4/5 TB NAS is fairly easy considering the price of 300/400
> GB SATA drives has been dropping steadily. This may be something
> we'd want to talk to iBiblio about though, as they already have the
> infrastructure in place. No point in re-inventing the wheel.

Since we are talking primarily about pre-1923 public domain books,
most of them are black and white, so I'll restrict the analysis to
those books. Color substantially adds to disk space requirements.
(Also, many of the books published in the 1923-63 time frame, 90% of
which are in the public domain, are black and white.)

Ideally, we would like to scan the books at 600 dpi (optical), 8-bit
greyscale, and store the images in some lossless compressed format
(such as PNG). The images should not have gone through any lossy
stage to get to this point, such as JPEG, since this adds annoying
artifacts to the images.

Unfortunately, this results in some pretty large scans. Using the
data I have for the "My Antonia" project, a typical 600 dpi (optical)
greyscale page saved as PNG occupies about 4.5 megs. So for a typical
300 page book, this works out to about 1.5 gigs per book (rounding up
some to cover incidentals.)

A terabyte hard disk storage system (optimized for data warehousing,
since optimizing for server use increases the hardware cost) would
thus hold about 700 books. This is not that many when there are
potentially several million public domain books out there
(especially if we include the many public domain books in the
1923-1963 range.)

What could be done in the next few years, until multi-terabyte hard
disk data warehousing systems become dirt cheap, is to backup the
lossless greyscale scans onto DVD-ROM (which, granted, is risky), or
even press DVDs (requires equipment to do this -- maybe someone will
donate access to their DVD presser?) Of course, we should donate
copies of the DVDs to IA and to other groups (?iBiblio) and hope they
will preserve them, even moving them to hard disk.

In the meanwhile, for public access and massive mirroring, we can
convert the 600 dpi greyscale to 600 dpi bitonal (2-color black and
white -- it is important to manually select the cutoff greyscale value
for best quality.) This will save a *lot* of space and will be
*minimally* acceptable as archival copies should the original
greyscale scans get lost or become unreadable.

Using 2-color PNG, a typical page now scrunches down to about 125
Kbytes, or about 40 Mbytes per book (using CCITT lossless compression,
which is optimized for bitonal scans of text, it is possible to get
the size down to about 60 Kbytes -- but this is an obscure format --
all web browsers will display PNG, but it requires a plugin or a
special graphics program to display CCITT TIFFs. There may also be
some proprietary problems with CCITT.)

This way we can now store about 25,000 books on a terabyte server,
which is very doable and will be sufficient for Distributed Scanners
(or similar project) for a few years (in the meanwhile, disk space
should continue to get cheaper and cheaper to the point we might even
begin migrating the biggie-size greyscale scans stored on DVD or
other storage medium back to mirrored hard disk servers.)

Some of my thinking -- no doubt there's other approaches to consider.

Should I start a "Distributed Scanners" discussion group at Yahoo?
It seems like there may be enough people interested in this project.

Jon

From hart at pglaf.org  Wed Mar  2 12:22:07 2005
From: hart at pglaf.org (Michael Hart)
Date: Wed Mar  2 12:22:08 2005
Subject: [gutvol-d] quick question re: lack of date and copyright clearance
In-Reply-To: <6d99d1fd0503012152595273d6@mail.gmail.com>
References: <1109741815.8c22f19csquawker@myrealbox.com>
	<6d99d1fd0503012152595273d6@mail.gmail.com>
Message-ID: <Pine.LNX.4.60.0503021220380.5112@pglaf.org>


If you include the number of pages, physical dimensions, binding type
and color, and provide this data to a reference librarian, the odds
go way up in identifying the particular edition.

mh


On Tue, 1 Mar 2005, David Starner wrote:

> Doug Adams <squawker@myrealbox.com> write:
>> How do i get a work cleared for copyright if it doesn't have a date on
>> the cover page?  I am working on a book that I know for a fact is from
>> the nineteenth century.  The publisher, however, neglected to include the date.
>
> If you can match it up with an edition in a library catalog (Library
> of Congress or the British Library) are good for this, it'll probably
> be clearable. Bonus points if it says it was printed in the US, since
> then you only have to establish printing pre-1989, not pre-1923 (in
> most cases.)
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
From squawker at myrealbox.com  Wed Mar  2 12:45:02 2005
From: squawker at myrealbox.com (Doug Adams)
Date: Wed Mar  2 12:45:11 2005
Subject: [gutvol-d] Re: quick question re: lack of date and copyright
Message-ID: <1109796302.333c8dbcsquawker@myrealbox.com>

>From: David Starner <prosfilaes@gmail.com>
>If you can match it up with an edition in a library catalog 
>(Library of Congress or the British Library) are good for 
>this, it'll probably be clearable. Bonus points if it says it 
>was printed in the US, since then you only have to 
>establish printing pre-1989, not pre-1923 (in most cases.)

Thanks David!  I've found my version in the Library of Congress.  The listing says it was published in:

Chicago, Belford, Clarke [187-?]

So even the LOC doesn't have a date for the book.  Now a technical question.  How do I submit this to get clearance without the date.  Do I need to do it by email to someone.  (I've previously used the internet form.)

From vze3rknp at verizon.net  Wed Mar  2 12:52:17 2005
From: vze3rknp at verizon.net (Juliet Sutherland)
Date: Wed Mar  2 12:52:25 2005
Subject: [gutvol-d] Re: quick question re: lack of date and copyright
In-Reply-To: <1109796302.333c8dbcsquawker@myrealbox.com>
References: <1109796302.333c8dbcsquawker@myrealbox.com>
Message-ID: <42262781.7010207@verizon.net>

You can submit on the internet form. Just write the word none in where 
the date would usually go. You can add a link to the LoC listing in the 
comments section. Be sure it include scans of both the title page and 
the verso.

JulietS


Doug Adams wrote:

>>From: David Starner <prosfilaes@gmail.com>
>>If you can match it up with an edition in a library catalog 
>>(Library of Congress or the British Library) are good for 
>>this, it'll probably be clearable. Bonus points if it says it 
>>was printed in the US, since then you only have to 
>>establish printing pre-1989, not pre-1923 (in most cases.)
>>    
>>
>
>Thanks David!  I've found my version in the Library of Congress.  The listing says it was published in:
>
>Chicago, Belford, Clarke [187-?]
>
>So even the LOC doesn't have a date for the book.  Now a technical question.  How do I submit this to get clearance without the date.  Do I need to do it by email to someone.  (I've previously used the internet form.)
>
>_______________________________________________
>gutvol-d mailing list
>gutvol-d@lists.pglaf.org
>http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
>
>  
>

From marcello at perathoner.de  Wed Mar  2 12:17:08 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed Mar  2 12:55:19 2005
Subject: [gutvol-d] Please test www-dev.gutenberg.org
Message-ID: <42261F44.1000005@perathoner.de>

We are ready to migrate the web site to the new fast file server.

Also some slight changes were made to the online catalog to make it 
better cacheable:

The dynamic authrec pages have been dropped in favour of the static 
browse-by-author pages. Browse-by-author now includes all information 
from the authrec pages. Redirects are in place.

The search has been optimized to redirect simple searches (searches for 
author only, title only) to the appropriate browse-by-author and 
browse-by-title pages.


A preview is online at:

    www-dev.gutenberg.org


Please test and report any oddities.


-- 
Marcello Perathoner
webmaster@gutenberg.org


From jeroen.mailinglist at bohol.ph  Wed Mar  2 13:28:17 2005
From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account))
Date: Wed Mar  2 13:28:02 2005
Subject: [gutvol-d] Enlightened Self Interest
In-Reply-To: <7815066468.20050225202613@noring.name>
References: <16927.48790.533173.228950@celery.zuhause.org>	<20050226012008.95779.qmail@web41601.mail.yahoo.com>	<20050226020452.GA24272@panix.com>
	<421FE721.2000004@zytrax.com>
	<7815066468.20050225202613@noring.name>
Message-ID: <42262FF1.2050504@bohol.ph>


I didn't notice this discussion was heading to my favourite subject... 
TEI. I guess enlightened is on my mental spam filter...

Jon Noring wrote:

>
>For maximum archivability, repurposeability and accessibility, it is
>important for the XML markup vocabulary used in the master document to
>be wholly structural and semantic. Except where absolutely necessary
>(and maybe best solved using SVG and MathML), presentational markup
>should be avoided.
>
>  
>
Since we are reproducing printed works, it is often not possible to 
reconstruct the intended semantics of the user. This is especially true 
of books before the mid 19th century, when typographic conventions where 
not as well established. For many older books the best we can do is 
capture the typography in some "reduced" way. The good thing about TEI 
is that it actually supports that.

>TEI is primarily structural/semantic, but there are some presentational
>components. The base DP-TEI (I envision three levels of DP-TEI), when
>it comes into being, should not specify any presentational markup
>components.
>
>I am not familiar with OpenOffice's XML vocabulary, but I would guess
>that it, too, is a mix of structural/semantic tags with presentation
>tags (I also guess that it is much more presentationally-oriented than
>TEI, and doesn't have the structural/semantic richness of TEI.) If
>OpenOffice's XML vocabulary is to be used, it should be subsetted (at
>least at the base level) to not allow presentational markup.
>
>  
>
OpenOffice XML has a lot of features geared towards an office 
application and the nasty details of presentation. It is quite 
presentational, and I wouldn't recommend it as a long term archive 
format. However, it is much better structured than Microsoft .DOC 
format, and considerable more compact (using zip as it does).

>I do not recommend DocBook as the primary markup vocabulary for
>general books, but certainly it is intriguing to consider it as a
>second "blessed" vocabulary for particular types of documents it
>is designed for (primarily technical documents.)
>
>  
>
Reminds me of that old saying about standards, good to have so many to 
choose from... DocBook is fine for technical manuals written from 
scratch, not for capturing a nineteenth century novel, or sixteenth 
century history.

Jeroen.

>
>  
>

From jeroen.mailinglist at bohol.ph  Wed Mar  2 13:30:36 2005
From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account))
Date: Wed Mar  2 13:30:20 2005
Subject: [gutvol-d] Enlightened Self Interest
In-Reply-To: <4220A2E9.7010300@hutchinson.net>
References: <16927.48790.533173.228950@celery.zuhause.org>	<20050226012008.95779.qmail@web41601.mail.yahoo.com>	<20050226020452.GA24272@panix.com>	<421FE721.2000004@zytrax.com>
	<20050226032924.GA29574@panix.com>
	<4220A2E9.7010300@hutchinson.net>
Message-ID: <4226307C.1050109@bohol.ph>

Joshua Hutchinson wrote:

>
> 1 - Converting those texts that come through me from DP into PGTEI 
> master format.  I then use the online PGTEI -> HTML conversion routine 
> to convert them to HTML for posting to PG.  Most of them are not 
> converted to TEXT simply because someone else at DP did the text 
> version before I got to them.  In other words, I've been mostly 
> concentrating on the PGTEI format itself and the HTML output that 
> results from it.
>
I've been producing all my ebooks as TEI (since 1997), but since 
Gutenberg can't deal with it, I've hardly ever been able to post them. 
Please don't convert any text I've submitted before asking me. All my 
HTML comes from a single stylesheet.

Jeroen.

From joshua at hutchinson.net  Wed Mar  2 13:52:21 2005
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Wed Mar  2 13:52:33 2005
Subject: [gutvol-d] Enlightened Self Interest
Message-ID: <20050302215221.9DF079E8FF@ws6-2.us4.outblaze.com>

----- Original Message -----
From: "Jeroen Hellingman (Mailing List Account)" <jeroen.mailinglist@bohol.ph>
> 
> Joshua Hutchinson wrote:
> 
> >
> > 1 - Converting those texts that come through me from DP into PGTEI master 
> > format.  I then use the online PGTEI -> HTML conversion routine to convert 
> > them to HTML for posting to PG.  Most of them are not converted to TEXT 
> > simply because someone else at DP did the text version before I got to them. 
> >  In other words, I've been mostly concentrating on the PGTEI format itself 
> > and the HTML output that results from it.
> >
> I've been producing all my ebooks as TEI (since 1997), but since Gutenberg 
> can't deal with it, I've hardly ever been able to post them. Please don't 
> convert any text I've submitted before asking me. All my HTML comes from a 
> single stylesheet.
> 
Oh, I don't grab works at random!  ;)  I've been helping people that don't want to learn HTML, mostly.

They send me a finished text version (with the page breaks still intact) and I convert that to TEI.  The longest part of the conversion is fixing up image links.  Pretty much everything else is handled through RegEx.

Right now, the vast majority of the TEI documents come through the conversion process to HTML with near perfection.  I just put an inline style section in place of the linked css file and convert the "TeX" style single quotes back into straight single quotes (').

Josh
From Bowerbird at aol.com  Wed Mar  2 16:01:31 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Mar  2 16:01:48 2005
Subject: [gutvol-d] so jon
Message-ID: <7EA3EC8D.678C432A.023039A8@aol.com>

so jon, are you going to take up my challenge?
if not, tell me, and i'll do that o.c.r. myself.
we're gonna see what accuracy-level we can get
on your nice hi-res scans of "my antonia"...

-bowerbird

From jon at noring.name  Wed Mar  2 16:28:10 2005
From: jon at noring.name (Jon Noring)
Date: Wed Mar  2 16:28:25 2005
Subject: [gutvol-d] so jon
In-Reply-To: <7EA3EC8D.678C432A.023039A8@aol.com>
References: <7EA3EC8D.678C432A.023039A8@aol.com>
Message-ID: <8935298093.20050302172810@noring.name>

Bowerbird wrote:

> so jon, are you going to take up my challenge?

I didn't know you issued a challenge.


> if not, tell me, and i'll do that o.c.r. myself.
> we're gonna see what accuracy-level we can get
> on your nice hi-res scans of "my antonia"...

I can only ask my friend so much for Abbyy scanning (he effectively
pays a per page fee for using Abbyy), and your request does not
qualify as anything important enough for me to spend "capital" on.

So feel free to go ahead and do what you will with the "My Antonia"
scans. That's why they're online (I will need to make some sort of
usage statement for them, maybe a Creative Commons license -- but the
intent is for the whole world to have ready access to them.)

I'm curious to know how well various OCR packages will perform on "My
Antonia" since the XHTML version is very accurate to the original --
so it can form sort of a test base. Of course, if you or anyone else
finds an error in the XHTML version as a result of the OCR test, I'll
appreciate being informed so I can make the correction.

Others here who use their own OCR package, feel free to test it out on
the My Antonia scans. Go to:

   http://www.openreader.org/myantonia/

Jon

(btw, I plan to soon scan my original edition of Burton's "Kama
Sutra", and that will be a much greater challenge to any OCR package,
even if it were new, because of very small print, overall poor
typesetting, and poor print quality.)

From jmdyck at ibiblio.org  Wed Mar  2 16:43:26 2005
From: jmdyck at ibiblio.org (Michael Dyck)
Date: Wed Mar  2 16:44:22 2005
Subject: [gutvol-d] DP anniversary?
Message-ID: <42265DAE.C60930D1@ibiblio.org>

In today's PG Weekly newsletter, and in a posting to the Book People
mailing list, Michael Hart says:
    "This is the 4th Anniversary of The Distributed Proofreaders!!!"

However, if you go to the DP site <http://www.pgdp.net/c/>, you'll see
that it says that DP was founded in 2000.  Moreover, Charles Franks
posted to the gutvol-d list on April 20, 2000, saying (in part) "I have
completed the working beta of a distributed proofreaders website." and
giving a link. I'm not sure if that was the first public announcement
of DP, but in any case, DP is about 5 years old.

Michael Hart appears to be referring to the 4th anniversary of
March 13th, 2001, which is when the PG Weekly newsletter says DP
completed its first book (PG #3320). However, that book ("Mohammed
Ali and His House" by Louise Muhlbach) was actually posted April 2nd,
2001, so it's unclear where the March 13 date comes from.

Moreover, a month or so *before* then, the PG newsletter for February
2001[a] says "The Online Distributed Proofreading Team has completed 8
books since mid October 2000!". This suggests that DP completed its first
book in mid-Oct 2000, which then might have appeared in the list at the
bottom of the mid-October "PG needs you" email[b], but I don't see any DP
books there.

Instead, I believe the first DP book to be posted by PG was #3059 (Homer's
"The Iliad", trans. Andrew Lang) which was posted by December 6, 2000[c].

[a] http://www.gutenberg.org/newsletter/archive/PGMonthly_2001_02_07.txt
[b] http://www.gutenberg.org/newsletter/archive/Other_2000_10_18_Project_Gutenberg_needs_you.txt
[c] http://www.gutenberg.org/newsletter/archive/PGMonthly_2000_12_06.txt

-Michael Dyck
From Bowerbird at aol.com  Wed Mar  2 17:11:54 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Mar  2 17:12:14 2005
Subject: [gutvol-d] so jon
Message-ID: <319A7C97.25B22035.023039A8@aol.com>

jon said:
>   I didn't know you issued a challenge.

i sure did.     :+)
it came through at 4:53 pacific, on 2005/2/28.
i have appended a copy for your convenience...

basically, you said "i doubt it" in direct response to
my claim that correct processing of the whole process -- 
from scanning through a few hours of post-o.c.r. work --
could result in an accuracy-rate of 1 error per 10 pages,
so i challenged you to a test with your "my antonia" scans.

since you don't seem to want to have the o.c.r. done,
for understandable reasons, i will do it myself.

by the way, i have done some extensive comparisons of
the project gutenberg version of "my antonia" and yours.
the more deeply i go into it, the more i become convinced
most differences are due to intentional edits, and _not_
due to sloppiness in the original preparation of the work.

so this appears to be exactly like the "frankenstein" case
-- a simple use of a different edition as the source-text.

in view of the insinuations you cast against the "accuracy"
of the project gutenberg e-text, perhaps you should apologize?

-bowerbird
  

Subj:    Re: [gutvol-d] Interesting message on TeBC from NetWorker (about fixing errors, Frankenstein, etc.) 
Date:    2/28/2005 7:53:21 PM Eastern Standard Time 
From:    Bowerbird 
To:    gutvol-d@lists.pglaf.org, Bowerbird 
 

jon said:
>   But the bigger issue is not constrained to errors (differences) 
>   with respect to the source text used, as you continue to focus on.

i think it was you who made "errors" the issue,
revolving around the concept of "trustworthiness".

if, once that house of cards falls down, you want to
turn the issue to one of "which source-text to use",
well then i think that michael's "i'm open to all of 'em"
stance covers _that_ quite nicely, thank you very much.

if you don't like the version of my antonia that's in the library now,
add your own!  the same goes for all the versions of "frankenstein".
casting aspersions on the edition that _is_ there isn't constructive.
provide all the meta-data you want on the version that you furnish;
heck, you can even put a pointer in to your project at librarycity.org;
these days i see a lot of e-texts referencing an .rtf version in france.


>   the PG version of Frankenstein, 
>   which now exposes PG to legal liability.

i don't agree.  but if the lawyers to whom "bantam classics" is 
paying good money decide to send a cease-and-desist, let 'em.

going by results obtained by the "gone with the wind" lawyers,
the project gutenberg people will probably fold very quickly;
without any money, you can't play poker against deep pockets.

but hey, i would like to hear the laughter that would resound
when bantam's lawyers argued that the way they can _prove_
that this e-text copied their book is because of the _errors_
(map-makers can pull that trick.  but book-publishers?  ha!)

who knows, jon, maybe the project gutenberg lawyers will call
_you_ to the stand, to throw your arms in the air and rant about
how those terrible mistakes are ruining the fragile public domain,
and therefore bantam doesn't _deserve_ the protection of the law.
wouldn't that be ironic?       :+)


>   The lack of proper processes, procedures and guidelines

well, i don't agree with that either, jon.
you might not agree with the procedures,
but that doesn't mean there is a "lack" of them.

maybe you don't agree with their choice of source-text
for frankenstein.  but it _was_ good enough for bantam.


>   is leading to serious questions about the integrity 
>   and trustworthiness of the whole PG library

not in my mind.  and not in the minds of most people, i don't think.
not any more so than with any paper-book i might find in a store.
like the "frankenstein" version that was being _sold_ by bantam.


>   1) redoing most of the non-DP works using DP,

let's find out how many d.p. people want me to go over _their_ work
with a fine-tooth comb.  go ahead, speak up, i'd _love_ the challenge.


>   Well, at least you seem to indicate from 
>   your interest in very low error rate OCR 
>   that every etext PG includes in its archive 
>   should be a textually faithful reproduction 
>   of some known source. 

not necessarily.  if someone wants to play editor and
combine editions, i don't have any problem with that.
in some sense, that's what the public domain is about.
i don't see it in black/white terms as something frozen.

if you _are_ going to represent something as faithful,
i think it should _be_ faithful.  but  even then, that is
_to_the_best_of_your_ability_.  as long as you do that,
and give your end-users a means of "checking your work", 
including a solid mechanism for improving it to perfection,
then i think you've done your job.  so yes, i agree with you,
that scans should absolutely be furnished to the end-users,
for works that purport to replicate that edition, certainly...

however, i understand why they haven't been, up to this point,
and so do you -- disk-space just hasn't been affordable enough,
even now, if it were not for the largess of ibiblio and brewster,
we couldn't even be entertaining the thought of posting the scans.


>   I doubt this error rate (let's say for even half of the public domain
>   printings out there) is accomplishable without sentient-level AI. 

i'm trying to get back off this listserve.  i don't like contributing
to the discourse in a place where my voice has been muffled before.

so let me set up a place where you and i can fight... i mean, discuss...

but this doubt of yours is rather easy to dispel, and quickly.

you did a pretty good job of scanning that copy of "my antonia".
and it looks like you processed (e.g., straightened) the scans well.
so now we need to put them through o.c.r., using abbyy finereader;
please have that done as follows:  save results out to an .rtf file,
one for each page; retaining line-breaks and paragraph indentation.
do this for 20-50 pages, and zip the output up and e-mail it to me.
i will reply to you with feedback on if the o.c.r. was done correctly.
then i'll run it through programs that will soon be made available,
at no cost, and we'll see what kind of an error-rate we end up with.

or, if you prefer, follow this same procedure with some other book.

then, if you still want to discuss this matter, we'll do it elsewhere.


>   But if proofreading is to be done anyway by the public, 
>   as is *now done* by DP, what difference is there between 
>   an OCR error of one every 10 pages, and one every page?

when i talk about "the public", i mean _end-users_ who
are reading the book for the purpose of reading the book, 
and _not_ specifically to be "proofreading" it per se.

for that type of reader, one error on every page is too many,
but one error on every tenth page is not.  especially since
-- if we give them an easy means of checking for errors and
reporting them, and then reward readers for finding them -- 
errors won't persist for very long, and the e-text will instead
progress very quickly on its merry way to a state of perfection.

in a practical sense, this means that before you turn an e-text
loose for download in an all-in-one file, you make it available
_page-by-page_ on the web.  anyone who might want to read it
has to do so in that form.  right alongside the text for each page
is the image, so the person can easily check any possible errors.
you let 'em know you are asking for their help to find mistakes.
if they find one, they fill out a form right on the page, and their
input is recorded -- wiki-style -- immediately.  later readers
can either confirm the error, or question it, or make comments.
first person to find each error gets a credit in the final e-text.

you also give people a viewer-program that allows them to
download the appropriate page-image if they suspect an error
-- displaying it right there in the viewer-app next to the text --
and which simplifies the process of reporting it if they find one.
(by, for instance, filling out an e-mail they can send with a click.)


>   The key is that for the aspect of building *trust* in 
>   the final product, it is a very good idea to involve 
>   the volunteer proofreaders to go over the texts, 
>   even if *you don't have to*.

what i just described does a good job of doing that.

this is the system of "continuous proofreading"
i outlined on this listserve a very long time ago.
you recently mistakenly credited it to james linden.

my offer to develop this system was largely snubbed.
for _that_, the project gutenberg "people in charge"
rightly deserve to be criticized.  for the tiny stuff
that you have been complaining about, they do not...


>   Having (and proving to anyone who asks) at least 
>   two independent people who proofed every page, 
>   adds to its trustworthiness. 

not nearly as well as putting text and image side-by-side, and 
allowing any number of "volunteer proofreaders" to examine 'em.

you might be surprised by the number of errors that "slip by"
the proofreaders through two rounds of eyeballing over at d.p.
(indeed, many even slip by the "third round" of post-processing
and whitewashing, and sit there big and ugly in the final e-text.)

even if a dozen people look at a page, an error might _still_ be there.

but with eternal transparency, there is always hope it will be fixed.

anyway, jon, i hope you take up the friendly challenge i issued here.
and if any d.p. people want to call me on the challenge i made to them,
you just let me know.

in the meantime, i'll let you get in the last word on this thread, 
jon, because i _really_ need to be going.  use it wisely...        ;+)

-bowerbird

From jon at noring.name  Wed Mar  2 20:48:03 2005
From: jon at noring.name (Jon Noring)
Date: Wed Mar  2 20:48:22 2005
Subject: [gutvol-d] so jon
In-Reply-To: <319A7C97.25B22035.023039A8@aol.com>
References: <319A7C97.25B22035.023039A8@aol.com>
Message-ID: <6550890890.20050302214803@noring.name>

Bowerbird wrote:

> since you don't seem to want to have the o.c.r. done,
> for understandable reasons, i will do it myself.

Great! Hopefully others here will run it through their favorite OCR
program and share the results with you and with gutvol-d.

Please!, others, OCR the scans, which are available at:

   http://www.openreader.org/myantonia/


> by the way, i have done some extensive comparisons of
> the project gutenberg version of "my antonia" and yours.
> the more deeply i go into it, the more i become convinced
> most differences are due to intentional edits, and _not_
> due to sloppiness in the original preparation of the work.

How do we know? We don't know what source edition was used for PG's
version of "My Antonia", but I now believe (but cannot prove until
someone does the actual comparison) that the source was the "mangled"
British edition, as noted below. So, the way to know for sure is to
secure a copy of that "mangled" British edition and do the comparison.
(Which I won't do because it is futile because the British edition is
itself unacceptable.)


> so this appears to be exactly like the "frankenstein" case
> -- a simple use of a different edition as the source-text.

Yes, and this is why I called the PG version of "My Antonia" "mangled",
because it is -- it is based on a mangled British edition which Willa
Cather herself was very unhappy about regarding the sloppy editing and
printing. She was very "painstaking" with regards to her books -- more
than the average author (and she had the status to dictate the editing
and typography of her books to her publisher -- most lesser authors
didn't have this luxury.)

Again, my focus on the problems with the PG collection go beyond the
error rates from some source -- it goes to the general aspects of
trust and using the proper (acceptable) editions as source, to
properly identify the source, and to provide means for easier
verification the etext faithfully conforms to the source (primarily
making the scans available, which is now possible -- I agree with you
things were tougher a few years ago vis-a-vis providing page scans
online.)

For example, if NetWorker's analysis is correct (posted to The eBook
Community), it now appears that the edition used for PG's version of
"Frankenstein" is based on a 1981 Bantam Classics Edition, which did
significant editing of the text (in essence, creating a convenient
"fingerprint"), and which NetWorker (who was an attorney at one time,
I believe) surmises may border on a copyright infringement (and not
just a "sweat of the brow" sort of thing.) Hopefully Bantam will not
catch wind of this -- but if they do, they probably won't do anything
anyway.

Nevertheless, one wonders how many other earlier PG texts, where
there's no source information given, were derived from post-1923
emended editions? Could those ebook publishers who today use PG texts
be potentially liable because of the lack of source information and
a means to verify provenance?

Even if the title page of a Work was photocopied and sent to PG for
copyright clearance, how do we know that the person did not then use
an easy-to-obtain and available modern edition for the actual scanning
-- and simply photocopied the title page from a non-circulating,
non-scannable copy of the rarer original edition? I believe most of
those individuals who submitted etexts to PG's collection did it
faithfully and followed common sense rules and expectations with
regards to sources --->

   But *how do we know*, and *how can we know*? We can't -- there's no
   mechanism to verify these things.

This is where having the full source information, and having all the
page scans of the source and making them available, builds trust in
(and protects from copyright infringement claims) the particular
etext and the associated collection it belongs to. It is also the
morally right thing to do.


> in view of the insinuations you cast against the "accuracy"
> of the project gutenberg e-text, perhaps you should apologize?

Why? The differences in the PG edition of "My Antonia" likely came from
a mangled British edition which Willa Cather apparently was upset about.
These changes are, in essence, errors. In addition, we have no idea as
to what emendments may have been made to the first and subsequent PG
etext editions since (until possibly now) we didn't know what edition
was used as the original source! You certainly don't have access to
the edition used to generate the PG edition of "My Antonia", do you?
If not, then *how do you know* it is accurate to some original source
edition?

We can't talk about what is an error and what is not an error when we
don't have the source information, and better yet page scans to
immediately verify.

That's why Michael Hart's interest in "correcting" the errors in the
non-DP portion of the PG corpus is beyond futile and will not build
trust in the collection -- how can one reliably correct an etext when
the original source is not known/available to consult with? It's
ludicrous, and a complete waste of time.

It's better to redo the etexts via DP where the source info is
recorded and page scans are (hopefully) available, as well as having
the proofing done by a number of independent proofers, rather than
just one person. Multiple, independent proofers adds trust to the
process, in addition to having the source info and scans available.

After all, intentional misspellings are common in many books (e.g.,
"My Antonia", Mark Twain's books, etc. -- and many pre-19th century
books use variant spellings since rigorous spelling was not then an
established norm) so how does one know if an "error" is really an
error? And there are errors which cannot be caught by simple reading
or even programs, such as missing (or added) accented characters,
wrong punctuation (such as replacing an em-dash with a colon), and
wrong paragraph breaks. (Most of which we see in "My Antonia".) Many
of these "not discernable" errors can sometimes tweak the meaning of
the etexts.

We owe readers, even the casual readers, an excellent product with
full disclosure. For example, the poll I'm conducting on this topic at
The eBook Community indicates (but not proves -- consider this a
preliminary assessment) that a significant percentage of those who
read public domain digital texts *prefer* (note carefully this word)
the texts they use to come from acceptable, known editions, and be
faithful renditions of those editions. This only makes common sense.

To dismiss this is essentially saying that the vast majority people
don't give a damn about whether the public domain texts they spend
hours and hours of their valuable time reading are reasonably faithful
to the original. Does anyone want to make that claim that the vast
majority of people (99% as it seems like PG's online info says) don't
care one whit? And trying to prove that claim by pointing to the large
number of people using PG texts, is not proof since I believe most
people have innocent blind faith that PG did things correctly.

Furthermore, anyone doing a major effort in delivering the public
domain to the public has a moral responsibility to do it correctly
and to state in sufficient detail the provenance and any edits of the
texts. If it is a heavily emended text, then it should be specified
to the public with sufficient detail *in that etext, not elsewhere* so
the reader *knows* a text they are reading has been emended (one
doesn't have to list the edits item by item, but it should be made
clear the text has been substantially edited and to give a general
overview of the types of edits done.) I've explained this on TeBC in
more detail. This is a *responsibility*, which places restrictions on
how PG and similar groups should conduct themselves. This is a serious
endeavor: digitally transfering and preserving the public domain. This
is not child's play.

It is true that the Public Domain exists for anyone to do anything
with it as they see fit, but like any freedom, there are associated
responsibilities. Full disclosure is one of them, and is a common
sense responsibility. Trying to be faithful in transcribing texts is
another one when no disclaimers are given in the texts themselves
since people assume the texts they are reading a reasonably faithful
to the original.

Jon

From sly at victoria.tc.ca  Wed Mar  2 21:08:10 2005
From: sly at victoria.tc.ca (Andrew Sly)
Date: Wed Mar  2 21:08:25 2005
Subject: [gutvol-d] Please test www-dev.gutenberg.org
In-Reply-To: <42261F44.1000005@perathoner.de>
References: <42261F44.1000005@perathoner.de>
Message-ID: <Pine.GSO.4.58.0503022054560.11590@vtn1.victoria.tc.ca>


On Wed, 2 Mar 2005, Marcello Perathoner wrote:

> We are ready to migrate the web site to the new fast file server.
>
> Also some slight changes were made to the online catalog to make it
> better cacheable:
>

I have an issue with the way that following an author name from a
bibrec page leads to an anchor in a "author by first letter of last name"
page.

To me, this does not look like a long-term solution. It could work
for a while, but as the collections continues to grow, these files
will inevitably get too large to be easily useful for general browsing.

Take a look at the New General Catalog of Old Books and Authors
where Phillip has begun to break some of the files of author records
into smaller sub-groupings. We could certainly do something like that
here as well, but that would create extra work to identify which
files are largest, and what the best way to split them up would be.

Andrew
From gbdavis at harborside.com  Wed Mar  2 22:48:19 2005
From: gbdavis at harborside.com (George Davis)
Date: Wed Mar  2 22:48:41 2005
Subject: [gutvol-d] re:  DP Anniversary
In-Reply-To: <20050303011217.817C08C8EC@pglaf.org>
References: <20050303011217.817C08C8EC@pglaf.org>
Message-ID: <4226B333.3070908@harborside.com>

Michael Dyck wrote:

> Subject:
> [gutvol-d] DP anniversary?
> From:
> Michael Dyck <jmdyck@ibiblio.org>
> Date:
> Wed, 02 Mar 2005 16:43:26 -0800
> To:
> gutvol-d <gutvol-d@lists.pglaf.org>
> 
> To:
> gutvol-d <gutvol-d@lists.pglaf.org>
> 
> 
> In today's PG Weekly newsletter, and in a posting to the Book People
> mailing list, Michael Hart says:
>     "This is the 4th Anniversary of The Distributed Proofreaders!!!"
> 
> However, if you go to the DP site <http://www.pgdp.net/c/>, you'll see
> that it says that DP was founded in 2000.  Moreover, Charles Franks
> posted to the gutvol-d list on April 20, 2000, saying (in part) "I have
> completed the working beta of a distributed proofreaders website." and
> giving a link. I'm not sure if that was the first public announcement
> of DP, but in any case, DP is about 5 years old.
> 
> Michael Hart appears to be referring to the 4th anniversary of
> March 13th, 2001, which is when the PG Weekly newsletter says DP
> completed its first book (PG #3320). However, that book ("Mohammed
> Ali and His House" by Louise Muhlbach) was actually posted April 2nd,
> 2001, so it's unclear where the March 13 date comes from.
> 
> Moreover, a month or so *before* then, the PG newsletter for February
> 2001[a] says "The Online Distributed Proofreading Team has completed 8
> books since mid October 2000!". This suggests that DP completed its first
> book in mid-Oct 2000, which then might have appeared in the list at the
> bottom of the mid-October "PG needs you" email[b], but I don't see any DP
> books there.

Checking the "Completed Gold E-texts" page, in ascending order by submission 
date (http://www.pgdp.net/c/list_etexts.php?x=g&sort=4), it shows:

1) "Mohammed Ali and His House", L. Muhlbach ()
Uploaded: Tuesday, March 13th, 2001

The link for that etext is to #3320.

The above was relied upon in arriving at the March 13th date.  Which has run in 
the newsletter since June 16, 2004, with no notice of correction.

> Instead, I believe the first DP book to be posted by PG was #3059 (Homer's
> "The Iliad", trans. Andrew Lang) which was posted by December 6, 2000[c].

Are in you for a surprise:  DP didn't do that one.  From the etext:

"This etext was prepared by Sandra Stewart <unfettered@aol.com>
and Jim Tinsley <jtinsley@pobox.com>"

#3320 has a credit to C.F. and D.P.

> [a] http://www.gutenberg.org/newsletter/archive/PGMonthly_2001_02_07.txt
> [b] http://www.gutenberg.org/newsletter/archive/Other_2000_10_18_Project_Gutenberg_needs_you.txt
> [c] http://www.gutenberg.org/newsletter/archive/PGMonthly_2000_12_06.txt
> 
> -Michael Dyck

Hopefully, someone from D.P. will step up and provide more meaningful activity 
updates for inclusion in the newsletter.  People outside of DP would be 
interested in seeing what's going on over there, and why not include such in the 
weekly PG newsletter?

And, boy, I'd like to be a fly on the wall in Jim's office when he reads this! <vbg>

[<G>eorge]


-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 266.5.5 - Release Date: 3/1/2005

From sly at victoria.tc.ca  Wed Mar  2 23:05:22 2005
From: sly at victoria.tc.ca (Andrew Sly)
Date: Wed Mar  2 23:05:39 2005
Subject: [gutvol-d] re:  DP Anniversary
In-Reply-To: <4226B333.3070908@harborside.com>
References: <20050303011217.817C08C8EC@pglaf.org>
	<4226B333.3070908@harborside.com>
Message-ID: <Pine.GSO.4.58.0503022302280.18957@vtn1.victoria.tc.ca>


It may not be relevant, but to see a bit of history,
the old PG volunteer web board is still in place:
http://promo.net/pg/vol/wwwboard/index.html

Here are two particular messages that mention DP:
http://promo.net/pg/vol/wwwboard/messages/1063.html
http://promo.net/pg/vol/wwwboard/messages/1557.html

Andrew

On Wed, 2 Mar 2005, George Davis wrote:

>
> Checking the "Completed Gold E-texts" page, in ascending order by submission
> date (http://www.pgdp.net/c/list_etexts.php?x=g&sort=4), it shows:
>
> 1) "Mohammed Ali and His House", L. Muhlbach ()
> Uploaded: Tuesday, March 13th, 2001
>
> The link for that etext is to #3320.
>
> The above was relied upon in arriving at the March 13th date.  Which has run in
> the newsletter since June 16, 2004, with no notice of correction.
>
> > Instead, I believe the first DP book to be posted by PG was #3059 (Homer's
> > "The Iliad", trans. Andrew Lang) which was posted by December 6, 2000[c].
>
> Are in you for a surprise:  DP didn't do that one.  From the etext:
>
> "This etext was prepared by Sandra Stewart <unfettered@aol.com>
> and Jim Tinsley <jtinsley@pobox.com>"
>
> #3320 has a credit to C.F. and D.P.
>
> > [a] http://www.gutenberg.org/newsletter/archive/PGMonthly_2001_02_07.txt
> > [b] http://www.gutenberg.org/newsletter/archive/Other_2000_10_18_Project_Gutenberg_needs_you.txt
> > [c] http://www.gutenberg.org/newsletter/archive/PGMonthly_2000_12_06.txt
> >
> > -Michael Dyck
>
> Hopefully, someone from D.P. will step up and provide more meaningful activity
> updates for inclusion in the newsletter.  People outside of DP would be
> interested in seeing what's going on over there, and why not include such in the
> weekly PG newsletter?
>
From jmdyck at ibiblio.org  Thu Mar  3 02:11:35 2005
From: jmdyck at ibiblio.org (Michael Dyck)
Date: Thu Mar  3 02:17:52 2005
Subject: [gutvol-d] re:  DP Anniversary
References: <20050303011217.817C08C8EC@pglaf.org>
	<4226B333.3070908@harborside.com>
Message-ID: <4226E2D7.B6C16F40@ibiblio.org>

George Davis wrote:
> 
> Checking the "Completed Gold E-texts" page, in ascending order by submission
> date (http://www.pgdp.net/c/list_etexts.php?x=g&sort=4), it shows:
> 
> 1) "Mohammed Ali and His House", L. Muhlbach ()
> Uploaded: Tuesday, March 13th, 2001

Ah, so it does. Mind you, it also says that books 2 through 160 were
uploaded on January 1st, 2002, which is pretty implausible. The bottom
line is, don't trust the dates on that page. (A little thing affects
them. A slight disorder of the projects table makes them cheats.)

> The above was relied upon in arriving at the March 13th date.  Which has
> run in the newsletter since June 16, 2004, with no notice of correction.

Until now.

> > Instead, I believe the first DP book to be posted by PG was #3059 (Homer's
> > "The Iliad", trans. Andrew Lang) which was posted by December 6, 2000[c].
> 
> Are in you for a surprise:  DP didn't do that one.  From the etext:
> 
> "This etext was prepared by Sandra Stewart <unfettered@aol.com>
> and Jim Tinsley <jtinsley@pobox.com>"

Sorry, no surprise -- I read that attribution before I posted my earlier
message. The lack of mention of DP didn't convince me that the text
hadn't gone through DP. We'll see what Jim says. 

> Hopefully, someone from D.P. will step up and provide more meaningful
> activity updates for inclusion in the newsletter.

Perhaps someone will. I'm not sure I share your hope though.

> People outside of DP would be interested in seeing what's going on
> over there, and why not include such in the weekly PG newsletter?

If people want to know what's going on, they're welcome to visit the DP
website <http://www.pgdp.net> and see for themselves. (They may need to
register -- it depends what they want to see.)

-Michael Dyck

From jtinsley at pobox.com  Thu Mar  3 04:39:53 2005
From: jtinsley at pobox.com (Jim Tinsley)
Date: Thu Mar  3 04:40:19 2005
Subject: [gutvol-d] re:  DP Anniversary
In-Reply-To: <4226E2D7.B6C16F40@ibiblio.org>
References: <20050303011217.817C08C8EC@pglaf.org>
	<4226B333.3070908@harborside.com> <4226E2D7.B6C16F40@ibiblio.org>
Message-ID: <20050303123953.GA17119@panix.com>

On Thu, Mar 03, 2005 at 02:11:35AM -0800, Michael Dyck wrote:
>George Davis wrote:
>> 
>> Checking the "Completed Gold E-texts" page, in ascending order by submission
>> date (http://www.pgdp.net/c/list_etexts.php?x=g&sort=4), it shows:
>> 
>> 1) "Mohammed Ali and His House", L. Muhlbach ()
>> Uploaded: Tuesday, March 13th, 2001
>
>Ah, so it does. Mind you, it also says that books 2 through 160 were
>uploaded on January 1st, 2002, which is pretty implausible. The bottom
>line is, don't trust the dates on that page. (A little thing affects
>them. A slight disorder of the projects table makes them cheats.)
>
>> The above was relied upon in arriving at the March 13th date.  Which has
>> run in the newsletter since June 16, 2004, with no notice of correction.
>
>Until now.
>
>> > Instead, I believe the first DP book to be posted by PG was #3059 (Homer's
>> > "The Iliad", trans. Andrew Lang) which was posted by December 6, 2000[c].
>> 
>> Are in you for a surprise:  DP didn't do that one.  From the etext:
>> 
>> "This etext was prepared by Sandra Stewart <unfettered@aol.com>
>> and Jim Tinsley <jtinsley@pobox.com>"
>
>Sorry, no surprise -- I read that attribution before I posted my earlier
>message. The lack of mention of DP didn't convince me that the text
>hadn't gone through DP. We'll see what Jim says. 
>

Jim has already said more than plenty on the DP Forums when the question
came up there. I ransacked my old e-mails, and you can see the whole
thread at 

http://www.pgdp.net/phpBB2/viewtopic.php?t=5726

The Lang Iliad was an unusual case. Sandra and I had the same translation,
but in different printings -- I had very small pages, a kind of pocket
book; she had normal sized ones. She was typing from the _end_ of the
book backwards by chapter; I was scanning and OCRing from the start
forward. We were going to meet in the middle. Charlz' site came up, and,
IIRC, I fed it the middle bit that neither of us had covered yet.

It was the first text submitted, and there was no concept at the
time of a credit for the site itself or the page-proofers thereat.
Which, I suspect, bothered me, because I added one in the Pope
Odyssey, which was next on my list.

I can't find an e-mail from those days discussing credit lines for
the site, but the first three posted books were:

Lang Iliad: No mention of DP

Pope Odyssey: This etext was prepared by Jim Tinsley <jtinsley@pobox.com>
with much help from the proofers at http://charlz.dynip.com/gutenberg

Irish Race: This etext was produced by Charles Franks and the 
Distributed Proofreaders Team.

and Charlz' formula is the one, more or less, that has been used since.


>> Hopefully, someone from D.P. will step up and provide more meaningful
>> activity updates for inclusion in the newsletter.
>
>Perhaps someone will. I'm not sure I share your hope though.
>
>> People outside of DP would be interested in seeing what's going on
>> over there, and why not include such in the weekly PG newsletter?
>
>If people want to know what's going on, they're welcome to visit the DP
>website <http://www.pgdp.net> and see for themselves. (They may need to
>register -- it depends what they want to see.)
>
>-Michael Dyck
>
>_______________________________________________
>gutvol-d mailing list
>gutvol-d@lists.pglaf.org
>http://lists.pglaf.org/listinfo.cgi/gutvol-d
From hart at pglaf.org  Thu Mar  3 09:31:08 2005
From: hart at pglaf.org (Michael Hart)
Date: Thu Mar  3 09:31:10 2005
Subject: [gutvol-d] re:  DP Anniversary
In-Reply-To: <20050303123953.GA17119@panix.com>
References: <20050303011217.817C08C8EC@pglaf.org>
	<4226B333.3070908@harborside.com>
	<4226E2D7.B6C16F40@ibiblio.org> <20050303123953.GA17119@panix.com>
Message-ID: <Pine.LNX.4.60.0503030929530.30585@pglaf.org>


I'll be glad to put revised dates in the Newsletter if/when and "official"
date is picked, along with any other items that should be included.

Thanks!

Michael

On Thu, 3 Mar 2005, Jim Tinsley wrote:

> On Thu, Mar 03, 2005 at 02:11:35AM -0800, Michael Dyck wrote:
>> George Davis wrote:
>>>
>>> Checking the "Completed Gold E-texts" page, in ascending order by submission
>>> date (http://www.pgdp.net/c/list_etexts.php?x=g&sort=4), it shows:
>>>
>>> 1) "Mohammed Ali and His House", L. Muhlbach ()
>>> Uploaded: Tuesday, March 13th, 2001
>>
>> Ah, so it does. Mind you, it also says that books 2 through 160 were
>> uploaded on January 1st, 2002, which is pretty implausible. The bottom
>> line is, don't trust the dates on that page. (A little thing affects
>> them. A slight disorder of the projects table makes them cheats.)
>>
>>> The above was relied upon in arriving at the March 13th date.  Which has
>>> run in the newsletter since June 16, 2004, with no notice of correction.
>>
>> Until now.
>>
>>>> Instead, I believe the first DP book to be posted by PG was #3059 (Homer's
>>>> "The Iliad", trans. Andrew Lang) which was posted by December 6, 2000[c].
>>>
>>> Are in you for a surprise:  DP didn't do that one.  From the etext:
>>>
>>> "This etext was prepared by Sandra Stewart <unfettered@aol.com>
>>> and Jim Tinsley <jtinsley@pobox.com>"
>>
>> Sorry, no surprise -- I read that attribution before I posted my earlier
>> message. The lack of mention of DP didn't convince me that the text
>> hadn't gone through DP. We'll see what Jim says.
>>
>
> Jim has already said more than plenty on the DP Forums when the question
> came up there. I ransacked my old e-mails, and you can see the whole
> thread at
>
> http://www.pgdp.net/phpBB2/viewtopic.php?t=5726
>
> The Lang Iliad was an unusual case. Sandra and I had the same translation,
> but in different printings -- I had very small pages, a kind of pocket
> book; she had normal sized ones. She was typing from the _end_ of the
> book backwards by chapter; I was scanning and OCRing from the start
> forward. We were going to meet in the middle. Charlz' site came up, and,
> IIRC, I fed it the middle bit that neither of us had covered yet.
>
> It was the first text submitted, and there was no concept at the
> time of a credit for the site itself or the page-proofers thereat.
> Which, I suspect, bothered me, because I added one in the Pope
> Odyssey, which was next on my list.
>
> I can't find an e-mail from those days discussing credit lines for
> the site, but the first three posted books were:
>
> Lang Iliad: No mention of DP
>
> Pope Odyssey: This etext was prepared by Jim Tinsley <jtinsley@pobox.com>
> with much help from the proofers at http://charlz.dynip.com/gutenberg
>
> Irish Race: This etext was produced by Charles Franks and the
> Distributed Proofreaders Team.
>
> and Charlz' formula is the one, more or less, that has been used since.
>
>
>>> Hopefully, someone from D.P. will step up and provide more meaningful
>>> activity updates for inclusion in the newsletter.
>>
>> Perhaps someone will. I'm not sure I share your hope though.
>>
>>> People outside of DP would be interested in seeing what's going on
>>> over there, and why not include such in the weekly PG newsletter?
>>
>> If people want to know what's going on, they're welcome to visit the DP
>> website <http://www.pgdp.net> and see for themselves. (They may need to
>> register -- it depends what they want to see.)
>>
>> -Michael Dyck
>>
>> _______________________________________________
>> gutvol-d mailing list
>> gutvol-d@lists.pglaf.org
>> http://lists.pglaf.org/listinfo.cgi/gutvol-d
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
From hart at pglaf.org  Thu Mar  3 09:46:29 2005
From: hart at pglaf.org (Michael Hart)
Date: Thu Mar  3 09:46:30 2005
Subject: [gutvol-d] Please test www-dev.gutenberg.org
In-Reply-To: <Pine.GSO.4.58.0503022054560.11590@vtn1.victoria.tc.ca>
References: <42261F44.1000005@perathoner.de>
	<Pine.GSO.4.58.0503022054560.11590@vtn1.victoria.tc.ca>
Message-ID: <Pine.LNX.4.60.0503030941540.30585@pglaf.org>


> On Wed, 2 Mar 2005, Marcello Perathoner wrote:
>
>> We are ready to migrate the web site to the new fast file server.
>>
>> Also some slight changes were made to the online catalog to make it
>> better cacheable:
>>

I got an email from one person who suggested that how to volunteer
should be listed up with the donation finromation in addition to where it is 
in the "In Depth" section [marked <<< below].  Apparently some people don't 
read "In Depth" until they are already involved, and this person just wanted to 
know how volunteer.


      + Donate. How to make a donation to Project Gutenberg.
      + News and Events. The news.
      + Contacts. How to get in touch.
      + Partners, Affiliates and Resources. A collection of links.
      + Credits. Thanks to our most prominent volunteers.
* In  Depth  Information.  All you ever wanted to know about Project  <<<
   Gutenberg.
      + Volunteering. How you can help Project Gutenberg.            <<<


From hart at pglaf.org  Thu Mar  3 09:47:41 2005
From: hart at pglaf.org (Michael Hart)
Date: Thu Mar  3 09:47:42 2005
Subject: [gutvol-d] Please test www-dev.gutenberg.org
In-Reply-To: <Pine.GSO.4.58.0503022054560.11590@vtn1.victoria.tc.ca>
References: <42261F44.1000005@perathoner.de>
	<Pine.GSO.4.58.0503022054560.11590@vtn1.victoria.tc.ca>
Message-ID: <Pine.LNX.4.60.0503030946541.30585@pglaf.org>


I suppose while these updates are going on, we should also update
13,000 to 15,000 in the opening:

Project  Gutenberg  is  the  oldest  producer of free electronic books
(eBooks or etexts) on the Internet. Our collection of more than 13.000  <<<
eBooks  was  produced  by  hundreds of volunteers.

From brandon at corruptedtruth.com  Thu Mar  3 09:53:50 2005
From: brandon at corruptedtruth.com (Brandon Galbraith)
Date: Thu Mar  3 09:53:59 2005
Subject: [gutvol-d] Please test www-dev.gutenberg.org
In-Reply-To: <Pine.LNX.4.60.0503030946541.30585@pglaf.org>
References: <42261F44.1000005@perathoner.de>	<Pine.GSO.4.58.0503022054560.11590@vtn1.victoria.tc.ca>
	<Pine.LNX.4.60.0503030946541.30585@pglaf.org>
Message-ID: <42274F2E.8010000@corruptedtruth.com>

It's too bad we can't make that dynamic, feeding off of a database =)

-brandon

Michael Hart wrote:

>
> I suppose while these updates are going on, we should also update
> 13,000 to 15,000 in the opening:
>
> Project  Gutenberg  is  the  oldest  producer of free electronic books
> (eBooks or etexts) on the Internet. Our collection of more than 
> 13.000  <<<
> eBooks  was  produced  by  hundreds of volunteers.
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
>


From kouhia at nic.funet.fi  Thu Mar  3 10:39:39 2005
From: kouhia at nic.funet.fi (Juhana Sadeharju)
Date: Thu Mar  3 10:39:53 2005
Subject: [gutvol-d] Re: Enlightened Self Interest
Message-ID: <S18102AbVCCSjj/20050303183939Z+1632@nic.funet.fi>

>From: Jon Noring <jon@noring.name>
>
>Btw, if anyone here has made, and plans to make, 600 dpi (optical)
>greyscale or color scans of any public domain books including the book
>covers (and this includes books printed between 1923 and 1963 which
>may be public domain), I'll gladly accept donations of them on CD-ROM
>and DVD-ROM.

I have scanned about 3400 pages of math text. 300 dpi only as it
looked ok enough. The images are in CDs and my CDROM device has
been broken for three months already.

Four of the books are journal books (600 pages per book) of
Mathematische Annalen. Random pages of them are also scanned
with 600 dpi because I wanted to extract all the fonts.
Unfortunately, I decided to wait until a digital camera would
appear, because it would be needed for the good quality fonts
(600 and 1200 dpi on scanner looks blurry compared to camera).

Yes, unfortunately, because our library sold two of the four books
as a standard book cleaning procedure. In the four books, there
were rare letters which appeared only in one page of one book.
If my CDs loose their data, I cannot rescan them. The images
are in zip files which are very fragile itself. E.g., removing
two bytes from the end or removing the TOC of zip makes the zips
unusable.

And libraries want more tax payer's money! For what? Why libraries
which destroys books should be supported at all? Better give the
money to institutions who preserve the history.

Juhana
-- 
  http://music.columbia.edu/mailman/listinfo/linux-graphics-dev
  for developers of open source graphics software
From Bowerbird at aol.com  Thu Mar  3 11:39:49 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Thu Mar  3 11:40:13 2005
Subject: [gutvol-d] new thread for noring
Message-ID: <c8.59edd1f9.2f58c205@aol.com>

well, jon, i'd have thought you could have used
"the last word" in that thread a bit more wisely.

because i believe that you ain't gonna have
a leg to stand on once my results come in...

but that won't be until next week,
so please enjoy your brief reprieve...     :+)

before i get too deep into the o.c.r./correction process
for "my antonia", though, i'd like to know how much time
you spent, jon, on (a) scanning and (b) image-manipulation.

because my general working rule-of-thumb will be that
people should spend _less_ time on the post-o.c.r. steps
than they did on the scanning and image-manipulation steps.

now, until i get all my procedures hitting on all cylinders,
that might be a pipe-dream, but that's my rule-of-thumb...

i'd estimate that you spent at least 4 hours on the project,
jon.  (probably more, since you were still learning the curve,
but if you had to repeat the whole thing, you could do it in 4.)
that's for the scanning _as_well_as_ the image-manipulation.
if i'm badly wrong, in either direction, do please let me know.
otherwise, i will give myself a time-limit of 4 hours on this,
and we'll see what i can come up with...

and jon, please allow me to say a few nice things to you...      ;+)

first of all, you did a bang-up job on the "my antonia" scans.
even though the world doesn't really have a place yet for
high-resolution scans like these, it's very good to do them.
you can always downsample to lower-resolution, if need be.
i understand why many places aren't yet doing high-resolution
-- like internet archive, distributed proofreaders, and google --
and i absolutely do _not_ fault them for the practical decision.
at the same time, though, i applaud people doing high-resolution.
it's not as if what you've done is unprecedented.  bennett kobb,
for instance, has high-res scans of _nearly_one_hundred_books_,
(http://fax.libs.uga.edu) making your single one pale in comparison.
(his kick-ass scanner: http://fax.libs.uga.edu/abovevu/abovevu.html)
but nonetheless, your quality output is rare enough to merit applause.

second, the image-manipulation you did on the scans is first-rate,
as far as i can tell from cursory examination.  the scans look great!
they are straight!  and their positioning is standardized very well!
(these last two factors are _very_ important in getting good o.c.r.)
there is no question in my mind that we'll get good o.c.r. out of 'em.

third, you used a reasonable naming-scheme for your image-files!
the scan for page 3, for instance, is named 003.png!  fantastic!
and when you had a blank page, your image-file says "blank page"!
please pardon me for making a big deal out of something so trivial
-- and i'm sure some lurkers wrongly think i'm being sarcastic --
but most people have no idea how uncommon this common sense is!
when you're working with hundreds of files, it _really_ helps you
if you _know_ that 183.png is the image of page 183.  immensely.
even the people over at distributed proofreaders, in spite of their
immense experience, haven't learned this first-grade lesson yet.
(well, a few of 'em have, and won't go back to that stupidity, but
an amazing number of others will even _argue_ with you about it!)
what this means, for those of you reading along at home, is that
when you scan, start scanning at page 1.  (and if the text starts
on page 3, like "my antonia" did, then start 2 pages before that.)
scan the blank pages.  if there are picture "plates" in the book or
other unnumbered pages, _skip_'em_, so numbers stay in sync;
then do them later, at the _end_ of the regular numbered pages.
that's also when you'll do the cover, and all of the front-matter.
(this includes a forward, preface, anything with roman numerals.)

fourth, jon, you scanned the headers and footers!  again, bravo!
some people don't, when they scan, and that is a big mistake.
let the post-o.c.r. processing software eliminate them later.
for now, they are worthwhile to keep in your master images;
also later, if you view the images as a book, they're a nice touch.
they aren't really necessary, in most cases, but why delete 'em?

fifth, your dedication in driving the text to perfection is exemplary.
you put together a team of a half-dozen people dedicated to the task,
and it shows.  while i don't think this approach can scale very well
-- your team might well burn itself out after doing a couple books,
while page-a-day people at distributed proofreaders go on and on,
and an even better approach is to turn readers into proofreaders --
i do think that, as a special effort, what you've done is admirable.
drawing attention to the importance of error-free e-texts is great.
and setting a positive example, as you've done with your own file,
is far superior to the vacuous criticism you make against p.g. files.
you've put your time and energy where your mouth is, and i approve.

sixth, i understand that you are motivated by good intentions, and
i respect your courage in standing up for them while some people
(including myself) are kicking you in the teeth, because we disagree.
(and _their_ intentions and motivations are just as good as yours.)
in case you haven't noticed, i have the exact same type of fortitude, 
and whenever i see it in other people, i hold it in very high esteem.

seventh, i can't think of anything else, but i like to have 7 points,
rather than 6, and i'm sure i'll think of the other when i hit "send".

anyway, i hope i haven't embarrassed you, saying nice things and all...

***

oh yeah, one more thing, just so nobody else wastes any time:
jon suggested that people with a range of o.c.r. packages could
run it on his scans.  i do not think that's necessary, not at all.

there's a ton of o.c.r. expertise here, all pointing the same:
abbyy finereader v7.x is superior to any other o.c.r. program.

combined with proper post-o.c.r. processing, its recognition
gives a level of accuracy that is as good as can be expected.
until other o.c.r. programs can deliver to us near-perfection,
or results equivalent to abbyy's for free, they waste our time.

***

anyway, off i go.  i'll let you know when i have some results...    :+)

-bowerbird
From jmdyck at ibiblio.org  Thu Mar  3 12:12:25 2005
From: jmdyck at ibiblio.org (Michael Dyck)
Date: Thu Mar  3 12:13:39 2005
Subject: [gutvol-d] re:  DP Anniversary
References: <20050303011217.817C08C8EC@pglaf.org>
	<4226B333.3070908@harborside.com> <4226E2D7.B6C16F40@ibiblio.org>
	<20050303123953.GA17119@panix.com>
Message-ID: <42276FA9.86E8BB7C@ibiblio.org>

Jim Tinsley wrote:
> 
> Jim has already said more than plenty on the DP Forums when the question
> came up there. I ransacked my old e-mails, and you can see the whole
> thread at
> 
> http://www.pgdp.net/phpBB2/viewtopic.php?t=5726

Thanks for that link -- lots of good information there. I must have
missed it when it happened originally (probably because I was deep into
copyright renewals at the time).

-Michael

From marcello at perathoner.de  Thu Mar  3 09:03:37 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Thu Mar  3 13:29:55 2005
Subject: [gutvol-d] Please test www-dev.gutenberg.org
In-Reply-To: <Pine.GSO.4.58.0503022054560.11590@vtn1.victoria.tc.ca>
References: <42261F44.1000005@perathoner.de>
	<Pine.GSO.4.58.0503022054560.11590@vtn1.victoria.tc.ca>
Message-ID: <42274369.7020102@perathoner.de>

Andrew Sly wrote:

> I have an issue with the way that following an author name from a
> bibrec page leads to an anchor in a "author by first letter of last name"
> page.
> 
> To me, this does not look like a long-term solution. It could work
> for a while, but as the collections continues to grow, these files
> will inevitably get too large to be easily useful for general browsing.

The old author pages had the problem that they were too many to generate 
statically (5000+) and very database-intensive to generate on-the-fly. 
We have a fair share of obnoxious robots visiting us (kids on a dsl line 
that want to grab everything and don't respect robots.txt) and every 
such visit costs us 5000+ heavy database hits. (The bibrec pages are 
much lighter on database resources.) I'll try this way to see how it 
performs.


The script uses a list of regexes to fill the pages with authors. If the 
"B" page (currently 219 KB) gets too big, we'll split it into "BA" and 
"BM".

Also, modern browser will request compression, so the 219 KB page will 
boil down to a ~50 KB transmission. Many other web sites use images that 
big.


Here are the actual sizes.

-rw-r--r--    1 marcello pgweb      120171 Mar  2 15:47 a.php
-rw-r--r--    1 marcello pgweb      219237 Mar  2 15:47 b.php
-rw-r--r--    1 marcello pgweb      168136 Mar  2 15:48 c.php
-rw-r--r--    1 marcello pgweb      124726 Mar  2 15:48 d.php
-rw-r--r--    1 marcello pgweb       54900 Mar  2 15:48 e.php
-rw-r--r--    1 marcello pgweb       68002 Mar  2 15:49 f.php
-rw-r--r--    1 marcello pgweb       93415 Mar  2 15:49 g.php
-rw-r--r--    1 marcello pgweb      182640 Mar  2 15:50 h.php
-rw-r--r--    1 marcello pgweb       17617 Mar  2 15:50 i.php
-rw-r--r--    1 marcello pgweb       61671 Mar  2 15:50 j.php
-rw-r--r--    1 marcello pgweb       52031 Mar  2 15:50 k.php
-rw-r--r--    1 marcello pgweb      132947 Mar  2 15:50 l.php
-rw-r--r--    1 marcello pgweb      184111 Mar  2  2005 m.php
-rw-r--r--    1 marcello pgweb       29596 Mar  2  2005 n.php
-rw-r--r--    1 marcello pgweb       38429 Mar  2  2005 o.php
-rw-r--r--    1 marcello pgweb        9530 Mar  2 03:20 other.php
-rw-r--r--    1 marcello pgweb      110174 Mar  2 03:19 p.php
-rw-r--r--    1 marcello pgweb       11253 Mar  2 03:19 q.php
-rw-r--r--    1 marcello pgweb       85506 Mar  2 03:19 r.php
-rw-r--r--    1 marcello pgweb      195736 Mar  2 03:19 s.php
-rw-r--r--    1 marcello pgweb       88693 Mar  2 03:19 t.php
-rw-r--r--    1 marcello pgweb       29340 Mar  2 03:20 u.php
-rw-r--r--    1 marcello pgweb      148515 Mar  2 03:20 v.php
-rw-r--r--    1 marcello pgweb      139151 Mar  2 03:20 w.php
-rw-r--r--    1 marcello pgweb        7759 Mar  2 03:20 x.php
-rw-r--r--    1 marcello pgweb       18127 Mar  2 03:20 y.php
-rw-r--r--    1 marcello pgweb       15734 Mar  2 03:20 z.php


-- 
Marcello Perathoner
webmaster@gutenberg.org


From jon at noring.name  Thu Mar  3 14:11:28 2005
From: jon at noring.name (Jon Noring)
Date: Thu Mar  3 14:11:42 2005
Subject: [gutvol-d] new thread for noring
In-Reply-To: <c8.59edd1f9.2f58c205@aol.com>
References: <c8.59edd1f9.2f58c205@aol.com>
Message-ID: <3027651234.20050303151128@noring.name>

Bowerbird wrote:

> well, jon, i'd have thought you could have used
> "the last word" in that thread a bit more wisely.

laugh.


> because i believe that you ain't gonna have
> a leg to stand on once my results come in...

Well, I hope you get an error rate that is one per ten pages for the
"My Antonia" scans. And even if you do, I still believe a DP-like
process is necessary to catch errors that OCR can't handle, and for
someone to properly assemble the pages, structure the document, etc.,
after the OCRing/proofing is complete. I don't quite put the same
level of faith in OCR as you seem to.

Btw, I believe as you do that an error reporting system is a good idea
so readers may submit errors they find in the texts they use -- sort
of an ongoing post-DP proofing process.

Obviously, it is necessary to make available the page scans of the
source document to aid in this process. How can an error be properly
verified and corrected when the source work is not available?


> i'd estimate that you spent at least 4 hours on the project,
> jon.  (probably more, since you were still learning the curve,
> but if you had to repeat the whole thing, you could do it in 4.)
> that's for the scanning _as_well_as_ the image-manipulation.
> if i'm badly wrong, in either direction, do please let me know.
> otherwise, i will give myself a time-limit of 4 hours on this,
> and we'll see what i can come up with...

Scanning took quite a while (much more than four hours) since all I
have at the moment is a flat bed scanner (an el cheapo and slow
Microtek ScanMaker X6EL to be exact), so I had to hand place each page
on the flat bed. Of course, 600 dpi optical resolution increases the
per page scanning time (4 times as many pixels to capture, which slows
everything down.)

It would have gone a *lot faster* had I used a high-quality sheet feed
scanner since I took apart the book to free the pages so as to get
high quality, flat scans. Someday...


> first of all, you did a bang-up job on the "my antonia" scans.

Thanks!


> even though the world doesn't really have a place yet for
> high-resolution scans like these, it's very good to do them.
> you can always downsample to lower-resolution, if need be.

Exactly. It is my vision for Distributed Scanners that it should
achieve at least this quality.


> i understand why many places aren't yet doing high-resolution
> -- like internet archive, distributed proofreaders, and google --
> and i absolutely do _not_ fault them for the practical decision.
> at the same time, though, i applaud people doing high-resolution.
> it's not as if what you've done is unprecedented.  bennett kobb,
> for instance, has high-res scans of _nearly_one_hundred_books_,
> (http://fax.libs.uga.edu) making your single one pale in comparison.
> (his kick-ass scanner: http://fax.libs.uga.edu/abovevu/abovevu.html)
> but nonetheless, your quality output is rare enough to merit applause.

Funny that I forgot about the UGA work. Quite an interesting and
eclectic list of mostly 19th century works. Will need to contact
Bennett one of these days.


> third, you used a reasonable naming-scheme for your image-files!
> the scan for page 3, for instance, is named 003.png!  fantastic!
> and when you had a blank page, your image-file says "blank page"!
> please pardon me for making a big deal out of something so trivial
> -- and i'm sure some lurkers wrongly think i'm being sarcastic --
> but most people have no idea how uncommon this common sense is!...

Yes, I deemed it important for processing purposes that the name of the
image contain semantic information of what it represents, and that
naming be consistent for file sorting purposes.

As an aside, it is interesting that in my copy of "My Antonia", which
is a first edition, the Introduction starts on page 3. There is no
page 1 and 2 -- at all. I carefully took the book apart (cutting the
sewing) before scanning and proved by this process (plus referring to
other info) that pages 1 and 2 never existed.

The publisher simply chose to start at page 3. Was this common?

(Hmmm, I probably need to take a trip to Utah University's library to
check their first edition copy of My Antonia to make sure that there
wasn't an inserted page, maybe of an illustration -- but the UNL
online Cather edition shows nothing. Maybe there was an intent to
insert a page there, which after typesetting it was decided not to.)


> fourth, jon, you scanned the headers and footers!  again, bravo!
> some people don't, when they scan, and that is a big mistake.
> let the post-o.c.r. processing software eliminate them later.
> for now, they are worthwhile to keep in your master images;
> also later, if you view the images as a book, they're a nice touch.
> they aren't really necessary, in most cases, but why delete 'em?

It was my intent to reproduce each page for direct reading purposes --
that is, if somebody wanted to read the book as it was printed, then
they could.

I attempted *archival scanning*, not *scanning only for OCR*. That OCR
benefits from archival quality scanning, though, is obvious.


> fifth, your dedication in driving the text to perfection is exemplary.
> you put together a team of a half-dozen people dedicated to the task,
> and it shows.  while i don't think this approach can scale very well
> -- your team might well burn itself out after doing a couple books,
> while page-a-day people at distributed proofreaders go on and on,
> and an even better approach is to turn readers into proofreaders --

It is not my intent to proof the way we did -- I still believe in the
DP approach for proofing. But we had to get something out the door for
demo purposes and did not have the time to submit it to the DP process.
Maybe we should have. Hindsight is 20-20.


And thanks for the rest of your comments.


Jon

From miranda_vandeheijning at blueyonder.co.uk  Thu Mar  3 14:39:43 2005
From: miranda_vandeheijning at blueyonder.co.uk (Miranda van de Heijning)
Date: Thu Mar  3 14:39:58 2005
Subject: [gutvol-d] 500th French book--Sodome et Gomorrhe
In-Reply-To: <421B0080.8060402@blueyonder.co.uk>
References: <BE3A27D9.3C61D%alex@awstudios.net>	<Pine.LNX.4.60.0502171001190.28262@pglaf.org>	<42171387.5020807@blueyonder.co.uk>	<20050220054956.GB30309@pglaf.org>	<42187065.4060107@blueyonder.co.uk>	<Pine.LNX.4.60.0502211104400.15772@pglaf.org>
	<421B0080.8060402@blueyonder.co.uk>
Message-ID: <4227922F.1050201@blueyonder.co.uk>


Hi guys,

Just to keep you all updated on progress:

We are at 496 French books at the moment. Marcel Proust's Sodome et 
Gomorrhe 1 will come out of DP shortly and should be  in time to become 
the official number 500, if that's okay with the rest of the PG community.

Kind regards,

Miranda


Miranda van de Heijning wrote:

> My intention is to continue A la recherche du temps perdu on DP-EU and 
> hopefully, one of the other PG sites will be able to publish them.
>
> After that, we just need to wait for US copyright to move along a few 
> years and then PG-US will have the full lot as well. :-)
>
> Miranda
>
>
> Michael Hart wrote:
>
>>
>> Don't forget, all of Proust can be posted at Project Gutenberg sites
>> with "life +50" and +70 copyrights, since he died so long ago.
>>
>> Michael
>>
>>
>> On Sun, 20 Feb 2005, Miranda van de Heijning wrote:
>>
>>> Hi all,
>>>
>>> I have just looked through the download info which Marcello very 
>>> kindly compiled for me and I would like to suggest we post as the 
>>> 500th book part 1 of 'Sodome et Gomorrhe'.
>>>
>>> It is part of Proust's classic A la recherche du temps perdu and the 
>>> only remaining volume which we can actually post to PG. This is 
>>> because the other parts of the series were published after his 
>>> death, between 1923 and 1927. We already have Sodomo et Gomorrhe 2.
>>>
>>> Sodome et Gomorrhe 1 is close to finishing proofing at Distributed 
>>> Proofreaders (162 pages to go in round 2) so I expect it will be 
>>> available for post-processing/posting soon.
>>>
>>> Or are there any other suggestions?
>>>
>>> Miranda
>>>
>>>
>>>
>>> Greg Newby wrote:
>>>
>>>> On Sat, Feb 19, 2005 at 10:23:03AM +0000, Miranda van de Heijning 
>>>> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> There are 485 French books in PG at the moment, so we will be 
>>>>> reaching 500 pretty soon. Has any thought been given yet about 
>>>>> what could be the 500th book? If no decision has been made, there 
>>>>> are quite a few George Sand's coming up from DP and they may be 
>>>>> suitable, considering that we are working on providing her 
>>>>> complete works.
>>>>>
>>>>
>>>> I don't think anyone has suggested one yet.  Sands sounds
>>>> like a good choice.  We also have a nice array of Jules Verne
>>>> and Victor Hugo, and I've noticed some Shakespeare translations.
>>>>
>>>>
>>>>> Secondly, are there any statistics on which are the most popular 
>>>>> French books? I know that Le Kama Soutra is quite a crowdpleaser, 
>>>>> but what about the rest?
>>>>>
>>>>
>>>> There's a "top 100" list at http://gutenberg.org/catalog
>>>> There is also a non-public analysis of the download
>>>> statistics.  Both of these are for ibiblio only, so while they're
>>>> useful they don't represent other download sources (notably,
>>>> our many mirrors).
>>>>
>>>> You'd need to look through the download list "by hand" to spot the
>>>> French titles.  Email if if you want the URL & username+password,
>>>> and I'll dig it up.
>>>>  -- Greg
>>>>
>>>>
>>>>
>>>>> Michael Hart wrote:
>>>>>
>>>>>
>>>>>> I sent the <dandelion> address,
>>>>>> unless someone has a better one.
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>> On Thu, 17 Feb 2005, Alex Wilson wrote:
>>>>>>
>>>>>>
>>>>>>> About a month ago Greg Newby offered to get me in touch with David
>>>>>>> Wyllie--who provided the English translation of Kafka's 
>>>>>>> Metamorphosis for
>>>>>>> PG--and I haven't heard from him since. I'm thinking Greg's 
>>>>>>> emails or mine
>>>>>>> are ending up in a junk mail folder, so I'm wondering if anyone 
>>>>>>> here knows
>>>>>>> how I can get in touch with Mr. Wyllie.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> Alex.
>>>>>>>
>>>>>>> http://www.telltaleweekly.org - Funding a Free Audiobook Library
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gutvol-d mailing list
>>>>>>> gutvol-d@lists.pglaf.org
>>>>>>> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> gutvol-d mailing list
>>>>>> gutvol-d@lists.pglaf.org
>>>>>> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> gutvol-d mailing list
>>>>> gutvol-d@lists.pglaf.org
>>>>> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> gutvol-d mailing list
>>> gutvol-d@lists.pglaf.org
>>> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>>>
>>
>>
>>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
>
>

From Bowerbird at aol.com  Thu Mar  3 15:03:14 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Thu Mar  3 15:03:31 2005
Subject: [gutvol-d] new thread for noring
Message-ID: <d6.218bd3d9.2f58f1b2@aol.com>

jon said:
>   I hope you get an error rate that is one per ten pages

i'll do my best.       :+)


>   even if you do, I still believe a DP-like process 
>   is necessary to catch errors that OCR can't handle

human readers will _always_ be necessary.
(and easy enough to find.  if no one wants to
read a book, there's little call to digitize it.)

thus a system of "continuous proofreading"
will be quite good enough if we can make the
computer-guided processing accurate enough.


>   and for someone to properly assemble the pages, 
>   structure the document, etc., after the OCRing/proofing 
>   is complete. 

that's part of what i include in "post-o.c.r. processing".


>   I don't quite put the same level of 
>   faith in OCR as you seem to.

except that once you see the evidence i lay out,
you will realize "faith" has nothing to do with it.

as i've been saying all along, 
for professionally typeset books,
the structure is _in_ the presentation.

so o.c.r. gives you all the information you need,
if you know how to look for it, and do so diligently.


>   Btw, I believe as you do that an error reporting system 
>   is a good idea so readers may submit errors they find 
>   in the texts they use -- sort of an ongoing 
>   post-DP proofing process.

post-d.p.?  i see it _replacing_ d.p. for most books.

and good thing, too.

once the coming avalanche of scanned-books engulfs us,
it'll be the only way most books have a chance to surface.

that will take the pressure off distributed proofreaders,
and they'll be able to focus on the books that _need_ them.


>   Obviously, it is necessary to make available 
>   the page scans of the source document to aid in this process. 
>   How can an error be properly verified and corrected 
>   when the source work is not available?

i've always said i think that page-scans should be publicly available.
particularly if your mission is _transcribing_an_existing_edition_.

(although, to remind people again, copyism is _not_ the mission
that michael hart chose to embed within his project gutenberg.)

but even in the case of project gutenberg's "amalgamated" e-texts,
i believe that a page-image graphic-version should be made available.
this would allow people to view it on a dvd-player, just as an example.


>   Scanning took quite a while (much more than four hours)

that doesn't surprise me.  nonetheless, i'll limit myself to 4 hours.
that's quite enough time to devote to it.  and to prove the point too.


>   I deemed it important for processing purposes that 
>   the name of the image contain semantic information 
>   of what it represents, and that
>   naming be consistent for file sorting purposes.

as one improvement, i would suggest _not_ using "001.png", etc.
instead, preface each one with a string that will make it _unique_,
such as "ma2005feb001.png".  it's easy to tell that to the o.c.r. app
-- you just type it in one time -- and it's an unmistakable stamp.

and of course, if you're going to do hundreds or thousands of books,
you want to cook up a naming convention that conveys information.
on big multimedia projects, it is not at all uncommon to have one
_full-time_ employee dedicated _solely_ to maintaining filenames.
because if things go wrong, it can waste a whole lot of man-hours.

oh yeah, one more suggestion.  your front-matter filenames were
prefaced with an "r".  my typical recommendation is that they be
prefaced with an "f", and that the regular pages be named with a "p",
so the front-matter files will sort _on_top_of_ the regular pages.
i want to be able to depend on the operating-system filename sort
to give me pages in the exact order they appear in the book itself.
so i use a "q" on back-matter files, so they will drop to the bottom.
for illustration plates, i use a name that sorts _them_ correctly;
for instance, if an illustration page is between pages 168 and 169,
name it "p168a.png".  (and don't forget the blank verso side either!,
which you will name "p168b.png".)


>   The publisher simply chose to start at page 3. Was this common?

it's not uncommon.  oftentimes there is a "title-page", consisting of
nothing more than the name of the book, which is considered "page 1",
with its blank verso being "page 2", so chapter 1 starts on "page 3".

sometimes chapter 1 starts on page 7.  or page 11.  publishers are weird.


>   Maybe there was an intent to insert a page there, 
>   which after typesetting it was decided not to.)

sometimes that happens too, yep.  an "unnecessary" page gets dropped
when the typesetter realizes they didn't plan the signatures correctly.
or when the preface runs two pages longer than was originally intended.
or any number of other snafus spring up.  shit happens.


>   It was my intent to reproduce each page for direct reading purposes --
>   that is, if somebody wanted to read the book as it was printed, 
>   then they could.

yeah, and sometimes people want to do exactly that.
which is why the page-images should be made available.
for many illustrated books, the text alone is not enough.
you want to be able to see the pages as they were printed.

my viewer-program will work with either, text or images.
it'll even work in "hybrid" mode, so you can display the text
in one of the 2-up pages, and the page-image on the other side.
(and of course that is the mode which is used for proofreading.)

that's why things like _blank_pages_ are so important to include.
because if you toss them out, you screw up the left/right sequence.
a convention of paper-books is that odd pages always go on the right.
screw that up and you make yourself look silly.

anyway, that's all for now.

-bowerbird
From jon at noring.name  Thu Mar  3 17:15:58 2005
From: jon at noring.name (Jon Noring)
Date: Thu Mar  3 17:16:32 2005
Subject: [gutvol-d] new thread for noring
In-Reply-To: <d6.218bd3d9.2f58f1b2@aol.com>
References: <d6.218bd3d9.2f58f1b2@aol.com>
Message-ID: <838721171.20050303181558@noring.name>

Bowerbird wrote:

> as one improvement, i would suggest _not_ using "001.png", etc.
> instead, preface each one with a string that will make it _unique_,
> such as "ma2005feb001.png".  it's easy to tell that to the o.c.r. app
> -- you just type it in one time -- and it's an unmistakable stamp.

Yes, a very good suggestion, and one that is being planned. I held off
because we are still thinking through the exact syntax of the book
identifier, although it *might* be based somewhat on the WEMI (Work/
Expression/Manifestation/Item) principle. The LibraryCity ID used at
the current "My Antonia" site is just a quick improvisation of the
WEMI principle.

For example:

Work:          "Frankenstein" by Mary Shelley
Expression:    Second edition (which differs a lot from the First)
Manifestation: 1895 printing edited by John Doe (just a dummy example)
Item:          XHTML

So in Trusted Editions, filed under the WorkID for "Frankenstein",
we could have multiple Expressions each with its own ExprID, e.g.
First Edition, Second Edition, a lost manuscript for a third edition,
etc. (many books will have only Expression since they did not become
popular and no author manuscript exists.) Under Manifestation we could
have several (with ManfID's) based on later edited editions as well as
a modern "Michael Hart" style amalgamated/edited edition. And then for
each Manifestation we can have several formats (Items, ItemID -- yeah,
this is a small twist on WEMI as it officially exists since 'item' in
the pbook world usually refers to a particular printed copy of a
Manifestation, with coffee stains and page rips and all -- but this
works well for ebooks/etexts where each item is a duplicatable digital
format derived from the paper Manifestation. This is not yet etched in
concrete -- it is still in the idea stage.)

So, as an example, we might have for Identifiers:

WorkID: 00000000025  (enough for 100 billion general Works.)
ExprID: 02
ManfID: 03
ItemID: 008  (referring to some standardized list which expands over time)

So the overall ID for a particular format of a particular source paper
book might be:

00000000025-02-03-008 (yeah, it's long)

Page scans only need the WEM portion of the ID for prefixing on the
filename:

00000000025-02-03-p295.png

(If we only care about 100 million Works, then we may have:

00000025-02-03-p295.png )

Of course, the WEM-ID itself does not contain any metadata other than
identifiers, but that would mesh with a database. It is very
problematic to include any Dublin Core type of metadata within an
identifier. It is understandable maybe using the two first letters
associated with the first two words of the title (ignoring articles),
such as MA for "My Antonia", but that's as far as I'd go.


> and of course, if you're going to do hundreds or thousands of books,
> you want to cook up a naming convention that conveys information.
> on big multimedia projects, it is not at all uncommon to have one
> _full-time_ employee dedicated _solely_ to maintaining filenames.
> because if things go wrong, it can waste a whole lot of man-hours.

Every scanned image is a unique digital object, so it needs to have a
unique identifier in the object's file name, applied when it is
created, along with a metadata record somewhere to describe and keep
track of it.

The catalogers will take care of the identifers and metadata, which
go hand in hand.


> oh yeah, one more suggestion.  your front-matter filenames were
> prefaced with an "r".  my typical recommendation is that they be
> prefaced with an "f", and that the regular pages be named with a "p",
> so the front-matter files will sort _on_top_of_ the regular pages.
> i want to be able to depend on the operating-system filename sort
> to give me pages in the exact order they appear in the book itself.
> so i use a "q" on back-matter files, so they will drop to the bottom.
> for illustration plates, i use a name that sorts _them_ correctly;
> for instance, if an illustration page is between pages 168 and 169,
> name it "p168a.png".  (and don't forget the blank verso side either!,
> which you will name "p168b.png".)

Also an excellent suggestion. The 'r' stands for "Roman", but I
noticed in sorting that the pages are not ordered, so the
front-/body-/end-matter approach makes sense. Too bad 'b' comes
before 'f', as you noted.


>> It was my intent to reproduce each page for direct reading purposes --
>> that is, if somebody wanted to read the book as it was printed, then
>> they could.

> that's why things like _blank_pages_ are so important to include.
> because if you toss them out, you screw up the left/right sequence.
> a convention of paper-books is that odd pages always go on the right.
> screw that up and you make yourself look silly.

Definitely! I will certainly need to relook at what I did to make
sure it's all there.

Handling inserted illustrations is a problem name-wise since in "My
Antonia", the illustrations were inserts between numbered pages. So
for naming/sorting purposes that will need to be worked out.

Thanks for the ideas.

Jon

From Bowerbird at aol.com  Thu Mar  3 23:10:47 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Thu Mar  3 23:11:08 2005
Subject: [gutvol-d] bob bob bobbing along
Message-ID: <25.5a67ce0d.2f5963f7@aol.com>

i said:
>   bennett kobb, for instance, has 
>   high-res scans of _nearly_one_hundred_books_,

"bennett kobb" is actually "bob kobres".        :+)

-bowerbird
From blondeel at clipper.ens.fr  Fri Mar  4 02:53:46 2005
From: blondeel at clipper.ens.fr (Sebastien Blondeel)
Date: Fri Mar  4 02:54:10 2005
Subject: 500th French book: Rimbaud? (Re: [gutvol-d] 500th French book--Sodome
	et Gomorrhe)
In-Reply-To: <4227922F.1050201@blueyonder.co.uk>
References: <BE3A27D9.3C61D%alex@awstudios.net>
	<Pine.LNX.4.60.0502171001190.28262@pglaf.org>
	<42171387.5020807@blueyonder.co.uk>
	<20050220054956.GB30309@pglaf.org>
	<42187065.4060107@blueyonder.co.uk>
	<Pine.LNX.4.60.0502211104400.15772@pglaf.org>
	<421B0080.8060402@blueyonder.co.uk>
	<4227922F.1050201@blueyonder.co.uk>
Message-ID: <20050304105346.GA28659@clipper.ens.fr>

I have been reading this 500th book discussion without connecting it to
a book I have in PP and that may be eligible.

It is a very famous book, maybe the best or second best know book of
French poetry: Rimbaud's

Les Illuminations, Une Saison en Enfer

(projectID3fbe0069d630e from PGDP US)

I finish it over the week/end if that I needed.
From Bowerbird at aol.com  Fri Mar  4 12:53:12 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Fri Mar  4 12:53:24 2005
Subject: [gutvol-d] march forth
Message-ID: <84.409159b4.2f5a24b8@aol.com>

well, that was a lot easier than i thought it would be...      :+)

i did o.c.r. on half of jon noring's page-scans for "my antonia",
using abbyy finereader v7.x.  the results were quite excellent.

after doing a small number of global corrections to the o.c.r.,
i checked it against noring's "trustworthy" version of the text.

except for exceptions i'll discuss right after this paragraph,
most results are given below; each pair of lines represents a
difference found between abbyy and noring, with the last word
in each line being the point of difference.  the number listed at
the start of each line is the word-number in the file, and the
string of words are the ones preceding the point-of-difference
in the file, so that you can easily pinpoint the correct location.

most of the o.c.r. errors were on _punctuation_, not _letters_.
in particular, there were many instances where a _period_ was
misrecognized as a comma.  i did not bother to list these cases,
mostly to avoid clutter.  i do not know what caused these errors.
i don't know if it's a _typical_ misrecognition that abbyy makes, 
if jon's manipulation of the images somehow caused confusion,
if i set one of the options incorrectly, or what.  help, anyone?

i have also not listed differences found in hyphenation, since
i don't have the time to write a decent routine to check them.
(i just accepted the dehyphenation abbyy did automatically.)

another set of differences not listed here is the _n't_ words.
words like "couldn't" and "shouldn't" were set with the _n't_
part distinct from the first part.  jon's version retained this.
abbyy did not.  personally, i find it an unnecessary distraction;
the first thing i'd do with such the file is to change it globally.
jon probably considers that "tampering with trustworthiness".
i think it's common sense recognition of a changed convention.
if you prefer jon's way, use his text.  if not, you can use mine.
(the change is global throughout the book, so it is easy to do.)

i note that jon _did_ close up some "there's" where the "'s" was
set off distinctively, so he was a bit inconsistent in this arena.
(i didn't check to see if there were other apostrophe-s words
that were set apart, because i would've closed 'em up myself.)

i also changed high-bit characters to low, to ease comparison,
so those differences are not listed.  yes, the book _did_ print
"antonia" with a squiggle over the "a".  to me, it's unnecessary.
(but i'm quite sure _that_ little detail gave jon wet dreams.)   ;+)
whichever way you like it, it is just one more global change.
that's one beauty of plain-text -- it's so easy to manipulate it.

so, now back to the quality of the recognition...

almost all of the words were correctly identified.  the ones that
were not would be flagged by a simple spell-check, with merely
2 stealth scanno exceptions: "cur" for "our" and "oven" for "over".
i imagine that these pairs are on the lists of know scannos, and
the variants appear just 5 times, total, so it's an easy test to do.

most of the errors were of two types -- periods and quote-marks.

both these error-types are easy to program routines to check them,
even if they aren't flagged in spellcheck -- many of them would be.

it's relatively easy to detect sentences, so as to check for periods.
and quote-marks are usually nested in pairs, and thus easy to check.

but my routines for checking these two items are still back in my
prototyping test-app, awaiting migration to the current version;
that's why i didn't bother doing o.c.r. on the second half of the scans;
once i've incorporated the routines, i'll refine 'em on the first half,
and then do a solid test of them using the text from the second half.

it's not surprising to me that my tools would find all the errors here.
this is a relatively straightforward text, with very few complications.

total time to do the o.c.r. on this book, once i know what i'm doing?
i'd estimate it at about an hour.  and for all post-o.c.r. processing?
i'd estimate that about the same.  total time for the book -- 2 hours.
that's much less time than it took to scan and manipulate the images.

i'm guessing that those 2 hours of o.c.r. and post-o.c.r. work would
make the accuracy level about 1 error for every 50-100 pages or so.
and those errors would be in the less-serious arena of punctuation.
i won't be able to say for sure until i've done the second-half test,
of course, but given the highly accurate recognition of the words
that i found on this half, i feel rather safe making that prediction.

in this half, of 200+ pages, the only errors that i might have missed
-- but found because i had noring's version to compare it to -- were
"layout/lay out" and "fairy tale/fairytale".  i _might_ have caught
"fairytale", because it's not contained in my spellcheck dictionary
in its joined variant, and the split variant _is_ in the book (twice).
i probably would not have caught "layout", since it's in my dictionary.
(but i should take it out of the dictionary for checking older books.
old-time typographers _did_ layout, but they didn't _call_ it that.)

either way, i'm sure you'll agree that those two errors are trivial.
if all the errors in our books were that meaningless, it'd be great.

wait, i might have even caught _those_ errors, as they are _right_
in the _project_gutenberg_ e-text, which has been out for years!

well, that wraps up my report.  for those who might be curious,
i'll be releasing my post-o.c.r. tool in the late spring.  look for it!

anyway...

i believe this makes it very clear that i am correct when i say that
if you do the scanning carefully, manipulate those scans correctly,
use abbyy finereader v7.x to do the o.c.r., and subject its results to
a good post-o.c.r. program, it is relatively quick and easy to process
an o.c.r. text to the state where it can become a high-powered e-book.

the notion that these procedures are difficult or time-consuming
is just plain wrong.  wrong, wrong, wrong.  in one word -- _untrue_.

-bowerbird

p.s.  although jon's highly accurate version of the text gave us
little opportunity to find errors in his work, we _did_ find two.
(one is an error in the text, i'd say, but jon did not preserve it.)
if michael would like to have another "my antonia" in the library,
i'll submit the _entirely_ correct version to project gutenberg,
and maybe jon can use it to find the error that eluded his team.    :+)

p.p.s.  i _did_ just drop a hint.  one i can use later to show that
i did indeed find the one error that is non-equivocal.  as for the
other error, which might or might not be an air, i'll sack that one.

-----------------------------------------------------------------

524        a group of people stood         huddied
524        a group of people stood         huddled

2442       of Jacob whom He loved.         SelahP
2442       of Jacob whom He loved.         Selah."

4562       grandmother's hand. The oldest son,             Ambro2,
4562       grandmother's hand. The oldest son,             Ambroz@,

5564       up like a hare. "Tatinek,       Tatinekl"
5564       up like a hare. "Tatinek,       Tatinek!"

6344       grumbled, but realized it was   Important
6344       grumbled, but realized it was   important

10749      was fixed for me by             chance;
10749      was fixed for me by             chance,

12887      the familiar road. "They still  come?  "he
12887      the familiar road. "They still  come?"   he

13132      they were always unfortunate. When              PavePs
13132      they were always unfortunate. When              Pavel's

16303      "You not mind my poor           mamenka>
16303      "You not mind my poor           mamenka,

17531      probably, in some deep Bohemian                 forest.....
17531      probably, in some deep Bohemian                 forest...

17718      would be lost ten times         oven
17718      would be lost ten times         over.

18300      the talking tree of the         fairytale;
18300      the talking tree of the         fairy tale;

21478      Ambrosch found him." "Krajiek could             V
21478      Ambrosch found him." "Krajiek could             'a'

23282      that, too, Jelinek. But we      beiieve
23282      that, too, Jelinek. But we      believe

25309      and I went into the             Shimerdas9
25309      and I went into the             Shimerdas'

25594      of his long, shapely hands      layout
25594      of his long, shapely hands      lay out

26036      which is also Thy mercy         seat,"
26036      which is also Thy mercy         seat."

26157      While the tempest still is      high."
26157      While the tempest still is      high."...

27674      milk like what your grandpa     s&y.
27674      milk like what your grandpa     say.

29075      in a spiteful, crowing --       "Jake-y,
29075      in a spiteful, crowing voice: --       "Jake-y

29418      kept for hot applications when  cur
29418      kept for hot applications when  our

30016      Shimerda dropped the rope, ran  aftet
30016      Shimerda dropped the rope, ran  after

36061      misfortune, his wife, "Crazy Mary,"             iried
36061      misfortune, his wife, "Crazy Mary,"             tried

36513      fine, making eyes at the        men!?.."
36513      fine, making eyes at the        men!..."

37226      given him one of Tiny           SoderbalPs
37226      given him one of Tiny           Soderball's

38713      pump water for the cattle.      '"Oh,
38713      pump water for the cattle.      "'Oh,
From marcello at perathoner.de  Fri Mar  4 11:58:33 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Fri Mar  4 13:26:26 2005
Subject: [gutvol-d] Please test www-dev.gutenberg.org
In-Reply-To: <Pine.LNX.4.60.0503030941540.30585@pglaf.org>
References: <42261F44.1000005@perathoner.de>	<Pine.GSO.4.58.0503022054560.11590@vtn1.victoria.tc.ca>
	<Pine.LNX.4.60.0503030941540.30585@pglaf.org>
Message-ID: <4228BDE9.1010007@perathoner.de>

Michael Hart wrote:

> I got an email from one person who suggested that how to volunteer
> should be listed up with the donation finromation in addition to where 
> it is in the "In Depth" section [marked <<< below].  Apparently some 
> people don't read "In Depth" until they are already involved, and this 
> person just wanted to know how volunteer.
> 
> 
>      + Donate. How to make a donation to Project Gutenberg.
>      + News and Events. The news.
>      + Contacts. How to get in touch.
>      + Partners, Affiliates and Resources. A collection of links.
>      + Credits. Thanks to our most prominent volunteers.
> * In  Depth  Information.  All you ever wanted to know about Project  <<<
>   Gutenberg.
>      + Volunteering. How you can help Project Gutenberg.            <<<

Duplicating menu entries just creates confusion.

We could move "Volunteering" into the "About" section, but I think its 
better placed in the "In Depth" section.


-- 
Marcello Perathoner
webmaster@gutenberg.org


From marcello at perathoner.de  Fri Mar  4 12:09:59 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Fri Mar  4 13:26:30 2005
Subject: [gutvol-d] Please test www-dev.gutenberg.org
In-Reply-To: <42274F2E.8010000@corruptedtruth.com>
References: <42261F44.1000005@perathoner.de>	<Pine.GSO.4.58.0503022054560.11590@vtn1.victoria.tc.ca>	<Pine.LNX.4.60.0503030946541.30585@pglaf.org>
	<42274F2E.8010000@corruptedtruth.com>
Message-ID: <4228C097.8030803@perathoner.de>

Brandon Galbraith wrote:

> > I suppose while these updates are going on, we should also update
>> 13,000 to 15,000 in the opening:
 >
> It's too bad we can't make that dynamic, feeding off of a database =)

Not worth the trouble ... First, we had to agree on what counts as an 
ebook in its own right.

Eg. we have a Bible in the collection, where every chapter got its own 
ebook number. Also, many books are posted in parts, and every part got 
its own number besides the complete book.

To get a meaningful count of ebooks we first had to get rid of such 
shameless stuffings.


-- 
Marcello Perathoner
webmaster@gutenberg.org


From miranda_vandeheijning at blueyonder.co.uk  Fri Mar  4 13:38:57 2005
From: miranda_vandeheijning at blueyonder.co.uk (Miranda van de Heijning)
Date: Fri Mar  4 13:39:08 2005
Subject: 500th French book: Rimbaud? (Re: [gutvol-d] 500th French
	book--Sodome et Gomorrhe)
In-Reply-To: <20050304105346.GA28659@clipper.ens.fr>
References: <BE3A27D9.3C61D%alex@awstudios.net>	<Pine.LNX.4.60.0502171001190.28262@pglaf.org>	<42171387.5020807@blueyonder.co.uk>	<20050220054956.GB30309@pglaf.org>	<42187065.4060107@blueyonder.co.uk>	<Pine.LNX.4.60.0502211104400.15772@pglaf.org>	<421B0080.8060402@blueyonder.co.uk>	<4227922F.1050201@blueyonder.co.uk>
	<20050304105346.GA28659@clipper.ens.fr>
Message-ID: <4228D571.5030106@blueyonder.co.uk>

Hi Sebastien,

Thanks, this sounds like a very appropriate suggestion as well. PG is 
currently holding back Proust awaiting book #499, so it looks like we 
now have #501 as well. I will leave it up to PG which one to post as 
which number. It is great to have such great classics to mark this 
wonderful milestone!

Miranda


Sebastien Blondeel wrote:

>I have been reading this 500th book discussion without connecting it to
>a book I have in PP and that may be eligible.
>
>It is a very famous book, maybe the best or second best know book of
>French poetry: Rimbaud's
>
>Les Illuminations, Une Saison en Enfer
>
>(projectID3fbe0069d630e from PGDP US)
>
>I finish it over the week/end if that I needed.
>_______________________________________________
>gutvol-d mailing list
>gutvol-d@lists.pglaf.org
>http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
>
>
>  
>

From gbnewby at pglaf.org  Fri Mar  4 14:29:09 2005
From: gbnewby at pglaf.org (Greg Newby)
Date: Fri Mar  4 14:29:11 2005
Subject: [gutvol-d] Please test www-dev.gutenberg.org
In-Reply-To: <4228C097.8030803@perathoner.de>
References: <42261F44.1000005@perathoner.de>
	<Pine.GSO.4.58.0503022054560.11590@vtn1.victoria.tc.ca>
	<Pine.LNX.4.60.0503030946541.30585@pglaf.org>
	<42274F2E.8010000@corruptedtruth.com>
	<4228C097.8030803@perathoner.de>
Message-ID: <20050304222909.GA32543@pglaf.org>

On Fri, Mar 04, 2005 at 09:09:59PM +0100, Marcello Perathoner wrote:
> Brandon Galbraith wrote:
> 
> >> I suppose while these updates are going on, we should also update
> >>13,000 to 15,000 in the opening:
> >
> >It's too bad we can't make that dynamic, feeding off of a database =)
> 
> Not worth the trouble ... First, we had to agree on what counts as an 
> ebook in its own right.
> 
> Eg. we have a Bible in the collection, where every chapter got its own 
> ebook number. Also, many books are posted in parts, and every part got 
> its own number besides the complete book.
> 
> To get a meaningful count of ebooks we first had to get rid of such 
> shameless stuffings.

That's an unwarranted poke, Marcello.

We do have a count, and it's eBook #s as used as the primary
access point to our files.

Agreeing on what counts as an eBook is not necessary.  We know
how many eBook #s we have, even if there is disagreement on
what counts as an eBook.  There are plenty of words (in
GUTINDEX.ALL and elsewhere) to augment this simplistic number.
  -- Greg

From jon at noring.name  Fri Mar  4 14:48:52 2005
From: jon at noring.name (Jon Noring)
Date: Fri Mar  4 14:49:17 2005
Subject: [gutvol-d] march forth
In-Reply-To: <84.409159b4.2f5a24b8@aol.com>
References: <84.409159b4.2f5a24b8@aol.com>
Message-ID: <726605875.20050304154852@noring.name>

Bowerbird wrote:

> i did o.c.r. on half of jon noring's page-scans for "my antonia",
> using abbyy finereader v7.x.  the results were quite excellent.

Great!


> after doing a small number of global corrections to the o.c.r.,
> i checked it against noring's "trustworthy" version of the text.

What type of global corrections were these? One area is how to handle
hyphenation, and whether there was a short dash in the compound word
in the first place before the typesetter hyphenated the word.


> except for exceptions i'll discuss right after this paragraph,
> most results are given below; each pair of lines represents a
> difference found between abbyy and noring, with the last word
> in each line being the point of difference.  the number listed at
> the start of each line is the word-number in the file, and the
> string of words are the ones preceding the point-of-difference
> in the file, so that you can easily pinpoint the correct location.

Great!


> most of the o.c.r. errors were on _punctuation_, not _letters_.
> in particular, there were many instances where a _period_ was
> misrecognized as a comma.  i did not bother to list these cases,
> mostly to avoid clutter.  i do not know what caused these errors.
> i don't know if it's a _typical_ misrecognition that abbyy makes, 
> if jon's manipulation of the images somehow caused confusion,
> if i set one of the options incorrectly, or what.  help, anyone?

I originally scanned the pages at 600 dpi (optical) 24-bit color
(which in the future I won't do for b&w works since I determined it
is unnecessary overkill.) Then for the online scans they were reduced
as follows:

   original --> 600 dpi bitonal --> 120 dpi greyscale antialiased

I'm not sure which set of scans you used (you don't have the original
since they occupy 5 gigs of space.) Hopefully you used the 600 dpi
bitonal which should OCR the best. Antialiasing actually causes
problems (notwithstanding the much lower resolution.)

One thing you could do is to look at the 600 dpi pages at 100% size
for which the punctuation was not correctly discerned. You probably
will see some errant pixels that fooled the OCR into thinking it
was some other punctuation mark than it is.

Regardless, punctuation is a toughie for OCR to exactly get right,
from what I understand. 600 dpi *helps* resolve the fine detail of
punctuation. 300 dpi is marginal for a lot of punctuation because
the characters are so small and don't occupy enough pixels (while
letters retain enough pixels to better identify them.)


> i have also not listed differences found in hyphenation, since
> i don't have the time to write a decent routine to check them.
> (i just accepted the dehyphenation abbyy did automatically.)

Ah, ok (answering my comments at the beginning.)

Resolving this usually requires a human being to go over, especially
for Works from the 18th and 19th century where compound words with
dashes were much more common than today (e.g., "to-morrow".) Sometimes
one has to see what the author did elsewhere in the text. In a few
cases a guess is necessary based on understanding what the author did
in similar cases in the text. Some of this can be automated. In other
cases it requires a human being to make a final decision. I followed
the UNL Cather Edition here.


> another set of differences not listed here is the _n't_ words.
> words like "couldn't" and "shouldn't" were set with the _n't_
> part distinct from the first part.  jon's version retained this.
> abbyy did not.  personally, i find it an unnecessary distraction;
> the first thing i'd do with such the file is to change it globally.
> jon probably considers that "tampering with trustworthiness".
> i think it's common sense recognition of a changed convention.
> if you prefer jon's way, use his text.  if not, you can use mine.
> (the change is global throughout the book, so it is easy to do.)

Whether or not it is an "unnecessary" distraction, it is better to
preserve the original text in the master etext version.

My thinking is that if someone wants to produce a derivative "modern
reader" edition of "My Antonia", they are welcome to do so and add it
to the collection because the original faithful rendition is *already*
there.

The only requirements I would place (and this applies in general for
any Work) are 1) the original textually faithful etext version has
already been done and is in the collection, and 2) the type of
modernizations done for the modern parallel editions are noted in the
texts themselves (such as within an Editor's Introduction.)


> i note that jon _did_ close up some "there's" where the "'s" was
> set off distinctively, so he was a bit inconsistent in this arena.
> (i didn't check to see if there were other apostrophe-s words
> that were set apart, because i would've closed 'em up myself.)

I spent some time looking at the " 's " issue last week. In many cases
in the original print edition the spacing between the preceding word
and the apostrophe s is quite small -- and for the same combination
elsewhere was larger -- indicating this was more of a typesetter's
convention rather than something Cather specified. [note]

In addition, the UNL Cather Edition closed off all the apostrophe s
(no spaces), but kept the space for many of " n't" words. So here again
I followed the UNL Cather Edition. (Btw, I found quite a few errors in
the online UNL Cather Edition of "My Antonia" which have been forwarded
to the team overseeing it -- sadly the professor overseeing the online
project passed away a few months ago. We are in touch with other Cather
scholars.)

But I've put the "'s" issue on my "to look at again" list.


[note] Cather wanted the line length to be fairly short, so this puts
extra pressure on typesetters who will either have to extend character
spacing for a particular line or scrunch it up more than usual,
depending upon the situation with the rest of the typesetting on the
page, and whether certain words can be hyphenated or not.


> i also changed high-bit characters to low, to ease comparison,

You mean accented characters?


> so those differences are not listed.  yes, the book _did_ print
> "antonia" with a squiggle over the "a".  to me, it's unnecessary.

But that's what is in the original, the "A acute". The squigly is an
'acute', btw. :^)

Accented characters are *always* important to preserve under all
situations. There's no need anymore, in these days of Unicode and
the like to stick with 7-bit ASCII.

I sense that you don't want to properly deal with accented characters
since this poses extra problems with OCRing and proofing, something
you are trying to avoid in your zeal to get everything to
automagically work. To me, that's going too far in simplifying.

Preserving accented characters are important.


> (but i'm quite sure _that_ little detail gave jon wet dreams.)   ;+)
> whichever way you like it, it is just one more global change.
> that's one beauty of plain-text -- it's so easy to manipulate it.

Unicode is plain text. Just more characters to play with. :^)

Btw, for those who are interested, here's the "non-Basic Latin"
(non-ASCII) alphabetic characters used in "My Antonia":

     A acute
     AE ligature
     ae ligature
     e acute
     e circumflex
     i umlaut
     n tilde
     small z with caron


> almost all of the words were correctly identified.  the ones that
> were not would be flagged by a simple spell-check, with merely
> 2 stealth scanno exceptions: "cur" for "our" and "oven" for "over".
> i imagine that these pairs are on the lists of know scannos, and
> the variants appear just 5 times, total, so it's an easy test to do.
>
> most of the errors were of two types -- periods and quote-marks.

Which makes sense. But these are the toughest to correct sometimes,
and punctuation changes can sometimes subtly affect the meaning.
They are hopefully caught by human proofers/readers when grammar
checkers don't (I do use Word to help find both spelling and
punctuation errors -- when they find something, I then manually
check it in the page scans and the master XML.)


> both these error-types are easy to program routines to check them,
> even if they aren't flagged in spellcheck -- many of them would be.

They are "sometimes" easy to spot. Other times the automatic routines
will not catch errors (e.g. ":" vs. ";")


> it's relatively easy to detect sentences, so as to check for periods.

Usually true, but there are some rare exceptions where an abbreviation
can be mistaken for an end of a sentence. Then there's the ellipsis
issue where sometimes an ellipsis is at the end of the sentence and
sometimes it is not (and incorrectly used.)


> and quote-marks are usually nested in pairs, and thus easy to check.

This is also true, but as found in "My Antonia", there are exceptions
to pure nesting, such as when a quotation spills over into several
paragraphs where the intermediate paragraphs are not terminated by
an end quotation mark (whether single or double.)

Also, apostrophes are sometimes confused with single right quote marks.
Here's a fictional example (imagine the straight quotes and apostrophe
marks being represented in print with the appropriate "curly" marks):

   "And Harry told me, 'the voters' confidence in the candidate
   waned.' To which I replied to Harry, 'I don't believe so.'"

With a smart enough grammar and parser, the above might be properly
parsed and the apostrophe correctly differentiated from the single
right quote mark. But still, real-world texts tend to throw a lot of
curve balls that are sometimes hard to correctly machine process.


> but my routines for checking these two items are still back in my
> prototyping test-app, awaiting migration to the current version;
> that's why i didn't bother doing o.c.r. on the second half of the scans;
> once i've incorporated the routines, i'll refine 'em on the first half,
> and then do a solid test of them using the text from the second half.

great!


> total time to do the o.c.r. on this book, once i know what i'm doing?

OCR is quite fast. It's making and cleaning up the scans which is the
human and CPU intensive part.


> i'd estimate it at about an hour.  and for all post-o.c.r. processing?
> i'd estimate that about the same.  total time for the book -- 2 hours.
> that's much less time than it took to scan and manipulate the images.

Yes.


> p.s.  although jon's highly accurate version of the text gave us
> little opportunity to find errors in his work, we _did_ find two.
> (one is an error in the text, i'd say, but jon did not preserve it.)
> if michael would like to have another "my antonia" in the library,
> i'll submit the _entirely_ correct version to project gutenberg,
> and maybe jon can use it to find the error that eluded his team.    :+)

Well, not all of the pages have been doubly proofed. The team is not
finished, and I plan to post a plea somewhere for more eyeballs to go
over it.

I would like to receive error reports as well for this text, since
Brewster wants highly proofed texts for some experiments he plans to
run similar to yours. But if I have to use the version you donate to
PG, so be it. :^)


> p.p.s.  i _did_ just drop a hint.  one i can use later to show that
> i did indeed find the one error that is non-equivocal.  as for the
> other error, which might or might not be an air, i'll sack that one.

Oh, a clue. :^)

Anyway, great work!

Jon

(p.s., I did find one error in my text based on the list you gave.
Thanks. There should be a comma after the first "Jake-y" in "Jake-y
Jake-y". So that's been corrected already in the online and archive
version. I rechecked the PG edition, and they get the comma right in
the text, which I oddly missed doing my "diff" (probably because there
were quite a few differences to pour over.) But then they enclose the
surrounding sentence within a single quote mark (following the
British convention), while the original first edition uses a double
quote mark. The PG edition seems to be inconsistent with regards to
quotation marks and to British/American spelling, which is why I
surmise the PG edition is based on some non-Cather-approved British
edition and might have subsequently been selectively and
inconsistently edited in trying to "re-Americanize" it. I assume you
discovered the several different paragraph breaks in the PG edition?)

From Bowerbird at aol.com  Fri Mar  4 15:23:39 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Fri Mar  4 15:23:54 2005
Subject: 500th French book: Rimbaud? (Re: [gutvol-d] 500th French
	book--Sodome et Gomorrhe)
Message-ID: <29.6e499476.2f5a47fb@aol.com>

miranda said:
>   PG is currently holding back Proust awaiting book #499, 
>   so it looks like we now have #501 as well. 

i'd use "une saison" as #499.
it'd be a shame for someone
to come along later and say,
"what?, you did 500 books 
_before_ that one?"      ;+)

-bowerbird
From donovan at abs.net  Fri Mar  4 17:08:42 2005
From: donovan at abs.net (D Garcia)
Date: Fri Mar  4 17:10:08 2005
Subject: [gutvol-d] new thread for noring
In-Reply-To: <d6.218bd3d9.2f58f1b2@aol.com>
References: <d6.218bd3d9.2f58f1b2@aol.com>
Message-ID: <200503042008.42482.donovan@abs.net>

> >   The publisher simply chose to start at page 3. Was this common?
>
> it's not uncommon.  oftentimes there is a "title-page", consisting of
> nothing more than the name of the book, which is considered "page 1",

Those are called "half-titles," btw.
From Bowerbird at aol.com  Fri Mar  4 17:14:14 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Fri Mar  4 17:14:34 2005
Subject: [gutvol-d] re: march forth
Message-ID: <8.638ae14b.2f5a61e6@aol.com>

jon said:
>   What type of global corrections were these? 

the type that is made easy by my tool.  that's all i'll say for now.


>   One area is how to handle hyphenation, 
>   and whether there was a short dash in the compound word
>   in the first place before the typesetter hyphenated the word.

as i said, i ignored the issue of hyphenation for the time being.

my tool will give a number of ways to deal with hyphenation,
but the routines haven't been brought into the current version.

but i can give a general overview.  end-line hyphenation is removed.
the hyphen in compound words is retained.  to tell the difference,
when there is ambiguity, you look at the rest of the text, to see if
the word was handled consistently there.  if it was, you match that.
if not, you have more work to do.  that's where it gets interesting.
to go any further is to give too much information for here and now.


>   Hopefully you used the 600 dpi bitonal which should OCR the best. 

i did.


>   Antialiasing actually causes problems 
>   (notwithstanding the much lower resolution.)

right.  i first thought the periods misrecognized as commas 
were the effect of anti-aliasing, but i used the 600-dpi scans.
so it must be something else causing that problem.


>   One thing you could do is to look at the 600 dpi pages at 100% size 
>   for which the punctuation was not correctly discerned. You probably
>   will see some errant pixels that fooled the OCR into thinking 
>   it was some other punctuation mark than it is.

i didn't care that much, really.
the post-o.c.r. software can
solve the problem well enough.
i mentioned it for the record,
for the sake of full disclosure,
and to see if anybody knew why.


>   punctuation is a toughie for OCR to exactly get right,

even if the recognition is admittedly somewhat difficult,
i expect abbyy to correct "mr," and "mrs,", for instance.
but even if abbyy doesn't, that's easy for me to program.


>   Resolving this usually requires a human being to go over, 
>   especially for Works from the 18th and 19th century 
>   where compound words with dashes were much more common 

if you want to retain those arcane spellings, it's difficult.
if you wanna update them, the computer does it very easily.
"to-day" and "to-morrow" become "today" and "tomorrow".  instantly.


>   Sometimes one has to see what the author did 
>   elsewhere in the text. 

is there some reason you think the computer can't do that?


>   In a few cases a guess is necessary based on understanding 
>   what the author did in similar cases in the text. 

oh, i see.  it takes "understanding".  
one of those rare precious human-being things.
well then, i guess there's no way to program it.


>   Some of this can be automated. In other cases 
>   it requires a human being to make a final decision. 
>   I followed the UNL Cather Edition here.

it's always easier to let other people make the decision, isn't it?    ;+)


>   Whether or not it is an "unnecessary" distraction, 
>   it is better to preserve the original text in the master etext version.

well see, jon, that's where i differ with you.  and other people do too.
but like i said, as long as it's just one global change away, no big deal.

i see lots of other cases, as well, where you diverge from the paper.
a good many of the quotation-marks are set apart from their words.
you're making editorial decisions whether you acknowledge it or not.


>   My thinking is that if someone wants to produce 
>   a derivative "modern reader" edition of "My Antonia", 
>   they are welcome to do so and add it to the collection 
>   because the original faithful rendition is *already* there.

whose "collection" are we talking about here jon?

yours?  do you have any intention of adding more "my antonia" editions?
specifically a "derivative modern reader"?  if so, i will submit mine.

but surely you don't mean michael hart's project gutenberg collection?

because, according to you anyway, he doesn't have a "faithful" rendition
in his library, not even one, not *already* anyway.  just a mangled one.

another difference between your collection and michael's is
you have 1 book in your collection and he has 10-15 thousand
in his collection, depending on who is in charge of defining 
how the official counting is tabulated these days, it appears.
whether you like it or not, that's a comment on the philosophies.


>   indicating this was more of a typesetter's convention 
>   rather than something Cather specified.

well that's a convenient dodge, isn't it?

and of course you have no real _evidence_ that this is the case,
do you?  so you _really_ should enter each case as it _appears_,
shouldn't you?  at least if you want to stick to your philosophy?


>   In addition, the UNL Cather Edition closed off all the apostrophe s
>   (no spaces), but kept the space for many of " n't" words. 
>   So here again I followed the UNL Cather Edition.

and that's the difficulty with following an authority, ain't it?
there are often so many, it's hard to know which one to follow!

i know i can't keep up even with the editions of this one book!
so how would a person possibly keep up on tens of thousands!

and before you know it, you're having arguments about _that_!
and not reading the book, or digitizing it, or playing at the park.

and i don't know about you, jon, i don't think you're being consistent.
you said you were reproducing what is right there in black-and-white
on the page itself, even made high-resolution scans to prove it to us,
and now you're making judges that are easy to spot.  and to justify it,
you're quoting some other figure of "authority".  that's inconsistent.

but heck, i have to be honest here.  even if you _were_ consistent,
and kept all of those quirks from the paper-book that _i_ consider
to be distracting, the first thing i'm gonna do is global-change 'em.
so all that hard work you did was for no good purpose to me.


>   Cather wanted the line length to be fairly short, 
>   so this puts extra pressure on typesetters 
>   who will either have to extend character spacing 
>   for a particular line or scrunch it up more than usual,
>   depending upon the situation with the rest of the typesetting 
>   on the page, and whether certain words can be hyphenated or not.

oh!, hold it!, wait!, did i just hear you say what you just said?
i think i did!  yes, i'm quite sure i did!

"cather wanted the line length to be fairly short".

wow.  you mean author-intent can go to _the_length_of_lines_?

do you realize how significant that is to your philosophy, jon?

it means you will need to respect willa's wishes on the matter.

none of the long lines you might get in a web-browser!  no sir!

willa wanted short lines!  (is that why the book looks so narrow?)


>   You mean accented characters?

if they aren't in the lower-128 of the ascii range ("true ascii"), yes.


>   Accented characters are *always* important 
>   to preserve under all situations. 

according to you, maybe.  according to me, it depends.
in this case, i say no.  that's my prerogative as an editor.
(and i _do_ consider myself an editor, not just a copyist.)


>   There's no need anymore, in these days of 
>   Unicode and the like to stick with 7-bit ASCII.

until unicode works flawlessly on every machine used
by all the people i know, for texts like this that have
only the occasional character outside the lower-127,
where the meaning isn't changed, i'll stick to plain ascii.


>   I sense that you don't want to 
>   properly deal with accented characters

first of all, jon, i define what "properly" means for me, you don't.
you can define it for yourself.  but i won't let you define it for me.


>   I sense that you don't want to 
>   properly deal with accented characters
>   since this poses extra problems with OCRing and proofing, 

nope.  it's just that i see them as _unnecessary_ to this book.
if a reader thinks it _is_ necessary, make the global-change.


>   something you are trying to avoid in your zeal to get everything
>   to automagically work. To me, that's going too far in simplifying.

i'm not "simplifying".  i'm consciously making a choice to use
something that will work on the broad range of machines out there,
as opposed to something that -- in far too many cases -- fails badly.

it's a pragmatic decision based on real-life knowledge of the actual
infrastructure of machines that exist out here in our real world.

it's the same pragmatic decision that michael made when he crafted
the philosophy guiding the building of this library of 10,000+ e-texts,
in sharp contrast to your philosophy, which has built a 1-book library.


>   Preserving accented characters are important.

in some cases, i'd agree with you.  in others, not.  in this case, not.


>   punctuation changes can sometimes subtly affect the meaning.

you know, as a writer, i'd really like to think that's possible.
as a person who uses a lot of commas, i _want_ to believe it.

but i'll be darned if i can think of that many good examples.

if you can, i would _love_ to hear them.  and if you can show me
_any_ in "my antonia", any at all, i'd give you extra bonus points.

as it is, though, i just have to resign myself to the position that
o.c.r punctuation errors are a distraction, but make no difference.
i'll still root them out, due to my sense of professionalism, but
i sure wish it felt _fun_, instead of feeling like _doing_chores_.
and to the extent that i can automate the chores, i'll be _happy_.


>   They are hopefully caught by human proofers/readers 
>   when grammar checkers don't (I do use Word to 
>   help find both spelling and punctuation errors -- 
>   when they find something, I then manually
>   check it in the page scans and the master XML.)

oh, so you _do_ use an assist from your tools at times.  that's good.


>   They are "sometimes" easy to spot. 
>   Other times the automatic routines will not catch errors

maybe the automatic routines you are using are just inferior.

use my tool.  if it doesn't spot something it should, let me know.


>   Usually true, but there are some rare exceptions where 
>   an abbreviation can be mistaken for an end of a sentence.

not if your routines are as smart as mine are.


>   Then there's the ellipsis issue

i'm three-dozen layers deep on some of these issues,
and you want to talk about level 2.  i'm not interested.
use my tool.  if it doesn't give you the results you want,
let me know.


>   This is also true, but as found in "My Antonia", 
>   there are exceptions to pure nesting, such as 
>   when a quotation spills over into several paragraphs 
>   where the intermediate paragraphs are not terminated 
>   by an end quotation mark (whether single or double.)

is it really your considered opinion that i don't know this?
that i haven't factored it into my thinking _and_ my tools?

maybe you're grandstanding to the lurkers, but my goodness,
jon, do you really think that _they_ are that stupid too?


>   Also, apostrophes are sometimes confused with single right quote marks.

ditto.


>   With a smart enough grammar and parser, 
>   the above might be properly parsed and the 

blah blah blah.  use my tool.  if it doesn't figure out your stuff, let me 
know.


>   But still, real-world texts tend to throw a lot of curve balls 
>   that are sometimes hard to correctly machine process.

i know how to hit 87 different pitches, from both sides of the plate,
and you're telling me to "watch out for the curve balls".  i laugh at you.


>   OCR is quite fast. It's making and cleaning up the scans 
>   which is the human and CPU intensive part.

wait!  i thought you said _proofreading_ and _mark-up_
were the steps that take up the most time.  didn't you?

or do i have you confused with someone else?


>   Well, not all of the pages have been doubly proofed. 
>   The team is not finished, and I plan to post a plea 
>   somewhere for more eyeballs to go over it.

have you heard about distributed proofreaders?
might be able to find some people there...

(ok, now you see what it feels like.)


>   I would like to receive error reports as well for this text, 

i'll tell you the same thing i told michael about project gutenberg:
set up a system for the checking, reporting, correction, and logging
of errors, a system that is transparent to the general public, and
i will be more than happy to report errors to you, and help you out.

otherwise, you waste my time, as i figure someone else can do it.

which, by the way, is what everyone else is thinking.
which is why errors in the texts are not being reported
at nearly the frequency that they should be being reported.
but i've got another message sitting here waiting to be sent
where i discuss that topic in more detail, so i'll stop here now.


>   since Brewster wants highly proofed texts 
>   for some experiments he plans to run similar to yours. 

i'll have to ask him about his tests.


>   But if I have to use the version you donate to PG, so be it. :^)

probably, yep.

if michael wants it.  they say he'll take just about anything...


>   I did find one error in my text based on the list you gave.  Thanks. 

you're welcome.  but that's not the one i was talking about.     :+)


>   I assume you discovered 
>   the several different paragraph breaks in the PG edition?

nope.  i didn't even evoke the routines to examine paragraph-breaks.
i considered doing so, once you said that there were differences,
but decided it was just too inconsequential to even bother with it.
it's another one of those things i would very much like to see a case
where it made a difference, because i'd love to believe it _could_,
but in the absence of a case (or even an _imaginary_ possibility,
which i confess i can't come up with, not off the top of my head),
i am forced to relegate it to the "too trivial to think about" pile.
as above, i'll make the corrections, but i ain't gonna sweat 'em...

-bowerbird
From Bowerbird at aol.com  Fri Mar  4 17:16:08 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Fri Mar  4 17:16:22 2005
Subject: [gutvol-d] new thread for noring
Message-ID: <13e.e482be3.2f5a6258@aol.com>

david said:
>   Those are called "half-titles," btw.

oh cool, thanks.  i figured they had a name, just didn't know it.

-bowerbird
From sly at victoria.tc.ca  Fri Mar  4 23:04:12 2005
From: sly at victoria.tc.ca (Andrew Sly)
Date: Fri Mar  4 23:04:32 2005
Subject: [gutvol-d] new thread for noring
In-Reply-To: <13e.e482be3.2f5a6258@aol.com>
References: <13e.e482be3.2f5a6258@aol.com>
Message-ID: <Pine.GSO.4.58.0503042303110.21904@vtn1.victoria.tc.ca>


On Fri, 4 Mar 2005 Bowerbird@aol.com wrote:

> david said:
> >   Those are called "half-titles," btw.
>
> oh cool, thanks.  i figured they had a name, just didn't know it.
>

One site that I bookmarked describing the "Anatomy of a Book"
can be found here:
http://www.bibliophilegroup.com/biblio/other/school/anatomy.html

Andrew
From Bowerbird at aol.com  Sat Mar  5 05:12:02 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Sat Mar  5 05:12:08 2005
Subject: [gutvol-d] re: Anatomy of a Book
Message-ID: <12b.581be762.2f5b0a22@aol.com>

andrew said:
>   http://www.bibliophilegroup.com/biblio/other/school/anatomy.html

what a marvelous page!  funny and informative at the same time!  a winner!

-bowerbird
From jon at noring.name  Sat Mar  5 10:35:52 2005
From: jon at noring.name (Jon Noring)
Date: Sat Mar  5 10:36:07 2005
Subject: [gutvol-d] re: march forth
In-Reply-To: <8.638ae14b.2f5a61e6@aol.com>
References: <8.638ae14b.2f5a61e6@aol.com>
Message-ID: <15697825812.20050305113552@noring.name>

Bowerbird wrote:
> jon said:

>> Some of this can be automated. In other cases it requires a human
>> being to make a final decision. I followed the UNL Cather Edition
>> here.

> it's always easier to let other people make the decision, isn't it?    ;+)

It's always *smarter* to leverage the experience and knowledge of
others.

The idea behind the "Trusted Edition" concept is to mobilize the help
of both professional scholars and amateur enthusiasts, using community-
oriented tools and processes, to assist with understanding the
specific and unique bibliographic details of any particular Work.

(Interestingly, when it comes to the more obscure public domain Works,
which is the vast majority of them, they were only published once in
one printing, so with respect to figuring out which edition is
"acceptable" or "authoritative", it is pretty cut-and-dried. It's the
famous classics, especially the much older ones which are written in
some archaic fashion or in another language, where it can get quite
complicated as to what is/are the acceptable editions to use as
source(s). Neverthess, for the classics most of this has already been
hashed out, and where there is no agreement between any two, do *both*
of them!)


>> Whether or not it is an "unnecessary" distraction, it is better to
>> preserve the original text in the master etext version.

> well see, jon, that's where i differ with you.  and other people do too.

And there are people who also agree, at least in general, with my
position. I'm not alone on this. You make it out to be like I'm alone
on this, like a "John the Baptist" in the desert.


> but like i said, as long as it's just one global change away, no big deal.

The problem is that sometimes global changes are easy to do in one
direction, and much harder to do the other. When information is
removed, such as converting accented characters to 7-bit ASCII with no
traceback information, it is harder to go in the other direction
because information has been lost.


> i see lots of other cases, as well, where you diverge from the paper.
> a good many of the quotation-marks are set apart from their words.
> you're making editorial decisions whether you acknowledge it or not.

There's not lots, but a few. The focus is to produce a *textually*
accurate rendition which is presentationally-agnostic wherever
possible. We took to heart a lot of the information provided by the
UNL Cather Edition online information because that is the smart thing
to do. We *are* in contact with a couple scholars of Willa Cather's
works besides the UNL folks. To ignore expert advice is, to put it
bluntly, stupid.

And we are putting together a preliminary list of the top 500/1000
classic public domain works, and should the project launch, we plan to
get these rigorously converted along the lines of "My Antonia", and to
mobilize the help of the professional *and amateur* enthusiasts to
help guide the process.


>> My thinking is that if someone wants to produce a derivative
>> "modern reader" edition of "My Antonia", they are welcome to do so
>> and add it to the collection because the original faithful
>> rendition is *already* there.

> whose "collection" are we talking about here jon?
>
> yours?

The collection (of one so far, and it is essentially a working demo
for learning purposes) does not state to be "Jon Noring's" collection.
Go to http://www.openreader.org/myantonia/ and tell me what it says
there, and if it prominently mentions my name.

There's another name given to it. Just because I'm the most visible
person with regards to it here, does not mean it is mine. It is not.
It is part of a fairly visible project mobilizing a group of people
(but not visible on the particular forums you frequent, and not by
the specific name, which doesn't matter.) Should this project go into
production mode, what is produced will belong to the world. It's not
going to be elitist or exclusive as some other etext projects are
(I'm not talking about PG obviously) -- all work product will be made
publicly available, as it should be since it is from the Public
Domain.

Anyway, what is this strange obsession with ownership and competition?
Why do you keep talking about PG being Michael Hart's (more on this
below)?


> do you have any intention of adding more "my antonia" editions?
> specifically a "derivative modern reader"?  if so, i will submit mine.

Sure. So long as the changes from the original acceptable source are
sufficiently noted in the text file, such as an "Editor's
Introduction", some boilerplate, or whatever you want to add.

I'm not sure if you have an interest in taking the time to provide
such editorial information, but we'll be happy to take your edited
version and mark it the "Bowerbird Modernized Edition" or whatever. I
am thinking of providing my own modernized edition as well (which will
have very few changes in the case of "My Antonia".)

("Sufficiently noted" does not mean to spell in gory detail each and
every change, but enough info so the reader will have a good general
idea of how it was "modernized". Readers will appreciate the
thoroughness expended to modernize a text for them, and will have warm
fuzzies that it is "accurate" when the editor *takes the time* to
explain what they did. This builds *trust* with the reader.)


> but surely you don't mean michael hart's project gutenberg collection?

So? What do you care? Is there a law saying any digital text version
of a public domain work *must* be submitted to PG? Does PG have a
government monopoly on the Public Domain? Of course not.

And about this strange fixation you have on "ownership", PG is no
longer "Michael Hart's". You seem to fail to understand that PG now
belongs to the hundreds/thousands who have materially contributed
in building it. (DP has greatly increased the ownership of PG several
fold by its cool way at mobilizing thousands of volunteers.)

Michael Hart is the pioneer and founder of the PG idea, but PG has
gone well beyond him. He can die tomorrow (hopefully not!), and what
he has started will continue unabated. If it were still his, it may
die with him.

[An outside example is the World Wide Web -- does it still "belong" to
Tim Berners-Lee because he invented the general idea and some of the
early standards and tools for it? If Tim Berners-Lee dies tomorrow,
will the plug be pulled on the Web?]

When you produce this magical "toolset" of yours and give it away to
others to use (or do you plan to sell it?), it will no longer be
"yours". So, should you die tomorrow (hopefully not!), will there be a
community of people who will take all your ideas and code and continue
on where you are now? Or will it die with you? So much for the
benefits of ownership and control. This is why just about everything
we've done for "My Antonia" is *already* online and downloadable, even
though it is still an early beta/demo shake out things. There is more
to put up.

The Bible mentions "casting one's bread upon the waters, and it will
be returned to you." The complementary logic of this is that those who
develop their tools in secret, who don't strive to build partnerships
with other like-minded folk, who are not transparent, etc., etc., are
not casting their bread upon the water, and thus may not find the
kinds of rewards they seek.

Interestingly, Michael Hart cast his bread upon the water, and it has
returned more than a hundred-fold. Of all the great contributions
Michael Hart has made, it is to inspire a volunteer movement. I do
have problems with how the earlier PG collection has been assembled
(which DP has mostly, but not completely, resolved), but I recognize
that Michael Hart has accomplished a lot *because he cast his bread
upon the waters*. He not did do his thing in secret, and he welcomed
volunteers from the beginning. DP is a result of his vision, of his
casting his bread upon the waters. Even his PGII concept (which I
think is ill-conceived for various reasons not germane to this
particular discussion) is an attempt to expand the PG collection by
embracing other collections into one big happy tent. And he talks
about giving away trillions and trillions of etexts for free. I like
this attitude.

He is giving away, not taking. He is open and transparent -- he does
not keep everything secret. If he were developing software, he would
immediately open source it and ask for others to help write it. It
will be free for all from the start. He does not keep his light kept
underneath a blanket.

So if Michael Hart is your hero, then consider emulating his
example. I think you catch the drift. That's why I keep asking when
you plan to start a SourceForge or similar open soruce project to
develop your system.


> because, according to you anyway, he doesn't have a "faithful"
> rendition in his library, not even one, not *already* anyway. just a
> mangled one.

With respect to PG's current "My Antonia". Yes, it is mangled. More
importantly, it is not trustworthy, which goes beyond just errors or
differences. I discussed this on TeBC, which I know you've read
(either from an anonymous account or a friend who forward messages. I
don't really care.)

And of course, in my discussion of the whole PG corpus, I carefully
differentiate between the DP and the non-DP portions of it -- I've
done this from the beginning. How convenient you ignore this important
fact.


> another difference between your collection and michael's is
> you have 1 book in your collection and he has 10-15 thousand
> in his collection, depending on who is in charge of defining 
> how the official counting is tabulated these days, it appears.
> whether you like it or not, that's a comment on the philosophies.

Hmmm. <laugh/> Sounds a lot like a school yard taunt: "Let's compare
yours and mine and we'll see whose is bigger -- drop your pants..."

So what? How many etexts did Michael have in "his" collection in
1991? Every journey starts with the first step.

And why do you say "my collection" (in reference to the LibraryCity
"My Antonia" project? Why this obsession with possession and
ownership: "My tool", "My idea", "My whatever"?

And why do you view everything in a competitive color, rather than
complementary and collaborative?

In these days of open source development, collaborative efforts, etc.,
your approach to do everything in secret is really odd and
out-of-synch. Why don't you cast your bread upon the waters and see
what happens? Or are you afraid your bread won't return to you
multiplied?


>> indicating this was more of a typesetter's convention rather than
>> something Cather specified.

> well that's a convenient dodge, isn't it?

No.


> i know i can't keep up even with the editions of this one book!
> so how would a person possibly keep up on tens of thousands!

The idea of "Trusted Editions" as an archetype is that it won't rely
on any one person. It is part of a bigger picture of building
communities around noted etexts. To mobilize people. To not only
bring digital texts to people (as PG has been doing), but to also
bring people and community to digital texts (which PG is NOT doing
now.)

But so far I don't see much interest in your "calculus" to understand
the important role people play in etexts, from creation to final use.
And that the most viable contributions to Mankind come from when
people are mobilized in a cooperative/community way (either in a
non-profit open source approach, or in a private for-profit approach
using employees and contractors.)

Technology is to provide tools to make a community of people work
better together for a common end-goal, not to replace community.

And the word "trust" is an important core human concept -- society
works only when there is sufficient trust between people, and trust
in the various products of their labors. So any human endeavor which
does not put "trust" as #1 is prone to eventually fail.


>> Cather wanted the line length to be fairly short, so this puts
>> extra pressure on typesetters who will either have to extend
>> character spacing for a particular line or scrunch it up more than
>> usual, depending upon the situation with the rest of the
>> typesetting on the page, and whether certain words can be
>> hyphenated or not.

> oh!, hold it!, wait!, did i just hear you say what you just said?
> i think i did!  yes, i'm quite sure i did!
>
> "cather wanted the line length to be fairly short".
>
> wow.  you mean author-intent can go to _the_length_of_lines_?
>
> do you realize how significant that is to your philosophy, jon?

*rolls eyes*


> it means you will need to respect willa's wishes on the matter.
>
> none of the long lines you might get in a web-browser!  no sir!
>
> willa wanted short lines!  (is that why the book looks so narrow?)

You really need to read less selectively.

I've used the phrase "textually faithful" many times the last couple
weeks for a reason. The reason? Because it is important that texts
transcend the visual as much as possible, to become agnostic with
respect to presentation type, yet contain sufficient structure and
semantics so quite authentic visual presentation is possible. This is
necessary not only for accessibility, but repurposeability and
usability. (And this helps Michael Hart's long-term vision in
universal language translations of digital texts.)

With the right style sheet, most of Cather's stated preferences are
possible to duplicate. There's a reason why the texts are marked up in
XML. With one tiny change in the CSS for our "My Antonia" demo, we can
duplicate quite well Willa Cather's apparent preference in visual
presentation of her book.

Interestingly, the UNL Cather Edition (the print version published by
UNL's publishing house) uses longer line lengths and smaller print
than Cather specified. They did not deem the exact visual presentation
of the content to be as important as much as the textual faithfulness,
even though they discuss it on their web site.


>> Accented characters are *always* important to preserve under all
>> situations. 

> according to you, maybe.  according to me, it depends.
> in this case, i say no.  that's my prerogative as an editor.
> (and i _do_ consider myself an editor, not just a copyist.)

Sure, you can call yourself an editor, and do what editors do.

But to throw away the richness of the expanded Western character set
which many, many public domain books use -- is simply bizarre.

This richness is what adds to the aesthetics of the text, and builds a
better reading experience. It also *adds* trust because people will
see the care you took in doing this -- in sweating out the details.


>> There's no need anymore, in these days of Unicode and the like to
>> stick with 7-bit ASCII.

> until unicode works flawlessly on every machine used
> by all the people i know, for texts like this that have
> only the occasional character outside the lower-127,
> where the meaning isn't changed, i'll stick to plain ascii.

I believe this is a copout. You can convert most of the western-based
Unicode characters to ISO-8859 (the "8-bit ASCII") if you want, and to
other encoding schemes, so you have even more encoding options to
handle just about everything everyone uses.

Today's web browsers handle Unicode very well. And since you are
building your own ebook viewer, you can implement Unicode in it quite
trivially (at least be able to handle, to start out with, the Latin-1,
Latin-Extended and Greek character sets.)

The problem with throwing away the higher-characters is that, contrary
to what you say, it is not easy to reinsert them as they appeared in
the original, unless you re-OCR the texts and the OCR accurately finds
them. I can tell you that OCR, even Abbyy, still has some problems
with accented characters, especially those which use very subtle
accent marks that can easily be mistaken for serifs. As an example,
I'm curious to know if Abbyy 7 will correctly recognize *all* the
accented characters in the current "My Antonia" scans -- I listed them
in my prior message. If you want, I will be happy to go through and
list the actual page numbers they are found on. For example, the
unlauted "i" in "na?ve" and "na?vety" -- this is a particularly
difficult character to recognize (it is often incorrectly recognized
as a capital 'Y'), and it is often (as are most accented characters)
used in words which will not be found in some lookup dictionary.


>> I sense that you don't want to properly deal with accented
>> characters

> first of all, jon, i define what "properly" means for me, you don't.
> you can define it for yourself.  but i won't let you define it for me.

A lot of people consider accented characters important to preserve.
Since, as you say, it is easy to translate from accented characters to
non-accented characters (but not vice-versa), then you can meet more
people's needs (including those odd few who prefer *not* to read
accented characters) by recognizing and preserving these characters.

I'd like feedback from the DP folk as to their policy regarding
reproducing the non-ASCII characters (Latin 1, Latin Extended, Greek,
etc.) It would not surprise me if DP, as a matter of policy,
reproduces them.


>> I sense that you don't want to properly deal with accented
>> characters since this poses extra problems with OCRing and
>> proofing, 

> nope.  it's just that i see them as _unnecessary_ to this book.
> if a reader thinks it _is_ necessary, make the global-change.

How? Unless you somehow record that information on accented
characters in some master document, you can't go in the other
direction. You are assuming all the words using accented characters
are found in some dictionary, which is not true.


>> something you are trying to avoid in your zeal to get everything
>> to automagically work. To me, that's going too far in simplifying.

> i'm not "simplifying".  i'm consciously making a choice to use
> something that will work on the broad range of machines out there,
> as opposed to something that -- in far too many cases -- fails badly.

Yes, but this is the fundamental flaw. You appear to be taking
short-cuts to try to prove that people don't matter in the process to
produce high-quality etexts that are repurposeable and trustworthy.

Certainly it is much preferred to have better and more accurate tools,
and hopefully the tools you are producing will make life easier for
many *people* involved in creating structured digital texts of public
domain works.


> it's the same pragmatic decision that michael made when he crafted
> the philosophy guiding the building of this library of 10,000+ e-texts,
> in sharp contrast to your philosophy, which has built a 1-book library.

As previously discussed, did Michael immediately go from 1 text to
10,000 etexts in two weeks? And did this growth occur solely by his own
sweat of the brow?

And note that almost half of the PG collection is done mostly right
because Distributed Proofreaders *does* follow "my philosophy" fairly
closely (or maybe better put I follow their philosophy fairly closely.)

There's another purpose behind the "Trusted Editions" project. It is
not intended to be a competitor to PG or other text projects, but to
further benefit the various users of public domain texts. More options
are better than fewer options.


>> Preserving accented characters are important.

> in some cases, i'd agree with you.  in others, not.  in this case, not.

Can you explain how you decide when accented characters are to be
reproduced? Or is this impossible to explain using an unambiguous,
objective rule?

(And will your toolset handle the full Western portion of the Unicode
set? If so, then why not process *all* texts using the full character
set? Why the need to reduce some of them to irreversible 7-bit ASCII?)


> as it is, though, i just have to resign myself to the position that
> o.c.r punctuation errors are a distraction, but make no difference.
> i'll still root them out, due to my sense of professionalism, but
> i sure wish it felt _fun_, instead of feeling like _doing_chores_.
> and to the extent that i can automate the chores, i'll be _happy_.

What's interesting is that there are lots of people who *enjoy* doing
this. That's what makes DP so successful, because it brings together
people with different interests. Does DP do what it does the best
possible way at this time? Of course not. Is DP as good as it could
ever be? Of course not. Charles himself noted that to me last year.
DP is still a "beta" in progress, or maybe a version 1.0. But DP
recognizes that mobilizing people is a critical requirement of
success. Juliet could talk for hours about how important the people
side of producing etexts really is.

And note that there are millions of texts that *cannot* be handled by
your toolset, such as handwritten records, horribly tabulated data
with poor and ambiguous structure, etc. These texts are held by
historical and genealogical societies, local governments, etc.,
etc.

DP, or a DP-like process, properly cloned, is the best way to convert
these texts to useful structured digital texts. Not only that, these
local groups have a lot of enthusiastic supporters who will volunteer
to scan and proof these texts. It will be done by people power,
enabled by technology, and not solely by machine power -- unless, of
course, someone soon invents truly sentient AI machines with real
human intelligence, personalities and even emotions.


>> They are hopefully caught by human proofers/readers when grammar
>> checkers don't (I do use Word to help find both spelling and
>> punctuation errors -- when they find something, I then manually
>> check it in the page scans and the master XML.)

> oh, so you _do_ use an assist from your tools at times.  that's good.

Of course! I use tools when I can, but I don't blindly use them.

Do you think I use 3x5 cards for everything I do? <laugh/>


>> They are "sometimes" easy to spot. Other times the automatic
>> routines will not catch errors

> maybe the automatic routines you are using are just inferior.

*Shrug*

After all, I put together "My Antonia" for the project by kludging
together sub-optimum tools, hardware and processes (e.g., not having a
high-quality sheet feed scanner). "My Antonia" is simply a pre-beta to
test out several (but not all) of the important concepts, to shake
down various things for the next stage effort. It is showing us the
kind of tools and applications we will need to go into production
(this includes the high-quality scanning and image preparation
processes.)

The discussion, both here, and on TeBC, both critical and supportive,
both public and private, has been extremely useful at helping us to
better understand various things. This feedback has shown things we've
done wrong, things that could be improved, and different ways of
looking at the various issues.

So your asumption that we've finalized the "formula" and the
"process" is incorrect. We feel comfortable in "casting our bread upon
the waters", so we can inspire many people, supporters and critics, to
provide valuable feedback. We obviously inspired you to reply -- your
feedback has been very valuable.


>> This is also true, but as found in "My Antonia", there are
>> exceptions to pure nesting, such as when a quotation spills over
>> into several paragraphs where the intermediate paragraphs are not
>> terminated by an end quotation mark (whether single or double.)

> is it really your considered opinion that i don't know this?
> that i haven't factored it into my thinking _and_ my tools?
>
> maybe you're grandstanding to the lurkers, but my goodness,
> jon, do you really think that _they_ are that stupid too?

You seem to have blind faith that you will be able to sufficiently
cover most every important "exception" found in most texts, and I
don't believe it is yet possible. If you do, that'll be wonderful.

But your apparent dismissal of the importance of universal handling of
extended character sets is alone a show-stopper, in my opinion. Now
if you do plan to soon universally support the Unicode character set
(or at least the European subset of it), then I believe it will
greatly make your toolset much more valuable.


>> Well, not all of the pages have been doubly proofed. The team is
>> not finished, and I plan to post a plea somewhere for more eyeballs
>> to go over it.

> have you heard about distributed proofreaders?
> might be able to find some people there...

I should have written "to post a plea to a few places", because yes, I
plan to post a message to the DP forums about "My Antonia". But I want
to do some more preliminary assessments before approaching them.

Anyway, I've already posted here for some help, and have done some
back channel chatting, so a few DPers already know about "My
Antonia". :^)


>> I would like to receive error reports as well for this text, 

> i'll tell you the same thing i told michael about project gutenberg:
> set up a system for the checking, reporting, correction, and logging
> of errors, a system that is transparent to the general public, and
> i will be more than happy to report errors to you, and help you out.

Now, I agree with you on this. Part of the community aspect of the
bigger vision is a system for follow-on proofing. But we also, for the
short-term, want to improve the "My Antonia" text the old-fashioned
way of manual error report submissions. Properly designing the error
feedback and updating system has to be integrated with the other
community aspects of the digital texts since these are inextricably
linked -- in addition, the "manual" process helps in better
understanding the community-based system.


> which, by the way, is what everyone else is thinking.
> which is why errors in the texts are not being reported
> at nearly the frequency that they should be being reported.
> but i've got another message sitting here waiting to be sent
> where i discuss that topic in more detail, so i'll stop here now.

I agree with you on this. And the error reporting system is an
important aspect of building user trust in any etext collection.


>> since Brewster wants highly proofed texts for some experiments he
>> plans to run similar to yours. 

> i'll have to ask him about his tests.

brewster@archive.org

Not sure what his current status is on this.


>> I did find one error in my text based on the list you gave. Thanks. 

> you're welcome.  but that's not the one i was talking about.     :+)

*shrug*. It will be found, unless it's something that you believe is
an error in how we transcribed the original first edition, and we do
not consider it to be an error.

You alluded to that in your prior message (such as mentioning the
small space that precedes a few question marks -- inspection of a
large number of pages where question marks appear strongly supports my
contention that this is a typesetting issue and not anything specified
by Willa. Anyway, the original communications by Cather on her many
preferences for "My Antonia" *exist* and scholars have poured over
them with a fine-toothed comb. The UNL Cather Edition does not place
any spaces before any question marks, nor do they place a space
anywhere before an apostrophe s used in contractions.) However, the
" 's" contraction issue is one I'm going to look at again today. One
of my proofers noted this to me the other day, so with hers and your
feedback, it will be looked at again. See, the system, primitive as
it is at present, *is* working (even if it is currently a manual,
short-term hack.)

Jon

From hart at pglaf.org  Sat Mar  5 11:17:58 2005
From: hart at pglaf.org (Michael Hart)
Date: Sat Mar  5 11:17:59 2005
Subject: [gutvol-d] Please test www-dev.gutenberg.org
In-Reply-To: <20050304222909.GA32543@pglaf.org>
References: <42261F44.1000005@perathoner.de>
	<Pine.GSO.4.58.0503022054560.11590@vtn1.victoria.tc.ca>
	<Pine.LNX.4.60.0503030946541.30585@pglaf.org>
	<42274F2E.8010000@corruptedtruth.com>
	<4228C097.8030803@perathoner.de> <20050304222909.GA32543@pglaf.org>
Message-ID: <Pine.LNX.4.60.0503051113090.21881@pglaf.org>


We resisted the temptation to divide the Bible and Shakespeare
into various sections when others were claiming AEsop's Fables
each as an individual eBook to pad their bibiographies.

However, when people started requesting individual Shakespeare
plays and books of the Bible for research purposes, we did as
they asked, which we nearly always try to do for our readers.

I'm sure some people would also try to prevent paper publishers
and libraries from publishing individual Shakespeare plays or
books of the Bible.  BTW, I think we put all the shortest books
in one file, at last that was my intention.  However, when someone
donates a Shakespeare or Bible in their own particular favorite
format and breakdown, that's totally up to them, and I'm not about
to fight with them about it. . . .  If someone wants a verse by verse
eBible, I think we should zip it all in one huge file, but still let
it unzip in the manner they prefer.

mh


On Fri, 4 Mar 2005, Greg Newby wrote:

> On Fri, Mar 04, 2005 at 09:09:59PM +0100, Marcello Perathoner wrote:
>> Brandon Galbraith wrote:
>>
>>>> I suppose while these updates are going on, we should also update
>>>> 13,000 to 15,000 in the opening:
>>>
>>> It's too bad we can't make that dynamic, feeding off of a database =)
>>
>> Not worth the trouble ... First, we had to agree on what counts as an
>> ebook in its own right.
>>
>> Eg. we have a Bible in the collection, where every chapter got its own
>> ebook number. Also, many books are posted in parts, and every part got
>> its own number besides the complete book.
>>
>> To get a meaningful count of ebooks we first had to get rid of such
>> shameless stuffings.
>
> That's an unwarranted poke, Marcello.
>
> We do have a count, and it's eBook #s as used as the primary
> access point to our files.
>
> Agreeing on what counts as an eBook is not necessary.  We know
> how many eBook #s we have, even if there is disagreement on
> what counts as an eBook.  There are plenty of words (in
> GUTINDEX.ALL and elsewhere) to augment this simplistic number.
>  -- Greg
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
From servalan at ar.com.au  Sat Mar  5 15:19:36 2005
From: servalan at ar.com.au (Pauline)
Date: Sat Mar  5 15:20:07 2005
Subject: [gutvol-d] Database down?
Message-ID: <422A3E88.5010501@ar.com.au>

Hiya All,

Did I miss an outage notice? The PG server appears to be having hassles:

I keep seeing "Could not connect to database server." when I try to 
access etexts.

Thanks,
P
-- 
Help digitise public domain books:
Distributed Proofreaders: http://www.pgdp.net
"Preserving history one page at a time."

Set free dead-tree books:
http://bookcrossing.com/referral/servalan
From Bowerbird at aol.com  Sat Mar  5 16:58:50 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Sat Mar  5 16:59:09 2005
Subject: [gutvol-d] have a nice day
Message-ID: <195.3a482828.2f5bafca@aol.com>


i believe there's no interest in these threads,
other than from jon and myself, so i will
reply to him backchannel and be done with it.

if anyone else does want a copy, let me know,
and i'll send it to you as well.  thank you.

the proof, as always, is in the pudding.

other than that, have a nice day...      :+)

-bowerbird
From prosfilaes at gmail.com  Sat Mar  5 19:49:57 2005
From: prosfilaes at gmail.com (David Starner)
Date: Sat Mar  5 19:50:13 2005
Subject: [gutvol-d] re: march forth
In-Reply-To: <15697825812.20050305113552@noring.name>
References: <8.638ae14b.2f5a61e6@aol.com>
	<15697825812.20050305113552@noring.name>
Message-ID: <6d99d1fd05030519495f5f2e91@mail.gmail.com>

On Sat, 5 Mar 2005 11:35:52 -0700, Jon Noring <jon@noring.name> wrote:
> The collection (of one so far, and it is essentially a working demo
> for learning purposes) does not state to be "Jon Noring's" collection.
> Go to http://www.openreader.org/myantonia/ and tell me what it says
> there, and if it prominently mentions my name.

The comment about DjVu and IE6 seems out of place; there's plugins for
Netscape there too.

It seems like an interesting project. I'm not sure I have the time or
ability to help, but I willing to make the offer.

> Readers will appreciate the
> thoroughness expended to modernize a text for them, and will have warm
> fuzzies that it is "accurate" when the editor *takes the time* to
> explain what they did. This builds *trust* with the reader.)

I got into a bit of a flame war on bookpeople by suggesting that a translation
might stand a few words on why.
 
> So? What do you care? Is there a law saying any digital text version
> of a public domain work *must* be submitted to PG? Does PG have a
> government monopoly on the Public Domain? Of course not.

I've cared because a central library makes it easier to find a work, instead
of having to search in several places. Also, Project Gutenberg has a
long history, indicating it will be around tomorrow and the day after that,
and it's decentralized, meaning that if it's not, everything won't just
disappear.
 
> And the word "trust" is an important core human concept -- society
> works only when there is sufficient trust between people, and trust
> in the various products of their labors. So any human endeavor which
> does not put "trust" as #1 is prone to eventually fail.

I don't agree. PG has not put "trust" as an explicit concept, but people
being as they are, they trust that the PG works are done competently.
When I gave my sister a copy of "A Doll's House", I didn't check editions
and quality of translation; I just bought a random copy. You want works
to be verifiable, but most people just don't worry about that; they "trust"
others to do a good job.
 
> I'd like feedback from the DP folk as to their policy regarding
> reproducing the non-ASCII characters (Latin 1, Latin Extended, Greek,
> etc.) It would not surprise me if DP, as a matter of policy,
> reproduces them.

We mangle the Greek via transliteration still, but we always get Latin-1 right, 
and we more or less get Latin Extended correct. (OE is usually broken, 
but accents are recorded, and I assume most PMs are aware enough to catch
the weird characters.) Hebrew, Arabic and friends are usually,
hopefully, handled
by the PPer.
 
> > nope.  it's just that i see them as _unnecessary_ to this book.
> > if a reader thinks it _is_ necessary, make the global-change.

Why judge that on a book for book basis? In fact, you can't, since
your programs don't tend to support "accented" characters in any
texts. Certainly, the majority of pre-1850 works have at least one
Greek quote that ASCII will horribly and irrevocably mangle. French
quotes are exactly uncommon in our era of books, either.
From marcello at perathoner.de  Sun Mar  6 10:57:41 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Sun Mar  6 11:19:06 2005
Subject: [gutvol-d] Database down?
In-Reply-To: <422A3E88.5010501@ar.com.au>
References: <422A3E88.5010501@ar.com.au>
Message-ID: <422B52A5.3020600@perathoner.de>

Pauline wrote:

> I keep seeing "Could not connect to database server." when I try to 
> access etexts.

The PG site is just too popular. The database cannot serve more than ~30 
requests at one time.

Try accessing the site in the "off" hours. Rush hours are 9 AM - 18 PM EST.


-- 
Marcello Perathoner
webmaster@gutenberg.org


From donovan at abs.net  Sun Mar  6 12:28:42 2005
From: donovan at abs.net (D Garcia)
Date: Sun Mar  6 12:30:08 2005
Subject: [gutvol-d] Database down?
In-Reply-To: <422B52A5.3020600@perathoner.de>
References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de>
Message-ID: <200503061528.43221.donovan@abs.net>

On Sunday 06 March 2005 01:57 pm, Marcello Perathoner wrote:
> The PG site is just too popular. The database cannot serve more than ~30
> requests at one time.
I don't think there's any such thing as PG being too popular. :)
But it does sound as if the DB is too anemic for the current (and future) 
popularity.
From servalan at ar.com.au  Sun Mar  6 13:27:32 2005
From: servalan at ar.com.au (Pauline)
Date: Sun Mar  6 13:28:12 2005
Subject: [gutvol-d] Database down?
In-Reply-To: <422B52A5.3020600@perathoner.de>
References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de>
Message-ID: <422B75C4.5070705@ar.com.au>

Marcello Perathoner wrote:
> Pauline wrote:
> 
>> I keep seeing "Could not connect to database server." when I try to 
>> access etexts.
> 
> 
> The PG site is just too popular. The database cannot serve more than ~30 
> requests at one time.
> 
> Try accessing the site in the "off" hours. Rush hours are 9 AM - 18 PM EST.

I've been recommending PG texts & now have a bunch of replies saying PG 
is unusable due to this problem. :(

If the server cannot be configured to cope with increased load, please 
at least consider changing the error message to something more useful 
for the user. e.g. "Project Gutenberg is too busy at the moment to 
handle your request, please try again later. Current slack times are 
18.00->8.00 EST."

Thanks,
P
-- 
Help digitise public domain books:
Distributed Proofreaders: http://www.pgdp.net
"Preserving history one page at a time."

Set free dead-tree books:
http://bookcrossing.com/referral/servalan
From tb at baechler.net  Sun Mar  6 19:43:02 2005
From: tb at baechler.net (Tony Baechler)
Date: Sun Mar  6 19:41:35 2005
Subject: [gutvol-d] Database down?
In-Reply-To: <422B75C4.5070705@ar.com.au>
References: <422B52A5.3020600@perathoner.de> <422A3E88.5010501@ar.com.au>
	<422B52A5.3020600@perathoner.de>
Message-ID: <5.2.0.9.0.20050306194202.037e4130@baechler.net>

Hi.  A slight workaround for this is to refresh the page.  In Internet 
Explorer, Control + F5 does the trick.  I got that same error and it 
refreshed fine.

From gbnewby at pglaf.org  Sun Mar  6 20:17:44 2005
From: gbnewby at pglaf.org (Greg Newby)
Date: Sun Mar  6 20:17:45 2005
Subject: [gutvol-d] Database down?
In-Reply-To: <422B52A5.3020600@perathoner.de>
References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de>
Message-ID: <20050307041744.GA10764@pglaf.org>

On Sun, Mar 06, 2005 at 07:57:41PM +0100, Marcello Perathoner wrote:
> Pauline wrote:
> 
> >I keep seeing "Could not connect to database server." when I try to 
> >access etexts.
> 
> The PG site is just too popular. The database cannot serve more than ~30 
> requests at one time.
> 
> Try accessing the site in the "off" hours. Rush hours are 9 AM - 18 PM EST.

Marcello, can you tell me what it would take to grow our capacity to
handle hits?  I know you're also looking at Web site mirrors (I can
supply some sites for this, BTW).  But if you could come up with some
recommendations for what it would take for iBiblio to dramatically grow
our capacity, I can try to put something together for them.

30 simultaneous requests to PostgreSQL does not seem like a whole lot,
so I'm assuming that contention for resources with other hosted sites is
the main problem.  It would be nice to do better.

I know that iBiblio claims network bandwidth is not an issue, but
possibly we need to look at the whole system. 

Thanks for any ideas you (or others) can provide.
  -- Greg
From brandon at corruptedtruth.com  Sun Mar  6 20:20:07 2005
From: brandon at corruptedtruth.com (Brandon Galbraith)
Date: Sun Mar  6 20:20:23 2005
Subject: [gutvol-d] Database down?
In-Reply-To: <20050307041744.GA10764@pglaf.org>
References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de>
	<20050307041744.GA10764@pglaf.org>
Message-ID: <422BD677.203@corruptedtruth.com>

Marcello,

Could connection pooling fix this? Maybe combined with more concurrent 
connections to the database server? I'm not sure how big the database 
box is though.

-brandon

Greg Newby wrote:

>On Sun, Mar 06, 2005 at 07:57:41PM +0100, Marcello Perathoner wrote:
>  
>
>>Pauline wrote:
>>
>>    
>>
>>>I keep seeing "Could not connect to database server." when I try to 
>>>access etexts.
>>>      
>>>
>>The PG site is just too popular. The database cannot serve more than ~30 
>>requests at one time.
>>
>>Try accessing the site in the "off" hours. Rush hours are 9 AM - 18 PM EST.
>>    
>>
>
>Marcello, can you tell me what it would take to grow our capacity to
>handle hits?  I know you're also looking at Web site mirrors (I can
>supply some sites for this, BTW).  But if you could come up with some
>recommendations for what it would take for iBiblio to dramatically grow
>our capacity, I can try to put something together for them.
>
>30 simultaneous requests to PostgreSQL does not seem like a whole lot,
>so I'm assuming that contention for resources with other hosted sites is
>the main problem.  It would be nice to do better.
>
>I know that iBiblio claims network bandwidth is not an issue, but
>possibly we need to look at the whole system. 
>
>Thanks for any ideas you (or others) can provide.
>  -- Greg
>_______________________________________________
>gutvol-d mailing list
>gutvol-d@lists.pglaf.org
>http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
>
>  
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050306/27426d04/attachment.html
From jlinden at projectgutenberg.ca  Sun Mar  6 21:41:51 2005
From: jlinden at projectgutenberg.ca (James Linden)
Date: Sun Mar  6 21:40:51 2005
Subject: [gutvol-d] Database down?
In-Reply-To: <20050307041744.GA10764@pglaf.org>
References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de>
	<20050307041744.GA10764@pglaf.org>
Message-ID: <422BE99F.7080302@projectgutenberg.ca>

   Migrating to MySQL might help -- and it's easier to replicate/mirror 
on the fly.

-- James

Greg Newby wrote:
> On Sun, Mar 06, 2005 at 07:57:41PM +0100, Marcello Perathoner wrote:
> 
>>Pauline wrote:
>>
>>
>>>I keep seeing "Could not connect to database server." when I try to 
>>>access etexts.
>>
>>The PG site is just too popular. The database cannot serve more than ~30 
>>requests at one time.
>>
>>Try accessing the site in the "off" hours. Rush hours are 9 AM - 18 PM EST.
> 
> 
> Marcello, can you tell me what it would take to grow our capacity to
> handle hits?  I know you're also looking at Web site mirrors (I can
> supply some sites for this, BTW).  But if you could come up with some
> recommendations for what it would take for iBiblio to dramatically grow
> our capacity, I can try to put something together for them.
> 
> 30 simultaneous requests to PostgreSQL does not seem like a whole lot,
> so I'm assuming that contention for resources with other hosted sites is
> the main problem.  It would be nice to do better.
> 
> I know that iBiblio claims network bandwidth is not an issue, but
> possibly we need to look at the whole system. 
> 
> Thanks for any ideas you (or others) can provide.
>   -- Greg
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
> 
> 
From hyphen at hyphenologist.co.uk  Mon Mar  7 00:50:11 2005
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Mon Mar  7 00:50:48 2005
Subject: [gutvol-d] Database down?
In-Reply-To: <200503061528.43221.donovan@abs.net>
References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de>
	<200503061528.43221.donovan@abs.net>
Message-ID: <8a5o21l78j4o9f1sjos6qav3tejkbegvpd@4ax.com>

On Sun, 6 Mar 2005 15:28:42 -0500,  D Garcia <donovan@abs.net> wrote:

| On Sunday 06 March 2005 01:57 pm, Marcello Perathoner wrote:
| > The PG site is just too popular. The database cannot serve more than ~30
| > requests at one time.
| I don't think there's any such thing as PG being too popular. :)

Agreed.   
Is there no way of transferring requests which cannot be handled to a
mirror site?   Cheaper than a bigger server?

-- 
Dave F

From brandon at corruptedtruth.com  Mon Mar  7 00:54:33 2005
From: brandon at corruptedtruth.com (Brandon Galbraith)
Date: Mon Mar  7 00:54:53 2005
Subject: [gutvol-d] Database down?
In-Reply-To: <8a5o21l78j4o9f1sjos6qav3tejkbegvpd@4ax.com>
References: <422A3E88.5010501@ar.com.au>
	<422B52A5.3020600@perathoner.de>	<200503061528.43221.donovan@abs.net>
	<8a5o21l78j4o9f1sjos6qav3tejkbegvpd@4ax.com>
Message-ID: <422C16C9.8050000@corruptedtruth.com>

Usually, it's cheaper and easier to use a bigger database server then 
try to redirect the requests to another site. The only time you'd want 
to redirect to another site would be in the event the primary site was down.

Disclaimer: I'm a sysadmin at a hosting company.

-brandon

Dave Fawthrop wrote:

>On Sun, 6 Mar 2005 15:28:42 -0500,  D Garcia <donovan@abs.net> wrote:
>
>| On Sunday 06 March 2005 01:57 pm, Marcello Perathoner wrote:
>| > The PG site is just too popular. The database cannot serve more than ~30
>| > requests at one time.
>| I don't think there's any such thing as PG being too popular. :)
>
>Agreed.   
>Is there no way of transferring requests which cannot be handled to a
>mirror site?   Cheaper than a bigger server?
>
>  
>


From bruce at zuhause.org  Mon Mar  7 07:21:01 2005
From: bruce at zuhause.org (Bruce Albrecht)
Date: Mon Mar  7 07:21:06 2005
Subject: [gutvol-d] Database down?
In-Reply-To: <422BE99F.7080302@projectgutenberg.ca>
References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de>
	<20050307041744.GA10764@pglaf.org>
	<422BE99F.7080302@projectgutenberg.ca>
Message-ID: <16940.29021.164180.819016@celery.zuhause.org>

There's also clustering available for Postgresql, which might be
easier than migrating to MySQL.  Either way, it would probably take
more human resource time than throwing hardware at it (for example, a
dual Opteron system with 4 GB RAM 4 250 GB SATA drive in 10 RAID for
about $3300, which might be overkill).

James Linden writes:
 >    Migrating to MySQL might help -- and it's easier to replicate/mirror 
 > on the fly.
 > 
 > -- James
 > 
 > Greg Newby wrote:
 > > On Sun, Mar 06, 2005 at 07:57:41PM +0100, Marcello Perathoner wrote:
 > > 
 > >>Pauline wrote:
 > >>
 > >>
 > >>>I keep seeing "Could not connect to database server." when I try to 
 > >>>access etexts.
 > >>
 > >>The PG site is just too popular. The database cannot serve more than ~30 
 > >>requests at one time.
 > >>
 > >>Try accessing the site in the "off" hours. Rush hours are 9 AM - 18 PM EST.
 > > 
 > > 
 > > Marcello, can you tell me what it would take to grow our capacity to
 > > handle hits?  I know you're also looking at Web site mirrors (I can
 > > supply some sites for this, BTW).  But if you could come up with some
 > > recommendations for what it would take for iBiblio to dramatically grow
 > > our capacity, I can try to put something together for them.
 > > 
 > > 30 simultaneous requests to PostgreSQL does not seem like a whole lot,
 > > so I'm assuming that contention for resources with other hosted sites is
 > > the main problem.  It would be nice to do better.
 > > 
 > > I know that iBiblio claims network bandwidth is not an issue, but
 > > possibly we need to look at the whole system. 
 > > 
 > > Thanks for any ideas you (or others) can provide.
From marcello at perathoner.de  Mon Mar  7 09:28:02 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Mon Mar  7 11:23:07 2005
Subject: [gutvol-d] Database down?
In-Reply-To: <422BE99F.7080302@projectgutenberg.ca>
References: <422A3E88.5010501@ar.com.au>
	<422B52A5.3020600@perathoner.de>	<20050307041744.GA10764@pglaf.org>
	<422BE99F.7080302@projectgutenberg.ca>
Message-ID: <422C8F22.7070500@perathoner.de>

James Linden wrote:

>   Migrating to MySQL might help -- and it's easier to replicate/mirror 
> on the fly.

Yuck! MySQL is just a glorified file system with an SQL interface.

They barely got transactions working. They still can't do referential 
integrity, views and triggers. And if you happen to need transactions, 
they only work with the InnoDB backend, which is slower than Postgres.

Postgres replicates very well. Just where should we replicate to?


-- 
Marcello Perathoner
webmaster@gutenberg.org


From marcello at perathoner.de  Mon Mar  7 10:04:31 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Mon Mar  7 11:23:09 2005
Subject: [gutvol-d] Database down?
In-Reply-To: <422BD677.203@corruptedtruth.com>
References: <422A3E88.5010501@ar.com.au>
	<422B52A5.3020600@perathoner.de>	<20050307041744.GA10764@pglaf.org>
	<422BD677.203@corruptedtruth.com>
Message-ID: <422C97AF.5050003@perathoner.de>

Brandon Galbraith wrote:

> Could connection pooling fix this? Maybe combined with more concurrent 
> connections to the database server? I'm not sure how big the database 
> box is though.

I'm not sure why the limit is so low. Maybe the folks at ibiblio have a 
good reason for it. We have to see what increasing the number of 
concurrent connections does to query response time.

The database box is a dual-processor IBM whatever. I can look up the 
specs if you want. But this box and his brother are serving all sites 
hosted at ibiblio. Many of those sites are build with CMS and thus very 
database intensive.


-- 
Marcello Perathoner
webmaster@gutenberg.org


From marcello at perathoner.de  Mon Mar  7 10:13:58 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Mon Mar  7 11:23:15 2005
Subject: [gutvol-d] Database down?
In-Reply-To: <20050307041744.GA10764@pglaf.org>
References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de>
	<20050307041744.GA10764@pglaf.org>
Message-ID: <422C99E6.8090000@perathoner.de>

Greg Newby wrote:

> Marcello, can you tell me what it would take to grow our capacity to
> handle hits?  I know you're also looking at Web site mirrors (I can
> supply some sites for this, BTW).  But if you could come up with some
> recommendations for what it would take for iBiblio to dramatically grow
> our capacity, I can try to put something together for them.

We have doubled our page hits over the last year. We are now serving 
nearly 200.000 pages a day. Just recently we became a top 5000 internet 
site. See Alexa stats starting at:

 
http://www.alexa.com/data/details/traffic_details?range=3m&size=large&compare_sites=gutenberg.net,promo.net&y=t&url=gutenberg.org


To handle the ever increasing load we could implement one of the 
following solutions:


1) An array of on-site squids at ibiblio. But ibiblio isn't adding 
squids for the vhosted sites. At least that's what I was told.

2) Make ibiblio throw more hardware at us (all hosted sites). This may 
not be possible with the limited budget. They recently got a faster file 
server.

3) One or more dedicated squids for PG co-located at ibiblio. (Make 
ibiblio pay for the bandwidth.) Somebody had to donate us a server. 
Needs fast disks, lots of ram, average cpu, linux, ssh.

4) Big time solution. A hierarchy of squids distributed around the 
world. We would have a squid hierarchy like this:

   www.gutenberg.org (apache)
   + us1.cache.gutenberg.org (squid)
   + us2.cache.gutenberg.org (squid)
   + au.cache.gutenberg.org (squid)
   + eu.cache.gutenberg.org (squid)
     + de.cache.gutenberg.org (squid)
     + en.cache.gutenberg.org (squid)
     + fr.cache.gutenberg.org (squid)

To do that we need squids 2.5 with the rproxy patch. I'm still exploring 
that solution, but if anybody has any experience please chime in.

We need service providers donate us (or co-locate our) servers and 
donate the bandwidth. Also we need to explore the legal implications of 
offering PG services outside the US.

The PG web site w/o file downloads averages a 5 GB of traffic / day. 
(The file downloads are 100 GB / day, but we ain't going to thrash the 
squids with the files.)


> 30 simultaneous requests to PostgreSQL does not seem like a whole lot,
> so I'm assuming that contention for resources with other hosted sites is
> the main problem.  It would be nice to do better.

I just asked ibiblio to double that. I'm not sure why the limit is so low.


-- 
Marcello Perathoner
webmaster@gutenberg.org


From marcello at perathoner.de  Mon Mar  7 11:22:10 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Mon Mar  7 11:23:25 2005
Subject: [gutvol-d] Shakespeare's Birthday April 23rd
Message-ID: <422CA9E2.3090409@perathoner.de>

I got a request from a proof-reader to celebrate Shakespeares birthday 
with a banner on the site. (The original request being to celebrate St. 
Georges Day, but I don't think that one qualifies.)

Any ideas?


-- 
Marcello Perathoner
webmaster@gutenberg.org


From brandon at corruptedtruth.com  Mon Mar  7 11:41:27 2005
From: brandon at corruptedtruth.com (Brandon Galbraith)
Date: Mon Mar  7 11:41:39 2005
Subject: [gutvol-d] Database down?
In-Reply-To: <422C97AF.5050003@perathoner.de>
References: <422A3E88.5010501@ar.com.au>	<422B52A5.3020600@perathoner.de>	<20050307041744.GA10764@pglaf.org>	<422BD677.203@corruptedtruth.com>
	<422C97AF.5050003@perathoner.de>
Message-ID: <422CAE67.30909@corruptedtruth.com>

Marcello,

Maybe it's time to talk about doing a master/slave replication 
configuration of postgres to handle the database load. If you want, 
contact me off list and I'd be willing to help any way I can.

-brandon

Marcello Perathoner wrote:

> Brandon Galbraith wrote:
>
>> Could connection pooling fix this? Maybe combined with more 
>> concurrent connections to the database server? I'm not sure how big 
>> the database box is though.
>
>
> I'm not sure why the limit is so low. Maybe the folks at ibiblio have 
> a good reason for it. We have to see what increasing the number of 
> concurrent connections does to query response time.
>
> The database box is a dual-processor IBM whatever. I can look up the 
> specs if you want. But this box and his brother are serving all sites 
> hosted at ibiblio. Many of those sites are build with CMS and thus 
> very database intensive.
>
>


From marcello at perathoner.de  Mon Mar  7 11:55:53 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Mon Mar  7 11:55:40 2005
Subject: [gutvol-d] Database down?
In-Reply-To: <422CAE67.30909@corruptedtruth.com>
References: <422A3E88.5010501@ar.com.au>	<422B52A5.3020600@perathoner.de>	<20050307041744.GA10764@pglaf.org>	<422BD677.203@corruptedtruth.com>	<422C97AF.5050003@perathoner.de>
	<422CAE67.30909@corruptedtruth.com>
Message-ID: <422CB1C9.1020809@perathoner.de>

Brandon Galbraith wrote:

> Maybe it's time to talk about doing a master/slave replication 
> configuration of postgres to handle the database load. If you want, 
> contact me off list and I'd be willing to help any way I can.

First ibiblio will have to host another database server and dedicate 
that server to PG. A dedicated database server would probably solve our 
problem. But we'll have to talk the ibiblio people into doing that for 
us. It means money, more maintenance hassles and maybe problems from 
other sites hosted at ibiblio, who want a faster server too.

OTOH a dedicated squid for PG would help too and be much cheaper.

Replication to an external server will not help much, as the latency 
will be too big.

Our current database server is:

   IBM Netfinity 6000R
   Quad Xeon PIII 700 MHz
   4.5 GB RAM
   108 GB storage
   iblinux

But this one is shared with other sites hosted at ibiblio.

See also:

   http://www.ibiblio.org/systems/hardware-details.html


-- 
Marcello Perathoner
webmaster@gutenberg.org

From Bowerbird at aol.com  Mon Mar  7 12:03:10 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Mon Mar  7 12:03:25 2005
Subject: [gutvol-d] lest the message be missed
Message-ID: <15b.4c4a3125.2f5e0d7e@aol.com>

lest the main message be missed in all the minutiae...

you can take an average p-book from scans to e-book in one evening.

one evening.

the people who want to convince you that it's difficult are _wrong_.

the fastest and _easiest_ way to get a million p-books digitized is 
for one million people to convert one book in the next month or two.

***

also, for the record, all of the global changes i made to "my antonia" are
completely reversible, if you're smart enough to know what you're doing.

-bowerbird
From kth at srv.net  Mon Mar  7 11:33:24 2005
From: kth at srv.net (Kevin Handy)
Date: Mon Mar  7 12:07:55 2005
Subject: [gutvol-d] Database down?
In-Reply-To: <16940.29021.164180.819016@celery.zuhause.org>
References: <422A3E88.5010501@ar.com.au>
	<422B52A5.3020600@perathoner.de>	<20050307041744.GA10764@pglaf.org>	<422BE99F.7080302@projectgutenberg.ca>
	<16940.29021.164180.819016@celery.zuhause.org>
Message-ID: <422CAC84.1050703@srv.net>

Bruce Albrecht wrote:

>There's also clustering available for Postgresql, which might be
>easier than migrating to MySQL.  Either way, it would probably take
>more human resource time than throwing hardware at it (for example, a
>dual Opteron system with 4 GB RAM 4 250 GB SATA drive in 10 RAID for
>about $3300, which might be overkill).
>
>  
>
what is the number of connections for postmaster (-N). You may just
need to up this value.

From joshua at hutchinson.net  Mon Mar  7 12:13:24 2005
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Mon Mar  7 12:13:32 2005
Subject: [gutvol-d] lest the message be missed
Message-ID: <20050307201324.B8A8C2F915@ws6-3.us4.outblaze.com>

----- Original Message -----
From: Bowerbird@aol.com
> 
> you can take an average p-book from scans to e-book in one evening.
> one evening.


HAHAHAHAHAHAHA

*gasp* *wheeze*

HAHAHAHAHAHAHA

That's the most laughable thing I've read in a long time.

If laughter helps us live longer, you just added 5 years to my life.

Josh

From gbnewby at pglaf.org  Mon Mar  7 13:31:15 2005
From: gbnewby at pglaf.org (Greg Newby)
Date: Mon Mar  7 13:31:16 2005
Subject: [gutvol-d] Database down?
In-Reply-To: <422CB1C9.1020809@perathoner.de>
References: <422A3E88.5010501@ar.com.au> <422B52A5.3020600@perathoner.de>
	<20050307041744.GA10764@pglaf.org>
	<422BD677.203@corruptedtruth.com> <422C97AF.5050003@perathoner.de>
	<422CAE67.30909@corruptedtruth.com>
	<422CB1C9.1020809@perathoner.de>
Message-ID: <20050307213115.GB4465@pglaf.org>

On Mon, Mar 07, 2005 at 08:55:53PM +0100, Marcello Perathoner wrote:
> Brandon Galbraith wrote:
> 
> >Maybe it's time to talk about doing a master/slave replication 
> >configuration of postgres to handle the database load. If you want, 
> >contact me off list and I'd be willing to help any way I can.
> 
> First ibiblio will have to host another database server and dedicate 
> that server to PG. A dedicated database server would probably solve our 
> problem. But we'll have to talk the ibiblio people into doing that for 
> us. It means money, more maintenance hassles and maybe problems from 
> other sites hosted at ibiblio, who want a faster server too.
> 
> OTOH a dedicated squid for PG would help too and be much cheaper.
> 
> Replication to an external server will not help much, as the latency 
> will be too big.
> 
> Our current database server is:
> 
>   IBM Netfinity 6000R
>   Quad Xeon PIII 700 MHz
>   4.5 GB RAM
>   108 GB storage
>   iblinux
> 
> But this one is shared with other sites hosted at ibiblio.

Thanks, and also for the info about squids.  I will pitch ibiblio
on the idea of a new system - we'll see what the response is.
A current top-end quad Xeon system from someplace like asaservers.com
is ~$20,000, which is a little steep for PG to pay for.  
  -- Greg


> See also:
> 
>   http://www.ibiblio.org/systems/hardware-details.html
> 
> 
> 
> -- 
> Marcello Perathoner
> webmaster@gutenberg.org
> 
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
From prosfilaes at gmail.com  Mon Mar  7 16:11:05 2005
From: prosfilaes at gmail.com (David Starner)
Date: Mon Mar  7 16:11:17 2005
Subject: [gutvol-d] lest the message be missed
In-Reply-To: <15b.4c4a3125.2f5e0d7e@aol.com>
References: <15b.4c4a3125.2f5e0d7e@aol.com>
Message-ID: <6d99d1fd05030716113ca7a9e@mail.gmail.com>

On Mon, 7 Mar 2005 15:03:10 EST, Bowerbird@aol.com <Bowerbird@aol.com> wrote:
> also, for the record, all of the global changes i made to "my antonia" are
> completely reversible, if you're smart enough to know what you're doing.

If you're "smart enough", you could just retype the book from memory.
Completely reversible, provided that you're already familiar with the
work (which you have to be, else you wouldn't know that Antonia needs
an accent), is a pretty lousy standard.
From Bowerbird at aol.com  Mon Mar  7 17:36:16 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Mon Mar  7 17:36:34 2005
Subject: [gutvol-d] lest the message be missed
Message-ID: <db.21c3aa20.2f5e5b90@aol.com>

josh said:
>   That's the most laughable thing I've read in a long time.

evidently josh missed the message.
and why does that not surprise me?

nonetheless, i invite skepticism.

when i do the entire "my antonia"
-- sometime later in the week --
i'll log my time and document all
of the changes i make on the file,
and wipe all that skepticism away.

***

david said:
>   If you're "smart enough", you could just retype the book from memory.

i'm not that smart.  are you?


>   Completely reversible, 
>   provided that you're already familiar with the work 
>   (which you have to be, else you wouldn't know that 
>   Antonia needs an accent), 
>   is a pretty lousy standard.

yeah, that _would_ be "a pretty lousy standard".
which is obviously why it's not the one i'm using.

for those who are smart enough to think about it a bit,
"completely reversible" in this type of situation means
that once you change it, you can change it back any time.

it doesn't mean you have to magically know what to change.
(if you know that, _every_ change is completely reversible.)

-bowerbird
p.s.  note to the list subscribers:  i usually don't respond to david,
since his points are too often paper bags that cannot hold water -- 
he's always clever enough to find a fault, but seemingly never clever
enough to realize why it doesn't apply, or to find its obvious solution
-- just like this post, but since this _was_ in regard to something
that i was putting "on the record", i am compelled to respond to it.
having done it once, though, i probably won't bother doing it again...
From prosfilaes at gmail.com  Mon Mar  7 17:49:01 2005
From: prosfilaes at gmail.com (David Starner)
Date: Mon Mar  7 17:49:19 2005
Subject: [gutvol-d] lest the message be missed
In-Reply-To: <db.21c3aa20.2f5e5b90@aol.com>
References: <db.21c3aa20.2f5e5b90@aol.com>
Message-ID: <6d99d1fd05030717497adc722c@mail.gmail.com>

On Mon, 7 Mar 2005 20:36:16 EST, Bowerbird@aol.com <Bowerbird@aol.com> wrote:
> it doesn't mean you have to magically know what to change.
> (if you know that, _every_ change is completely reversible.)

You've stripped the accents; how am I supposed to know which accents
to put back?
 
Take a look at this text, from Garnett's translation of Elene that's
currently going through DP. I have removed the accent; replace it.

"Lo! that we heard through holy books,
That the Lord to you gave blameless glory,      365
The Maker, mights' Speed, to Moses said
How the King of heaven ye should obey,
His teaching perform. Of that ye soon wearied,
And counter to right ye had contended;
Ye shunned the bright Creator of all,      370

> find its obvious solution

But it's always too much work to show us this obvious solution. Show
me the pudding; put the accent back into the above text.
From Bowerbird at aol.com  Mon Mar  7 22:14:23 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Mon Mar  7 22:14:44 2005
Subject: [gutvol-d] lest the message be missed
Message-ID: <1f8.5bc37bd.2f5e9cbf@aol.com>

codepoints for macroman encoding:

>   142  ?         B?n?dictine
>   142  ?         na?vet?
>   144  ?         cr?che
>   149  ?         na?ve
>   149  ?         na?vet?.
>   150  ?         ca?on
>   170  ?         Edition?
>   174  ?         ?neid
>   190  ?         antenn?
>   231  ?         ?ntonia

-bowerbird
From traverso at dm.unipi.it  Tue Mar  8 01:04:48 2005
From: traverso at dm.unipi.it (Carlo Traverso)
Date: Tue Mar  8 01:02:54 2005
Subject: [gutvol-d] lest the message be missed
In-Reply-To: <1f8.5bc37bd.2f5e9cbf@aol.com> (Bowerbird@aol.com)
References: <1f8.5bc37bd.2f5e9cbf@aol.com>
Message-ID: <200503080904.j2894mo11253@posso.dm.unipi.it>


To bowerbird:

How do you manage words that are written in the same way, except the
accent? This is quite common in french and italian. And of course you
find both in the same book, and sometimes in the same quotation that
you can find in an english book. "Il a dit a toi" (he said to
you). Which a has an accent? 
From shimmin at uiuc.edu  Tue Mar  8 07:07:10 2005
From: shimmin at uiuc.edu (Robert Shimmin)
Date: Tue Mar  8 07:07:22 2005
Subject: [gutvol-d] lest the message be missed
In-Reply-To: <15b.4c4a3125.2f5e0d7e@aol.com>
References: <15b.4c4a3125.2f5e0d7e@aol.com>
Message-ID: <422DBF9E.20904@uiuc.edu>

Bowerbird@aol.com wrote:

> lest the main message be missed in all the minutiae...
> 
> you can take an average p-book from scans to e-book in one evening.
> 
> one evening.

With some of the OCR I've seen latesly, this is probably about 90% 
right, for 90% of books, provided you can get good scans, and provided 
you are willing to let a few hard-to-detect classes of error go until 
post-production.

-- RS
From Bowerbird at aol.com  Tue Mar  8 10:04:18 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Tue Mar  8 10:04:29 2005
Subject: [gutvol-d] lest the message be missed
Message-ID: <1a4.33384a22.2f5f4322@aol.com>

carlo said:
>   How do you manage words that are written in the same way, 
>   except the accent? This is quite common in french and italian.

i wouldn't strip away high-bit characters on a french or ltalian book;
they are an essential part of those languages.  i've said that repeatedly.
jon noring seems to turn every listserve into an endless merry-go-round.

***

robert said:
>   With some of the OCR I've seen latesly, 
>   this is probably about 90% right, for 90% of books

if you want me to estimate some hard numbers,
i'd say 75% of the e-texts in the library now
could be done to our standard in one evening.

the "standard" we've been talking about thus far
is 1 error every 10 pages.  if an individual can
take an e-text to that level of accuracy, then
it can be turned over to a quasi-public process
i call "continuous proofreading", discussed below.

another 20% of the e-texts in the library now
would take two or three evenings, and might
require the help of a "specialist" of some kind,
someone knowledgeable about a certain arena,
like greek, or tables, or indexes, or graphics, etc.
here's where the value of distributed proofreaders
will most come into play in the future, in my opinion,
being able to "fix up" the work of independent proofers.

the remaining 5% of the  e-texts in the library now
(remember, i'm just guessing at the numbers here)
might be too difficult for any individual to take on,
for reasons of size, complexity, or what have you.
again, distributed proofreaders will shine here...


>   provided you can get good scans, 

jon noring promises us good scans, at a high resolution.
i guess he's got a lotta money, to be able to afford the
maintenance costs of storage and bandwidth for them.
if he delivers on that promise, we don't need to worry...

if he can't -- and he has no track-record in this arena,
so maybe we shouldn't count on him -- then we have
the scans that internet archive is making for toronto.
they've promised us "thousands" of scanned books by
this spring, but the last thing i heard was that they
were "pausing to do an evaluation of their quality",
which probably means they've come to realize that
they must do a much better job than they have been.
i've heard that some books are done very well and
that others leave a lot to be desired.  consistency
in this regard is often a difficult goal to achieve.
but i believe brewster will eventually get it right.

and of course, there's always google.  we still don't know
what kind of job they'll do, or if we can use their scans.
but if they do a good job, and release their scans freely,
that will provide us with a ton of scanned-image books.

some people reading this undoubtedly work in an office
that has one of those "multi-functionality" machines.
some of those babies can scan over 60 pages a minute,
straighten the scans, and upload them to the internet,
all while you sit there and whistle and pick your nose.
at that rate, the 450 pages of "my antonia" would take
7.5 minutes.  i don't know how crafty _you_ might be,
but my fellow poets are _highly_ skilled at hijacking
office machinery for our own nefarious purposes...     ;+)
(heck, that's the only reason some of us get a job at all.)

finally, there are millions of home computers out there
that were sold with an all-in-one printer/scanner/fax.
and, as before, the quickest and easiest way to get to a
million scan-books is for a million people to scan _one_.

so, it's fairly easy to predict that, from one or more of the
above factors, there will soon be an _avalanche_ of books
that have been scanned and need to be converted into text.

and every scanned-book will have at least _one_ person
who will want its text badly enough to do a little work.
what we need to do is _give_that_person_a_good_tool_
that enables their little bit of work to get good results.
that's what i intend to give them.  all i ask of you is that
you stop telling people that this job is difficult.  it's not.


>   and provided you are willing to let a few 
>   hard-to-detect classes of error 

you'll have to explain what you mean by "hard-to-detect".

in my experience, if an error is serious (in any meaningful way),
then it'll be detected by a person who's actually reading the book.

some errors are unforgivable, such as an incorrect word that
won't even pass spellcheck.  those should _always_ be caught.

trivial punctuation errors, like a missing comma, are... well, trivial.

(although i haven't mentioned anything about it until now,
a great way to catch some errors is to have the computer
speak the text aloud to you as you follow along reading it;
stealth scannos, for instance, are handily exposed by this.)


>   and provided you are willing to let a few 
>   hard-to-detect classes of error 
>   go until post-production.

"post-production" has no meaning in my scenario.

i repeat myself, again and again, by saying that once a person
gets the error-level on an e-text down to 1 error in 10 pages,
we can make it available via "continuous proofreading" and 
let readers-from-the-general-public zoom it to perfection.

1.  scan.
2.  "fix" the scans.
3.  do the o.c.r. on them.
4.  run the post-o.c.r. tool.
5.  do quasi-public "continuous proofreading".
6.  consolidate the corrections into a public release.
7.  release the e-text out to the public as a single file.
8.  continue doing a full-public "continuous proofreading".
9.  take error-reports from the people reading the file offline.

i will also state, for the record, that i think step #4 can do
_far_ better than 1 error in 10 pages if we sharpen our tools.

look at the "my antonia" example.  jon had a team of _seven_
proofreading it.  i don't know how many looked at each page,
but, according to my analysis, they took the error-rate down
to about 1 every 70 pages.  when i used my bag of tricks on it,
i removed 3 errors from the 210 pages i subjected to scrutiny.
(which leads us to predict 3 more errors in the second half.)

to the best of my knowledge, there are no errors in my file,
i.e., the first half of the book.  (full book by friday, hopefully.)

i'm not saying it _is_ free of errors even now, since someone
with a different set of tricks in _their_ bag might be able to
locate 2 more remaining errors i couldn't find, but i will say
that it is more than clean enough to turn loose on the public...
(that is, i think we could skip steps #5 and #6 on this e-text.)

-bowerbird
From jon at noring.name  Tue Mar  8 10:40:39 2005
From: jon at noring.name (Jon Noring)
Date: Tue Mar  8 10:40:48 2005
Subject: [gutvol-d] lest the message be missed
In-Reply-To: <1a4.33384a22.2f5f4322@aol.com>
References: <1a4.33384a22.2f5f4322@aol.com>
Message-ID: <19715944156.20050308114039@noring.name>

Bowerbird wrote:
> carlo said:

>> How do you manage words that are written in the same way,
>> except the accent? This is quite common in french and italian.

> i wouldn't strip away high-bit characters on a french or ltalian book;
> they are an essential part of those languages.  i've said that repeatedly.

Yes you have said that -- repeatedly.

But I believe it is also essential to preserve all accented Latin and
non-accented characters found in *all* books. This is where the
differences of view arise.

Throwing them out because they are "inconvenient" (which seems to be
your motive, but I'm not sure) is not a valid excuse. Since your tool
set (and viewing software) can handle any character set you want, then
not supporting the non-ASCII characters is even more confusing.


> jon noring seems to turn every listserve into an endless merry-go-round.

<laugh/>

Jon

From Bowerbird at aol.com  Tue Mar  8 11:33:12 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Tue Mar  8 11:33:29 2005
Subject: [gutvol-d] lest the message be missed
Message-ID: <86.236d71ad.2f5f57f8@aol.com>

jon said:
>   But I believe it is also essential to preserve 
>   all accented Latin and non-accented characters 
>   found in *all* books. 

once again, the minutiae is being brought to the surface.

why doesn't anyone here respond to the main message?

because you have no response, that's why.

the main point doesn't correspond to your
petty-politics of throwing mud at michael, 
so y'all continue to try to shift the emphasis.


>   But I believe it is also essential to preserve 
>   all accented Latin and non-accented characters 
>   found in *all* books. 

we know that's what you believe, jon.
you've said it over and over and over.

and i have said, over and over and over,
that _i_ believe it is _not_ essential,
not in *all* books.  so there we have it.

i will do things my way, and i expect
that you will do things your way.  fine!
let's leave the other people here alone!

as usual, you look only at the _benefits_,
without factoring _costs_ into the equation.

the _cost_ of including high-bit characters
is the e-text then _breaks_ for some users,
ones who are using viewer-programs that
are not encoding-savvy, or who don't have
all of the correct fonts on their computer.

or other reasons i haven't come across yet.

if the unicode people had done their job right,
and made unicode follow the mac philosophy
-- "it just works" -- i would be up there on the 
unicode bandwagon with you and your friends.

but it doesn't "just work", not for everyone
-- not yet -- and until it does, i don't want 
to talk about it.  and _after_ it does, i don't
want to talk about it _then_, either, i just
wanna use it and have it work.  for everyone.

wanna do something useful?  _make_it_work_!
not just on the new machines, with certain
browsers and not any other viewer-programs
-- on _every_ machine, with _every_ program.

but until then, just stop bugging all of us about it.
we've heard it, too often, and we are unconvinced.

and buddy, you are _not_ going to convince us 
by repeating the same old argument _again_,
or by asserting your beliefs again and again...

with all the time i've wasted discussing this
stupid topic for the 829th time, i could have
cleaned up the rest of that "my antonia" text.

go away.  oh never mind, i will...

-bowerbird
From fvandrog at scripps.edu  Tue Mar  8 11:43:57 2005
From: fvandrog at scripps.edu (Frank van Drogen)
Date: Tue Mar  8 11:43:46 2005
Subject: [gutvol-d] lest the message be missed
In-Reply-To: <86.236d71ad.2f5f57f8@aol.com>
References: <86.236d71ad.2f5f57f8@aol.com>
Message-ID: <6.2.0.8.0.20050308114121.01e91f68@mail.scripps.edu>

At 11:33 AM 3/8/2005, you wrote:
>jon said:
> >   But I believe it is also essential to preserve
> >   all accented Latin and non-accented characters
> >   found in *all* books.
>
>once again, the minutiae is being brought to the surface.
>
>why doesn't anyone here respond to the main
>message?


Maybe because it got lost between all the other stuff you wrote?

Ah, I see you mean:

 >you can take an average p-book from scans to e-book in one evening.

Well that's great, so start going:)

Frank

From jon at noring.name  Tue Mar  8 12:33:46 2005
From: jon at noring.name (Jon Noring)
Date: Tue Mar  8 12:33:59 2005
Subject: [gutvol-d] Accented characters are important to reproduce in PG
	texts (was: lest the message be missed)
In-Reply-To: <86.236d71ad.2f5f57f8@aol.com>
References: <86.236d71ad.2f5f57f8@aol.com>
Message-ID: <16622730312.20050308133346@noring.name>

Bowerbird wrote:
> jon said:

>> But I believe it is also essential to preserve 
>> all accented Latin and non-accented characters 
>> found in *all* books. 

> once again, the minutiae is being brought to the surface.

The devil is in the details.


> as usual, you look only at the _benefits_,
> without factoring _costs_ into the equation.

On the other hand, there are certain minimum requirements for every
project. As a corollary of an adage I've given earlier: "If a job is
to be done, it is to be done right."


> the _cost_ of including high-bit characters
> is the e-text then _breaks_ for some users,
> ones who are using viewer-programs that
> are not encoding-savvy, or who don't have
> all of the correct fonts on their computer.

All web browsers today, and most more advanced formats, such as PDF,
support the full Unicode set.

That's the future. Embrace it, don't fight it.

There's a saying: "I focus on the future since that's where I'm going
to spend the rest of my life."


> if the unicode people had done their job right,
> and made unicode follow the mac philosophy
> -- "it just works" -- i would be up there on the 
> unicode bandwagon with you and your friends.

This is a specious argument. The Unicode working group is doing their
job right because before Unicode things were a *real* mess and were
NOT working. There is a clear need to unify the world's character
sets and to create universal text encoding formats (e.g. UTF-8)

There is still some controversy regarding some Han scripts, but by and
large Unicode has been successful at its stated goals.


> wanna do something useful?  _make_it_work_!
> not just on the new machines, with certain
> browsers and not any other viewer-programs
> -- on _every_ machine, with _every_ program.

Throwing out important accented characters is unacceptable. Period.
The author/publisher considered it important enough to spend the $$$
to include these characters (in the 19th century it took more effort
to print books with accented and foreign characters.) It adds richness
to the text, and it is hard to argue that the characters are not
somehow an integral part of the text.

Anyway, it is trivial, as *you said yourself*, to autoconvert text
with accented characters to 7-bit ASCII text. So you *can* make your
system work for the folk using legacy systems.

It is far better to do the job right for the long-term future, than to
compromise it for the short-term (legacy hardware and software that is
rapidly becoming obsolete.)


> but until then, just stop bugging all of us about it.
> we've heard it, too often, and we are unconvinced.

Who's "we"? It would not surprise me if the majority of PG and DP
volunteers consider it important (or at least a very good idea) to
reproduce the full character set in all Public Domain texts,
especially now that it is easy to do (both by UTF-8/16 encoding, and
using character entities in XML/XHTML/TEI.)

Hopefully a few of the PGers and DPers will give their thoughts on
this particular topic.


> and buddy, you are _not_ going to convince us 
> by repeating the same old argument _again_,
> or by asserting your beliefs again and again...

Who's "us"?


> with all the time i've wasted discussing this
> stupid topic for the 829th time, i could have
> cleaned up the rest of that "my antonia" text.

If it weren't important *to you*, you would not have replied.

I can only interpret your vociferous replies to mean that you consider
permanently dumping accented characters to be an *important*
requirement to implement your system. That's why I have used the word
"inconvenient" since that's the only reason I can think of.

But if you have another reason why you believe it o.k. to dump
accented characters for most English language PG texts, let us know.
You've not given a good reason why they should not be reproduced.

(The argument of meeting legacy needs is not a compelling argument
since, as you said and I'm repeating what I said above, one can
autoconvert a Master document with accented characters to 7-bit ASCII
for use by legacy-users. Thus, you can meet the needs of these people
*and* the needs and preferences of future generations by preserving
the non-ASCII characters. Instead, you inexplicably want to
permanently remove accented characters from the digital *Master*
versions of most public domain English-language digital texts.)

There's a lot of aspects to Public Domain texts that are
"inconvenient" which prevent easy digitizing. We figure out how to
overcome these "inconveniences" and produce a high-quality product,
not make short-term short-cuts so we can avoid dealing with them.

Distributed Proofreaders is one example of not giving in to the
"convenient", but rather to figure out how to do it right in a
reasonably efficient way.

Anyway, why the rush to digitize (make structured digital texts) out
of page scans, to the point you are willing to sacrifice textual
accuracy and quality? So long as the page scans are available for
posterity, they can be transcribed any time, and done more carefully
and thoughtfully. To me, the most critical thing is to make archival-
quality scans of public domain texts and get them online via IA and
similar organizations. In the meanwhile, the most popular of these
texts can be carefully and methodically converted to Structured
Digital Texts (SDT). There are about 1000 very classic Public Domain
works (part of the pre-DP PG collection) that should be redone to at
least the quality of the "My Antonia" demo project (for those who have
not seen it, it is at: http://www.openreader.org/myantonia/  It is
still an early "beta", but it's been a real learning experience for
several of us working on it.)

Jon

From Bowerbird at aol.com  Tue Mar  8 13:08:08 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Tue Mar  8 13:08:21 2005
Subject: [gutvol-d] stop changing the message-headers
Message-ID: <129.584c0d53.2f5f6e38@aol.com>

jon said:
>   I can only interpret your vociferous replies 
>   to mean that you consider permanently dumping accented characters 

"permanently"?

as i said, as soon as unicode works everywhere, i will embrace it.
those of us out here in the real world know that time is not yet here.

in the meantime, my viewer-tool will actually _display_ all those
accented characters, even when they are not present in the e-text,
if the user chooses that option.  (it's all about user-choice for me.)
if you want to help with that, create a list of such accented words.


>   to be an *important* requirement to implement your system. 

"my" system?

michael's philosophy of having the e-texts work on all machines,
specifically including trailing-edge machinery, is _the_ factor
that has made his e-library the premier one in all of cyberspace,
thank you very much.  you got a high-tech solution?  fine, use it.
and watch it wither, just like every other one before it has...

as to my tools in particular, they will support unicode fully,
long before unicode works with all the other tools out there,
so you're barking up the wrong tree, buster.

i'm done with this stupid thread!  done, done, done!  aarrgghh!     :+)

-bowerbird
From Bowerbird at aol.com  Tue Mar  8 13:10:40 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Tue Mar  8 13:11:01 2005
Subject: [gutvol-d] lest the message be missed
Message-ID: <1a3.2eed3a51.2f5f6ed0@aol.com>

frank said:
>   Maybe because it got lost between all the other stuff you wrote?

no, i think it "got lost" because it was buried under minutiae.

and it is _still_ being subjected to attempts to bury it...


>   Ah, I see you mean:
>   >   you can take an average p-book from scans to e-book in one evening.

well, a better phrasing of that main point might be that
"_any_average_person_ can do an average book in an evening..."

and the follow-on point would be this:
"...so let's start informing the people who might wanna
do a book, once the avalanche of scanned-books arrives,
so they'll realize it is within the realm of possibility;
let's stop spreading the false meme that it's difficult."


>   Well that's great, so start going:)

it's wiser for me to build the tool that
enables people to do a book in an evening,
rather than spend my time doing books...

but yes, i will "start going" on that, right away!
just as soon as i help jon find the rest of his errors.
(as i'm sure you realize, those two go hand-in-hand.)

but really, there won't be a need for that tool _until_
after the avalanche of scanned-books becomes available.
most people just won't be motivated enough to do a book
until there is a scanned-book they want to have as text.

until then, i'm in no hurry.  been working on this tool
for well over a year now.  no reason to rush things.
just waiting for jon to get some more books scanned...      ;+)

-bowerbird
From prosfilaes at gmail.com  Tue Mar  8 16:52:14 2005
From: prosfilaes at gmail.com (David Starner)
Date: Tue Mar  8 16:52:27 2005
Subject: [gutvol-d] lest the message be missed
In-Reply-To: <1a4.33384a22.2f5f4322@aol.com>
References: <1a4.33384a22.2f5f4322@aol.com>
Message-ID: <6d99d1fd050308165275501188@mail.gmail.com>

On Tue, 8 Mar 2005 13:04:18 EST, Bowerbird@aol.com <Bowerbird@aol.com> wrote:
> carlo said:
> >   How do you manage words that are written in the same way,
> >   except the accent? This is quite common in french and italian.
> 
> i wouldn't strip away high-bit characters on a french or ltalian book;
> they are an essential part of those languages.  i've said that repeatedly.

Then why take the time to remove the high-bit characters in an English
book? There's lots of books that have important French quotations or
use accents to denote unpredictable stress; why strip them from the
books that you can, just because? It's not hard at all to deal with
the handful of accents the average English book has.
 
> if you want me to estimate some hard numbers,
> i'd say 75% of the e-texts in the library now
> could be done to our standard in one evening.

Not by someone new to the job. To do a book in an evening requires
that you be experianced with the job and the tools.

And I get real tired of you using the average book in PG as a metric.
The average book in PG was chosen because it was relatively easy to
do. Out of the three floors of books in the library I'm sitting
(excluding the governmental depository), the basement is full of
science, math or technology books, and will require complex graphical
work or mathematical work. Of the remaining 66% (probably more like
55% or 60%, since the third floor is small), many of them are art or
music books or dictionaries and grammars, or archiac languages that
OCR doesn't handle well, or archiac fonts that OCR doesn't handle
well.
 
> here's where the value of distributed proofreaders
> will most come into play in the future, in my opinion,
> being able to "fix up" the work of independent proofers.

It's funny that if the value of DP is so limited, that the percentage
of texts that have come in from DP is so high. Why don't we have more
people doing books by hand alone?

> that's what i intend to give them.  all i ask of you is that
> you stop telling people that this job is difficult.  it's not.

When you upload books to PG, what name do you put on them? For all
your words, I can't recall ever seeing a book credited to you. I have
no samples of what you've worked on alone and what your quality
standards are to judge by.
From prosfilaes at gmail.com  Tue Mar  8 16:56:45 2005
From: prosfilaes at gmail.com (David Starner)
Date: Tue Mar  8 16:57:00 2005
Subject: [gutvol-d] stop changing the message-headers
In-Reply-To: <129.584c0d53.2f5f6e38@aol.com>
References: <129.584c0d53.2f5f6e38@aol.com>
Message-ID: <6d99d1fd05030816563202ce2e@mail.gmail.com>

On Tue, 8 Mar 2005 16:08:08 EST, Bowerbird@aol.com <Bowerbird@aol.com> wrote:
> jon said:
> >   I can only interpret your vociferous replies
> >   to mean that you consider permanently dumping accented characters
> 
> "permanently"?
> 
> as i said, as soon as unicode works everywhere, i will embrace it.
> those of us out here in the real world know that time is not yet here.

The reason why Unicode doesn't work places is because idiots like you
aren't bothering to support it. You're being part of the problem, and
having the audicity to complain about _other_ people causing the
problem.

> in the meantime, my viewer-tool will actually _display_ all those
> accented characters, even when they are not present in the e-text,

How? You still haven't put the accent back in the sample from Elene.
You're throwing the baby out with the bathwater and keep telling us
how easy it is to refill the bathtub with water.
From brad at chenla.org  Tue Mar  8 19:14:35 2005
From: brad at chenla.org (Brad Collins)
Date: Tue Mar  8 19:17:10 2005
Subject: [gutvol-d] stop changing the message-headers
In-Reply-To: <129.584c0d53.2f5f6e38@aol.com> (Bowerbird@aol.com's message of
	"Tue, 8 Mar 2005 16:08:08 EST")
References: <129.584c0d53.2f5f6e38@aol.com>
Message-ID: <acpdtsro.fsf@chenla.org>

Bowerbird@aol.com writes:

> in the meantime, my viewer-tool will actually _display_ all those
> accented characters, even when they are not present in the e-text,
> if the user chooses that option.  (it's all about user-choice for me.)
> if you want to help with that, create a list of such accented words.

Such a feature never would have occured to me.  Toggle accents that
aren't in a text?  How many users have told you that they want to
toggle accents?

If you knowlingly strip out accents you give up any claim to have
created a faithful and accurate edition of a text.  Sorry but that's
blown your credibility right there and drops your text down to the
level of a bootleg Harry Potter translation[1].

Why is this so important?  It's the old game of whispering a sentence
into someone's ear and then they repeat it to someone else etc.
After passing through a few people the sentence get's mangled.

Unicode is very much ready for prime time.  Hell, Unicode is even
supported by Xterm. Man pages on Red Hat Linux use Unicode.  If the
command line in a unix terminal window uses Unicode, it's everywhere.

b/

Footnotes: 

[1] BTW. Usually the Harry Potter translations come out a good few
months after the English version so there is a real market for quicky
translations for people who can't bear to wait and can't read the
English.  My wife can't read English very well, and she bought a
bootleg translation of the second book in Thai.  We compared the first
few pages with the English edition and she said it was so horrible
that she could wait for the official Thai translation which is quite
good.

I also saw a bootleg of Goblet of Fire in Chinese which came out a
week after the English edition was published!  From the look of it, it
had been done in Shanghai. That's a 636 page book translated, printed
and shipped to where I found it in the dingy dark dusty market stalls
in Beijing in a week!  Looking through the book you could see very
distinct shifts in writing style and vocabulary every few pages.  Even
the translation of the names of some of the characters changed
slighlty a couple of times in the book.  They must have chopped up the
book and split the translation between scores of translaters to do it
in a day or two.


-- 
Brad Collins <brad@chenla.org>, Bangkok, Thailand

From Bowerbird at aol.com  Tue Mar  8 20:18:59 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Tue Mar  8 20:19:26 2005
Subject: [gutvol-d] stop changing the message-headers
Message-ID: <1ad.33150c38.2f5fd333@aol.com>

brad said:
>   How many users have told you that they want to toggle accents?

toggle them on?  or toggle them off?          :+)

according to some people here, there is a great desire out there
to see the accents.  so i'll try to reintroduce them when possible

so far i have these words in my lookup-table:

>   142  ?         B?n?dictine -- Benedictine
>   142  ?         na?vet? -- naivete
>   144  ?         cr?che -- creche (my spellchecker gives another accent?)
>   149  ?         na?ve -- naive
>   149  ?         na?vet? -- naivete
>   150  ?         ca?on -- canon
>   170  ?         ? -- (TM)
>   174  ?         ?neid -- Aeneid
>   190  ?         antenn? -- antenna
>   231  ?         ?ntonia  -- Antonia

please do feel free to send me more...      :+)


>   If you knowlingly strip out accents you give up any claim to 
>   have created a faithful and accurate edition of a text. 

contrary to what some people would like for you to believe,
that doesn't have to be the only objective, or even the main one.
for me, _mass_usability_ reigns supreme.  call me a heathen...


>   and drops your text down to the level of 
>   a bootleg Harry Potter translation[1].

my understanding is that some harry potter pirate digitizations
have attained an extremely high level of fidelity to the source...

i'd be leery of the crummy ones.  as would most people.
which is probably why one writer organizations has
been advised to put out crappy "pirate" digitizations,
so as to sour people on underground editions.  so perhaps
the bad version your wife got was planted by the publisher?
if so, it seems to have had exactly the desired effect, eh?
(although it sounds like it was a paper-book as well?)

i heard one "potter" e-book was finished _within_24_hours_
of the release of the paper-book.  and when my tool is released,
i expect that the pirates will become even _more_ efficient!


>   Why is this so important?  

i don't know.  but to hear some people talk,
you'd think it's a matter of life-and-death!


>   It's the old game of whispering a sentence into someone's ear 
>   and then they repeat it to someone else etc.
>   After passing through a few people the sentence get's mangled.

hence the importance of complete reversibility, already mentioned.


>   After passing through a few people the sentence get's mangled.

like the way the word "gets" got mangled in your sentence?    :+)
or the way "knowingly" turned into "knowlingly" up above?       ;+)

-bowerbird
From traverso at dm.unipi.it  Tue Mar  8 22:11:33 2005
From: traverso at dm.unipi.it (Carlo Traverso)
Date: Tue Mar  8 22:09:33 2005
Subject: [gutvol-d] stop changing the message-headers
In-Reply-To: <1ad.33150c38.2f5fd333@aol.com> (Bowerbird@aol.com)
References: <1ad.33150c38.2f5fd333@aol.com>
Message-ID: <200503090611.j296BXj22745@posso.dm.unipi.it>

> 
> so far i have these words in my lookup-table:
> 
> >   142  ??         B??n??dictine -- Benedictine
> >   142  ??         na??vet?? -- naivete
> >   144  ??         cr??che -- creche (my spellchecker gives another accent?)
> >   149  ??         na??ve -- naive
> >   149  ??         na??vet?? -- naivete
> >   150  ??         ca??on -- canon
> >   170  ???         ??? -- (TM)
> >   174  ??         ??neid -- Aeneid
> >   190  ??         antenn?? -- antenna
> >   231  ??         ??ntonia  -- Antonia
>

Would you replace a four-voices canon with a four-voices ca??on ?

Carlo

From Bowerbird at aol.com  Tue Mar  8 23:41:28 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Tue Mar  8 23:41:50 2005
Subject: [gutvol-d] stop changing the message-headers
Message-ID: <75.40ab08d7.2f6002a8@aol.com>

carlo said:
>   Would you replace a four-voices canon 
>   with a four-voices ca??on ?

with a _what_?  i see a capital "a" with a curvy squiggle above it,
followed by a plus-or-minus sign.  what you sent me back is _not_
what i sent you; it's been changed; just like that game of "telephone"
that brad was telling us about.  so it looks like our software here
isn't handling the encoding correctly.  which is precisely my point.

and in the cases like this, it's great to give the user the option to
go back to the 7-bit letters, so there is a semblance of normality.
because we _know_ that "canon" is always going to be "canon".

-bowerbird
From prosfilaes at gmail.com  Tue Mar  8 23:52:40 2005
From: prosfilaes at gmail.com (David Starner)
Date: Tue Mar  8 23:52:59 2005
Subject: [gutvol-d] stop changing the message-headers
In-Reply-To: <75.40ab08d7.2f6002a8@aol.com>
References: <75.40ab08d7.2f6002a8@aol.com>
Message-ID: <6d99d1fd050308235215046ff@mail.gmail.com>

On Wed, 9 Mar 2005 02:41:28 EST, Bowerbird@aol.com <Bowerbird@aol.com> wrote:
> and in the cases like this, it's great to give the user the option to
> go back to the 7-bit letters, so there is a semblance of normality.
> because we _know_ that "canon" is always going to be "canon".

Nice strawman. Everyone wants to give the users the option to go back
to the 7-bit letters; it's whether we throw away the information at
the start, so nobody has it, or at the point the users want it thrown
away.

BTW, for your list of accents, that -> th?t was the change in the
section of Elene. Well, _one_ of the that's had an accent.
From Bowerbird at aol.com  Wed Mar  9 00:15:11 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Mar  9 00:15:36 2005
Subject: [gutvol-d] stop changing the message-headers
Message-ID: <19f.2f0941df.2f600a8f@aol.com>

carlo said:
>   Would you replace a four-voices canon 
>   with a four-voices ca??on ?

but to answer your question, there is only one "canon"
in "my antonia".  and yes, i will do the conversion on a 
book-by-book basis in cases where that is necessary.

(how many of these terms do you think you can find
-- with a non-accented and an accented version --
where _both_ are listed in an english dictionary?
look away, my friends, because every one you find
is one that makes my lookup-table more extensive.)

and even then, if a change is not completely reversible,
you'll need to give me the entire sentence in each case
where the change is to be made.  (or, if it's easier to
do it the other way -- each sentence where the change
is _not_ to be made -- you can do it that way instead.
the only requirement is an absolute non-ambiguity.)

you can _count_ on the fact that i have thought things
through _well_past_ the first exception to everything.
you will have to burrow down to a _much_ deeper layer
if you really want to trip me up.  and i dare you to try...

-bowerbird
From jon at noring.name  Wed Mar  9 07:53:49 2005
From: jon at noring.name (Jon Noring)
Date: Wed Mar  9 07:54:02 2005
Subject: [gutvol-d] stop changing the message-headers
In-Reply-To: <75.40ab08d7.2f6002a8@aol.com>
References: <75.40ab08d7.2f6002a8@aol.com>
Message-ID: <1783410265.20050309085349@noring.name>

Bowerbird wrote:

> and in the cases like this, it's great to give the user the option to
> go back to the 7-bit letters, so there is a semblance of normality.
> because we _know_ that "canon" is always going to be "canon".

The closest American-English equivalent of ca?on is 'canyon', not
canon. Interestingly in "My Antonia" Willa Cather used both variants:
"canyons" on page xi, and "ca?on" on page 124. However, one can
forgive Cather on this since page xi is part of the Introduction,
spoken by the character Jake, and page 124 is part of the main story
as told (in a "written" manuscript) by Jim. <smile/>

Jon

From marcello at perathoner.de  Wed Mar  9 09:24:06 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed Mar  9 09:23:56 2005
Subject: [gutvol-d] lest the message be missed
In-Reply-To: <86.236d71ad.2f5f57f8@aol.com>
References: <86.236d71ad.2f5f57f8@aol.com>
Message-ID: <422F3136.8060206@perathoner.de>

Bowerbird@aol.com wrote:

> once again, the minutiae is being brought to the surface.

... the minutiae *are* brought to the surface.

If we are going to show off in Latin better get our numeri right.


-- 
Marcello Perathoner
webmaster@gutenberg.org

From Bowerbird at aol.com  Wed Mar  9 09:26:45 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Mar  9 09:27:04 2005
Subject: [gutvol-d] stop changing the message-headers
Message-ID: <cf.241d7a6a.2f608bd5@aol.com>

jon said:
>   ca?on 

i came across one project gutenberg e-text that used "ca?on" throughout,
including a word-capped reference to "the grand ca?on" -- you know,
the one in arizona with all the pretty colors.

anyway, guys, call me when unicode works on all apps on all machines.
until then, i have put this issue to bed.

by the way, it is _still_ so easy to digitize the average book that,
once you have the scans, an average person can do it in one evening.

-bowerbird
From Bowerbird at aol.com  Wed Mar  9 09:48:42 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Mar  9 09:48:53 2005
Subject: [gutvol-d] lest the message be missed
Message-ID: <1e6.36eb827a.2f6090fa@aol.com>

marcello said:
>   If we are going to show off in Latin better get our numeri right.

a constructive comment from marcello!  wow!  that's a first!

thanks!  yes, my latin has been a bit rusty, for quite a while now...

-bowerbird
From hart at pglaf.org  Wed Mar  9 09:49:26 2005
From: hart at pglaf.org (Michael Hart)
Date: Wed Mar  9 09:49:27 2005
Subject: [gutvol-d] lest the message be missed
In-Reply-To: <20050307201324.B8A8C2F915@ws6-3.us4.outblaze.com>
References: <20050307201324.B8A8C2F915@ws6-3.us4.outblaze.com>
Message-ID: <Pine.LNX.4.60.0503090946000.8581@pglaf.org>


On Mon, 7 Mar 2005, Joshua Hutchinson wrote:

> ----- Original Message -----
> From: Bowerbird@aol.com
>>
>> you can take an average p-book from scans to e-book in one evening.
>> one evening.
>
>
> HAHAHAHAHAHAHA
>
> *gasp* *wheeze*
>
> HAHAHAHAHAHAHA
>
> That's the most laughable thing I've read in a long time.
>
> If laughter helps us live longer, you just added 5 years to my life.
>
> Josh

Of course we should not forget people such as David Widger,
who has produced nearly 3,000 eBooks, about one per day,
over a period of years, nor David Price who sent us one
eBook per week for years, or several others who prefer
to remain anonymous.

mh

From Bowerbird at aol.com  Wed Mar  9 10:44:45 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Mar  9 10:44:58 2005
Subject: [gutvol-d] lest the message be missed
Message-ID: <1b9.ef334b5.2f609e1d@aol.com>

michael said:
>   Of course we should not forget people such as David Widger,
>   who has produced nearly 3,000 eBooks, about one per day,
>   over a period of years, nor David Price who sent us 
>   one eBook per week for years, or 
>   several others who prefer to remain anonymous.

that's right!       :+)

of course, david widger is super-human, not "an average person".   ;+)

but -- with the right tool -- now an _average_ person can do it too!

-bowerbird
From marcello at perathoner.de  Wed Mar  9 09:42:56 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed Mar  9 11:37:00 2005
Subject: [gutvol-d] stop changing the message-headers
In-Reply-To: <129.584c0d53.2f5f6e38@aol.com>
References: <129.584c0d53.2f5f6e38@aol.com>
Message-ID: <422F35A0.9060808@perathoner.de>

Bowerbird@aol.com wrote:

> in the meantime, my viewer-tool will actually _display_ all those
> accented characters, even when they are not present in the e-text,
> if the user chooses that option.

Balderdash.

You think you can sneak by using a word list? Then tell me how your 
forever-announced reader program is going to distinguish between the 
Italian words:

   e (meaning: and)

   ? (meaning: is)


Now put the accents back:

   La grappa e buona e la carne e cattiva.


Don't be irritated by the fact that you don't understand the text. Your 
program also has to put the accents back without understanding the text.


> if you want to help with that, create a list of such accented words.

Get ispell or aspell or any other open-source spellchecker. They all 
have multilingual wordlists included.


> michael's philosophy of having the e-texts work on all machines,
> specifically including trailing-edge machinery, is _the_ factor
> that has made his e-library the premier one in all of cyberspace,

Prove this assertion.


> as to my tools in particular, they will support unicode fully,
> long before unicode works with all the other tools out there,
> so you're barking up the wrong tree, buster.

Your tools so far supported only your endless blabbing about them, 
buster. They never even got so mature as to print the greeting screen 
without crashing.


-- 
Marcello Perathoner
webmaster@gutenberg.org


From Bowerbird at aol.com  Wed Mar  9 11:58:26 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Mar  9 11:58:43 2005
Subject: [gutvol-d] stop changing the message-headers
Message-ID: <87.2301818c.2f60af62@aol.com>

marcello said:
>   Prove this assertion.

history is the proof.  study it.

-bowerbird
From marcello at perathoner.de  Wed Mar  9 12:21:49 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed Mar  9 12:21:27 2005
Subject: [gutvol-d] stop changing the message-headers
In-Reply-To: <87.2301818c.2f60af62@aol.com>
References: <87.2301818c.2f60af62@aol.com>
Message-ID: <422F5ADD.8010802@perathoner.de>

Bowerbird@aol.com wrote:

>>> michael's philosophy of having the e-texts work on all machines,
>>> specifically including trailing-edge machinery, is _the_ factor
>>> that has made his e-library the premier one in all of cyberspace, 
>>
>>  Prove this assertion.
> 
> history is the proof.  study it.


What can I reply to this blinding proof of bowerbirds superior 
argumentative powers?

I bow to the Great Master.


-- 
Marcello Perathoner
webmaster@gutenberg.org

From Bowerbird at aol.com  Wed Mar  9 12:56:26 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Mar  9 12:56:45 2005
Subject: [gutvol-d] a wiki-like mechanism for continuous proofreading and
	error-reporting
Message-ID: <66.5297a9a9.2f60bcfa@aol.com>


here's one from last week that never got mailed out...

i'll be leaving here again very shortly, since i have been
reminded just why i had stayed away, because this place
can be so negative and destructive and poisonous...  ick!

***

jon, you said the scanning took "much more than four hours".
so how long _did_ it take?  and if you were to do it again,
with your present scanner, how long would it take you?

also, how long did it take you to manipulate the images?
and how did you do that?  what specific steps did you take,
in what order, and what program did you use to do all that?
is there anything of all that which you'd do differently now?

***

jon said:
>   OCR is quite fast. It's making and cleaning up the scans 
>   which is the human and CPU intensive part.

well, it all depends, jon, it all depends...

with the right hardware -- like office-level machinery --
60 pages a minute can get swallowed by the gaping maw.
that's right.  one page per second.  that seems fast to me.

that means your 450-page scan-job would take 7.5 minutes.
probably took you more time than that to cut the cover off.

and the machine will automatically straighten those pages,
o.c.r., and upload to the net, while you stare dumbfounded...

likewise with the kirtas 1200, geared to scanning books.
     http://www.kirtas-tech.com/
it does "only" 20 pages a minute, but hey, 1000 pages/hour
ain't nothing to sneeze at.  they estimate that in a full-scale
production environment, the price-per-scan is 3 cents a page.
sounds like brewster should buy a half-dozen of these babies.

so it all depends.

the bottom line, though, is that if a person has experience,
good equipment, solid software, and a concentrated focus,
they can open a paper-book to start scanning it and move it
all the way through to finished, high-power, full-on e-book
in one evening, maybe two.

***

i said:
>   third, you used a reasonable naming-scheme for your image-files!
>   the scan for page 3, for instance, is named 003.png!  fantastic!
>   and when you had a blank page, your image-file says "blank page"!
>   please pardon me for making a big deal out of something so trivial
>   -- and i'm sure some lurkers wrongly think i'm being sarcastic --
>   but most people have no idea how uncommon this common sense is!
>   when you're working with hundreds of files, it _really_ helps you
>   if you _know_ that 183.png is the image of page 183.  immensely.
>   even the people over at distributed proofreaders, in spite of their
>   immense experience, haven't learned this first-grade lesson yet.

i forgot to mention earlier that my processing tool can automatically
rename your image and text-files, based on the page-numbers that it
finds right in the text-files (which it extends in sequence for those
files without a page-number -- usually the section-heading pages).

so even if you're dealing with someone else's scans, and _they_ didn't
name their files wisely, you don't have to deal with the consequences.

***

jon said:
>   I believe as you do that an error reporting system is a good idea
>   so readers may submit errors they find in the texts they use -- 
>   sort of an ongoing post-DP proofing process.

i didn't elaborate earlier that it goes much deeper than that.

a very important point here is that an error-reporting system
-- over and above the obvious effect of getting errors fixed --
will actively incorporate readers into the entire infrastructure,
making them active participants cumulating a world of e-books.

if you have ever edited a page on a wiki, you're likely aware that
the experience gives a very strong feeling of _empowerment_ --
because you can "leave your mark" right on a page, quite literally.

if we set up a wiki-page to collect the error-reports for an e-text,
in a system allowing people to check the text against a page-image,
they'll be much more motivated to report errors than they are now,
with the "send an e-mail" system.  the feedback is more immediate,
and compelling, with a wiki.  furthermore, by collecting the reports,
in the change-log right on the wiki, you can avoid duplicate reports.
you can also give rational for rejecting any submitted error-reports,
and/or engage people in a discussion about whether to act on a report.

all of this makes your readers feel _responsible_ for the e-texts.

a lifetime of experience with printed matter has made people very
_passive_ about typographic errors.  there's no reason to "report"
an error they find in a newspaper, for instance, because hey, it's
already been printed.  the same with a magazine or a printed book.
water under the bridge.  and they translate that same attitude over
to e-books, even though it _does_ do good to report errors there.
so we need to do something to shake them out of their passivity,
something to make them feel _responsible_ for helping fix errors.

(just for the record, although i use the term "wiki", i don't mean it
literally.  what i have in mind is more of a "guestbook" type method,
where people can _add_ their text to the page, but not necessarily
_delete_ what other people have added.  it's thus more like a blog,
where everyone can add their comments to the bottom of the page,
but the top part stays constant, to list the "official" information.
but i'll still use the term "wiki" to connote a free-flowing attitude.)

in addition to the wiki, you can build an error-reporting capability
into the viewer-program that you give people to display the e-texts.
if they doubt something in the e-text, they click a button and boom!,
that page-image is downloaded into the program so they can see it.
if they have indeed found an error, they copy the line in its bad form,
correct it to its good form, and then click another button and boom!,
the error-report is e-mailed right off to the proper e-mail address.

this symbolic (and real!) incorporation of readers into our processes
is a rad thing to do.  but it's not the _only_ benefit of such a system;
it also facilitates the automation of the error-correction procedures.

the error-report can be formatted such that your software can
automatically summon the e-text _and_ the relevant page-scan.
so you see a screen with the page-scan _and_ the error-report.
you check its merit, and if it's good, click the "approve" button
and the e-text is automatically edited.  further, the change-log
is updated right on the wiki-page for that e-text, and anyone who
requested error-notification gets an e-mail describing the change.
auxiliary versions of the e-text -- like the .html and .pdf files --
are automatically updated.  and all you did was click one button...
face it, if you're dealing with 15,000+ e-texts, doing it manually
is a sure-fire way to burn yourself out.  who needs that hassle?

i mocked up a demo up this, using a simple a.o.l. guestbook script.
i'm sure you versatile script-kiddies here could do something that
was much more sophisticated, but my version will give you the idea:
     http://users.aol.com/bowerbird/proof_wiki.html

-bowerbird
From prosfilaes at gmail.com  Wed Mar  9 14:15:30 2005
From: prosfilaes at gmail.com (David Starner)
Date: Wed Mar  9 14:15:40 2005
Subject: [gutvol-d] lest the message be missed
In-Reply-To: <Pine.LNX.4.60.0503090946000.8581@pglaf.org>
References: <20050307201324.B8A8C2F915@ws6-3.us4.outblaze.com>
	<Pine.LNX.4.60.0503090946000.8581@pglaf.org>
Message-ID: <6d99d1fd05030914151ee0afd5@mail.gmail.com>

On Wed, 9 Mar 2005 09:49:26 -0800 (PST), Michael Hart <hart@pglaf.org> wrote:
> Of course we should not forget people such as David Widger,
> who has produced nearly 3,000 eBooks, about one per day,
> over a period of years, nor David Price who sent us one
> eBook per week for years, or several others who prefer
> to remain anonymous.

I'm sure it gets a lot easier after your hundredth book. For all those
people doing thousands of books, a large group of books can be done in
an evening. But the vast majority of people helping PG, those who sign
up and proof a few hundred pages at DP and quit, or produce one or two
books and wander off, don't have the skills and expertise to do a book
in an evening.

In any case, this applies to novels and simple non-fiction. You aren't
doing many of the books I currently have up for proofing in one night.
From miranda_vandeheijning at blueyonder.co.uk  Wed Mar  9 14:26:52 2005
From: miranda_vandeheijning at blueyonder.co.uk (Miranda van de Heijning)
Date: Wed Mar  9 14:27:04 2005
Subject: [gutvol-d] lest the message be missed
In-Reply-To: <1b9.ef334b5.2f609e1d@aol.com>
References: <1b9.ef334b5.2f609e1d@aol.com>
Message-ID: <422F782C.20600@blueyonder.co.uk>

hi bowerbird,

This sounds very exciting! I have a book which I want to put online, a 
grammar in three languages with loads of accents etc. It is very 
difficult and I expect it will take a long time to get through DP, which 
will be a shame as it is a very important text. I am encouraged to hear 
you can make this into an e-text in one evening! The scans are done and 
if you like I will mail you a copy.  I'd like to have the proofed book 
back before the weekend, if that's not too much trouble.

Thanks so much!

Miranda van de Heijning


Bowerbird@aol.com wrote:

>michael said:
>  
>
>>  Of course we should not forget people such as David Widger,
>>  who has produced nearly 3,000 eBooks, about one per day,
>>  over a period of years, nor David Price who sent us 
>>  one eBook per week for years, or 
>>  several others who prefer to remain anonymous.
>>    
>>
>
>that's right!       :+)
>
>of course, david widger is super-human, not "an average person".   ;+)
>
>but -- with the right tool -- now an _average_ person can do it too!
>
>-bowerbird
>_______________________________________________
>gutvol-d mailing list
>gutvol-d@lists.pglaf.org
>http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
>
>
>  
>

From Bowerbird at aol.com  Wed Mar  9 14:38:25 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Mar  9 14:38:41 2005
Subject: [gutvol-d] hey marcello
Message-ID: <1a3.2f05cf2e.2f60d4e1@aol.com>


hey marcello, since you recently noted that
some of the e-texts are subsets of other e-texts
-- like the separate e-texts for books of the bible --
how about if you continue your constructive streak
and give us a summary of these duplicated e-texts?

best would be to delete the subsets and give us
just the larger "collection" -- so we would have
the smallest possible list of all the unique books
in the whole library -- but if it would be easier
to do it the other way -- delete the collections --
that would be fine too.

whenever you get a chance...  thanks...

-bowerbird
From Bowerbird at aol.com  Wed Mar  9 15:00:00 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Mar  9 15:00:17 2005
Subject: [gutvol-d] lest the message be missed
Message-ID: <e1.ecd9f2f.2f60d9f0@aol.com>

miranda said:
>   The scans are done and if you like I will mail you a copy.  

that would be great, miranda!  i'd love to help you out!
and you don't even have to mail them!  just put them online,
in one zip file, and let me know where they are.  i'll go get 'em.

oh yeah, i only have the _english_ module for abbyy finereader,
and that won't work well with accented text, so you'll have to
do the o.c.r. on the images with the correct language modules, ok?

so put the o.c.r. files in a zip file too, so i can grab those.

and i should say that my tool facilitates the proofing process,
but can't help much if you're proofing a language you don't know,
and i only know english.  so you might not get the best results
from me.  so far, i'm just concentrating on doing _english_ books.
once i get those down, then i can do work on helping people who
speak other languages extend the tool for their purposes as well.

oh yeah, one more thing.  please include spell-check dictionaries
for the languages that are contained in the text, because i only
have an english one.  marcello can probably help you find those...


>   I'd like to have the proofed book back before the weekend, 
>   if that's not too much trouble.

my schedule is full for the next week or more.
i can't even get to the second half of "my antonia"
until friday at the earliest, and probably next week.

so it'll be a couple weeks before i can get to yours.
and this sounds like it's not really "an average book",
so it might take me two or three evenings, not just one.

but i'd still love to take on your project sometime!
so put those scans somewhere where i can grab 'em,
and i'll get to them at my very first opportunity, ok?

oh yeah, i do hope they are 600-dpi scans, like jon's.
those were really fine.  they gave very clean o.c.r.,
and they're very pleasant to look at, as well.  nice.

and let me know if you decide to start doing the project,
because there's no reason for us to duplicate our efforts.
it won't hurt my feelings if you get impatient waiting!

but since there's lots of books over at d.p. for you to do,
if you want to just hold off on this one and let me take it,
feel free!

-bowerbird
From jon at noring.name  Wed Mar  9 15:00:36 2005
From: jon at noring.name (Jon Noring)
Date: Wed Mar  9 15:01:08 2005
Subject: [gutvol-d] a wiki-like mechanism for continuous proofreading and
	error-reporting
In-Reply-To: <66.5297a9a9.2f60bcfa@aol.com>
References: <66.5297a9a9.2f60bcfa@aol.com>
Message-ID: <13029016843.20050309160036@noring.name>

Bowerbird asked:

> jon, you said the scanning took "much more than four hours".
> so how long _did_ it take?  and if you were to do it again,
> with your present scanner, how long would it take you?

It took about a minute or so to carefully place each page on the flat
bed scanner, close the top, initiate the scanning, open the top, and
replace the page with a new one. While one page was being scanned, I
could do some related work such as naming and saving the previous
scanned images. It got old pretty fast.

So with a manual flat bed scanner, with an already chopped book, it
took me about ten hours, spread over a few days, to do the 450 or so
pages in "My Antonia" (I did it in cracks of time). If I had chosen
300 dpi scanning (rather than 600 dpi), it would have gone faster, but
not four times faster -- maybe 20-30% faster as a rough guess. Of
course, one goal was archival-quality scans -- I could have cut
corners to make it go faster.

Obviously, a fairly new model, professional-grade sheet feed scanner
would have made life a lot easier. But lots of people, the average
Joe, generally only have the el cheapo flat bed scanners which are
*slow*, plus they may not have the necessary knowledge on scanning
and image processing fundamentals to do a good job. I have a strong
background in image processing (plus being an engineer helps in
general, as well as an amateur photographer), so I caught on quite
fast after talking with a few of the pros on scanner newsgroups.

As an aside, I'm used to processing giant images, on the order of
24000x18000 in pixel dimensions (fractal art printing using Kodak
LVT -- now it's Durst Lambda and equivalent machines) -- and I did
this a few years ago on lower-horsepower PCs.


> also, how long did it take you to manipulate the images?

One needs enough *horsepower* to manipulate 600 dpi images (300 dpi
images are *four* times smaller), plus some knowledge. Fortunately,
most of today's basic Win XP boxes and laptops, and latest Mac OS X
hardware, have sufficient horsepower (lots of memory helps.)


> and how did you do that?  what specific steps did you take,
> in what order, and what program did you use to do all that?
> is there anything of all that which you'd do differently now?

There are "all-in-one" professional-level application tools that
straighten out misaligned images, and crops them accordingly. I did
this processing mostly by hand using Paint Shop Pro plus another tool
for semi-automated alignment whose name eludes me at the moment (it
was a 15-day trial software, and it expired the day after I completed
the job -- they want $400 for that sucker. :^(  )

For all of the above, this is why I'm advocating a semi-centralized
project to scan public domain texts, working in parallel with other
scanning projects, such as IA's:

1) We will use volunteers who have access to higher-end scanners (if
   not ones we supply), plus the knowledge on how to use them properly
   for books.

2) We probably can get $$$ to buy sheet feed scanners (which are not
   that expensive, less than 1% the cost of the automated page
   turning scanners IA is using in Canada, as will be discussed below.)

3) We will be able to afford the professional-level "all-in-one"
   scan processing software to do the automated alignment, consistent
   cropping, and image clean-up.

4) We will establish sufficient guidelines, plus QC procedures, to
   maintain a minimum scanned image quality.


>> OCR is quite fast. It's making and cleaning up the scans 
>> which is the human and CPU intensive part.

> with the right hardware -- like office-level machinery --
> 60 pages a minute can get swallowed by the gaping maw.
> that's right.  one page per second.  that seems fast to me.

The fairly good quality sheet feed scanners, which are "office-
quality", may be able to do 5-7 archival-quality scans per minute
(this includes down time due to setting up, stuck pages, etc.) So for
scanning alone, not including keeping track of pages, page numbering,
and other administrative details associated with scanning, the average
300 page book could be raw-scanned, by someone experienced, in about
45 minutes. This assumes 600 dpi optical (archival quality). It may go
a little faster with 300 dpi optical settings -- not sure...


> that means your 450-page scan-job would take 7.5 minutes.
> probably took you more time than that to cut the cover off.

Not possible, unless one bought the *big buck* (above office-level)
sheet feed or page turning scanners, or one simply used a photocopy
machine, and captured the low-rez images it produces.

If you want to increase speed for a given technology, the scan quality
(dpi and maybe color depth) has to be reduced. (Well, except maybe for
photographic-type scanners, which are coming down in price, where a
high-rez snapshot is taken at one moment of each page rather than
running a scan head over the page. I see this as the long-term savior
to produce archival quality scans, and do it more quickly. It may also
be possible to autorotate the book to assure alignment, rather than
doing alignment by image processing after-the-fact.)


> and the machine will automatically straighten those pages,
> o.c.r., and upload to the net, while you stare dumbfounded...

The software exists, but this is *expensive*. You are not going to
find the average person able to afford to buy the software. However,
for the proposed "Distributed Scanners", we'll get the needed hardware
and software to speed up the process, plus the book chopper for those
books which can be chopped (both Charles and Juliet at DP have these
guillotine-type page choppers -- they are quite impressive. <smile/>)


> likewise with the kirtas 1200, geared to scanning books.
>      http://www.kirtas-tech.com/
> it does "only" 20 pages a minute, but hey, 1000 pages/hour
> ain't nothing to sneeze at.  they estimate that in a full-scale
> production environment, the price-per-scan is 3 cents a page.
> sounds like brewster should buy a half-dozen of these babies.

Brewster is already using something like the Kirtas for the Canada
book scanning project. Not sure if it is a Kirtas or some other brand,
though.

I was told, or read somewhere, that the page turning scanner cost IA
about $100,000. This is *major* bucks. Whether such machines will come
down a lot in cost remains to be seen -- I doubt they will come down
very much. These are fairly complex robotic machines, designed to
handle all kinds of variations found in books, and to be very gentle
on them -- yet produce a reasonably good image. I don't see a big
enough market for these machines to substantially come down in cost by
the power of competition.

The Kirtas cost quote of 3 cents per page (which I assume includes
labor, but unsure whether it includes capital equipment amortization)
works out to about $10/book, which is IA's goal, btw. It requires a
trained person to operate it.


> the bottom line, though, is that if a person has experience,
> good equipment, solid software, and a concentrated focus,
> they can open a paper-book to start scanning it and move it
> all the way through to finished, high-power, full-on e-book
> in one evening, maybe two.

Yes, but this is not for the average, ordinary Joe working in his
basement. This requires a lot of $$$ in upfront investment to get
this fancy equipment and software.

For books which can be chopped (such as books where the cover is
falling off, or very common old printings), then one can use $1000 (or
less) sheet feed scanners, which maybe run at an average 5-7 pages
per minute.

Of course, with a "fleet" of sheet feed scanners, and the right image
capture system, it is possible to run them in parallel -- above two
machines, though, it probably requires two people to keep the machines
properly fed (I don't think one person can operate any more than two
sheet feed scanners and keep them occupied -- just a guess.)

There's still need for the whiz-bang scan cleanup software, which I
know is expensive. It can be done by hand, but it is laborious. (This
cleanup could be centralized at one place, but there's the issue of
moving the raw scans to the central location.)


> i forgot to mention earlier that my processing tool can automatically
> rename your image and text-files, based on the page-numbers that it
> finds right in the text-files (which it extends in sequence for those
> files without a page-number -- usually the section-heading pages).
>
> so even if you're dealing with someone else's scans, and _they_ didn't
> name their files wisely, you don't have to deal with the consequences.

Well, yes. However, in "My Antonia" a lot of pages were not numbered
at all (such as the last page in each chapter). I had to be especially
careful not to mess up and lose which page is which. Of course, with
the Kirtas or a sheet feed scanner properly run, it is possible to
keep all the scans in the proper order (which for a monoplex sheet
feed scanner just run the ordered stack through once, and then once
again.)


> i didn't elaborate earlier that it goes much deeper than that.
>
> a very important point here is that an error-reporting system
> -- over and above the obvious effect of getting errors fixed --
> will actively incorporate readers into the entire infrastructure,
> making them active participants cumulating a world of e-books.

This is *exactly* what we have in mind for LibraryCity's role in this,
Bowerbird. We planned for this at least six months ago, but not
implemented anything yet -- we have bigger fish to fry at the moment.
But we envision enabling readers to build community around digital
texts, and this includes mechanisms for error reporting/correction --
but not limited to just that.


> if you have ever edited a page on a wiki, you're likely aware that
> the experience gives a very strong feeling of _empowerment_ --
> because you can "leave your mark" right on a page, quite literally.

Yes, LibraryCity plans to use wiki, or wiki-like, technology in
various of its processes to build community, to enable people to
become an integral part of the texts themselves, and to create new
content -- to make the old texts come alive.


> if we set up a wiki-page to collect the error-reports for an e-text,
> in a system allowing people to check the text against a page-image,
> they'll be much more motivated to report errors than they are now,
> with the "send an e-mail" system.  the feedback is more immediate,
> and compelling, with a wiki.  furthermore, by collecting the reports,
> in the change-log right on the wiki, you can avoid duplicate reports.
> you can also give rational for rejecting any submitted error-reports,
> and/or engage people in a discussion about whether to act on a report.
>
> all of this makes your readers feel _responsible_ for the e-texts.

Yes. This, btw, is also the power of Distributed Proofreaders -- it is
an environment which not only increases trust in the work product, but
it helps volunteers to feel like they are a part of something big.


> in addition to the wiki, you can build an error-reporting capability
> into the viewer-program that you give people to display the e-texts.
> if they doubt something in the e-text, they click a button and boom!,
> that page-image is downloaded into the program so they can see it.
> if they have indeed found an error, they copy the line in its bad form,
> correct it to its good form, and then click another button and boom!,
> the error-report is e-mailed right off to the proper e-mail address.

With our XML-based approach, we have the power of XPointer/etc. to enable
not only error reporting, but full annotation, interpublication linking
and so on. We're going to let the public annotate the books they read
(the annotations will point to the XML internally, not alter the
documents themselves.) This is just one of many things we are thinking
of. (Btw, one has to be careful in how to reconcile error correction
of texts with their usefulness in a full hypertext setting -- we don't
want error corrections to break the already-established links for
annotations, interpublication linking, RDF/topic maps for indexing, and
so forth.)


> the error-report can be formatted such that your software can
> automatically summon the e-text _and_ the relevant page-scan.
> so you see a screen with the page-scan _and_ the error-report.
> you check its merit, and if it's good, click the "approve" button
> and the e-text is automatically edited.  further, the change-log
> is updated right on the wiki-page for that e-text, and anyone who
> requested error-notification gets an e-mail describing the change.
> auxiliary versions of the e-text -- like the .html and .pdf files --
> are automatically updated.  and all you did was click one button...
> face it, if you're dealing with 15,000+ e-texts, doing it manually
> is a sure-fire way to burn yourself out.  who needs that hassle?

Hmmm, this is a lot like what James Linden is developing, which may
be incorporated into PG of Canada's operations. <smile/>

It is a good idea to maintain change tracking of all texts.

And to answer your last point. Doing 15,000 texts, or a million texts,
still needs some manual processing.

It is also important to produce them correctly and uniformly in the
first place, gather full metadata about them and put the metadata into
a library acceptable form (e.g., MARC), and for various fields, such
as author name, to maintain a single authority database as librarians
do. PG's collection has been assembled so ad-hoc that trying to
consistently autoprocess the collection is nigh impossible. That's
why, to me, it is more important to redo the collection, put it on
a common, surer footing (including building trust), before launching
into doing a lot more texts. Imagine how difficult it would be to
process one million texts if they were produced in the same ad-hoc
fashion, without following some common standards. In the meanwhile,
while most of the pre-DP portion of the collection is redone, a strong
focus can be made on the archival scanning and *public access* of
public domain books (including tackling the 1923-63 era in the U.S.)
and getting them online as soon as possible (including properly done
metadata and copyright clearance). Then, when the next-gen systems are
in place to resume major text production, the scans will be there,
available, and already online for associating with the SDT versions.

And this is where we diverge -- I don't believe the full process can
be done totally by machine, there's still need for people to go over
every text to make sure the markup for document structure and inline
text semantics are correctly done. This is *very* important for the
more advanced usages of the digital texts: indexing, interpublication
linking, multiple output formats and presentation types, cataloging,
data mining, and Michael Hart's dream of eventual language
translation. PG's ad hoc approach up to now (which DP has partly
fixed), works against making the text collection capable of meeting
these very advanced needs. XML (or some other text structuring
technology with similar fine granularity) is necessary -- it can't be
done using any plain text regularization scheme, unless the scheme is
made very complex, whereupon going to XML simply makes sense because
it follows the general trends of XML in the publishing workflow.

Jon Noring

From Bowerbird at aol.com  Wed Mar  9 15:49:07 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Mar  9 15:49:23 2005
Subject: [gutvol-d] a wiki-like mechanism for continuous proofreading and
	error-reporting
Message-ID: <2b.6e87e0b0.2f60e573@aol.com>

jon said:
>   Not possible, unless one bought the *big buck* (above office-level)
>   sheet feed or page turning scanners, or one simply used 
>   a photocopy machine, and captured the low-rez images it produces.

my girlfriend's office has a $10,000 lanier just down the hall.
that's the kind of machine i was talking about.  their website
says that their high-end machines can scan 60+ pages an hour.

but i grant you that a scanning time of a few hours (or more)
is much more in line with what most normal people can attain,
even those with lots of experience like yourself...


>   Yes, but this is not for the average, ordinary Joe 
>   working in his basement. This requires a lot of $$$ 
>   in upfront investment to get this fancy equipment 
>   and software.

i think you might be surprised in the coming months, jon.


>   There's still need for the whiz-bang scan cleanup software, 
>   which I know is expensive.

donovan was working on some open-source deskewing routines.
might want to check that out.  and i'm told that abbyy does a
fairly good job setting brightness and contrast automatically.
so the other thing that needs to be done is to standardize the
placement of each scan relative to each other, which isn't hard.
(removing curvature is a bear, but the best new scanner out
-- the optik? -- lets you lay the book on the edge of the bed,
which i understand effectively cures the curvature problems.)


>   in "My Antonia" a lot of pages were not numbered at all 

that's not uncommon.


>   (such as the last page in each chapter). 

yes, i noticed that.  _that_ is a little uncommon.
but like i said earlier, publishers can be weird.


>   I had to be especially careful not to mess up 
>   and lose which page is which.

it's _fairly_ easy to do each page in sequence --
just have to pay some attention turning the page 
-- and then using the auto-increment-name option
will ensure that all of the files are named correctly.


>   Hmmm, this is a lot like what James Linden is developing, 
>   which may be incorporated into PG of Canada's operations. <smile/>

if you check the archives you'll find i'm the one who posted it.
i also offered to write all the software.  all that was ignored.
doesn't matter though, i'm proceeding to build my own system.

if james took my post to heart, then he's smart.       :+)


>   Doing 15,000 texts, or a million texts,
>   still needs some manual processing.

if you're manually opening every file, and manually summoning
every scan you need to check, you're going to burn yourself out.
_plus_ expose yourself to the reality of inadvertent changes.
you have to have a system that tracks every change that's made,
so you can review the log to make sure it was the correct change,
and that nothing else was changed.  reviewing the log is "manual",
and so is the decision as to _approval/rejection_ of the change,
but the change itself should be totally automated.


>   That's why, to me, it is more important to redo the collection, 
>   put it on a common, surer footing (including building trust), 
>   before launching into doing a lot more texts. 

the library needs to be _corrected_, yes, but _not_ "redone".

and i think you do more damage than good when you talk about
e-texts being done "incorrectly", when what you _really_ mean
is that an edition was used that you don't happen to approve of,
or that metadata isn't included, just to use some most examples.

there are _real_ errors in the e-texts.  honest-to-goodness mistakes.
we need to concentrate on _those_, not on some edition that uses the
british spellings instead of american ones.  (even if that _was_ silly.)

but distributed proofreaders is more interested in doing new books
than fixing old ones.  they're volunteers who set their own priorities.


>   Imagine how difficult it would be to process one million texts 
>   if they were produced in the same ad-hoc fashion, 
>   without following some common standards. 

i don't have to "imagine" it.  that's the way the library is now.
and i made my fair share of efforts to try and convince the powers
that that situation needed to be addressed with some standardization.

but the difficulty of doing it with the type of heavy-markup that you like
has held up that whole darn process.  if we would have proceeded with the
"zen markup language" that i like, the library would have been clean now.


>   PG's ad hoc approach up to now (which DP has partly fixed)

the d.p. e-texts still exhibit a large degree of inconsistencies.
and contrary to what you imply, they are not generally error-free.
some are, but others are not.  the same is true of earlier e-texts.
the quality has improved, yes, surely.  it is still not highest quality.
but they are volunteers, and thus they set their own bar for quality.
and they certainly deliver quality that is high enough that we could
use "continuous proofreading" and have the public zoom us to perfect.


>   it can't be done using any plain text regularization scheme

you're wrong.  dead wrong.

***

anyway, jon, thanks for the information on your scanning experience.
i come away from hearing it with an even more firm conclusion that
scanning and image-cleanup is indeed the biggest part of the process.

-bowerbird
From jon at noring.name  Wed Mar  9 16:14:46 2005
From: jon at noring.name (Jon Noring)
Date: Wed Mar  9 16:15:01 2005
Subject: [gutvol-d] a wiki-like mechanism for continuous proofreading and
	error-reporting
In-Reply-To: <2b.6e87e0b0.2f60e573@aol.com>
References: <2b.6e87e0b0.2f60e573@aol.com>
Message-ID: <3633467218.20050309171446@noring.name>

Bowerbird wrote:
> jon said:

>> Not possible, unless one bought the *big buck* (above office-level)
>> sheet feed or page turning scanners, or one simply used 
>> a photocopy machine, and captured the low-rez images it produces.

> my girlfriend's office has a $10,000 lanier just down the hall.
> that's the kind of machine i was talking about.  their website
> says that their high-end machines can scan 60+ pages an hour.

But what resolution? With scanners that move something with respect to
the page, the higher the resolution, the slower it is. (On the other
hand, today's 12 megapixel digital cameras, which for "My Antonia"
would produce approximately 600 dpi quality, take a snapshot of the
whole page, and can transfer the file in very short time, short than
it takes to turn the page.)


> but i grant you that a scanning time of a few hours (or more)
> is much more in line with what most normal people can attain,
> even those with lots of experience like yourself...

Well, I'm not an experienced scanner (there's a difference between
understanding the principles, and actual experience), but I think by
the time I got finished with My Antonia, I gained a few stripes.
<smile/>


>> There's still need for the whiz-bang scan cleanup software, 
>> which I know is expensive.

> donovan was working on some open-source deskewing routines.
> might want to check that out.

O.k., thanks. Open source, high-quality deskewing routines are
definitely needed!

Now, it's a matter to also get a high-quality open source cropping and
normalization application.


> and i'm told that abbyy does a
> fairly good job setting brightness and contrast automatically.
> so the other thing that needs to be done is to standardize the
> placement of each scan relative to each other, which isn't hard.
> (removing curvature is a bear, but the best new scanner out
> -- the optik? -- lets you lay the book on the edge of the bed,
> which i understand effectively cures the curvature problems.)

Yes, I've heard of these book-oriented scanners which are more gentle
on bindings (but even here the binding is stressed.) There's a web
site somewhere giving a review of the model you describe, but don't
have the URL handy.


> but distributed proofreaders is more interested in doing new books
> than fixing old ones.  they're volunteers who set their own priorities.

Yes, that is true. There is a lot of interest in DP to redo a lot of
the pre-DP classics in the PG corpus, from what I understand, so it
may get done anyway even if PG does not encourage it.

Jon

From Bowerbird at aol.com  Wed Mar  9 17:04:58 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Mar  9 17:05:14 2005
Subject: [gutvol-d] a wiki-like mechanism for continuous proofreading and
	error-reporting
Message-ID: <129.58668f74.2f60f73a@aol.com>

jon said:
>   But what resolution? 

their website tells you.
in a $10,000 machine, 
it better be good.


>   There's a web site somewhere giving a review 
>   of the model you describe, but don't have the URL handy.

we don't need a review on a website,
as there's plenty of d.p. people here
who'll vouch it's an amazing machine.


>   Yes, that is true. There is a lot of interest in DP 
>   to redo a lot of the pre-DP classics in the PG corpus, 
>   from what I understand, so it may get done anyway 
>   even if PG does not encourage it.

you didn't read what i wrote.

it is _distributed_proofreaders_ that
-- as a whole -- is more interested in
doing new books than re-doing old ones.

if they wanted to do it before now,
they would have.  but they haven't...

(a few of 'em have redone old books.
including some html versions that
jim recently asked them to fix up.
but as a course of action, not much.)

michael doesn't tell d.p. what to do.
he doesn't tell _anyone_ what to do.
even if you _ask_ him for guidance,
he's usually too stubborn to give it.

-bowerbird
From fielden3 at aol.com  Thu Mar 10 00:23:56 2005
From: fielden3 at aol.com (Kent Fielden)
Date: Thu Mar 10 00:24:20 2005
Subject: [gutvol-d] a wiki-like mechanism for continuous proofreading and
	error-reporting
In-Reply-To: <129.58668f74.2f60f73a@aol.com>
References: <129.58668f74.2f60f73a@aol.com>
Message-ID: <4230041C.7060606@aol.com>

At the risk of coming into the middle:

My experience is that the time consuming part of going from book to 
E-book is the proofreading.  The scanning, cropping, and OCR are 
probably less than about a quarter of the time.

- I use a Canon S230 3 megapixel camera in a copy stand to get about 300 
dpi scans.  I can do 4-6 pages a minute, without destroying the book.  I 
have been quite happy with using a camera as scanner, but 600 dpi would 
halve the processing speed.  I tried 2 pictures per page, but I did not 
find any improvement in the OCR quality.

- I use Abby FineReader 5.0, which was not that expensive, and it 
usually finds the right text, flipping pages and cropping automatically. 
  A pass over the pages using FineReader to find basic OCR issues takes 
about 15 seconds a page.

So up to this point, a 250 page book could be done in 2-3 hours of 
concentrated work.  I would guess I have 2-4 errors per page at this point.

- then comes a first pass proofreading, also fixing headers and footers.
this is often 30 seconds per page.
- then a full second pass of proofreading, again about 30 seconds a 
page.  I probably find an error a page in this pass.

Then I ship it.  I could believe it could be done in a day's elapsed 
time, but I don't think I can focus that hard all in a single day.

The real problem is my day job is using up most of my available 
concentration, so I don't feel up to spending too much time proofing.

my 2 cents...

Kent Fielden

From traverso at dm.unipi.it  Thu Mar 10 03:09:17 2005
From: traverso at dm.unipi.it (Carlo Traverso)
Date: Thu Mar 10 03:07:03 2005
Subject: [gutvol-d] a wiki-like mechanism for continuous proofreading and
	error-reporting
In-Reply-To: <3633467218.20050309171446@noring.name> (message from Jon Noring
	on Wed, 9 Mar 2005 17:14:46 -0700)
References: <2b.6e87e0b0.2f60e573@aol.com>
	<3633467218.20050309171446@noring.name>
Message-ID: <200503101109.j2AB9H232437@posso.dm.unipi.it>

>>>>> "Jon" == Jon Noring <jon@noring.name> writes:

    Jon> Yes, I've heard of these book-oriented scanners which are
    Jon> more gentle on bindings (but even here the binding is
    Jon> stressed.) There's a web site somewhere giving a review of
    Jon> the model you describe, but don't have the URL handy.

I have one, Plustek OpticBook 3600, and I am very much satisfied of
it, but scanning books in book mode trims away at least 1cm. in the
middle, so can be used only if the margins are generous. To use it you
have to open the book at 90 degrees, usually possible.

I am satisfied nevertheless for the speed, the depth of the scan
(there is almost no shadow in the gutter), the overall quality for the
price. 

I see it quoted now $239, but it is difficult to find it in online
shops: apparently there is much demand.


However, in my experience, the limit is not scanning quality: it is
print quality. OCR quality is pretty good on modern editions, but old
books, often stained, and even more often with defective print, give
raise to a lot of errors. Often you don't have the choice of a better
print.

Carlo
From shimmin at uiuc.edu  Thu Mar 10 08:45:56 2005
From: shimmin at uiuc.edu (Robert Shimmin)
Date: Thu Mar 10 08:46:02 2005
Subject: [gutvol-d] a wiki-like mechanism for continuous proofreading
	and	error-reporting
In-Reply-To: <200503101109.j2AB9H232437@posso.dm.unipi.it>
References: <2b.6e87e0b0.2f60e573@aol.com>	<3633467218.20050309171446@noring.name>
	<200503101109.j2AB9H232437@posso.dm.unipi.it>
Message-ID: <423079C4.5010903@uiuc.edu>

Carlo Traverso wrote:

> I have one, Plustek OpticBook 3600, and I am very much satisfied of
> it, but scanning books in book mode trims away at least 1cm. in the
> middle, so can be used only if the margins are generous. To use it you
> have to open the book at 90 degrees, usually possible.

I use the same model, and am very happy with its speed; for 300 dpi 
images of 8vo sized books, I have clocked myself at 300 pages per hour 
on a book with a good binding.

I don't know what software you use it with, but if you have Abbyy, you 
might do what I do and run it through Abbyy's interface rather than its 
own "book mode" interface.  The Abbyy driver should capture the entire 
platen rather than throwing away the outer cm.  My experience is that 
having the book only 90 degrees open eliminates much of the gutter 
shadow on its own, and the additional processing that "book mode" does 
is largely unnecessary.

> However, in my experience, the limit is not scanning quality: it is
> print quality. OCR quality is pretty good on modern editions, but old
> books, often stained, and even more often with defective print, give
> raise to a lot of errors. Often you don't have the choice of a better
> print.

This can't be helped.  However, the other issue that gives problematic 
raw OCR is that even when character recognition is good, layout 
detection can be poor, and sidenotes, multi-column text, and the like 
can be blended in with the main text, while corners might be chopped 
off, and in older printings where the inter-line spacing might not be 
exactly constant, whole lines can be elided.

If I'm going to exert more effort in getting images and OCR, I've found 
that the place where it pays off the most is in previewing and 
correcting the recognition areas before letting the OCR do its work.

-- RS
From Bowerbird at aol.com  Thu Mar 10 22:41:43 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Thu Mar 10 22:42:04 2005
Subject: [gutvol-d] ok, let's wrap this up, folks
Message-ID: <199.3ab251fa.2f6297a7@aol.com>

kent said:
>   At the risk of coming into the middle:

ain't _that_ the truth!       ;+)

unless i am prodded further, however, this will be
my last post on this thread.  and this will also be
my last thread before i take a long break from here,
with the exception of my final report on "my antonia",
and reports on the book that miranda asked me to do...

and yes, people, it's a long post, because it's full of
detailed thinking and analyses.  if that ain't for you,
not your cup'o'tea this afternoon, hit the 'delete' key,
don't go running off complaining to michael and greg...


>   My experience is that the time consuming part 
>   of going from book to E-book is the proofreading.

ok, let's take a look at what you have to say.


>   I use a Canon S230 3 megapixel camera 

um, it is unlikely that's good enough.

this very issue of using a digital camera
rather than a scanner is being discussed
right this second on another listserve,
but the people there are talking about
5-megapixel and up, even a 10-megapixel.
i seriously doubt a 3-megapixel works well.

there are other concerns with a camera too.

are you using external lighting on the book?
if not, then your images will be substandard.

do you use a tripod?  do you focus manually?

as always, photography can be a tricky thing.


>   I use a Canon S230 3 megapixel camera 
>   in a copy stand to get about 300 dpi scans.

there are also issues with the "copy stand".
some stands can be good.  others, not at all.

are your scans showing curvature problems?
if so, that can be a killer to o.c.r. recognition.

unevenness in the brightness across the page?
that can significantly impair the o.c.r. too.


>   I use a Canon S230 3 megapixel camera 
>   in a copy stand to get about 300 dpi scans.

300dpi ain't giving your o.c.r. app the best you could.
and isn't really creating what you'd want for archives.

it's much more time-consuming to scan at 600dpi,
and i think it's an open-question whether we want
to ask individuals like you to take that extra time,
or whether we wait to re-do scans until we have
the equipment that will make that process fly by.

but if we do take the 300dpi shortcut in scanning,
or by using a digital camera rather than a scanner,
then we need to do it with the full knowledge that
that decision _might_ impact o.c.r. accuracy, which
in turn _might_ result in more proofing work, which
_might_ end up actually _costing_ us time overall...

given the differences obtained from different scanners,
and different source-texts, and different o.c.r. programs,
and even from different _people_ doing the imaging --
if you've looked at a range of scanned books, you'll know
that different people exhibit a wide range of variability
in how carefully, e.g., straight, they position each page
-- it's very difficult to do the research we'd need to do
to find out exactly _how_much_ time we're wasting by
creating images at less-than-ideal resolution.  but we 
are _certainly_ wasting some time, in some situations
-- and perhaps a _lot_ of time in more than we know...

i'll put this as plainly as i can:
if we use inferior tools, we _will_ get inferior results.

if you take care to notice it, my statements about
"one evening" are hedged carefully with qualifiers
about "the right scanner", "the right manipulations",
"the right tools", and of course, "an average book"...

a lot of the people who scoff are people who are
using inferior tools, and getting inferior results.

people once thought heavier-than-air flight impossible.
it is, if you do it wrong.  if you do _anything_ wrong.
and there are lots and lots of things you can do wrong.
but do _everything_right_ and flying is certainly possible.
now people fly every day in a plane, with no second thought.

and, to be clear, i'm talking about the amount of time
that it takes _after_ the page-scans are cleaned up.
as people have confirmed, the scanning and clean-up
will often take a very long time, all by themselves.
compared to _that_, proofing should be much faster.

before i leave the arena of the image-creation process,
i should say there is only _one_ "right" scanner out there
currently, in the range of personal affordability anyway.
it's that optic3600 that other people have mentioned here.

if you're using another scanner, you're wasting your time.
maybe you're not wasting a _lot_ of your time, perhaps
not enough to consider a $250 scanner as an "investment",
but you need to know that you _are_ wasting some time.
and if you use inferior tools, you will get inferior results.

one more thing, since carlo mentioned that sometimes
he gets inferior results because the p-book is shoddy.
hey, no question that a bad original will make bad scans.
the best answer to that problem, though, is very simple:
go find a cleaner copy of the book to get your scans from.
_somewhere_ out in the world, there _is_ a cleaner copy.
(if not, let that rare book be scanned by a professional!)

and if those bad scans are coming from somewhere else?
the same answer:  go find a cleaner copy and scan _that_.
don't waste valuable time dealing with inferior images!

jon noring keeps talking about how wonderful it is that
distributed proofreaders keeps the scans for their books.
and it is.  but the truth of the matter is that previous few
of those scans can be considered good enough for archival.
so those books will have to be rescanned in the future too.
let's hope that brewster and/or google are doing it right...


>   I use Abby FineReader 5.0

v5 won't give you the accuracy that v7 will.

that's likely the _main_ reason that proofing 
is taking you longer than it should.  version 7
does a much better job that 5.  you will find
the upgrade price _is_ an excellent investment,
even if your time isn't really worth very much...

if you use inferior tools, blah blah blah...


>   then comes a first pass proofreading, 
>   also fixing headers and footers.
>   this is often 30 seconds per page.

um, no.

you're getting way ahead of yourself.

after scanning, you _first_ need to clean up the page-scans --
which means deskewing them, standardizing placement, etc.

almost every page is skewed to some degree.  even though this
might not be apparent to you without careful analysis, it _is_
a factor with big impact on the o.c.r. accuracy.  and furthermore,
when a person views page after page of the images, to read 'em,
even a small skewing causes a subconscious weirdness to them.

as for placement, i mean the left and top margins of each scan
are identical.  it's another factor effecting reader subconscious.
while it's less important to o.c.r. accuracy, it does sometimes
exert an impact there too, specifically in regard to the "zoning".
(and yes, you _do_ have to zone the pages to get the best o.c.r.)

there are a whole slew of other ways to manipulate the images.
i don't have any experience with some of them, to discuss them,
but there are some people over at distributed proofreaders who
seem to know a lot, including one person whose name escapes me,
who has formulated his "recipe" for enhancing page-scan images.
interestingly, it includes "blurring" the image at one point, which
certainly seems counterintuitive, but has the effect of converting
the one-pixel dots into two-pixel dots (or some such), which means
they don't get deleted in a later step where the image is downsized.
(d.p. resizes many scans to a size that works well in their system;
that also might be considered a shortcoming in their scan-archive.)

now some of the skeptics out there are probably muttering now that
adding time to the imaging process to save it on the proofing process
isn't really "saving" us any time.  and there is a little truth to that.

however, many of these image-cleanup steps can be _automated_,
so they are great candidates for inclusion in our ideal work-flow.

even more importantly, it's vital that we start considering the scans
as a product in and of themselves.  i fully agree with michael hart that
"a picture of a book is not an e-book".  i too want raw, editable text.

but that doesn't mean a high-quality "picture of a book" isn't useful.
indeed, as pointed out here, it's the first step on the way to getting
the raw, editable text.  and even after that, it continues to be useful.
people _will_ -- in the future -- desire to _replicate_ older books.
they will want print-outs that "look exactly like" the original book.
(_especially_ with books like those by william blake, for instance.)
and the best way to fill that demand is to have high-quality scans.
tomorrow's low-end printers will be 600dpi (if they aren't already).
so that's the resolution that we need to be aiming at with our scans.

yes, i fully realize that that is ridiculous in terms of the present,
when that kind of resolution overwhelms our memory and bandwidth,
as soon as we stop thinking about books at the individual-book level
and start thinking about them as collections in the tens of thousands.

which is precisely why i tell people now that 300dpi is acceptable,
even for the "archive" versions we're building for the here-and-now,
just as long as the 300dpi scans give us acceptable o.c.r. recognition.

but i give louder applause to the foresight to go to 600dpi right now.
(me, though, i'll go 300dpi unless/until i have a high-speed scanner,
expecting that _every_ book i'll scan will eventually be rescanned.)


>   then comes a first pass proofreading, 
>   also fixing headers and footers.
>   this is often 30 seconds per page.

ok, after you've cleaned up the scans, you can start the "proofing".
but there are lots and lots of different ways of "doing the proofing",
so let's be perfectly clear about exactly what we we're talking about.

my software tool guides you through the processes a certain way,
so i'll be discussing that path.  like i said, i plan to release my tool
in late spring, about the same time that the internet archive begins
to release scan-books from their toronto project, so if you prefer,
save this post until then, when my tool is out. that's fine with me.

on the other hand, if you want to consider my alternative processes,
to see which ones you can incorporate into your work-flow, read on.
i don't mean to frustrate anyone by saying "i've got a tool to do that"
before the tool is released.  but if this advance information helps...

the first thing to do is a quick check that you got all the scans right.
my tool allows you to "thumb through" all of them, from start to end;
it displays them 2-up, so they look exactly like a p-book page-spread.
on the first pass, you'll just look at each spread, ensure it looks good.

on the second pass, you'll be looking at the text instead of the scans.
here, the 2-up view shows the text on one side, the scan on the other.
(my tool uses this 2-up view -- text next to its scan -- throughout.)
in this pass, you'll be formatting the text, to make it match the scan.

i'm still in the process of figuring the best way to save o.c.r. output,
i hope my tool will do most of the formatting right automatically, but
when it doesn't, you will have to do the formatting yourself, manually.

"manually" doesn't mean "editing", like you'd do with a word-processor.
while that may be necessary on some rare occasions here, in general
there will be buttons that you can click to do most of the formatting.

for instance, say there's a block-quote that didn't get auto-formatted.
you would select the lines of the quote, and hit a "block-quote" button.
same for a poem that didn't get indented, or to right-justify an epigraph.

if your book is like most -- one boring page after another boring page --
there will be very little for you to do.  for "my antonia", for instance,
the only real excitement here was with the occasional chapter heading.

for books that need heavy formatting, you should save that for later,
and move to the next step, which is where the tool starts "proofing".

my tool -- and the ones that are being developed by other people too --
takes the o.c.r. results and automatically makes some changes _before_
ever presenting them to you "for proofing".  for the most part, these are
changes due to known recurring errors in the o.c.r. recognition routines,
so a person generally needs to build a list idiosyncratic to their setup.
(one person doing this had a list of over 400 rules with his old scanner,
but when he bought the optic3600, he was able to drop _half_ of them.)

there are also some checks that are generic to all setups.  an example
would be replacing any "tbe" word with "the".  undoubtedly a flyspeck
caused that nonsense error, so we would just change it automatically.

remember that all of these changes are taking place _before_ the text
has even been viewed yet by a human being, so if -- for some reason --
it _really_was_ "tbe" instead of "the" (because, for instance, it was
_this_ message that was being scanned), the human can change it back!

(well, if it actually was _this_ message being scanned, then the change
wouldn't be _automatic_, not with my tool anyway, because any "scanno"
that is in quotes is _not_ changed automatically, for just that reason.
but you get my point: it's safe to make automatic changes at this time,
because we know that human beings are still going to review the text.)

there are a number of other checks that happen at this time as well,
based on analyses of the text.  i won't say much about these, because
that would give away too much about my program before its release,
but some of the obvious ones would include the one to "close up" the
spaces that o.c.r. often injects around punctuation.  (or which, like in
"my antonia", are _really_ right there in the paper-book.  an example
is on the very first page -- page 3 -- where "hands" is surrounded by
such floating quotemarks; it's clearly printed as " hands ".  even jon, 
with his focus on "fidelity", tightened up those floating quotemarks.)

this is where the o.c.r. of "mr," and "mrs," -- followed by a comma,
instead of a period (which i mentioned before) -- would get fixed.

all of these automatic changes are logged to a file, so they can be
reviewed by a human.  except that review is often a waste of time,
because these changes are (or at least should be) totally obvious.

and if your review _does_ show an auto-change that was incorrect,
and therefore shouldn't have been made, you would seriously consider
_the_removal_of_ the rule that was responsible for that auto-change.

also, kent, since you specifically mentioned headers and footers,
a good tool will let you retain those right up until the last minute.
they don't hurt anything -- and they help you keep your bearing --
so there's no need to delete 'em.  the tool should de-emphasize them
-- mine displays them in gray, which makes 'em unobtrusive _and_
has the benefit of letting you know it identified them correctly --
but they're something you shouldn't have to spend time on in any way.

after the automatic changes comes the fun part.  at this time, the app
does the hard work.  again, i don't wanna steal thunder from my tool,
but the aim at this point in time is to present to you _each_line_ that
will need your attention (accompanied by the page-scan containing it),
and _only_ those lines that need your attention (i.e., no false-alarms).

that is, the tool seeks to find every line that has an _error_ in it, and
present it to you, alongside a page-scan, so you can correct the error;
and it seeks to show you _only_ those lines that really have an error,
so it doesn't waste your time showing you lines you don't need to fix.

that is the "secret sauce" in the tool -- to show you _every_ line that
you'll need to fix, and _only_ the lines that need fixing, and no others...

of course, that's the _ideal_, and we can only hope to _approach_ that.
after all, if the tool knew for certain where each and every error was,
we could just tell it to correct the errors itself, while we ate lunch.

so we scale our expectations back to something a bit more reasonable,
and have the program bring up -- to the best of its ability to do so --
each line for which it has some good reason to think we need to check.

to put this into a phrase, we have the tool look for _probable_ errors.
some of them might not actually be errors, but we go on probability...

we do want to find _all_ the errors, or as many as we reasonably can,
so we'll accept _some_ "false alarms".  they're preferable to _missing_
an actual error.  but at the same time, too many of 'em wastes our time.
after all, the tool could just show us _every_ line and say "check it";
but that wouldn't be buying us any improved efficiency now, would it?

so the closer we get to the ideal -- show us every line we need to see,
and not one line that we _don't_ need to see -- the better we like it.

and if the tool tells us what is wrong with the line, and suggests the 
correct fix, with a "yes, fix it" button we can click, so much the better.

to use an example from above, let's say that it offered to close up those
floating quotemarks around "hands" with just the click of button.  slick!

if we get _close_enough_ to the ideal -- where we are shown only lines
that have errors, and no others -- then we will have just sat there and
button-clicked, while our text became easily and adequately "proofed".
once we've corrected every line that needs to be corrected, we are done!

but we don't really have to get all the way to the ideal to be successful.
again, my "standard" is 1 error every 10 pages.  and i expect to do better.
but if i attain that rate, i will consider my tool to have been "successful".

i should say specifically that _spell-check_ is an important part of this.

i find it laughable and ridiculous that distributed proofreaders does _not_
do a spell-check on the o.c.r. results before shipping them off to proofers.

your first reaction might be "why do a spell-check, since that is exactly
the job proofers are gonna be doing anyway?", plus then go on to point out
how much time a spell-check would take, and various other considerations,
perhaps even launch into your spiel about "what a distributed process is".
(spare me; as a social psychologist, i understand it far better than most.)

heck, there is actually some debate over at distributed proofreaders about
whether a spell-check must be done _after_ the text comes out of proofing.
which explains why some e-texts are actually being posted now that have
obvious spelling errors in them that will _not_ pass a spell-check!  awful!

except i'm talking about a very specific form of limited spell-check, namely
an analysis of the text that creates a list of all the words used in the book.
again, i won't explain how it works, but the purpose is to compile the words
that are _unique_ to the book.  the best example is _names_of_characters_,
another good example is _words_and_phrases_from_a_foreign_language_.
and there are other categories.  here are some examples from "my antonia":
>   kolaches
>   mamenka
>   misterioso
>   patria
>   tatinek
>   amour propre
>   noblesse oblige
>   Optima dies? prima fugit
>   palatia Romana
>   Primus ego in patriam mecum? deducam Musas

these words are used to create a _book-specific_spell-check_dictionary_:
words not in a normal spell-check dictionary, but which _are_ in the book.

i believe that every e-text should include such a word-list in an appendix.
first, it's useful, from the standpoint of end-users running a spell-check;
once this book-specific word-list is specified as an additional dictionary,
the entire file should pass through spell-check without pausing even once.
but moreover, it's just plain _fascinating_ to browse this list for a book.
it is a quickie road-map to the freakish extremes of that particular book.

back to the job at hand...  the word-list _is_ very useful to spell-check
text right out of o.c.r., and _before_ you commence the job of "proofing".

as a good example, remember those character-names?  when you browse
an alphabetized version of the word-list, you'll see a name popping up in
a variety of variant forms, such as the possessive, the plural, and so on.

what you'll _also_ see, though, is an occasional place where the name
was misrecognized.  boom!  my tool allows you to click on it, and then
immediately jumps you to it in the text -- right alongside the image --
so you can verify that it's an error, and change to the correct spelling.
(my plan is to have a button you can just click to make the correction.)
and if the error is obvious enough, you might not even go to the bother
of jumping to its location in the text, but rather just fix it immediately.
(remember, you can review these changes if you want down the line.)

one of the test-books i used to develop my tool, way back when i first
started putting it together, was "the hawaiian romance of laieikawai".
(some of you know this e-text was in the group issued for dp#5000.)

i might've spelled that name wrong; face it, it's a pretty difficult one.
and, as you can imagine, the o.c.r. yielded quite a few variations of it!
there were literally _dozens_ of 'em, off by a letter or two (or more).

and not surprisingly, there were many hawaiian names, long and short,
in this text, and the o.c.r. came up with a number of variants on each!

although  it was a pleasant story, and the o.c.r. was relatively clean
for the pages -- remarkably so, considering how bad the scans were --
those difficult names made the task of proofing a terrible nightmare,
so this text took a fairly long time to make it through all the rounds.

using my tool, however, all of the various scannos on those names
were easy to locate, and to correct, and that task was done quickly.

thinking about individual proofers, going to the trouble of correcting
each of those name scannos, independently, manually, i am appalled!
imagine how much of a hassle that was!  what a tremendous waste!

but the scenario is even worse, at least for proofers who were careful,
and took their job seriously, because in order to check _whether_ the
name is spelled correctly or not, you must examine _every_instance_.
and that process is extremely error-prone.  and fatiguing.  and boring.

if the name was _at_least_ in the spell-check-dictionary for the file,
the spell-check on the d.p. page would show it was correctly spelled
(when it was) by failing to highlight it.  and flag incorrect spellings.
but until it's in the dictionary, every occurrence must be scrutinized.

think how much of the proofer's time and energy could've been saved
if the instructions would have said, "hey, ignore the hawaiian names,
we fixed them all in a global operation before you got these pages...".

to subject proofers to those difficulties, when such a simpler method
isn't being developed and utilized, is almost an abuse of the good-will
those fine volunteers are giving you by donating their time and energy

along about now, someone will say, "d.p. plans to install the capability
for a proofer to add a word to the spell-check dictionary for a book."
well, gee, after 6,000 books, i would _hope_ you finally got the idea!
and if you did it _right_, you'd create the book-specific dictionary
_automatically_, before the first page is sent to the first proofer.

i don't mean to sound high-handed and morally indignant and all that,
because i fully realize this is an ongoing learning process for everyone,
but hey, i guess it's easy to waste volunteer time if you have lots of it.

and it would address my concerns _greatly_ if the people-in-charge
(and the loudmouths who _act_ like they are) would be _accepting_
when well-intentioned people try to advise them on their processes.
but there is an active hostility over there to constructive criticism.
and i find that tragic.  but i digress...

getting back to the matter of an _individual_ doing a book, though,
my objective for that situation is to make that person _efficient_.

so _this_ is the type of spell-checking that you need to do _first_,
one whose essential operating philosophy is a _book-wide_basis_.

and then, only after that, yes, if you are an individual doing a book,
the next thing to do is a _regular_ old spell-check, the type that
goes from one questionable word to the next.  the difference here --
and yes, one that my tool facilitates, of course -- is that when you
come to a questionable word, the _page-scan_ is shown right there.

some people actually say, "you should never do a spell-check, because
some words that will pop up are actually as they were in the original,
and they need to be left that way.  so a spell-check is a waste of time,
because what you really need to do instead is a line-by-line comparison."

that's poppycock.  _of_course_ that situation _can_ happen.  sometimes.
and that's why you've got the scan there, to check the questionable word.
i don't advocate a blind "correction" to each and every questionable word.
and you must be able to easily add a word to the book-wide dictionary,
if you find that my tool is continually popping up a word that it shouldn't.
(but odds are that it would've been put in the dictionary in the prior step.)

but _nonetheless_, if you want to find words the o.c.r. _misrecognized_
-- and remember, that's the objective, to isolate _probable_ errors --
the best bet is to look at words that aren't in the spell-check dictionary.

all right, so that takes care of spell-check.

a final set of checks is then done that looks for anomalous situations;
some of these involve punctuation, infrequent juxtapositions, and so on.
there are some words that pass spell-check that you still want to view
-- they are called "stealth scannos" over at distributed proofreaders -- 
and they are one of the things that are checked in this final set.

and at that point, you're done with the text-cleanup.  congratulations.

all in all, as well as i can tell from the testing that i've done so far,
you can expect the tool will present between 1% and 5% of the lines
in the text-file to you for one kind of close examination or another,
and perhaps 75% of those will require a "fix" of some kind or another,
assuming that you got relatively clean o.c.r. results in the first place.

that's a lot better than looking at 100% of the lines to "proof" them.

and that, my friends, is how you can do a whole book in a few hours.

unless you put aside that heavy markup earlier.  if so, it's time to do it.
once again, you will page through the book, text and scan side by side,
doing whatever editing needs to be done so the text is formatted right.

without knowing what kind of formatting you'll need to do, it's hard to
tell you how you'll go about doing it.  so you'll have to wait until you can
get some hands-on experience with the tool to see exactly how it'll work.

but it definitely will not be anything like the pseudo-markup over at d.p.
-- where, for example, /* and */ are used to bracket poetry and stuff --
and it will most certainly not be any form of x.m.l. or h.t.m.l. markup .

it _will_ be z.m.l. -- invisible markup that mimics the p-book page.
and as my tool gets more and more advanced, it will actually _display_
the text just exactly as it will be shown by the z.ml. viewer-program.
and sooner or later, the two apps will morph into one.  (bet on sooner.)

how complex can formatting get using z.m.l.?  we'll have to see...    ;+)

so now that you've gone through all the post-o.c.r. cleanup my tool does,
and the pages are nicely formatted so they resemble the original p-book,
what next?  well, it's probably the case now that your text is _already_
clean enough to meet or exceed our standard of 1 error every 10 pages.

but i assume that if you're doing this book as an individual, it's because
_you_actually_have_an_honest_desire_to_read_or_re-read_this_book._
because _that_ is really the absolute _best_ reason to digitize a p-book.

so read it!

read it in my tool, which allows you to display the image of the page
right alongside the o.c.r. text for that page.  keep in mind that you are
reading for the express purpose of catching any errors in the text, so
read carefully.  at the same time, though, read for your enjoyment too!
it's only by being engrossed in the story that you'll catch some errors,
such as a word or a line inadvertently dropped.  so become engrossed!

if you find an error, first _log_it_!  keep records, to improve the tool.

_then_ use your word-processor to search the text for _similar_ errors.
if that search yields other instances, see what you can learn from them,
and expand your search based on anything you can generalize about them.
some errors are flukes -- a coffee-stain on the page, or what have you.
but others can be recurrent, and if you can pin down a recurrent error,
you will become much more efficient in your efforts to clean up a text.

finally, i will mention again that _text-to-speech_ can be _amazing_
in helping you to locate errors in a text you might never have _seen_
my tool will do text-to-speech; it'll even pronounce the punctuation,
if you select that option, so you can verify that in your text as well.

so i highly recommend that -- rather than reading the text to check it
for that final "proof" -- you _listen_ to it instead, via text-to-speech.
this has the added benefit that you can do it away from your computer.
a lot of people enjoy putting a book onto a walkman, or even an ipod,
and listening to it in the car, or at the exercise club, or out jogging.
that's fine.  (just be conscientious about _remembering_ any errors!)

once you have done this final check, your "proofing" job is all finished.

say what?  does this mean i don't advocate a line-by-line comparison?
isn't that what most people, like d.p., consider to _be_ "the proofing".
well, let me put it this way:  if you _want_ to do that, by all means, do!
do i think it's absolutely necessary?  well, in most cases, absolutely not!
doesn't a failure to do that mean that you might release a text that has
some small errors in it?  well, yes, it certainly does, but that is exactly
why i build the "continuous proofreading" step into my overall processes.
no matter how good a job you might do, certainty requires more eyeballs.
so if you're really feeling insecure, have other people read your file too.
better yet, have someone else process the book completely independently,
and compare their final file to yours.  that should catch _every_ error.

but if an error hides through all of the tools, and withstands a reading
by an engrossed human and/or wasn't noticeable during text-to-speech,
then that error is insignificant enough that i'm not gonna worry about it.

i think it _should_ be corrected, and (due to "continuous proofreading")
that it eventually _will_be_ corrected.  but i ain't gonna worry about it.

and considering the care i put into listserve posts, it's obvious i'm anal.
there are 6,272 words in 707 lines in this message.  find the typo in it.
i circle the mistakes in everything i read, for the sheer fun of doing it.
so if i can live with that error, hey, you can probably live with it too...

at the point of insignificant errors, our attention is much better spent
with a focus on digitizing additional books.  i'll repeat, so it sinks in,
that if someone _wants_ to do line-by-line comparison, that's _great_.

but if we can get texts that are far-and-away error-free without it,
then _i_ have far better ways to spend my time, thank you very much.

and don't try to make that out that i don't care about finding errors,
or that i'm talking about "something different" than what you mean,
and that's the only reason i say it can be done in just one evening.
because my processes will give just as accurate results as yours.
and i'll be happy to prove it by finding the errors in _your_ e-texts.

anyway, now you're done _proofing_, but you're not _completely_ done.

because there's just one more step before you can send your e-text out.
up until now, you might have had the text from each page in its own file.
(or maybe you had it all in one file, since my tool can work either way.)
but if you had them in separate files, they'll now need to be combined.
we also want to get rid of the headers and footers and make it all nice.

these are things my tool does for you --  mostly automatically --
but there are a few that do require some input from you, and some
others you have to monitor to make sure they are done correctly.
one example would be footnotes, which are moved to be end-notes.
another example is to make sure all headings are at the right level.
and when the end-line hyphenation is removed, you might be asked to
make decisions for the tool when it seeks your guidance on that job.
but for the most part, the tool will step you through all these tasks.
it assumes that you're not an expert at doing this, and it helps you.

there isn't that much more for me to explain about this final step,
other than to mention that you _might_ want to execute this step
before you read through the book or listen to it via text-to-speech.

once you've concluded these steps, your file is a bona fide e-book.
congratulations!  you've moved a book into the realm of cyberspace!

you can load your e-text into my z.m.l. viewer-program, and boom!,
you'll see that what you created is a high-powered electronic-book!
the headings are big and bold!  your table-of-contents is hot-linked!
words that were italicized in the p-book, which my tool marked with
underscores like _this_, are again shown in all their italicized glory!
illustrations are displayed on the appropriate page, automatically,
and all you did was make sure their file-name was nearby that text.

after this step, future versions of my tool might perform conversions
of the e-text to other formats, like .html and .pdf and .rtf, if you want.
plans in that regard are still fairly tentative, and i might decide that
i will leave that matter to the end-reader using my viewer-program.
your time might be better allocated by proceeding on to the next book.
after all, it was fun to do it, wasn't it?  and it only took one evening!


>   The real problem is my day job is using up most of my available 
>   concentration, so I don't feel up to spending too much time proofing.

well, yeah, there's no question that this job does take concentration.
there's really no way around that.  i will say, however, that my tool
helps to _conserve_ your concentration by helping you to _focus_ on
the things that require your attention, and not the things that don't.
and that's really the big secret in making people more efficient here.

indeed, that's what enables you to do an average book in just one evening.

anyway, i have exposed enough flaws and gored enough sacred cows
in this post that i can feel the vilification efforts building already.

like i said, unless i am prodded, this is my last post in this thread.
and except for a few final reports on the other threads, i'm all done.

if those vilification efforts break out, though, and i am challenged,
i _will_ remain here to defend myself, as i stand behind this post...

otherwise, i'll be out of here until one of these tools is released,
either from me or from one of the other people working on them,
or until someone comes on here trying to tell you this job is hard.
it ain't, folks.  it's easy.  and people have been flying for decades...

the choice is 
up to you, people...

-bowerbird
From tb at baechler.net  Thu Mar 10 23:50:50 2005
From: tb at baechler.net (Tony Baechler)
Date: Thu Mar 10 23:49:20 2005
Subject: [gutvol-d] No part 2 of newsletter
Message-ID: <5.2.0.9.0.20050310234714.01f6ace0@baechler.net>

Hi.  I'm sure I'm not the only one to notice this, but neither George, Greg 
or Michael commented, so I'm asking.  What happened to part 2 of the 
newsletter?  I got part 1 as always.  George said that he would no longer 
be editing so part 2 would now be automated, but I think something must 
have happened because I never got anything.  I did not fully read part 1 
but I think based on length it is too short to contain a full list of new 
books.  Any thoughts?  Any idea when it will be sent out?  No big rush, but 
I'm curious to see the apparently new, automated format if there is one.

From JBuck814366460 at aol.com  Fri Mar 11 00:18:49 2005
From: JBuck814366460 at aol.com (Jared Buck)
Date: Fri Mar 11 00:19:20 2005
Subject: [gutvol-d] No part 2 of newsletter
In-Reply-To: <5.2.0.9.0.20050310234714.01f6ace0@baechler.net>
References: <5.2.0.9.0.20050310234714.01f6ace0@baechler.net>
Message-ID: <1110529129.22730.1.camel@lsanca1-ar51-4-42-023-178.lsanca1.dsl-verizon.net>

On Thu, 2005-03-10 at 23:50 -0800, Tony Baechler wrote:
> Hi.  I'm sure I'm not the only one to notice this, but neither George, Greg 
> or Michael commented, so I'm asking.  What happened to part 2 of the 
> newsletter?  I got part 1 as always.  George said that he would no longer 
> be editing so part 2 would now be automated, but I think something must 
> have happened because I never got anything.  I did not fully read part 1 
> but I think based on length it is too short to contain a full list of new 
> books.  Any thoughts?  Any idea when it will be sent out?  No big rush, but 
> I'm curious to see the apparently new, automated format if there is one.
> 
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d

Don't worry, Tony, sometimes it takes a week or so to switch to
automated emails from emails that are hand-edited.  That's been my
experience with newsletters I subscribe to that converted to automation
from hand-done newsletters.  If it doesn't come, let me know, and I'll
talk to Michael.

Jared


From jeroen.mailinglist at bohol.ph  Sat Mar 12 08:17:08 2005
From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account))
Date: Sat Mar 12 08:16:07 2005
Subject: [gutvol-d] lest the message be missed
In-Reply-To: <422F782C.20600@blueyonder.co.uk>
References: <1b9.ef334b5.2f609e1d@aol.com> <422F782C.20600@blueyonder.co.uk>
Message-ID: <42331604.6070002@bohol.ph>


Hi Miranda,

May I also claim this wonderful resource, to finish all my obscure 
Philippine grammars. The last one took 10 months to go through DP, so I 
am somewhat discouraged to put up more of these. They are absolutely 
very scarce and very important works, and helpful in reviving interest 
in those languages, which, although widely spoken, until today lack any 
official status. I also have some great dictinionaries, and while we are 
on it, I still wish to convert my great Hiligaynon dictionary (over 1000 
pages in two columns, small type print) to a nice ebook. Scans are ready 
for shipping. Loads of accents, single letters in italics that are 
significant, and so on. I think that resource will also have the time to 
deal with my wonderful sanskrit dictionary, written by Monier-Williams, 
and with 1600 A3 pages in tiny print (three columns, Devanagari and 
Greek script as a bonus), it should just be a breeze for this powerbird. 
When they are through, I have some great census books as well... 
thousands of pages of 6 point letter tables, and we cannot tolerate a 
single mistake.

O yes, he can simply download all stuff from 
http://www.hti.umich.edu/cgi/t/text/text-idx?c=philamer;cc=philamer;tpl=home.tpl

Jeroen Hellingman

Miranda van de Heijning wrote:

> hi bowerbird,
>
> This sounds very exciting! I have a book which I want to put online, a 
> grammar in three languages with loads of accents etc. It is very 
> difficult and I expect it will take a long time to get through DP, 
> which will be a shame as it is a very important text. I am encouraged 
> to hear you can make this into an e-text in one evening! The scans are 
> done and if you like I will mail you a copy.  I'd like to have the 
> proofed book back before the weekend, if that's not too much trouble.
>
> Thanks so much!
>
> Miranda van de Heijning
>
From traverso at dm.unipi.it  Sat Mar 12 09:31:50 2005
From: traverso at dm.unipi.it (Carlo Traverso)
Date: Sat Mar 12 09:31:47 2005
Subject: [gutvol-d] Scanning/OCR tips
Message-ID: <200503121731.j2CHVoX03502@posso.dm.unipi.it>


In margin of the BB-Jon discussion, I would like to issue a warning
from my experience with FineReader OCR: it is not true that higher
resolution scans always provide better OCR. Often they cause minute
imperfections of the original print to be recognized as letters,
punctuations or diacritics. Sometimes it pays to reduce the resolution
of the scans to 300DPI and use the reduced images; FineReader seems to
expect 300DPI scans. Higher resolution is only better with very small
type. 

I think that this is a bug of FineeReader (should not recognize as
letters etc. image details that are much smaller than the other
characters, or incorrectly placed) but this is something on which we
don't have control, except than pre-processing images.

Often the higher resolution scans have different errors from reduced
resolution scans; procedures to compare the OCR at different
resolutions might lead to better global recognition.

Globally, and not unexpectedly, the OCR seems well tuned to recent
print in contemporary language. My impression is that an effort of
developing free OCR software of good quality, in which the knowledge
of the source can be used in the recognition process, could be well
spent for the needs of PG.

Another domain in which a considerable progress could be attained is
the spell-checking software, that is much more tuned to typing than to
OCR, especially of old texts. It is common experience that the most
common OCR errors are down in the list of suggestions.

This is however a domain in which free software exists, and the
problem is one of a metric tuned for OCR in the corrections space.

Carlo

From gbnewby at pglaf.org  Sun Mar 13 18:48:57 2005
From: gbnewby at pglaf.org (Greg Newby)
Date: Sun Mar 13 18:48:59 2005
Subject: [gutvol-d] FWD: converting text into audio for reader format
In-Reply-To: <42013639.5000100@sheridanc.on.ca>
References: <420122D6.30201@sheridanc.on.ca> <20050202193514.GB29652@pglaf.org>
	<42013639.5000100@sheridanc.on.ca>
Message-ID: <20050314024857.GB12812@pglaf.org>

Please see the below - the question is, what 
eBook readers with text to speech capabilities 
can input a .txt file (versus .htm etc.)

Please copy Donna Woodstock <donna.woodstock@sheridanc.on.ca>
or respond to her directly with any suggestions.  Thanks!

On Wed, Feb 02, 2005 at 03:21:13PM -0500, Donna Woodstock wrote:
> Hi Greg,
> 
> If it's no trouble to forward to the list that would be appreciated.  I 
> tried searching for a html format for Frankenstein...it shows it is 
> available but it still downloads as a .txt file.
> 
> Cheers!
> 
> Greg Newby wrote:
> >On Wed, Feb 02, 2005 at 01:58:30PM -0500, Donna Woodstock wrote:
> >
> >>I am wondering if it is possible when an ebook is downloaded to be able 
> >>to open it up in a reader that has audio capabilities.  I've tried 
> >>Microsoft reader but I cannot get it to read the text format.  If you 
> >>can recommend a reader that can do this I would greatly appreciate it.
> >
> >
> >Hi, Donna.  We've had people using products like ViaVoice and
> >other text-to-speech programs.  I don't know anything about Microsoft
> >Reader's audio capabilities - it might be it's not capable
> >of processing .txt files.  Perhaps you could try one of our
> >titles in HTML (see http://gutenberg.org/find)?  Or, it might
> >be necessary to transform a .txt to the proprietary Reader
> >format.  I know people can do this, but we don't have any
> >information about the tools.  If you're still stuck, I can
> >forward your note to the gutvol-d list (http://lists.pglaf.org)
> >to see whether people can provide some more specific guidance.
> >
> >Sorry this isn't too helpful...
> >  -- Greg Newby
> 
From shimmin at uiuc.edu  Mon Mar 14 06:18:05 2005
From: shimmin at uiuc.edu (Robert Shimmin)
Date: Mon Mar 14 06:18:11 2005
Subject: [gutvol-d] Scanning/OCR tips
In-Reply-To: <200503121731.j2CHVoX03502@posso.dm.unipi.it>
References: <200503121731.j2CHVoX03502@posso.dm.unipi.it>
Message-ID: <42359D1D.6050205@uiuc.edu>

Carlo Traverso wrote:

> In margin of the BB-Jon discussion, I would like to issue a warning
> from my experience with FineReader OCR: it is not true that higher
> resolution scans always provide better OCR. Often they cause minute
> imperfections of the original print to be recognized as letters,
> punctuations or diacritics. Sometimes it pays to reduce the resolution
> of the scans to 300DPI and use the reduced images; FineReader seems to
> expect 300DPI scans. Higher resolution is only better with very small
> type. 

Agreed.  Higher resolution only improves recognition of small type. 
Once the resolution is high enough that the thinnest parts of the 
letters are reliably one pixel thick, if the software misrecognizes the 
  character, it will misrecognize a larger character of the same shape.

At 300 dpi, "normal" sized roman fonts seem to usually have thick stems 
of 3-4 pixels wide, thin stems of 1 pixel wide, and serifs also 1 pixel 
wide.

Also, greyscale images do not appear to improve OCR with Abbyy either. 
Although I'm not privvy to their algorithms, certain aspects of the user 
interface suggest to me that the software only operates on black / white 
values, and even if you take greyscale scans, the software threshholds 
them for the purposes of recognition.  You have the greyscales to save 
for whatever other purposes you wish to put them to, but the software 
itself seems to make use of a B/W version.

My usual scanning practice for DP is to 300 dpi B/W scans for text, and 
300 or 600 dpi greyscale scans for illustrations.

> Globally, and not unexpectedly, the OCR seems well tuned to recent
> print in contemporary language. My impression is that an effort of
> developing free OCR software of good quality, in which the knowledge
> of the source can be used in the recognition process, could be well
> spent for the needs of PG.

But it can be trained to recognize other fonts with some success.  I 
have trained Abbyy 5 Pro on blackletter with not stellar, but not 
exactly embarrassing, results.  There is an (unfortunately out of most 
people's price range) version of Abbyy 7 that is designed with oldstyle 
fonts in mind.  If this software is the Abbyy 7 engine, specially 
trained on old text, it suggests that we might do well to set up a place 
to share our pre-trained user patterns for old printing styles.

-- RS
From vze3rknp at verizon.net  Mon Mar 14 07:12:27 2005
From: vze3rknp at verizon.net (Juliet Sutherland)
Date: Mon Mar 14 07:12:30 2005
Subject: [gutvol-d] Scanning/OCR tips
In-Reply-To: <42359D1D.6050205@uiuc.edu>
References: <200503121731.j2CHVoX03502@posso.dm.unipi.it>
	<42359D1D.6050205@uiuc.edu>
Message-ID: <4235A9DB.5010001@verizon.net>


Robert Shimmin wrote:

> Also, greyscale images do not appear to improve OCR with Abbyy either. 
> Although I'm not privvy to their algorithms, certain aspects of the 
> user interface suggest to me that the software only operates on black 
> / white values, and even if you take greyscale scans, the software 
> threshholds them for the purposes of recognition.  You have the 
> greyscales to save for whatever other purposes you wish to put them 
> to, but the software itself seems to make use of a B/W version.
>
I have found, using Finereader 6.0 Corporate, that for certain kinds of 
material I do get substantially better recognition results from 
greyscale. The best example are some old medical journals from the 
1820's that are severely foxed. Finereader is able to recognize most of 
the text on these in greyscale, where B&W scanning produced images that 
even humans can't read. In sizing these down for proofing at DP, I found 
I could not go to B&W but had to go to 2 bit greyscale, and even then 
there were a few pages that need the full 8-bit greyscale to be legible.

I always scan at 600 dpi B&W with the sheet-fed high-speed scanner 
because that slows it down enough for me to hand feed it (which is often 
necessary with the old paper). It doesn't seem to change the recognition 
quality much either way.

JulietS

From traverso at dm.unipi.it  Mon Mar 14 09:58:14 2005
From: traverso at dm.unipi.it (Carlo Traverso)
Date: Mon Mar 14 09:57:52 2005
Subject: [gutvol-d] Scanning/OCR tips
In-Reply-To: <4235A9DB.5010001@verizon.net> (message from Juliet Sutherland on
	Mon, 14 Mar 2005 10:12:27 -0500)
References: <200503121731.j2CHVoX03502@posso.dm.unipi.it>
	<42359D1D.6050205@uiuc.edu> <4235A9DB.5010001@verizon.net>
Message-ID: <200503141758.j2EHwEN08451@posso.dm.unipi.it>


I confirm that FineReader stores the images internally as monochrome.
Probably grayscale works better because of an optimized thresholding
algorithm; but in general the quality of the B/W scans very much
depend on the quality of the scanning software: my B/W scans with the
Plustek OpticBook are very much better that the scans of a (low-end)
Epson.

Carlo

From marcello at perathoner.de  Mon Mar 14 11:06:05 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Mon Mar 14 11:05:47 2005
Subject: [gutvol-d] Another PG 'clone'
Message-ID: <4235E09D.70102@perathoner.de>

How do we like this one?

   http://www.gutenberg.com


-- 
Marcello Perathoner
webmaster@gutenberg.org

From miranda_vandeheijning at blueyonder.co.uk  Mon Mar 14 11:25:53 2005
From: miranda_vandeheijning at blueyonder.co.uk (Miranda van de Heijning)
Date: Mon Mar 14 11:26:04 2005
Subject: [gutvol-d] Another PG 'clone'
In-Reply-To: <4235E09D.70102@perathoner.de>
References: <4235E09D.70102@perathoner.de>
Message-ID: <4235E541.6060406@blueyonder.co.uk>


This is what it says:


"Project Gutenberg is a wonderful project that has been going on for 
several decades, making public domain books available to people for 
free. We support the work of Project Gutenberg. We also believe, 
however, that as more and more value is added to books, even public 
domain books, that people will pay reasonable prices for these new 
information forms. So, whether a book is $1 or free is not a big issue 
to us at Gutenberg.com, but it is an issue if that additional $1 allows 
for newer and better services to be offered. Project Gutenberg and 
Gutenberg.com are not affiliated and if you look at the About page on 
Gutenberg.com, you will see that ebooks, and within that context free 
ebooks, will be a portion of this site. And we plan to have many places 
where free ebooks can be found, including Project Gutenberg. We hope you 
will join us here at Gutenberg.com as your home for the next phase of 
books, publishing, ebooks, and so on and so forth. This list below is 
from Project Gutenberg's site, and we are putting this up here to see 
how people like it." [followed by our PG Top 100 with links to the books]

It's obviously making money on PGs reputation but very clear about the 
fact they are not directly affiliated with PG. It is free PR, but would 
this still be considered abuse of the PG trademark?

Marcello Perathoner wrote:

> How do we like this one?
>
>   http://www.gutenberg.com
>
>

From joshua at hutchinson.net  Mon Mar 14 11:29:08 2005
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Mon Mar 14 11:29:16 2005
Subject: [gutvol-d] Another PG 'clone'
Message-ID: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com>

The site makes it very clear on the front page that they are not affliated with PG, so I've got no problem with it.  I'd prefer a different domain name, but obviously we don't own the domain name, so it was free for anyone to take, I suppose.  At least this one is actually something to do with eBooks and not a porn site or something... :)

Josh

----- Original Message -----
From: "Marcello Perathoner" <marcello@perathoner.de>
To: "Project Gutenberg volunteer discussion" <gutvol-d@lists.pglaf.org>
Subject: [gutvol-d] Another PG 'clone'
Date: Mon, 14 Mar 2005 20:06:05 +0100

> 
> How do we like this one?
> 
>    http://www.gutenberg.com
> 
> 
> -- Marcello Perathoner
> webmaster@gutenberg.org
> 
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d

From marcello at perathoner.de  Mon Mar 14 11:33:53 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Mon Mar 14 11:33:32 2005
Subject: [gutvol-d] Announce: web site directory switch
Message-ID: <4235E721.8@perathoner.de>

We are switching directories on the web site as announced on 02/23.

If you have changed any content since then you should test it now.

The old directories are still there if you forgot to copy things over.


-- 
Marcello Perathoner
webmaster@gutenberg.org

From hacker at gnu-designs.com  Mon Mar 14 11:34:50 2005
From: hacker at gnu-designs.com (David A. Desrosiers)
Date: Mon Mar 14 11:36:20 2005
Subject: [gutvol-d] Another PG 'clone'
In-Reply-To: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com>
References: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com>
Message-ID: <Pine.LNX.4.61.0503141434420.2926@angst.gnu-designs.com>


> The site makes it very clear on the front page that they are not 
> affliated with PG, so I've got no problem with it.  I'd prefer a 
> different domain name, but obviously we don't own the domain name, 
> so it was free for anyone to take, I suppose.  At least this one is 
> actually something to do with eBooks and not a porn site or 
> something... :)

	But aren't they using the Gutenberg name to drive banner ad 
revenue to them, instead of to the "real" Gutenberg sites and pages? 
With a bit of creative SEO, they could end up with a PR8 or higher, 
knocking you off the SERPS for Google and Yahoo hits. 

	They may say they're not affiliated with Project Gutenberg, 
but if you look at their meta keywords, they certainly are making that 
direct association, because they mention 'project gutenberg' a few 
times.


David A. Desrosiers
desrod@gnu-designs.com
http://gnu-designs.com
From ian at babcockbrown.com  Mon Mar 14 11:40:34 2005
From: ian at babcockbrown.com (Ian Stoba)
Date: Mon Mar 14 11:40:46 2005
Subject: [gutvol-d] Another PG 'clone'
In-Reply-To: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com>
References: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com>
Message-ID: <c1ec26b53a27644c4ec36dbf9cf185db@babcockbrown.com>

Here's the information from whois about the domain registrant. Does 
anyone know Chris Andrews? According to his web site 
(http://www.chrisandrews.com), he was an early participant in 
multimedia CD-ROMs. He has a page on his site about digital publishing, 
but there seems to be very little information there.


Registrant:
Chris Andrews (GUTENBERG-DOM)
po box 1330
los altos, CA 84024
US

Domain Name: GUTENBERG.COM

Administrative Contact:
Andrews, Chris  (30036170I)             chris@chrisandrews.com
Chris Andrews
PO Box 3550
Los Altos, CA 94024
US
650-599-3747 fax: 650-599-3747

Technical Contact:
Network Solutions, LLC.  (HOST-ORG)             
customerservice@networksolutions.com
13200 Woodland Park Drive
Herndon, VA 20171-3025
US
1-888-642-9675 fax: 571-434-4620

Record expires on 02-Mar-2012.
Record created on 01-Mar-1995.
Database last updated on 14-Mar-2005 14:36:53 EST.

Domain servers in listed order:

NS41.WORLDNIC.COM            216.168.228.23
NS42.WORLDNIC.COM            216.168.225.172


On Mar 14, 2005, at 11:29 AM, Joshua Hutchinson wrote:

> The site makes it very clear on the front page that they are not 
> affliated with PG, so I've got no problem with it.  I'd prefer a 
> different domain name, but obviously we don't own the domain name, so 
> it was free for anyone to take, I suppose.  At least this one is 
> actually something to do with eBooks and not a porn site or 
> something... :)
>
> Josh
>
> ----- Original Message -----
> From: "Marcello Perathoner" <marcello@perathoner.de>
> To: "Project Gutenberg volunteer discussion" <gutvol-d@lists.pglaf.org>
> Subject: [gutvol-d] Another PG 'clone'
> Date: Mon, 14 Mar 2005 20:06:05 +0100
>
>>
>> How do we like this one?
>>
>>    http://www.gutenberg.com
>>
>>
>> -- Marcello Perathoner
>> webmaster@gutenberg.org
>>
>> _______________________________________________
>> gutvol-d mailing list
>> gutvol-d@lists.pglaf.org
>> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d


This email message may contain information that is confidential and proprietary to Babcock & Brown or a third party. If you are not the intended recipient, please contact the sender and destroy the original and any copies of the original message. Babcock & Brown takes measures to protect the content of its communications. However, Babcock & Brown cannot guarantee that email messages will not be intercepted by third parties or that email messages will be free of errors or viruses. 

If you do not wish to receive any further e-mail from Babcock & Brown, please send an email to opt-out@babcockbrown.com.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 2201 bytes
Desc: not available
Url : http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050314/684b867e/attachment-0001.bin
From jhowse at nf.sympatico.ca  Mon Mar 14 16:44:54 2005
From: jhowse at nf.sympatico.ca (JHowse)
Date: Mon Mar 14 12:13:36 2005
Subject: [gutvol-d] Another PG 'clone'
In-Reply-To: <Pine.LNX.4.61.0503141434420.2926@angst.gnu-designs.com>
References: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com>
	<20050314192908.1AF064F44F@ws6-5.us4.outblaze.com>
Message-ID: <5.1.0.14.0.20050314164355.00a62dc0@pop1.nf.sympatico.ca>

At 02:34 PM 14/03/05 -0500, you wrote:
>         But aren't they using the Gutenberg name to drive banner ad
>revenue to them, instead of to the "real" Gutenberg sites and pages?
>With a bit of creative SEO, they could end up with a PR8 or higher,
>knocking you off the SERPS for Google and Yahoo hits.
>
>         They may say they're not affiliated with Project Gutenberg,
>but if you look at their meta keywords, they certainly are making that
>direct association, because they mention 'project gutenberg' a few
>times.


and with their top ebooks list, they are actually linking to Project Gutenberg.

JH


                        ================================================================================
                        "I'm not likely to write a great novel or compose a 
song or save a baby from a burning building...but I can help
                         make sure that there is an electronic library of 
free knowledge available for future people to access."--jhutch.
                                                                        Preserving 
History One Page at a Time!!
                                                             Celebrating 
our 6000th book posted to Project Gutenberg
                                                  Join Project Gutenberg's 
Distributed Proofreaders http://www.pgdp.net/c/
                        ================================================================================

From jeroen.mailinglist at bohol.ph  Mon Mar 14 15:51:03 2005
From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account))
Date: Mon Mar 14 15:50:35 2005
Subject: [gutvol-d] Another PG 'clone'
In-Reply-To: <4235E09D.70102@perathoner.de>
References: <4235E09D.70102@perathoner.de>
Message-ID: <42362367.10202@bohol.ph>


Hi All,

I've registered www.gutenberg.ph today, but that will be the home of a 
Philippines oriented Project Gutenberg, and I have asked Michael 
beforehand. Initially, it will contain many pointers back to the 
original PG, but we are planning to add more materials that cannot be 
cleared in the US (The Philippines is a life+50 country)

Anyway, before people discover it, and start asking questions.

Jeroen Hellingman.

Marcello Perathoner wrote:

> How do we like this one?
>
>   http://www.gutenberg.com
>
>

From tb at baechler.net  Tue Mar 15 00:21:35 2005
From: tb at baechler.net (Tony Baechler)
Date: Tue Mar 15 00:19:58 2005
Subject: [gutvol-d] FWD: converting text into audio for reader
  format
In-Reply-To: <20050314024857.GB12812@pglaf.org>
References: <42013639.5000100@sheridanc.on.ca> <420122D6.30201@sheridanc.on.ca>
	<20050202193514.GB29652@pglaf.org>
	<42013639.5000100@sheridanc.on.ca>
Message-ID: <5.2.0.9.0.20050315001811.03978ec0@baechler.net>

Hi.  Here is a partial, although not necessarily good or recommended 
solution.  You can get various older versions of the DEC-Talk software 
demo.  They will work with text files or content pasted from the 
clipboard.  Unfortunately the ones I know of require Windows.  Also there 
is a size limit on how much text it will process at once but I don't know 
what it is.

Another and probably better option is to get a free Linux text to speech 
system such as FreeTTS or Festival and use that.  I know that FreeTTS can 
be downloaded at freetts.sf.net but I don't have links for anything else at 
the moment.  Contact me if you need a link for the DEC-Talk demo and I'll 
find it.


At 06:48 PM 3/13/2005 -0800, you wrote:
>Please see the below - the question is, what
>eBook readers with text to speech capabilities
>can input a .txt file (versus .htm etc.)
>
>Please copy Donna Woodstock <donna.woodstock@sheridanc.on.ca>
>or respond to her directly with any suggestions.  Thanks!

From schultzk at uni-trier.de  Tue Mar 15 00:43:43 2005
From: schultzk at uni-trier.de (Keith J.Schultz)
Date: Tue Mar 15 00:49:24 2005
Subject: [gutvol-d] Another PG 'clone'
In-Reply-To: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com>
References: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com>
Message-ID: <c15d88df098ad60ca2a61ac070109437@uni-trier.de>

Hi Everybody,

	They are definately using PG for what they can get. Their disclaimer 
is a good thing,
	but they are VIOLATING Nettique:
		To download there so-called free books they link up to the PG site in 
a Frame, where
		their navbar is at the bottom!! This should not be done !! As they 
are not affilliated
		not I doubt have permission to the PG site in such a manner. 
Furthermore they are using
		the PG resources to make money, that is: linking directly to the PG 
site and using the PG Disk
		space for their service. They should either fork over some money or 
link to a new window!!
		That is the way to do it !!!


		Just my 0 Euro cents worth: 2 Euro cents added value tax deducted


			Keith.
		
Am 14.03.2005 um 20:29 schrieb Joshua Hutchinson:

> The site makes it very clear on the front page that they are not 
> affliated with PG, so I've got no problem with it.  I'd prefer a 
> different domain name, but obviously we don't own the domain name, so 
> it was free for anyone to take, I suppose.  At least this one is 
> actually something to do with eBooks and not a porn site or 
> something... :)
>
> Josh
>
> ----- Original Message -----
> From: "Marcello Perathoner" <marcello@perathoner.de>
> To: "Project Gutenberg volunteer discussion" <gutvol-d@lists.pglaf.org>
> Subject: [gutvol-d] Another PG 'clone'
> Date: Mon, 14 Mar 2005 20:06:05 +0100
>
>>
>> How do we like this one?
>>
>>    http://www.gutenberg.com
>>
>>
>> -- Marcello Perathoner
>> webmaster@gutenberg.org
>>
>> _______________________________________________
>> gutvol-d mailing list
>> gutvol-d@lists.pglaf.org
>> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>

From miranda_vandeheijning at blueyonder.co.uk  Tue Mar 15 01:46:41 2005
From: miranda_vandeheijning at blueyonder.co.uk (Miranda van de Heijning)
Date: Tue Mar 15 01:47:13 2005
Subject: [gutvol-d] Another PG 'clone'
In-Reply-To: <c15d88df098ad60ca2a61ac070109437@uni-trier.de>
References: <20050314192908.1AF064F44F@ws6-5.us4.outblaze.com>
	<c15d88df098ad60ca2a61ac070109437@uni-trier.de>
Message-ID: <4236AF01.2030607@blueyonder.co.uk>

I think we should be greatly suspicious of anyone who refers to George 
W. Bush as the leader of the free world, but that's an entirely 
different matter.

All in all it looks like a bit of a crap website, from a creator who is 
too lazy to make his own eBooks or to do his own marketing.

Keith J.Schultz wrote:

> Hi Everybody,
>
>     They are definately using PG for what they can get. Their 
> disclaimer is a good thing,
>     but they are VIOLATING Nettique:
>         To download there so-called free books they link up to the PG 
> site in a Frame, where
>         their navbar is at the bottom!! This should not be done !! As 
> they are not affilliated
>         not I doubt have permission to the PG site in such a manner. 
> Furthermore they are using
>         the PG resources to make money, that is: linking directly to 
> the PG site and using the PG Disk
>         space for their service. They should either fork over some 
> money or link to a new window!!
>         That is the way to do it !!!
>
>
>         Just my 0 Euro cents worth: 2 Euro cents added value tax deducted
>
>
>             Keith.
>        
> Am 14.03.2005 um 20:29 schrieb Joshua Hutchinson:
>
>> The site makes it very clear on the front page that they are not 
>> affliated with PG, so I've got no problem with it.  I'd prefer a 
>> different domain name, but obviously we don't own the domain name, so 
>> it was free for anyone to take, I suppose.  At least this one is 
>> actually something to do with eBooks and not a porn site or 
>> something... :)
>>
>> Josh
>>
>> ----- Original Message -----
>> From: "Marcello Perathoner" <marcello@perathoner.de>
>> To: "Project Gutenberg volunteer discussion" <gutvol-d@lists.pglaf.org>
>> Subject: [gutvol-d] Another PG 'clone'
>> Date: Mon, 14 Mar 2005 20:06:05 +0100
>>
>>>
>>> How do we like this one?
>>>
>>>    http://www.gutenberg.com
>>>
>>>
>>> -- Marcello Perathoner
>>> webmaster@gutenberg.org
>>>
>>> _______________________________________________
>>> gutvol-d mailing list
>>> gutvol-d@lists.pglaf.org
>>> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>>
>>
>> _______________________________________________
>> gutvol-d mailing list
>> gutvol-d@lists.pglaf.org
>> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
>
>

From kouhia at nic.funet.fi  Thu Mar 17 12:19:52 2005
From: kouhia at nic.funet.fi (Juhana Sadeharju)
Date: Thu Mar 17 12:20:02 2005
Subject: [gutvol-d] Scanner vs. digital camera
Message-ID: <S5382AbVCQUTw/20050317201952Z+4380@nic.funet.fi>


[ Continuing the thread under subject "wiki...". ]

Scanners have pros but because they are dangerous to use,
I would prefer digital camera. At least I got fed up to
lifting up the book for changing pages. A 600 pages book
is quite weighty. Another annoyance was that the scanner
collected dust (both from book and from room). Scanner was
also slow.

A couple of days ago I borrowed a tourist range digital
camera. I could digitize 8 pages per minute. It was as fast
and easy as I had predicted. The digitization speed was
limited only by image transfer technology, not by speed
of my fingers. "Easy" is the keyword here.

I have invented a couple of camera features which would help
in book digitization. Anyone would know how to contact camera
manufacturers?

Juhana
-- 
  http://music.columbia.edu/mailman/listinfo/linux-graphics-dev
  for developers of open source graphics software
From hacker at gnu-designs.com  Thu Mar 17 12:55:36 2005
From: hacker at gnu-designs.com (David A. Desrosiers)
Date: Thu Mar 17 12:57:25 2005
Subject: [gutvol-d] Scanner vs. digital camera
In-Reply-To: <S5382AbVCQUTw/20050317201952Z+4380@nic.funet.fi>
References: <S5382AbVCQUTw/20050317201952Z+4380@nic.funet.fi>
Message-ID: <Pine.LNX.4.61.0503171555040.4997@angst.gnu-designs.com>


> I have invented a couple of camera features which would help in book 
> digitization. Anyone would know how to contact camera manufacturers?

	You've "invented" camera features? What hardware did you use 
when building these features into your camera? What camera model did 
you use as a base unit? 


David A. Desrosiers
desrod@gnu-designs.com
http://gnu-designs.com
From hart at pglaf.org  Fri Mar 18 10:53:24 2005
From: hart at pglaf.org (Michael Hart)
Date: Fri Mar 18 10:53:26 2005
Subject: [gutvol-d] Kenyan school turns to handhelds
In-Reply-To: <20050301192815.88274.qmail@web52310.mail.yahoo.com>
References: <20050301192815.88274.qmail@web52310.mail.yahoo.com>
Message-ID: <Pine.LNX.4.60.0503181053110.17564@pglaf.org>


On Tue, 1 Mar 2005, maitri venkat-ramani wrote:

> Technological progress reaching end users in developing countries makes
> me so happy!  They bear a lot of the brunt for our wellbeing.  Is there
> any way we can get PG books to this school and others like it?  Do we
> have any African contacts?

I emailed my Africa contact from the UN, no reply.

>
> Thanks,
> Maitri
>
> ============================================================
>
> Kenyan school turns to handhelds
> By Julian Siddle
> BBC Go Digital
>
> At the Mbita Point primary school in western Kenya students click away
> at a handheld computer with a stylus.
> They are doing exercises in their school textbooks which have been
> digitised.
>
> It is a pilot project run by EduVision, which is looking at ways to use
> low cost computer systems to get up-to-date information to students who
> are currently stuck with ancient textbooks.
>
> Matthew Herren from EduVision told the BBC programme Go Digital how the
> non-governmental organisation uses a combination of satellite radio and
> handheld computers called E-slates.
>
> "The E-slates connect via a wireless connection to a base station in
> the school. This in turn is connected to a satellite radio receiver.
> The data is transmitted alongside audio signals."
>
> The base station processes the information from the satellite
> transmission and turns it into a form that can be read by the handheld
> E-slates.
>
> "It downloads from the satellite and every day processes the stream,
> sorts through content for the material destined for the users connected
> to it. It also stores this on its hard disc."
>
> Linux link
>
> The system is cheaper than installing and maintaining an internet
> connection and conventional computer network. But Mr Herren says there
> are both pros and cons to the project.
>
> "It's very simple to set up, just a satellite antenna on the roof of
> the school, but it's also a one-way connection, so getting feedback or
> specific requests from end users is difficult."
>
> The project is still at the pilot stage and EduVision staff are on the
> ground to attend to teething problems with the Linux-based system.
> "The content is divided into visual information, textual information
> and questions. Users can scroll through these sections independently of
> each other."
>
> EduVision is planning to include audio and video files as the system
> develops and add more content.
>
> Mr Herren says this would vastly increase the opportunities available
> to the students. He is currently in negotiations to take advantage of a
> project being organised by search site Google to digitise some of the
> world's largest university libraries.
>
> "All books in the public domain, something like 15 million, could be
> put on the base stations as we manufacture them. Then every rural
> school in Africa would have access to the same libraries as the
> students in Oxford and Harvard"
>
> Currently the project is operating in an area where there is mains
> electricity. But Mr Herren says EduVision already has plans to extend
> it to more remote regions.
>
> "We plan to put a solar panel at the school with the base station, have
> the E-slates charge during the day when the children are in school,
> then they can take them home at night and continue working."
>
> Maciej Sundra, who designed the user interface for the E-slates, says
> the project's ultimate goal is levelling access to knowledge around the
> world.
>
> "Why in this age when most people do most research using the internet
> are students still using textbooks? The fact that we are doing this in
> a rural developing country is very exciting - as they need it most."
>
>
> Story from BBC NEWS:
> http://news.bbc.co.uk/go/pr/fr/-/2/hi/technology/4304375.stm
>
> Published: 2005/02/28 11:47:23 GMT
>
>
>
>
> __________________________________
> Do you Yahoo!?
> Yahoo! Sports - Sign up for Fantasy Baseball.
> http://baseball.fantasysports.yahoo.com/
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
From joshua at hutchinson.net  Fri Mar 18 12:56:06 2005
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Fri Mar 18 12:56:15 2005
Subject: [gutvol-d] Online TEI page and the recent server move
Message-ID: <20050318205606.EDA2A2FAB9@ws6-3.us4.outblaze.com>

I know Marcello has been talking about some server moves that have been taking place recently.

Starting this week, the online TEI conversion tools at: 

http://www.gutenberg.org/tei/services/tei-online 

have quit working.  I thought at first it might be my local firewall or an overloaded server (it does cause problems once in a while).  However, it has been down all week for me, so I'm starting to think it may be due to the server move.

My question is:  Is this the cause and is this something that is going to get fixed in time?  Should I just be patient?

Also, as a sidenote question, since the PG server can get overwhelmed sometimes, would this be better server over on the DP server?  (It seems to fit better over with that work load anyway, at least in my opinion).

Josh
From marcello at perathoner.de  Fri Mar 18 12:10:44 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Fri Mar 18 13:18:31 2005
Subject: [gutvol-d] Kenyan school turns to handhelds
In-Reply-To: <Pine.LNX.4.60.0503181053110.17564@pglaf.org>
References: <20050301192815.88274.qmail@web52310.mail.yahoo.com>
	<Pine.LNX.4.60.0503181053110.17564@pglaf.org>
Message-ID: <423B35C4.90607@perathoner.de>

Michael Hart wrote:

>> At the Mbita Point primary school in western Kenya students click away
>> at a handheld computer with a stylus.
>> They are doing exercises in their school textbooks which have been
>> digitised.
>>
>> It is a pilot project run by EduVision, which is looking at ways to use
>> low cost computer systems to get up-to-date information to students who
>> are currently stuck with ancient textbooks.
>>
>> Matthew Herren from EduVision told the BBC programme Go Digital how the
>> non-governmental organisation uses a combination of satellite radio and
>> handheld computers called E-slates.

Do we want African nations to get into an educational dependency from 
satellite links and such high tech stuff? Maybe textbooks are just right 
for these students. A textbook will not need a new battery pack in a 
couple of years. It will not stop working if the school can't get new 
battery packs because the publicity value of the project has died away.

Reminds me very much of the shipping of wheat into nations that are used 
to eat mais. Ship free wheat, thus ruin the local industry who produces 
cheap mais, then ship pricy wheat.


>> "Why in this age when most people do most research using the internet
>> are students still using textbooks? The fact that we are doing this in
>> a rural developing country is very exciting - as they need it most."

And -- as a side effect -- maximizes the publicity Return On Investment.


-- 
Marcello Perathoner
webmaster@gutenberg.org


From hart at pglaf.org  Sat Mar 19 12:51:45 2005
From: hart at pglaf.org (Michael Hart)
Date: Sat Mar 19 12:51:47 2005
Subject: [gutvol-d] Kenyan school turns to handhelds
In-Reply-To: <423B35C4.90607@perathoner.de>
References: <20050301192815.88274.qmail@web52310.mail.yahoo.com>
	<Pine.LNX.4.60.0503181053110.17564@pglaf.org>
	<423B35C4.90607@perathoner.de>
Message-ID: <Pine.LNX.4.60.0503191247470.11921@pglaf.org>


On Fri, 18 Mar 2005, Marcello Perathoner wrote:

> Michael Hart wrote:
>
>>> At the Mbita Point primary school in western Kenya students click away
>>> at a handheld computer with a stylus.
>>> They are doing exercises in their school textbooks which have been
>>> digitised.
>>> 
>>> It is a pilot project run by EduVision, which is looking at ways to use
>>> low cost computer systems to get up-to-date information to students who
>>> are currently stuck with ancient textbooks.
>>> 
>>> Matthew Herren from EduVision told the BBC programme Go Digital how the
>>> non-governmental organisation uses a combination of satellite radio and
>>> handheld computers called E-slates.
>
> Do we want African nations to get into an educational dependency from 
> satellite links and such high tech stuff? Maybe textbooks are just right for 
> these students. A textbook will not need a new battery pack in a couple of 
> years. It will not stop working if the school can't get new battery packs 
> because the publicity value of the project has died away.

Personally, I think cell phones have already made the satelites obsolete
for distributing eBooks.

Africa has the fastest growing cell phone base in the world.


> Reminds me very much of the shipping of wheat into nations that are used to 
> eat mais. Ship free wheat, thus ruin the local industry who produces cheap 
> mais, then ship pricy wheat.

Sounds like something the World Bank or International Monetary Fund would do.


>>> "Why in this age when most people do most research using the internet
>>> are students still using textbooks? The fact that we are doing this in
>>> a rural developing country is very exciting - as they need it most."
>
> And -- as a side effect -- maximizes the publicity Return On Investment.

As long as anyone can send their own eBooks, things should be ok,
but that requires freedom of expression. . . .

On the other hand, it's harder to get rid of an eBook, once published,
than the paper editions.


Michael
From marcello at perathoner.de  Sat Mar 19 17:52:08 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Sat Mar 19 17:51:53 2005
Subject: [gutvol-d] Online TEI page and the recent server move
In-Reply-To: <20050318205606.EDA2A2FAB9@ws6-3.us4.outblaze.com>
References: <20050318205606.EDA2A2FAB9@ws6-3.us4.outblaze.com>
Message-ID: <423CD748.1080707@perathoner.de>

Joshua Hutchinson wrote:

> I know Marcello has been talking about some server moves that have
> been taking place recently.
> 
> Starting this week, the online TEI conversion tools at:
> 
> http://www.gutenberg.org/tei/services/tei-online
> 
> have quit working.  I thought at first it might be my local firewall
> or an overloaded server (it does cause problems once in a while).
> However, it has been down all week for me, so I'm starting to think
> it may be due to the server move.

Many small things stopped working with the recent file server move. The 
online tei conversion being on of them. I will try to fix them when they 
come to my notice.

The tei conversion should now be online again.


-- 
Marcello Perathoner
webmaster@gutenberg.org

From marcello at perathoner.de  Tue Mar 22 13:44:33 2005
From: marcello at perathoner.de (Marcello Perathoner)
Date: Tue Mar 22 13:44:15 2005
Subject: [gutvol-d] Slashdot on Google Print
Message-ID: <424091C1.4030003@perathoner.de>

Discussion about Google Print mentions PG too.

   http://slashdot.org/articles/05/03/21/1237243.shtml


-- 
Marcello Perathoner
webmaster@gutenberg.org

From JBuck814366460 at aol.com  Tue Mar 22 13:59:56 2005
From: JBuck814366460 at aol.com (Jared Buck)
Date: Tue Mar 22 14:00:07 2005
Subject: [gutvol-d] Spam on PG lists?
In-Reply-To: <424091C1.4030003@perathoner.de>
References: <424091C1.4030003@perathoner.de>
Message-ID: <4240955B.3020103@aol.com>

Hey,

Is it me, or are we getting a lot of spam on a lot of the PG (and PGLAF) 
lists?  We need to stop the spam coming, or the list is going to get 
overwhelmed before we know it.

I've already banned receiving messages from known spammers on the list, 
it may help, or it may not.

Jared

From servalan at ar.com.au  Tue Mar 22 14:07:02 2005
From: servalan at ar.com.au (Pauline)
Date: Tue Mar 22 14:07:42 2005
Subject: [gutvol-d] Spam on PG lists?
In-Reply-To: <4240955B.3020103@aol.com>
References: <424091C1.4030003@perathoner.de> <4240955B.3020103@aol.com>
Message-ID: <42409706.3080609@ar.com.au>

Jared Buck wrote:
> Hey,
> 
> Is it me, or are we getting a lot of spam on a lot of the PG (and PGLAF) 
> lists?  We need to stop the spam coming, or the list is going to get 
> overwhelmed before we know it.

It's not just you. I'm a little annoyed as the DP posts email address 
which is supposed to be used only by our volunteers to notify the site 
admins of posted projects is available via a google search of the PG 
mailing list archives (& only from there) & is getting spam.

I doubt it is fully fixable now - but it would be great if the PG 
mailman archives can be protected from future email address harvesters.

I suspect other volunteers are also receiving spam via this path.

Thanks,
P
-- 
Help digitise public domain books:
Distributed Proofreaders: http://www.pgdp.net
"Preserving history one page at a time."

Set free dead-tree books:
http://bookcrossing.com/referral/servalan
From hacker at gnu-designs.com  Tue Mar 22 14:08:19 2005
From: hacker at gnu-designs.com (David A. Desrosiers)
Date: Tue Mar 22 14:09:46 2005
Subject: [gutvol-d] Spam on PG lists?
In-Reply-To: <4240955B.3020103@aol.com>
References: <424091C1.4030003@perathoner.de> <4240955B.3020103@aol.com>
Message-ID: <Pine.LNX.4.61.0503221707490.2203@angst.gnu-designs.com>


> Is it me, or are we getting a lot of spam on a lot of the PG (and 
> PGLAF) lists?  We need to stop the spam coming, or the list is going 
> to get overwhelmed before we know it.

	Isn't the list open to subscribers-only? If not, I suggest 
moving it to that model. 


David A. Desrosiers
desrod@gnu-designs.com
http://gnu-designs.com
From JBuck814366460 at aol.com  Tue Mar 22 14:27:20 2005
From: JBuck814366460 at aol.com (JBuck814366460@aol.com)
Date: Tue Mar 22 14:27:36 2005
Subject: [gutvol-d] Spam on PG lists?
Message-ID: <d2.25404315.2f71f5c8@aol.com>

> Isn't the list open to subscribers-only?  If not, I suggest
> moving it to that model.
 
 
I agree, if it isn't subscriber-only, it should be as soon as  possible.  The 
spam is very annoying and doesn't belong on the list.
 
Jared
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050322/087de180/attachment.html
From hacker at gnu-designs.com  Tue Mar 22 14:55:18 2005
From: hacker at gnu-designs.com (David A. Desrosiers)
Date: Tue Mar 22 14:56:47 2005
Subject: [gutvol-d] Spam on PG lists?
In-Reply-To: <d2.25404315.2f71f5c8@aol.com>
References: <d2.25404315.2f71f5c8@aol.com>
Message-ID: <Pine.LNX.4.61.0503221754270.2203@angst.gnu-designs.com>


> > Isn't the list open to subscribers-only?  If not, I suggest moving 
> > it to that model.

> I agree, if it isn't subscriber-only, it should be as soon as 
> possible.  The spam is very annoying and doesn't belong on the list.

	Honestly, I haven't seen a single spam on either list since 
I've been a subscriber (a year?). Then again, I run dspam on my MTA, 
and its probably catching and quarantining them so I never even see 
them.


David A. Desrosiers
desrod@gnu-designs.com
http://gnu-designs.com
From servalan at ar.com.au  Tue Mar 22 15:03:44 2005
From: servalan at ar.com.au (Pauline)
Date: Tue Mar 22 15:04:24 2005
Subject: [gutvol-d] Spam on PG lists?
In-Reply-To: <Pine.LNX.4.61.0503221754270.2203@angst.gnu-designs.com>
References: <d2.25404315.2f71f5c8@aol.com>
	<Pine.LNX.4.61.0503221754270.2203@angst.gnu-designs.com>
Message-ID: <4240A450.6040207@ar.com.au>

David A. Desrosiers wrote:
> 	Honestly, I haven't seen a single spam on either list since 
> I've been a subscriber (a year?). Then again, I run dspam on my MTA, 
> and its probably catching and quarantining them so I never even see 
> them.

 From my quick peek - it's only the posted list (posted@pglaf.org) 
archive which is visible. So anyone submitting projects to PG will have 
a visible email address to email harvesters. The gutvol* lists are OK.

I hope this helps,
P
-- 
Help digitise public domain books:
Distributed Proofreaders: http://www.pgdp.net
"Preserving history one page at a time."

Set free dead-tree books:
http://bookcrossing.com/referral/servalan
From tb at baechler.net  Tue Mar 22 23:42:11 2005
From: tb at baechler.net (Tony Baechler)
Date: Tue Mar 22 23:40:25 2005
Subject: [gutvol-d] Spam on PG lists?
In-Reply-To: <Pine.LNX.4.61.0503221754270.2203@angst.gnu-designs.com>
References: <d2.25404315.2f71f5c8@aol.com>
 <d2.25404315.2f71f5c8@aol.com>
Message-ID: <5.2.0.9.0.20050322234007.035a8ba0@baechler.net>

Hello.  What is dspam?  How hard is it to set up?  Is it similar to Spam 
Assassin?  I'm running qmail under Linux and had an extremely hard time 
setting up spam filtering, so I eventually gave up.  I have not heard of 
that antispam package before.  More information would be appreciated.  Thanks.

To stay on topic, I have received no spam from the pglaf.org lists and I do 
not run a spam filter locally.

From gbnewby at pglaf.org  Wed Mar 23 11:22:52 2005
From: gbnewby at pglaf.org (Greg Newby)
Date: Wed Mar 23 11:22:53 2005
Subject: [gutvol-d] Spam on PG lists?
In-Reply-To: <5.2.0.9.0.20050322234007.035a8ba0@baechler.net>
References: <d2.25404315.2f71f5c8@aol.com> <d2.25404315.2f71f5c8@aol.com>
	<5.2.0.9.0.20050322234007.035a8ba0@baechler.net>
Message-ID: <20050323192252.GA564@pglaf.org>

On Tue, Mar 22, 2005 at 11:42:11PM -0800, Tony Baechler wrote:
> Hello.  What is dspam?  How hard is it to set up?  Is it similar to Spam 
> Assassin?  I'm running qmail under Linux and had an extremely hard time 
> setting up spam filtering, so I eventually gave up.  I have not heard of 
> that antispam package before.  More information would be appreciated.  

I did a very informal comparison of dspam to Spam Assassin, and found
them to be about the same.  They have some different features, but
basically both "learn" based on your mail patterns.  dspam takes a
little longer to get trained, and is tuned to have a very low portion of
false positives (that is, it very seldom flags non-spam as spam).  With
any spam filter, though, it's important to periodically check the logs
or spam folders, to see what messages were misidentified as spam.

> To stay on topic, I have received no spam from the pglaf.org lists and I do 
> not run a spam filter locally.

If people could forward spam items to me that were distributed
via the lists.pglaf.org server, I can look into how they
got to the list.

I'll also look into obfuscating email addresses in the logs
(via transforming the @ or similar techniques).  This is sometimes 
done automatically with Pipermail (which manages our Mailman
archives, I believe), but doesn't seem to be happening.  Sorry
about that.... 

I'm still looking for a volunteer to manage the mailing lists, by
the way.  It takes just a few minutes per day (every day).
  -- Greg

From traverso at dm.unipi.it  Wed Mar 23 11:43:35 2005
From: traverso at dm.unipi.it (Carlo Traverso)
Date: Wed Mar 23 11:43:19 2005
Subject: [gutvol-d] Spam on PG lists?
In-Reply-To: <20050323192252.GA564@pglaf.org> (message from Greg Newby on Wed, 
	23 Mar 2005 11:22:52 -0800)
References: <d2.25404315.2f71f5c8@aol.com> <d2.25404315.2f71f5c8@aol.com>
	<5.2.0.9.0.20050322234007.035a8ba0@baechler.net>
	<20050323192252.GA564@pglaf.org>
Message-ID: <200503231943.j2NJhZa05915@pico.dm.unipi.it>


I don't filter the lists, (I apply the filters after accepting pglaf lists)
and I don't receive any spam on the lists (a lot outside). Consider
the possibility of forged sender address.

Carlo
From mattsen at arvig.net  Wed Mar 23 12:35:57 2005
From: mattsen at arvig.net (Chuck MATTSEN)
Date: Wed Mar 23 12:36:08 2005
Subject: [gutvol-d] Spam on PG lists?
In-Reply-To: <20050323192252.GA564@pglaf.org>
References: <d2.25404315.2f71f5c8@aol.com> <d2.25404315.2f71f5c8@aol.com>
	<5.2.0.9.0.20050322234007.035a8ba0@baechler.net>
	<20050323192252.GA564@pglaf.org>
Message-ID: <20050323143557.198b0fb5@localhost.localdomain>

On Wed, 23 Mar 2005 11:22:52 -0800
Greg Newby <gbnewby@pglaf.org> typed:

> On Tue, Mar 22, 2005 at 11:42:11PM -0800, Tony Baechler wrote:
> > Hello.  What is dspam?  How hard is it to set up?  Is it similar to
> > Spam Assassin?  I'm running qmail under Linux and had an extremely
> > hard time setting up spam filtering, so I eventually gave up.  I
> > have not heard of that antispam package before.  More information
> > would be appreciated.  
> 
> I did a very informal comparison of dspam to Spam Assassin, and found
> them to be about the same.  They have some different features, but
> basically both "learn" based on your mail patterns.  dspam takes a
> little longer to get trained, and is tuned to have a very low portion
> of false positives (that is, it very seldom flags non-spam as spam).
> With any spam filter, though, it's important to periodically check
> the logs or spam folders, to see what messages were misidentified as
> spam.

Another alternative tool is POPFile (or any of the other Bayesian
filters) ... http://popfile.sourceforge.net/ ... also free, open
source, cross-platform.  It has the advantage of being very fast in its
processing of incoming mail (POP3 included), and it "learns" very
quickly what the user considers spam and "not spam" ... actually, one
could set up any number of different categories and, with time, it
would learn to sort things however one wished.  I get about 10,000 e-
mails per months and POPFile has been running at about 99.81% accuracy
for me with respect to false-positives, etc.

> > To stay on topic, I have received no spam from the pglaf.org lists
> > and I do not run a spam filter locally.

Nor have I received any....

-- 
Chuck MATTSEN / mattsen at arvig dot net / Mahnomen, MN, USA
Mandrakelinux release 10.2 (Cooker) for i586 kernel 2.6.10-3.mm.5mdk
RLU #346519 / MT Lookup: http://eot.com/~mattsen/mtsearch.htm
 
Random Thought/Quote for this Message:
     From listening comes wisdom, from speaking, repentance.
From JBuck814366460 at aol.com  Wed Mar 23 15:35:58 2005
From: JBuck814366460 at aol.com (Jared Buck)
Date: Wed Mar 23 15:36:13 2005
Subject: [gutvol-d] Spam on PG lists?
In-Reply-To: <20050323192252.GA564@pglaf.org>
References: <d2.25404315.2f71f5c8@aol.com> <d2.25404315.2f71f5c8@aol.com>
	<5.2.0.9.0.20050322234007.035a8ba0@baechler.net>
	<20050323192252.GA564@pglaf.org>
Message-ID: <4241FD5E.4070500@aol.com>

Hi Greg,

Sure, I wouldn't mind managing the lists for a couple minutes a day.  I 
can't promise it will be as soon as I get up (I tend to sleep more than 
the average person) but it will be once a day.

I'll forward you copies of the spam I'm getting on the list as I receive 
them, then you can figure out how to ban the senders' IPs to keep that 
mail from getting on the list and interfering with perfectly good 
discussions.

Jared

From hacker at gnu-designs.com  Wed Mar 23 16:08:44 2005
From: hacker at gnu-designs.com (David A. Desrosiers)
Date: Wed Mar 23 16:10:08 2005
Subject: [gutvol-d] Spam on PG lists?
In-Reply-To: <20050323192252.GA564@pglaf.org>
References: <d2.25404315.2f71f5c8@aol.com> <d2.25404315.2f71f5c8@aol.com>
	<5.2.0.9.0.20050322234007.035a8ba0@baechler.net>
	<20050323192252.GA564@pglaf.org>
Message-ID: <Pine.LNX.4.61.0503231902130.9091@angst.gnu-designs.com>


> I did a very informal comparison of dspam to Spam Assassin, and 
> found them to be about the same.

	They are so dramatically different, I can't believe you even 
would suggest they're "about the same". 

	SpamAssassin is written in Perl, and is significantly slower 
than dspam. SpamAssassin also relies on static rulesets, not the 
"quality" of the mail received. You can't do per-user filtering with 
SA. With dspam, if one user prefers seeing lots of HTML 
advertisements, they can. Another user on the same system can reject 
those as spam.

	In my case, I was using SpamAssassin for about 2 years, 
trained down to a threshhold of 2, with 13 RBLs in place, and my users 
were still getting 20-30 spams per-week. SpamAssassin's accuracy under 
that configuration after 2 years was about 90%. 

	In 1 month of using dspam, we were over 98% accuracy, AND I no 
longer had to manage mail. The users get their own quarantine and they 
can manage their own mail "quality" themselves, I don't _ever_ have to 
get involved.

> They have some different features, but basically both "learn" based 
> on your mail patterns.  dspam takes a little longer to get trained, 
> and is tuned to have a very low portion of false positives (that is, 
> it very seldom flags non-spam as spam).

	You probably didn't read the docs. Did you load it with the SA 
corpus first? Did you train it with that corpus? It took about an hour 
for me to train it to a level where it was accurately catching and 
quarantining mail. 

	Getting dspam configured properly is no small task, and you 
have to be _very_ careful about using conflicting algorithms when you 
configure and build it. Also, were you using TOE? TEFT? TUM? Each of 
these has VERY different usages and specific conditions where they 
work well, or horrible.

> With any spam filter, though, it's important to periodically check 
> the logs or spam folders, to see what messages were misidentified as 
> spam.

	And with dspam, this is all handled completely seamlessly, no 
need to "check logs" or "spam folders" at all. Users simply forward 
their false positives to spam-$USER@domain.com, and it gets marked as 
spam. When more emails come in that match similar tokens, those are 
marked as spam also.

> I'm still looking for a volunteer to manage the mailing lists, by 
> the way.  It takes just a few minutes per day (every day).

	I host quite a few mailing lists here for SourceFubar.Net, and 
I'd be happy to take over management of the lists for you, if you 
wish. We don't have any spam on the lists we host, and everything 
works as it should.


David A. Desrosiers
desrod@gnu-designs.com
http://gnu-designs.com
From sly at victoria.tc.ca  Wed Mar 23 21:00:37 2005
From: sly at victoria.tc.ca (Andrew Sly)
Date: Wed Mar 23 21:00:54 2005
Subject: [gutvol-d] Humanities Computing conference
Message-ID: <Pine.GSO.4.58.0503232057440.14685@vtn1.victoria.tc.ca>

I'm looking for feedback from other PG volunteers.

There will be a four-day "humanities computing Summer Institute"
taking place in my city in June, as described here:

http://web.uvic.ca/hrd/institute/

It looks as if its main focus will be on digitizing
texts using the tei dtd.

Any ideas on how worthwhile it would be participating
in this?

Andrew
From felix.klee at inka.de  Thu Mar 24 03:16:04 2005
From: felix.klee at inka.de (Felix E. Klee)
Date: Thu Mar 24 03:17:13 2005
Subject: [gutvol-d] Scanner vs. digital camera
In-Reply-To: <S5382AbVCQUTw/20050317201952Z+4380@nic.funet.fi>
References: <S5382AbVCQUTw/20050317201952Z+4380@nic.funet.fi>
Message-ID: <87sm2lz43v.wl%felix.klee@inka.de>

At Thu, 17 Mar 2005 22:19:52 +0200,
Juhana Sadeharju wrote:
> A couple of days ago I borrowed a tourist range digital camera. I
> could digitize 8 pages per minute. It was as fast and easy as I had
> predicted. The digitization speed was limited only by image transfer
> technology, not by speed of my fingers. "Easy" is the keyword here.

How did OCR'ing go?  I wonder because the resolution of cheap digital
cameras is quite low for scanning. For example, to scan an A4 page
(aspect ratio: sqrt(2)) with a usual digital camera (aspect ratio of
images: 4:3) in 300DPI, you need a camera with more than nine
mega-pixels.

-- 
Felix E. Klee
From bruce at zuhause.org  Thu Mar 24 07:54:20 2005
From: bruce at zuhause.org (Bruce Albrecht)
Date: Thu Mar 24 07:54:25 2005
Subject: [gutvol-d] Spam on PG lists?
In-Reply-To: <Pine.LNX.4.61.0503231902130.9091@angst.gnu-designs.com>
References: <d2.25404315.2f71f5c8@aol.com>
	<5.2.0.9.0.20050322234007.035a8ba0@baechler.net>
	<20050323192252.GA564@pglaf.org>
	<Pine.LNX.4.61.0503231902130.9091@angst.gnu-designs.com>
Message-ID: <16962.58028.410879.644627@celery.zuhause.org>

David A. Desrosiers writes:
 > 
 > > I did a very informal comparison of dspam to Spam Assassin, and 
 > > found them to be about the same.
 > 
 > 	They are so dramatically different, I can't believe you even 
 > would suggest they're "about the same". 
 > 
 > 	SpamAssassin is written in Perl, and is significantly slower 
 > than dspam. SpamAssassin also relies on static rulesets, not the 
 > "quality" of the mail received. You can't do per-user filtering with 
 > SA. With dspam, if one user prefers seeing lots of HTML 
 > advertisements, they can. Another user on the same system can reject 
 > those as spam.

I don't want this to turn this mailing list into a dspam vs Spam
Assassin war, but I think your information about SA is out of date.
SA v3 supports multi-tiered (e.g., global, domain, user)
configurations, and has bayesian filtering as one of several rules for
determining spam.  

I'd also like to point out that being written in Perl does not imply
that something is always much slower than C, especially when large
amounts of regular expression pattern matching is involved.  Perl
developers have spent a lot of time optimizing its pattern matching.
The SA Wiki suggests that if you find that SA is slow, you should
examine the rule set you're using, and disable inappropriate rules
(for example, ones requiring DNS lookups).

Bruce
From hacker at gnu-designs.com  Thu Mar 24 08:15:13 2005
From: hacker at gnu-designs.com (David A. Desrosiers)
Date: Thu Mar 24 08:17:02 2005
Subject: [gutvol-d] Spam on PG lists?
In-Reply-To: <16962.58028.410879.644627@celery.zuhause.org>
References: <d2.25404315.2f71f5c8@aol.com>
	<5.2.0.9.0.20050322234007.035a8ba0@baechler.net>
	<20050323192252.GA564@pglaf.org>
	<Pine.LNX.4.61.0503231902130.9091@angst.gnu-designs.com>
	<16962.58028.410879.644627@celery.zuhause.org>
Message-ID: <Pine.LNX.4.61.0503241058450.7321@angst.gnu-designs.com>


> I don't want this to turn this mailing list into a dspam vs Spam 
> Assassin war, but I think your information about SA is out of date. 

	You're right, my information is a bit out of date, dspam is 
quite a bit ahead of SA now, further than I originally surmised (see 
further down). 

	But I agree, let's not turn this into a religious war.

> SA v3 supports multi-tiered (e.g., global, domain, user) 
> configurations, and has bayesian filtering as one of several rules 
> for determining spam.

	Does SA support allowing the user to configure their own mail 
preferences via a simple web interface? Does it support adding and 
revoking tokens by simply sending the false-positives back through 
email, without involving a mail administrator? Sure, those things can 
be written, but do they come as part of the core package? Does that 
capability exist in the base engine?

	Incidentally, dspam supports the following, out of the box: 

	- Bayesian filtering
		- Graham Bayes
		- Burton Bayes
		- Noise Reduction
	- Robinson Geometric Mean calculation
	- Fisher-Robinson Inverse Chi-Square calculation
	- Robinson Combined P-Values
	- Chained Tokens
	- Neural Networking
	- Message Innoculation

	..and quite a bit more for filtering mail.

	Does SpamAssassin v3?

	I'm glad that SA is now beginning to incorporate some of these 
things now, and they've got a good base project to learn from. I've 
been very disappointed with SA, and dspam has already trounced it in 
our case, so we have no need to de-evolve to something that doesn't 
suit our needs. 

	Less than 10 spam messages total in any user's mailbox in over 
a year now (that we've been told about), and only a small handful of 
innocent messages were caught as spam, but were really ham. 

	With the web interface, the user just sends them on to their 
normal account, and dspam scores them lower, so future versions aren't 
caught. Works great, and I don't have to be involved in the mail 
management process _at all_ anymore.

> I'd also like to point out that being written in Perl does not imply 
> that something is always much slower than C, especially when large 
> amounts of regular expression pattern matching is involved.

	True, poorly-written C can definately be worse than Perl, but 
well-written C is ALWAYS going to be faster than equivalently written 
Perl. I don't think I've ever seen SA process 100 messages/sec., but 
dspam has no problem doing the same thing, every day.

> Perl developers have spent a lot of time optimizing its pattern 
> matching. The SA Wiki suggests that if you find that SA is slow, you 
> should examine the rule set you're using, and disable inappropriate 
> rules (for example, ones requiring DNS lookups).

	You're preaching to the choir here, I'm a very heavy user and 
supporter of Perl, and I use it for 99% of my tasks... but there are 
some cases where an interpreted language just can't compete with a 
natively-compiled object code.

	Anyway, good discussions all around. Use whatever tool fits 
your needs. In my case (heavy mail use from very disparate sources), 
dspam easily beat what SA could do, hands-down in terms of quality and 
speed and flexibility. The added benefit is that now I don't have to 
micro-manage mail, whitelists, or rulesets anymore.


David A. Desrosiers
desrod@gnu-designs.com
http://gnu-designs.com
From grythumn at gmail.com  Thu Mar 24 09:19:18 2005
From: grythumn at gmail.com (Robert Cicconetti)
Date: Thu Mar 24 09:19:27 2005
Subject: [gutvol-d] Scanner vs. digital camera
In-Reply-To: <87sm2lz43v.wl%felix.klee@inka.de>
References: <S5382AbVCQUTw/20050317201952Z+4380@nic.funet.fi>
	<87sm2lz43v.wl%felix.klee@inka.de>
Message-ID: <15cfa2a5050324091950b5eab3@mail.gmail.com>

On Thu, 24 Mar 2005 12:16:04 +0100, Felix E. Klee <felix.klee@inka.de> wrote:
> How did OCR'ing go?  I wonder because the resolution of cheap digital
> cameras is quite low for scanning. For example, to scan an A4 page
> (aspect ratio: sqrt(2)) with a usual digital camera (aspect ratio of
> images: 4:3) in 300DPI, you need a camera with more than nine
> mega-pixels.

Let's try something more realistic. Typical book size that I scan is
under 8.5x11". Typical page is about 8.5x5.5"; typical text area is
6.5x4" to 7x4.5". So if focused solely on the text area, one would
need about 2.2-2.8 megapixels / page, or for a full page impression,
about 4.2. Most books do not lie flat enough to get two full page
scans from straight up; you're better off doing each page at a time.

So a 4 MP camera, with good optical zoom / focus, should be fine. This
won't be cheap, but it's not in the same realm as a 9 MP camera.

R C
From jenzed at gmail.com  Thu Mar 24 09:21:25 2005
From: jenzed at gmail.com (Jen Zed)
Date: Thu Mar 24 09:21:33 2005
Subject: [gutvol-d] Humanities Computing conference
In-Reply-To: <Pine.GSO.4.58.0503232057440.14685@vtn1.victoria.tc.ca>
References: <Pine.GSO.4.58.0503232057440.14685@vtn1.victoria.tc.ca>
Message-ID: <7d5745970503240921d601da@mail.gmail.com>

The relevance of the workshops and conference depend mostly on what
James has planned for the UniBook back-end. James, are you planning to
implement TEI / XSL / FO? (Actually, any info about UniBook would be
really useful to me, as I've started to think about the site
front-end, but can't go very far unless I know what the back-end looks
like.)

At work, I'm doing a DocBook XSL implementation right now. The issues
are similar enough that I might be able to swing a seminar and
conference attendance on the company's tab. (DocBook is like TEI, only
it's optimized for  generating printed reference books.)

Too bad we don't have a little pot of money we could use to send
people to events like these. Can I hope (request) that getting our
non-profit status established is on the agenda for the upcoming
meeting in Toronto?


jen.


On Wed, 23 Mar 2005 21:00:37 -0800 (PST), Andrew Sly <sly@victoria.tc.ca> wrote:
> I'm looking for feedback from other PG volunteers.
> 
> There will be a four-day "humanities computing Summer Institute"
> taking place in my city in June, as described here:
> 
> http://web.uvic.ca/hrd/institute/
> 
> It looks as if its main focus will be on digitizing
> texts using the tei dtd.
> 
> Any ideas on how worthwhile it would be participating
> in this?
> 
> Andrew
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
From sly at victoria.tc.ca  Thu Mar 24 10:17:18 2005
From: sly at victoria.tc.ca (Andrew Sly)
Date: Thu Mar 24 10:17:25 2005
Subject: [gutvol-d] Humanities Computing conference
In-Reply-To: <7d5745970503240921d601da@mail.gmail.com>
References: <Pine.GSO.4.58.0503232057440.14685@vtn1.victoria.tc.ca>
	<7d5745970503240921d601da@mail.gmail.com>
Message-ID: <Pine.GSO.4.58.0503241007460.8327@vtn1.victoria.tc.ca>


Just to avoid confusing other PG volunteers too much, I'll state
that most of Jen's message was regarding issues for the slowly
emerging PG Canada.

Any general feedback on the value of a conference such as I
mentioned would still be welcome...

Andrew

On Thu, 24 Mar 2005, Jen Zed wrote:

> The relevance of the workshops and conference depend mostly on what

[snip]
From Bowerbird at aol.com  Thu Mar 24 10:34:13 2005
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Thu Mar 24 10:34:28 2005
Subject: [gutvol-d] Humanities Computing conference
Message-ID: <1d5.38d9b6e5.2f746225@aol.com>

andrew said:
>   Just to avoid confusing other PG volunteers too much

wouldn't want anyone to be confused...

as for the conference, i say go for it.
it would be nice if _someone_ here
could answer questions about t.e.i.

-bowerbird
From felix.klee at inka.de  Thu Mar 24 10:41:34 2005
From: felix.klee at inka.de (Felix E. Klee)
Date: Thu Mar 24 10:42:27 2005
Subject: [gutvol-d] Scanner vs. digital camera
In-Reply-To: <15cfa2a5050324091950b5eab3@mail.gmail.com>
References: <S5382AbVCQUTw/20050317201952Z+4380@nic.funet.fi>
	<87sm2lz43v.wl%felix.klee@inka.de>
	<15cfa2a5050324091950b5eab3@mail.gmail.com>
Message-ID: <87mzsszy1t.wl%felix.klee@inka.de>

At Thu, 24 Mar 2005 12:19:18 -0500,
Robert Cicconetti wrote:
> > How did OCR'ing go?  I wonder because the resolution of cheap
> > digital cameras is quite low for scanning. For example, to scan an
> > A4 page (aspect ratio: sqrt(2)) with a usual digital camera (aspect
> > ratio of images: 4:3) in 300DPI, you need a camera with more than
> > nine mega-pixels.
> 
> Let's try something more realistic.

Admittedly, for most book scanning tasks the requirements are not as
high as I illustrated.  However, a simple camera wouldn't fit the need
of people that frequently have to create quality scans of pages whose
size is around A4 (I'm one of these people).  IOW: An ordinary flatbed
scanner is probably still the best and cheapest solution for most
people.  A dream for scanning books, of course, is the BookEye series of
scanners that one can sometimes find in some public libraries.

-- 
Felix E. Klee
From jlinden at projectgutenberg.ca  Thu Mar 24 11:18:40 2005
From: jlinden at projectgutenberg.ca (James Linden)
Date: Thu Mar 24 11:22:55 2005
Subject: [gutvol-d] Humanities Computing conference
In-Reply-To: <7d5745970503240921d601da@mail.gmail.com>
References: <Pine.GSO.4.58.0503232057440.14685@vtn1.victoria.tc.ca>
	<7d5745970503240921d601da@mail.gmail.com>
Message-ID: <42431290.4090406@projectgutenberg.ca>

Jen Zed wrote:
> The relevance of the workshops and conference depend mostly on what
> James has planned for the UniBook back-end. James, are you planning to
> implement TEI / XSL / FO?

   TEI will be implemented as an input/output format, yes. It will have 
nothing to do with the internal workings of the system. XSL isn't needed 
- the application doesn't rely on transformations of any kind.

> (Actually, any info about UniBook would be
> really useful to me, as I've started to think about the site
> front-end, but can't go very far unless I know what the back-end looks
> like.)

   I'm still working on the tech docs -- the 6 pages of docs that we put 
on the wiki took me almost two weeks -- tech docs are going to be about 
6 pages - per section!

> At work, I'm doing a DocBook XSL implementation right now. The issues
> are similar enough that I might be able to swing a seminar and
> conference attendance on the company's tab. (DocBook is like TEI, only
> it's optimized for  generating printed reference books.)

   My demo app (on ibiblio) has an experimental docbook output... when 
it comes time, I know who I'm going to ask for help to implement that 
module. :-)

> Too bad we don't have a little pot of money we could use to send
> people to events like these. Can I hope (request) that getting our
> non-profit status established is on the agenda for the upcoming
> meeting in Toronto?

  Meeting in Toronto? What meeting?

-- James
From jenzed at gmail.com  Thu Mar 24 13:06:40 2005
From: jenzed at gmail.com (Jen Zed)
Date: Thu Mar 24 13:06:49 2005
Subject: [gutvol-d] Humanities Computing conference
In-Reply-To: <Pine.GSO.4.58.0503241007460.8327@vtn1.victoria.tc.ca>
References: <Pine.GSO.4.58.0503232057440.14685@vtn1.victoria.tc.ca>
	<7d5745970503240921d601da@mail.gmail.com>
	<Pine.GSO.4.58.0503241007460.8327@vtn1.victoria.tc.ca>
Message-ID: <7d57459705032413062adba9f9@mail.gmail.com>

My apologies, I didn't notice that Andrew's original post was on the
PG  (as opposed to the PG Canada) list.


jen.


On Thu, 24 Mar 2005 10:17:18 -0800 (PST), Andrew Sly <sly@victoria.tc.ca> wrote:
> 
> 
> Just to avoid confusing other PG volunteers too much, I'll state
> that most of Jen's message was regarding issues for the slowly
> emerging PG Canada.
> 
> Any general feedback on the value of a conference such as I
> mentioned would still be welcome...
> 
> Andrew
> 
> On Thu, 24 Mar 2005, Jen Zed wrote:
> 
> > The relevance of the workshops and conference depend mostly on what
> 
> [snip]
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
From webmaster at gutenberg.org  Sun Mar 27 09:04:12 2005
From: webmaster at gutenberg.org (Marcello Perathoner)
Date: Sun Mar 27 09:03:51 2005
Subject: [gutvol-d] [Fwd: Thought on http://www.gutenberg.org/faq/C-18.php]
Message-ID: <4246E78C.7010503@gutenberg.org>


-------- Original Message --------
Subject: Thought on http://www.gutenberg.org/faq/C-18.php
Date: Sun, 27 Mar 2005 17:24:47 +0100 (BST)
From: Nick Burch <nick@gagravarr.org>
To: webmaster@gutenberg.org

Hi

I'm not sure if you're the right person on the guttenberg team to send
this to, but hopefully if not you're close.

I happened across http://www.gutenberg.org/faq/C-18.php from a discussion
on slashdot, and I had a thought that there is something you can try. It
should be possible to use a copyright repository to prove a book is out of
copyright, without having to use the old (and hard to find) edition.

I live in Oxford, and we have one of the UK's three copyright
repositories, in the form of the Bodleian library. Most people can get
temporary access to it for research, and I believe the same is true of the
other two libraries. These libraries hold most books published in the UK.

So, the steps for a book which you think is out of copyright would be:
1) Get a copy of the new version of the book
2) Find your nearest copyright library
3) Check to see if they have a copy of an older version - the search of
     the collection should be available online, eg http://library.ox.ac.uk/
4) Arrange temporary membership of the library
5) Turn up, request the book, and go away for a few hours while someone
     retrieves it from the stack (most books aren't on open shelves)
6) Compare lots of pages to ensure the text is the same
7) Photocopy a few pages (including the copyright info to be sure)
8) Head home, and set to work on the new version


I hope the above makes some sense, and might be of use

Nick


-- 
Marcello Perathoner
webmaster@gutenberg.org

From gbnewby at pglaf.org  Sun Mar 27 10:07:40 2005
From: gbnewby at pglaf.org (Greg Newby)
Date: Sun Mar 27 10:07:41 2005
Subject: [gutvol-d] Re: [Fwd: Thought on
	http://www.gutenberg.org/faq/C-18.php]
In-Reply-To: <4246E78C.7010503@gutenberg.org>
References: <4246E78C.7010503@gutenberg.org>
Message-ID: <20050327180740.GA25403@pglaf.org>

On Sun, Mar 27, 2005 at 07:04:12PM +0200, Marcello Perathoner wrote:
> 
> 
> -------- Original Message --------
> Subject: Thought on http://www.gutenberg.org/faq/C-18.php
> Date: Sun, 27 Mar 2005 17:24:47 +0100 (BST)
> From: Nick Burch <nick@gagravarr.org>
> To: webmaster@gutenberg.org
> 
> Hi
> 
> I'm not sure if you're the right person on the guttenberg team to send
> this to, but hopefully if not you're close.
> 
> I happened across http://www.gutenberg.org/faq/C-18.php from a discussion
> on slashdot, and I had a thought that there is something you can try. It
> should be possible to use a copyright repository to prove a book is out of
> copyright, without having to use the old (and hard to find) edition.

Hi, Nick.  Thanks for your suggestion.  In fact, this
is our procedure.  

We should probably mention it more prominently in our
FAQ & Copyright HOWTO.
  -- Greg


> 
> I live in Oxford, and we have one of the UK's three copyright
> repositories, in the form of the Bodleian library. Most people can get
> temporary access to it for research, and I believe the same is true of the
> other two libraries. These libraries hold most books published in the UK.
> 
> So, the steps for a book which you think is out of copyright would be:
> 1) Get a copy of the new version of the book
> 2) Find your nearest copyright library
> 3) Check to see if they have a copy of an older version - the search of
>     the collection should be available online, eg http://library.ox.ac.uk/
> 4) Arrange temporary membership of the library
> 5) Turn up, request the book, and go away for a few hours while someone
>     retrieves it from the stack (most books aren't on open shelves)
> 6) Compare lots of pages to ensure the text is the same
> 7) Photocopy a few pages (including the copyright info to be sure)
> 8) Head home, and set to work on the new version
> 
> 
> I hope the above makes some sense, and might be of use
> 
> Nick
> 
> 
> 
> 
> -- 
> Marcello Perathoner
> webmaster@gutenberg.org
From kouhia at nic.funet.fi  Tue Mar 29 08:38:17 2005
From: kouhia at nic.funet.fi (Juhana Sadeharju)
Date: Tue Mar 29 08:38:28 2005
Subject: [gutvol-d] Re: Scanner vs. digital camera
Message-ID: <S17731AbVC2QiR/20050329163817Z+1815@nic.funet.fi>


>From: "David A. Desrosiers" <hacker@gnu-designs.com>
>
>	You've "invented" camera features? What hardware did you use 
>when building these features into your camera? What camera model did 
>you use as a base unit? 

Inventions nor patentions require physical hardware because the
inventions can be readily described in the text. Patent office
does not require inventors to send the hardware to them, anymore.
I invented, but I have not patented. It basically does not matter
which camera gets the features first, but I favor Canon EOS 300D,
Nikon D70, and equivalent competitors.

I'm curious why you were not interested in the features itself.
They are basically public domain, but manufacturers could be
interested in them more, if such features appears first in
their camera. The competition is now on the camera features.

Juhana
-- 
  http://music.columbia.edu/mailman/listinfo/linux-graphics-dev
  for developers of open source graphics software
From felix.klee at inka.de  Tue Mar 29 14:04:01 2005
From: felix.klee at inka.de (Felix E. Klee)
Date: Tue Mar 29 14:04:14 2005
Subject: [gutvol-d] Re: Scanner vs. digital camera
In-Reply-To: <S17731AbVC2QiR/20050329163817Z+1815@nic.funet.fi>
References: <S17731AbVC2QiR/20050329163817Z+1815@nic.funet.fi>
Message-ID: <87is3a5ctq.wl%felix.klee@inka.de>

At Tue, 29 Mar 2005 19:38:17 +0300,
Juhana Sadeharju wrote:
> I'm curious why you were not interested in the features itself.

Now I'm curious: Could you tell us about the features?  ... especially
since I think that hardware features are probably not needed that much:
Software can automatically detect page borders and correct distortions.
As an example have a look at the Bookeye software: It has a crappy user
interface but mostly it does a good job.  To improve automatic detection
of distortions it may be interesting to experiment with generation and
interpretation of stereo photos of book pages, but that's probably
overkill.

-- 
Felix E. Klee
From joshua at hutchinson.net  Tue Mar 29 14:17:48 2005
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Tue Mar 29 14:17:57 2005
Subject: [gutvol-d] Re: Scanner vs. digital camera
Message-ID: <20050329221748.6C8751099C0@ws6-4.us4.outblaze.com>

I think the original poster was sarcastically making fun of the notion that "invention" is simply a matter of coming up with an original idea. While current patent practice seems to support that view, it is ridiculous to most people.

Hence the saying, "Invention is 1% inspiration, 99% perspiration." 

In other words, just coming up with an idea is the easy part.

Josh

----- Original Message -----
From: "Juhana Sadeharju" <kouhia@nic.funet.fi>
To: gutvol-d@lists.pglaf.org
Subject: [gutvol-d] Re: Scanner vs. digital camera
Date: Tue, 29 Mar 2005 19:38:17 +0300

> 
> 
> > From: "David A. Desrosiers" <hacker@gnu-designs.com>
> >
> > 	You've "invented" camera features? What hardware did you use when building 
> > these features into your camera? What camera model did you use as a base 
> > unit?
> 
> Inventions nor patentions require physical hardware because the
> inventions can be readily described in the text. Patent office
> does not require inventors to send the hardware to them, anymore.
> I invented, but I have not patented. It basically does not matter
> which camera gets the features first, but I favor Canon EOS 300D,
> Nikon D70, and equivalent competitors.
> 
> I'm curious why you were not interested in the features itself.
> They are basically public domain, but manufacturers could be
> interested in them more, if such features appears first in
> their camera. The competition is now on the camera features.
> 
> Juhana
> --
>    http://music.columbia.edu/mailman/listinfo/linux-graphics-dev
>    for developers of open source graphics software
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d

From felix.klee at inka.de  Wed Mar 30 02:13:43 2005
From: felix.klee at inka.de (Felix E. Klee)
Date: Wed Mar 30 02:14:07 2005
Subject: [gutvol-d] Re: Scanner vs. digital camera
In-Reply-To: <20050329221748.6C8751099C0@ws6-4.us4.outblaze.com>
References: <20050329221748.6C8751099C0@ws6-4.us4.outblaze.com>
Message-ID: <87oed14f1k.wl%felix.klee@inka.de>

At Tue, 29 Mar 2005 17:17:48 -0500,
Joshua Hutchinson wrote:
> In other words, just coming up with an idea is the easy part.

Certainly.  AFAIK, Switzerland was one of the last countries that
required you to hand in working prototypes of devices to be patented.
That requirement was overturned by Germany threatening to increase
customs duties [1].  Then there's the upcoming threat of software
patents - I'm active in that area for quite some time already.

Nevertheless Joshua might have some good ideas concerning camera design
that he wants to share with us.  Seems like I escaped his subtle
sarcasm.

[1] http://www.sffo.de/machlup1.htm

-- 
Felix E. Klee
From nwolcott at dsdial.net  Wed Mar 30 05:29:50 2005
From: nwolcott at dsdial.net (N Wolcott)
Date: Wed Mar 30 05:30:14 2005
Subject: [gutvol-d] More PG spam being spread around
Message-ID: <000a01c5352c$9790a780$0c9495ce@gw98>

Resellers of PG books have taken on a new target, Lulu.com.

Lulu offers POD publishing at zero up front cost, thus luring those who find free advertising for their spam. The postings I have seen so far both imply PG and Lulu are supporting thier spam. They advertise the quality of their texts as being from PG. One ofthem admits there may be errors. 

There is probably nothing for PG to do except to get Lulu to take the PG off their customer's postings. If they want to host 15000 books on their computers for free that is their business. I quote my post to the LuLu foruml I have posted 2 books to Lulu at 15 cent royalty with added content to the PG text and I do not mention PG in the blurb. My "quality" book may soon be submerged in a flood of lulu spam. 

Posting follows:
-------------------------
Lulu offers a good service for self publishers who provide "content added" material. This offers the publisher to continually upgrade the product until it is in final form then market it through Lulu's various mechanisms. 

However recently public domain texts lifted from project gutenberg have been appearing on Lulu. The accomopanying blurb states that www.lulu.com and Project gutenberg have joined forces to offer you these long out of print books. The implication is that somehow Lulu and PG are supporting this effort. PG is trademarked and there is no right to use the name in advertising; enforcing the trademark is another thin however for an all volunteer organization. 

Software exists to move PG texts to a number of formats, ipod, ebook, etc including Lulu. So there is a real possibillity that most of the 15000 pg books could end up being hosted on Lulu. No review copy would ever be required, so the posting for the converter would be free. Lulu could end up hosting the entire pg corpus for free in a kind of publishing spam. The books are listed with a royalty of $1 to $2.  One is published with a $1.59 royalty, and claims that $1 will be contriputed to PG of every book sold. This leaves only 27 cents for the seller. 

Iin one case the publisher had re-copyrighted the book and in the other had listed it as Public Domain. Nothing wrong with this, but the copyright only applies to "new material" and certainly not the entire book. In one case a ISBN number was listed, so Lulu might have gotten some revenue from that if the ISBN is real. One of the books was listed as 5000 in sales, so I imagine that is how many Lulu has in its archive. It may soon get 14,999 more!

Another feature with Lulu is you never know who is selling the book. Lulu distributes it, but the real seller is someone else, unknown. This may raise legal issues about ultimate responsibilitiy. 

People like myself who provide added content at no or minimal royalty will be unhappy to see our listing efforts buried in an avalanche of Lulu spam. At the very least Lulu should require permission before violating trademark laws. 

To see the books in this post, search for "Verne" on Lulu. 

The additional cost of hosting all these books could end up in forcing up front charges on Lulu providers or radically restructuring the way Lulu operates,  neither of which is desirable in my humble opinion. 

I mention this as a discussion topic, as I feel it is an emerging problem.

---------------------
N Wolcott  nwolcott2@post.harvard.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20050330/00d64f22/attachment.html
From kouhia at nic.funet.fi  Thu Mar 31 06:26:22 2005
From: kouhia at nic.funet.fi (Juhana Sadeharju)
Date: Thu Mar 31 06:26:31 2005
Subject: [gutvol-d] Re: Scanner vs. digital camera
Message-ID: <S16403AbVCaO0W/20050331142623Z+3259@nic.funet.fi>

>From: "Felix E. Klee" <felix.klee@inka.de>
>
>How did OCR'ing go?  I wonder because the resolution of cheap digital
>cameras is quite low for scanning.

Well, I did not test OCR'ing at all. :-)
I store digitizations only as images which also are used
for reading.

Please test it yourself and tell the results in the list.
  ftp://ftp.funet.fi/pub/sci/audio/devel/books/
A few first images are various testings. The digitization
sequence test starts at the image 1438.

Remember, it is a tourist camera with lens distortions and with
poor focus control. I used a plain ceiling light, not better movable
lights. The book is on a chair and the photographed page points
directly to up -- which is wrong.

Yes, one page per image is better because the page bends when
the book is laid wide open. The book and camera stand could
be designed so that the book rests in V shape holder and
that the camera is facing perpendicular to the book page.
That is, camera would not be above the book and would not
face down.

(The scanner, which allows the book rest on the edge of the
scanning glass, solves the same bending-pages problem. So does
the scanning glass-wedge.)

Juhana
-- 
  http://music.columbia.edu/mailman/listinfo/linux-graphics-dev
  for developers of open source graphics software
From traverso at dm.unipi.it  Thu Mar 31 06:52:39 2005
From: traverso at dm.unipi.it (Carlo Traverso)
Date: Thu Mar 31 06:50:59 2005
Subject: [gutvol-d] Re: Scanner vs. digital camera
In-Reply-To: <S16403AbVCaO0W/20050331142623Z+3259@nic.funet.fi> (message from
	Juhana Sadeharju on Thu, 31 Mar 2005 17:26:22 +0300)
References: <S16403AbVCaO0W/20050331142623Z+3259@nic.funet.fi>
Message-ID: <200503311452.j2VEqdn29068@posso.dm.unipi.it>

>>>>> "Juhana" == Juhana Sadeharju <kouhia@nic.funet.fi> writes:

    >> From: "Felix E. Klee" <felix.klee@inka.de>
    >> 
    >> How did OCR'ing go?  I wonder because the resolution of cheap
    >> digital cameras is quite low for scanning.

    Juhana> Well, I did not test OCR'ing at all. :-) I store
    Juhana> digitizations only as images which also are used for
    Juhana> reading.

    Juhana> Please test it yourself and tell the results in the list.
    Juhana> ftp://ftp.funet.fi/pub/sci/audio/devel/books/ A few first
    Juhana> images are various testings. The digitization sequence
    Juhana> test starts at the image 1438.

Please, instead of putting there a big tar.gz file of 72MB, can you
put some individual images? Probably downloading a couple is
enough to say that they are unsuitable for OCR.

Indeed, my attempts with a good digital camera (5Mpixels, manual
focus, uncompressed tiff output, a special mode for text, a
professional tripod, etc) have been poor.


Carlo Traverso
From kth at srv.net  Thu Mar 31 08:18:42 2005
From: kth at srv.net (Kevin Handy)
Date: Thu Mar 31 08:51:58 2005
Subject: [gutvol-d] DP Down?
Message-ID: <424C22E2.6040603@srv.net>

Is it just me, or is DP down today? All I get is a forbidden message.
Any news on when it will be available again?

From miranda_vandeheijning at blueyonder.co.uk  Thu Mar 31 08:57:19 2005
From: miranda_vandeheijning at blueyonder.co.uk (Miranda van de Heijning)
Date: Thu Mar 31 08:57:28 2005
Subject: [gutvol-d] DP Down?
In-Reply-To: <424C22E2.6040603@srv.net>
References: <424C22E2.6040603@srv.net>
Message-ID: <424C2BEF.9000206@blueyonder.co.uk>

DP's ISP has been down today.... The latest news from the DP local bar, 
aka the chatroom at jabber.org, is that the ISP is back up, but we are 
still waiting for DPs server to come back. Keep checking!

In the meantime, you can visit the European site http://dp.rastko.net/ 
for all your proofing needs.

Best regards,

Miranda


Kevin Handy wrote:

> Is it just me, or is DP down today? All I get is a forbidden message.
> Any news on when it will be available again?
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
>
>

From servalan at ar.com.au  Thu Mar 31 17:54:33 2005
From: servalan at ar.com.au (Pauline)
Date: Thu Mar 31 17:55:29 2005
Subject: [gutvol-d] DP Down?
In-Reply-To: <424C2BEF.9000206@blueyonder.co.uk>
References: <424C22E2.6040603@srv.net> <424C2BEF.9000206@blueyonder.co.uk>
Message-ID: <424CA9D9.6030508@ar.com.au>

Miranda van de Heijning wrote:
> DP's ISP has been down today.... The latest news from the DP local bar, 
> aka the chatroom at jabber.org, is that the ISP is back up, but we are 
> still waiting for DPs server to come back. Keep checking!

& the DP server is now back up & available. Thanks for your patience.

Cheers,
P
-- 
Help digitise public domain books:
Distributed Proofreaders: http://www.pgdp.net
"Preserving history one page at a time."

Set free dead-tree books:
http://bookcrossing.com/referral/servalan