From kionon at animemusicvideos.org  Sat Mar  1 10:35:08 2008
From: kionon at animemusicvideos.org (Kionon)
Date: Sun, 2 Mar 2008 03:35:08 +0900
Subject: [gutvol-d] The Old Fashioned Way...
Message-ID: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com>

List,

If I wish to add a public domain book to the project, and I actually desire
to type it up by hand, is there any reason why I can't?

Very respectfully,
Kevin M. Callahan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080302/8a5513c7/attachment.htm 

From klofstrom at gmail.com  Sat Mar  1 10:56:47 2008
From: klofstrom at gmail.com (Karen Lofstrom)
Date: Sat, 1 Mar 2008 08:56:47 -1000
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com>
References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com>
Message-ID: <1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com>

On 3/1/08, Kionon <kionon at animemusicvideos.org> wrote:

> If I wish to add a public domain book to the project, and I actually desire
> to type it up by hand, is there any reason why I can't?

Unfortunately, you can. I say "unfortunately" because it is close to
certain that you are going to produce a flawed text. Since Project
Gutenberg, at present, doesn't have any quality controls, it will
accept your flawed text.

Why flawed? The more work I do at Distributed Proofreaders -- and in
commercial publishing -- the clearer it is that it takes more than one
pair of eyes to produce a good text. A second person will catch what
the first person missed. No matter how good the first person is.
That's why Distributed Proofreaders now subjects most (but not all) of
the texts it produces to three rounds of human proofreading.
Particularly easy projects may be done in two rounds.

DP recently re-did a book that had been done in the early days of
Project Gutenberg. The post-processor checked for differences between
the early, typed-in text and the later DP effort. There were 44
differences. 44 that were all errors in the earlier text.

If you find the type-in process involving and soothing, DP does do
"type-in" projects. These are texts that cannot be OCRed, usually very
old books in antique typefaces.

Instead of trying to do it on your own, come join the community at
Distributed Proofreaders. We make better books and we (usually) have a
lot of fun doing it. If you participate in the community forum, the
bulletin board, you will meet lots of bright, bookish, and
delightfully eccentric people.

--
Zora

From kionon at animemusicvideos.org  Sat Mar  1 11:08:26 2008
From: kionon at animemusicvideos.org (Kionon)
Date: Sun, 2 Mar 2008 04:08:26 +0900
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com>
References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com>
	<1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com>
Message-ID: <8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com>

> Unfortunately, you can. I say "unfortunately" because it is close to
>  certain that you are going to produce a flawed text. Since Project
>  Gutenberg, at present, doesn't have any quality controls, it will
>  accept your flawed text.

Hrm.

>  Why flawed? The more work I do at Distributed Proofreaders -- and in
>  commercial publishing -- the clearer it is that it takes more than one
>  pair of eyes to produce a good text. A second person will catch what
>  the first person missed. No matter how good the first person is.
>  That's why Distributed Proofreaders now subjects most (but not all) of
>  the texts it produces to three rounds of human proofreading.
>  Particularly easy projects may be done in two rounds.

Oh, I am most certainly aware of this. My background is in journalism,
politics, and philosophy. I always made it a habit to read documents
backwards, but always had other eyes looking over the text as well.

>  If you find the type-in process involving and soothing, DP does do
>  "type-in" projects. These are texts that cannot be OCRed, usually very
>  old books in antique typefaces.

Well, there certainly is the interest in process in general. However,
I also wanted to do works that were of personal importance to me.
There is something far more magical, I would think, with a text that
impacted you in a such a way that you would wish to do a rather labor
intensive transcription.

That, and of course, I lack a scanner, and am not located within a
reasonable radius of a location with which to obtain one (I am, in
fact, not even in an English speaking country).

>  Instead of trying to do it on your own, come join the community at
>  Distributed Proofreaders. We make better books and we (usually) have a
>  lot of fun doing it. If you participate in the community forum, the
>  bulletin board, you will meet lots of bright, bookish, and
>  delightfully eccentric people.

I had intended to do that, certainly, but again I point to the fact I
wish to work on projects that have had a particular impact on me; we
are far more wont, it should not surprise you, to  preserve that which
we are fond of.

Very respectfully,
Kevin M. Callahan

From grythumn at gmail.com  Sat Mar  1 11:13:36 2008
From: grythumn at gmail.com (Robert Cicconetti)
Date: Sat, 1 Mar 2008 14:13:36 -0500
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com>
References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com>
	<1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com>
	<8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com>
Message-ID: <15cfa2a50803011113r646e3bbch793873cc6495701c@mail.gmail.com>

On Sat, Mar 1, 2008 at 2:08 PM, Kionon <kionon at animemusicvideos.org> wrote:

> That, and of course, I lack a scanner, and am not located within a
> reasonable radius of a location with which to obtain one (I am, in
> fact, not even in an English speaking country).
>

You have to have access to a scanner (or find a scan online of the same
edition) in order to use the copyright clearance process at copy.pglaf.org.
I think there is an older process involving mailing in photocopies of the
title/verso to MH, but I have never used it...

R C
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080301/04c9c9c7/attachment.htm 

From kionon at animemusicvideos.org  Sat Mar  1 11:15:19 2008
From: kionon at animemusicvideos.org (Kionon)
Date: Sun, 2 Mar 2008 04:15:19 +0900
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <15cfa2a50803011113r646e3bbch793873cc6495701c@mail.gmail.com>
References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com>
	<1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com>
	<8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com>
	<15cfa2a50803011113r646e3bbch793873cc6495701c@mail.gmail.com>
Message-ID: <8893d7a30803011115v5d6f816evfa0a4f73d78f6931@mail.gmail.com>

> You have to have access to a scanner (or find a scan online of the same
> edition) in order to use the copyright clearance process at copy.pglaf.org.
> I think there is an older process involving mailing in photocopies of the
> title/verso to MH, but I have never used it...
>

That would seem to present a problem.

Very respectfully,
Kevin M. Callahan

From grythumn at gmail.com  Sat Mar  1 11:58:02 2008
From: grythumn at gmail.com (Robert Cicconetti)
Date: Sat, 1 Mar 2008 14:58:02 -0500
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <8893d7a30803011115v5d6f816evfa0a4f73d78f6931@mail.gmail.com>
References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com>
	<1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com>
	<8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com>
	<15cfa2a50803011113r646e3bbch793873cc6495701c@mail.gmail.com>
	<8893d7a30803011115v5d6f816evfa0a4f73d78f6931@mail.gmail.com>
Message-ID: <15cfa2a50803011158l2ff0b112i75af06dc3a7c7399@mail.gmail.com>

On Sat, Mar 1, 2008 at 2:15 PM, Kionon <kionon at animemusicvideos.org> wrote:

> > You have to have access to a scanner (or find a scan online of the same
> > edition) in order to use the copyright clearance process at
> copy.pglaf.org.
> > I think there is an older process involving mailing in photocopies of
> the
> > title/verso to MH, but I have never used it...
> >
>
> That would seem to present a problem.
>

How about access to a digital camera? Even an old webcam might do, if you
take pictures of several segments and stitch them together. We're only
talking 2-4 pages.

R C
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080301/bdf4c8ed/attachment.htm 

From ajhaines at shaw.ca  Sat Mar  1 12:12:44 2008
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Sat, 01 Mar 2008 12:12:44 -0800
Subject: [gutvol-d] The Old Fashioned Way...
References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com>
	<1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com>
Message-ID: <001e01c87bd8$9c7e89b0$6501a8c0@ahainesp2400>

As long as you've gotten a copyright clearance as per PG's How-To's, you can pretty much do what you 
want.

BUT...

If you do this book by hand, be prepared to proof it almost word by word.  It's too easy for eyes to 
jump words, skip lines, and make all manner of human mistakes.   Type a page, proof a page is a good 
rule.  When you've done an entire chapter, put it aside for a few days so it's stale to your memory 
of it, then proof it again.  Then, get someone else to do another proof.

If you haven't already, check out the PG How-To and FAQ links at PG's main page 
http://www.gutenberg.org/wiki/Main_Page.

I don't know what word processor you're using, but try to keep the lines the same length as they are 
in the original book.  If possible, put a soft return at the end of each line (in MS-Word, that's 
done with Shift-Enter).  (But use a hard return at paragraph end.)  That way, you can do a line 
count of each typed page, and if that doesn't match the book, you've done something wrong.  The soft 
returns can be dealt with when the chapter is complete.

Save each chapter as a separate file, in two formats--your word processor's native format and as a 
standard text file.  Run Gutcheck, Jeebies, and Gutspell on the text version, and fix problems in 
the native version.  They're available at http://gutcheck.sourceforge.net/ and are invaluable for 
finding typos, scannos, etc, etc, but they work only on text files.  Don't forget to do a spellcheck 
with whatever word processor you're using.

If the book has footnotes, type them at the bottom of their respective page and leave them there 
until the page is thoroughly proofed.  When the chapter is complete, they can be handled as per PG 
guidelines.  (I renumber them sequentially and move them to the end of their chapter.)


I speak from a certain amount of experience--several years ago I proofed a 450-page book that 
someone had spent three years typing by hand.  (They had sent me the book to proof against.)  The 
person had done a reasonable job, all things considered, but I still found a number of problems per 
page.  The whole exercise took me about six weeks, and I'm certain that there are still problems in 
the posted version, just because of the density and complexity of the book.  Several months later, I 
got hold of the book's follow-on volume, which had about the same complexity, and produced it in 
about two weeks, clearance to submission, with a much cleaner result, simply because I started from 
a scanned text, not a hand-typed text.


A follow-up on Robert's comment re getting a clearance:  Michael Hart's address is in this How-To: 
http://www.gutenberg.org/wiki/Gutenberg:Copyright_How-To.

Al


----- Original Message ----- 
From: "Karen Lofstrom" <klofstrom at gmail.com>
To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
Sent: Saturday, March 01, 2008 10:56 AM
Subject: Re: [gutvol-d] The Old Fashioned Way...


> On 3/1/08, Kionon <kionon at animemusicvideos.org> wrote:
>
>> If I wish to add a public domain book to the project, and I actually desire
>> to type it up by hand, is there any reason why I can't?
>
> Unfortunately, you can. I say "unfortunately" because it is close to
> certain that you are going to produce a flawed text. Since Project
> Gutenberg, at present, doesn't have any quality controls, it will
> accept your flawed text.
>
> Why flawed? The more work I do at Distributed Proofreaders -- and in
> commercial publishing -- the clearer it is that it takes more than one
> pair of eyes to produce a good text. A second person will catch what
> the first person missed. No matter how good the first person is.
> That's why Distributed Proofreaders now subjects most (but not all) of
> the texts it produces to three rounds of human proofreading.
> Particularly easy projects may be done in two rounds.
>
> DP recently re-did a book that had been done in the early days of
> Project Gutenberg. The post-processor checked for differences between
> the early, typed-in text and the later DP effort. There were 44
> differences. 44 that were all errors in the earlier text.
>
> If you find the type-in process involving and soothing, DP does do
> "type-in" projects. These are texts that cannot be OCRed, usually very
> old books in antique typefaces.
>
> Instead of trying to do it on your own, come join the community at
> Distributed Proofreaders. We make better books and we (usually) have a
> lot of fun doing it. If you participate in the community forum, the
> bulletin board, you will meet lots of bright, bookish, and
> delightfully eccentric people.
>
> --
> Zora
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d 



From steven at desjardins.org  Sat Mar  1 19:47:50 2008
From: steven at desjardins.org (Steven desJardins)
Date: Sat, 1 Mar 2008 22:47:50 -0500
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com>
References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com>
	<1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com>
	<8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com>
Message-ID: <41fd8970803011947p7a63350dh8673b7bddec00048@mail.gmail.com>

On Sat, Mar 1, 2008 at 2:08 PM, Kionon <kionon at animemusicvideos.org> wrote:
>  >  Instead of trying to do it on your own, come join the community at
>  >  Distributed Proofreaders. We make better books and we (usually) have a
>  >  lot of fun doing it. If you participate in the community forum, the
>  >  bulletin board, you will meet lots of bright, bookish, and
>  >  delightfully eccentric people.
>
>  I had intended to do that, certainly, but again I point to the fact I
>  wish to work on projects that have had a particular impact on me; we
>  are far more wont, it should not surprise you, to  preserve that which
>  we are fond of.

It's possible to make such a contribution at Distributed Proofreaders,
even without a scanner or OCR software. There are several sites, like
Google Book Search and the Internet Archive, which have scans of
public domain books. If you find one of the books you want to preserve
on one of these sites (and you check to make sure it has no missing
pages), then you should be able to find someone who will OCR the files
for you. At that point, you can take over as Project Manager and
shepherd the book through the rounds. When it enters post-processing
you can, if you choose, do that step yourself, using software
developed at Distributed Proofreaders for exactly that purpose. I
guarantee you this will be easier and result in a higher-quality
electronic book than trying to type in the whole thing and proofread
it yourself.

In any case, before trying to do a solo project, I strongly recommend
you spend some time at Distributed Proofreaders, get some experience
in the proofreading and formatting rounds, and post-process one or two
books from DP's pool. Over time, DP has figured out what works pretty
well and what doesn't. You may not agree with all of our
procedures--if you read this list for more than a few days, you'll see
that a lot of people don't--but working through your first few books
with a set of carefully established guidelines and a forum full of
helpful, experienced folks is the best education I know of.

From klofstrom at gmail.com  Sat Mar  1 20:21:09 2008
From: klofstrom at gmail.com (Karen Lofstrom)
Date: Sat, 1 Mar 2008 18:21:09 -1000
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <41fd8970803011947p7a63350dh8673b7bddec00048@mail.gmail.com>
References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com>
	<1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com>
	<8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com>
	<41fd8970803011947p7a63350dh8673b7bddec00048@mail.gmail.com>
Message-ID: <1e8e65080803012021l6c8a47cfp5d4dd196fd3db91e@mail.gmail.com>

On 3/1/08, Steven desJardins <steven at desjardins.org> wrote:

>  It's possible to make such a contribution at Distributed Proofreaders,
>  even without a scanner or OCR software. There are several sites, like
>  Google Book Search and the Internet Archive, which have scans of
>  public domain books. If you find one of the books you want to preserve
>  on one of these sites (and you check to make sure it has no missing
>  pages), then you should be able to find someone who will OCR the files
>  for you. At that point, you can take over as Project Manager and
>  shepherd the book through the rounds. When it enters post-processing
>  you can, if you choose, do that step yourself, using software
>  developed at Distributed Proofreaders for exactly that purpose. I
>  guarantee you this will be easier and result in a higher-quality
>  electronic book than trying to type in the whole thing and proofread
>  it yourself.

Steven knows whereof he speaks; he's one of the more prolific of the
content providers at DP. He likes to feel responsible for the finished
result, so he usually does just as he says: PMs and then PPs the book.
I've done this for a couple of books, but I'm much lazier than Steven
and like to just proof, leaving the responsibility to others. I can
assure you that if you PM and PP, you will feel that it's YOUR book.
It will have your name on it too, in the acknowledgments.

-- 
Zora

From Bowerbird at aol.com  Sat Mar  1 21:42:19 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 2 Mar 2008 00:42:19 EST
Subject: [gutvol-d] The Old Fashioned Way...
Message-ID: <bd2.1a4b4ff4.34fb983b@aol.com>

kevin said:
>    If I wish to add a public domain book to the project, 
>    and I actually desire to type it up by hand, 
>    is there any reason why I can't?

no, there's no reason you can't.   and indeed, it is
a _wonderful_ way to interact deeply with a book.

when a book is truly meaningful to you, you will
absorb it   into your d.n.a. if you type it by hand...

it's very time-consuming, yes, but also rewarding.

***

you might look on the net to see if the book has
already been digitized...   if it has, then you would
be doing the world an equally worthy favor if you
_proofed_ that existing copy.   just an idea for you.

***

also, if you need help in finalizing your type-ins,
i'd be honored to give you some of my software...

i'll even go further and check your work myself...
i've submitted to p.g., so i know the requirements.

***

steven said:
>   I guarantee you this will be easier and result in 
>    a higher-quality electronic book than trying to 
>    type in the whole thing and proofread it yourself.

and i guarantee you that, with my help, kevin, you will be
able to create a higher-quality electronic-book than d.p.

***

steven said:
>    In any case, before trying to do a solo project, 
>    I strongly recommend you spend some time 
>    at Distributed Proofreaders, get some experience
>    in the proofreading and formatting rounds, and 
>    post-process one or two books from DP's pool.

how very ironic that this post should come through today.

because just _yesterday_, i got an e-mail from a person...

a while back, he had come on the list, just like you, kevin,
asking how he might begin the process of digitization...

i recommended that he should go over to d.p. and join,
proof some pages there, see how the system worked, etc.
i told him not to pick up any bad habits over there, because
they do a lot of things the wrong way, but to join and learn...

well, i guess it didn't work out for him.   he's back on his own.

so i told him i'd help him out.   and he sent me his text-files...
and gosh, i'm looking at the mess that d.p. visits on a book...

i've also been examining one of the "tests" they are running,
and i've found it is 10 times more work _undoing_ their mess
than it would've been if i'd started with the original materials.

so i can no longer in good faith point anyone toward d.p.

indeed, i think the best course is to recommend against them.

***

steven said:
>    Over time, DP has figured out what works pretty well 
>    and what doesn't. 

i disagree.   vehemently.   the d.p. workflow is _extremely_ bad.
it wastes valuable time and energy of thousands of volunteers.

the reason people think it's good is they don't know any better.


>    working through your first few books with a set of 
>    carefully established guidelines and a forum full of
>    helpful, experienced folks is the best education I know of.

if you could be immunized against the damage caused by
exposure to the d.p. workflow, then you might well benefit
from dialog with the volunteers in the forums, should you
come across any rough spots when doing your digitization.
there are people there with a lot of digitization experience...

but as that's probably not possible, i'd advise you to stay away.

again, this is a change from what i've recommended up to now.
it was wrong to recommend them, so i have changed my mind...

***

and now we come to the advice given to kevin by al haines.

al is a _phenomenal_ digitizer -- he has done probably _hundreds_
of books submitted to d.p. -- and is a newly-deputized whitewasher.

so it pains me a great deal to have to take issue with some of his points,
however minor my disagreement might be.   (and it's usually quite minor.)

but i must.   so i will.

***

al said:
>   I don't know what word processor you're using, but try to 
>    keep the lines the same length as they are in the original book.? 
>    If possible, put a soft return at the end of each line (in MS-Word, 
>    that's done with Shift-Enter).? (But use a hard return at paragraph 
end.)? 
>    That way, you can do a line count of each typed page, 
>    and if that doesn't match the book, you've done something wrong.? 
>    The soft returns can be dealt with when the chapter is complete.

ok, so let us begin with the slight disagreement.           :+)

don't "try" to keep the lines the same length as in the original book.

instead, type the lines exactly _as_is_.   even hyphenate the words
which were hyphenated when the text hit against the right margin.

it's true p.g. wants you to dehyphenate those words, but you can
have me (have my software) do that dehyphenation for you, later.

some people _want_ the linebreaks just as they were in the p-book,
and there's absolutely no reason for you not to make them happy...

besides, you will find it easier to get into the rhythm of the type-in
if you get yourself in sync with the lines as they appear on the page.

and, of course, _proofing_ -- which we all agree has to be done --
is _absolutely_ easier (by orders of magnitude) when the linebreaks
in your text-file precisely match the linebreaks of the physical page.

so put a hard-return at the end of each line, and 2 hard-returns at
the end of each paragraph, and save yourself a ton of later misery...


>    Save each chapter as a separate file, in two formats --
>    your word processor's native format and as a standard text file.? 
>    Run Gutcheck, Jeebies, and Gutspell on the text version, 
>    and fix problems in the native version.? 

you can break the file into separate chapters if you _want_ to.
but there's certainly no _need_ to do it.   (i'd think it's a hassle.)

and for goodness sake, do _not_ use "gutcheck", or "jeebies",
or "gutspell", even if you know what they are.   (and if you don't,
be glad that you don't have to bother learning what they are...)
i will correct your text entirely by running checks using my tools.

you _do_ need to run a spellcheck on your work, most certainly,
but anyone who has learned to proof text by reading it backward
doesn't need to be told something as basic and elementary as that.

what you might not appreciate, however, at least not _sufficiently_,
is the value of creating a specific "dictionary" for your spellchecker
for each book.   a book has a certain number of words unique to it
-- such as names -- which will typically occur with great frequency.
you don't want your spellchecker to stop at each and every one, but
you might not wanna add the word to your _main_ dictionary either.

so if your wordprocessor lets you do so -- and many of them do --
declare a "special" dictionary for each book you do, and use it then.
(indeed, if your _main_ spellchecker doesn't allow this, it is worth
the trouble to do the spellchecking in a wordprocesser that does.)
don't use the "alternate" dictionary, allowed by some wordprocessors.
make it a "special" dictionary, one that you'll use solely for that book.
that way you can always go back to the book, at any time, and call up
its special dictionary.   very handy.   as an aside, i find it fascinating to 

examine the special dictionary; gives you insight into the book itself.


>    If the book has footnotes, type them at the bottom of their 
>    respective page and leave them there until the page is 
>    thoroughly proofed.? When the chapter is complete, 
>    they can be handled as per PG guidelines.? (I renumber them 
>    sequentially and move them to the end of their chapter.)

my programs automatically handle all of that footnote movement...

just type 'em at the bottom of the page, exactly like you find them,
and let me worry about the rest of it.

oh yeah, _about_ pages...

just as you've recorded the linebreaks as they were in the p-book,
you'll need to mark the pagebreaks as well.   i suggest you simply
use the "pagebreak" command in your wordprocessor, but if you
are using one that doesn't have such a command, then just type
a line of dashes for a pagebreak.   _do_ type the pagenumber too;
you can either type it at the _bottom_ of each page, or the _top_,
but do it _consistently_ -- even if the p-book did it inconsistently!

also, because you want these p-book pagenumbers to be in sync
with the pagenumbers as they're figured by your wordprocessor,
put the frontmatter in one file, and the body-text in another file...
that way, "page 1" in your wordprocessor will be the _real_ "page 1".

let's see, is there anything else?

text-styling!   oh my goodness, i almost forgot.   p.g. wants ascii,
but you should _definitely_ record any text-styling when present,
like italics and bold.   use your regular wordprocessor formatting.
when you are finished, you can convert it to the p.g. conventions.
(what this means is that you do _not_ save your file as a text file.)

i can create an .html file and a .pdf out of your work, if you want,
so don't think you must do any special work to accomplish that...

feel free to send me any of your work once you've started doing it.
i'll be happy to give you feedback if you're doing anything wrong...

and if google (or anyone else) has scanned the book you're doing,
let me know, and i'll take a look at it to see if you need any advice...
(i will also o.c.r. it for you, and compare the output to your type-in;
that way, you won't even have to do the proofing if you don't want.
in contrast to typing, proofing does _not_ imprint upon your d.n.a.)

welcome to the world of book-digitizing...

-bowerbird

p.s.   having your computer do text-to-speech on your file can be
a _great_ way to do "proofing".   the errors really jump out at you...



**************
Ideas to please picky eaters. Watch video on AOL Living.
      
(http://living.aol.com/video/how-to-please-your-picky-eater/rachel-campos-duffy/
2050827?NCID=aolcmp00300000002598)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080302/619674e1/attachment-0001.htm 

From ajhaines at shaw.ca  Sat Mar  1 23:17:56 2008
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Sat, 01 Mar 2008 23:17:56 -0800
Subject: [gutvol-d] The Old Fashioned Way...
References: <bd2.1a4b4ff4.34fb983b@aol.com>
Message-ID: <001701c87c35$89af7f30$6501a8c0@ahainesp2400>

Correction - I've never used DP for my submissions.  Two reasons--one, I wasn't aware of DP until a year or so after I started doing ebooks for PG, and, two, I prefer the immediacy, control, and accuracy I can bring to an ebook by doing everything myself.   (I did use DP's harvesting page some months ago to record a couple of harvests from Internet Archive, but did my own work on them, then stopped harvesting in favor of the several hundred books I have that aren't in PG *or* in IA.)


Bowerbird - can you supply a list of the books you've submitted to PG?  I'd like to have a look a them.


Al

  ----- Original Message ----- 
  From: Bowerbird at aol.com 
  To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com 
  Sent: Saturday, March 01, 2008 9:42 PM
  Subject: Re: [gutvol-d] The Old Fashioned Way...


  kevin said:
  >   If I wish to add a public domain book to the project, 
  >   and I actually desire to type it up by hand, 
  >   is there any reason why I can't?

  no, there's no reason you can't.  and indeed, it is
  a _wonderful_ way to interact deeply with a book.

  when a book is truly meaningful to you, you will
  absorb it  into your d.n.a. if you type it by hand...

  it's very time-consuming, yes, but also rewarding.

  ***

  you might look on the net to see if the book has
  already been digitized...  if it has, then you would
  be doing the world an equally worthy favor if you
  _proofed_ that existing copy.  just an idea for you.

  ***

  also, if you need help in finalizing your type-ins,
  i'd be honored to give you some of my software...

  i'll even go further and check your work myself...
  i've submitted to p.g., so i know the requirements.

  ***

  steven said:
  >   I guarantee you this will be easier and result in 
  >   a higher-quality electronic book than trying to 
  >   type in the whole thing and proofread it yourself.

  and i guarantee you that, with my help, kevin, you will be
  able to create a higher-quality electronic-book than d.p.

  ***

  steven said:
  >   In any case, before trying to do a solo project, 
  >   I strongly recommend you spend some time 
  >   at Distributed Proofreaders, get some experience
  >   in the proofreading and formatting rounds, and 
  >   post-process one or two books from DP's pool.

  how very ironic that this post should come through today.

  because just _yesterday_, i got an e-mail from a person...

  a while back, he had come on the list, just like you, kevin,
  asking how he might begin the process of digitization...

  i recommended that he should go over to d.p. and join,
  proof some pages there, see how the system worked, etc.
  i told him not to pick up any bad habits over there, because
  they do a lot of things the wrong way, but to join and learn...

  well, i guess it didn't work out for him.  he's back on his own.

  so i told him i'd help him out.  and he sent me his text-files...
  and gosh, i'm looking at the mess that d.p. visits on a book...

  i've also been examining one of the "tests" they are running,
  and i've found it is 10 times more work _undoing_ their mess
  than it would've been if i'd started with the original materials.

  so i can no longer in good faith point anyone toward d.p.

  indeed, i think the best course is to recommend against them.

  ***

  steven said:
  >   Over time, DP has figured out what works pretty well 
  >   and what doesn't. 

  i disagree.  vehemently.  the d.p. workflow is _extremely_ bad.
  it wastes valuable time and energy of thousands of volunteers.

  the reason people think it's good is they don't know any better.


  >   working through your first few books with a set of 
  >   carefully established guidelines and a forum full of
  >   helpful, experienced folks is the best education I know of.

  if you could be immunized against the damage caused by
  exposure to the d.p. workflow, then you might well benefit
  from dialog with the volunteers in the forums, should you
  come across any rough spots when doing your digitization.
  there are people there with a lot of digitization experience...

  but as that's probably not possible, i'd advise you to stay away.

  again, this is a change from what i've recommended up to now.
  it was wrong to recommend them, so i have changed my mind...

  ***

  and now we come to the advice given to kevin by al haines.

  al is a _phenomenal_ digitizer -- he has done probably _hundreds_
  of books submitted to d.p. -- and is a newly-deputized whitewasher.

  so it pains me a great deal to have to take issue with some of his points,
  however minor my disagreement might be.  (and it's usually quite minor.)

  but i must.  so i will.

  ***

  al said:
  >   I don't know what word processor you're using, but try to 
  >   keep the lines the same length as they are in the original book.  
  >   If possible, put a soft return at the end of each line (in MS-Word, 
  >   that's done with Shift-Enter).  (But use a hard return at paragraph end.)  
  >   That way, you can do a line count of each typed page, 
  >   and if that doesn't match the book, you've done something wrong.  
  >   The soft returns can be dealt with when the chapter is complete.

  ok, so let us begin with the slight disagreement.          :+)

  don't "try" to keep the lines the same length as in the original book.

  instead, type the lines exactly _as_is_.  even hyphenate the words
  which were hyphenated when the text hit against the right margin.

  it's true p.g. wants you to dehyphenate those words, but you can
  have me (have my software) do that dehyphenation for you, later.

  some people _want_ the linebreaks just as they were in the p-book,
  and there's absolutely no reason for you not to make them happy...

  besides, you will find it easier to get into the rhythm of the type-in
  if you get yourself in sync with the lines as they appear on the page.

  and, of course, _proofing_ -- which we all agree has to be done --
  is _absolutely_ easier (by orders of magnitude) when the linebreaks
  in your text-file precisely match the linebreaks of the physical page.

  so put a hard-return at the end of each line, and 2 hard-returns at
  the end of each paragraph, and save yourself a ton of later misery...


  >   Save each chapter as a separate file, in two formats --
  >   your word processor's native format and as a standard text file.  
  >   Run Gutcheck, Jeebies, and Gutspell on the text version, 
  >   and fix problems in the native version.  

  you can break the file into separate chapters if you _want_ to.
  but there's certainly no _need_ to do it.  (i'd think it's a hassle.)

  and for goodness sake, do _not_ use "gutcheck", or "jeebies",
  or "gutspell", even if you know what they are.  (and if you don't,
  be glad that you don't have to bother learning what they are...)
  i will correct your text entirely by running checks using my tools.

  you _do_ need to run a spellcheck on your work, most certainly,
  but anyone who has learned to proof text by reading it backward
  doesn't need to be told something as basic and elementary as that.

  what you might not appreciate, however, at least not _sufficiently_,
  is the value of creating a specific "dictionary" for your spellchecker
  for each book.  a book has a certain number of words unique to it
  -- such as names -- which will typically occur with great frequency.
  you don't want your spellchecker to stop at each and every one, but
  you might not wanna add the word to your _main_ dictionary either.

  so if your wordprocessor lets you do so -- and many of them do --
  declare a "special" dictionary for each book you do, and use it then.
  (indeed, if your _main_ spellchecker doesn't allow this, it is worth
  the trouble to do the spellchecking in a wordprocesser that does.)
  don't use the "alternate" dictionary, allowed by some wordprocessors.
  make it a "special" dictionary, one that you'll use solely for that book.
  that way you can always go back to the book, at any time, and call up
  its special dictionary.  very handy.  as an aside, i find it fascinating to 
  examine the special dictionary; gives you insight into the book itself.


  >   If the book has footnotes, type them at the bottom of their 
  >   respective page and leave them there until the page is 
  >   thoroughly proofed.  When the chapter is complete, 
  >   they can be handled as per PG guidelines.  (I renumber them 
  >   sequentially and move them to the end of their chapter.)

  my programs automatically handle all of that footnote movement...

  just type 'em at the bottom of the page, exactly like you find them,
  and let me worry about the rest of it.

  oh yeah, _about_ pages...

  just as you've recorded the linebreaks as they were in the p-book,
  you'll need to mark the pagebreaks as well.  i suggest you simply
  use the "pagebreak" command in your wordprocessor, but if you
  are using one that doesn't have such a command, then just type
  a line of dashes for a pagebreak.  _do_ type the pagenumber too;
  you can either type it at the _bottom_ of each page, or the _top_,
  but do it _consistently_ -- even if the p-book did it inconsistently!

  also, because you want these p-book pagenumbers to be in sync
  with the pagenumbers as they're figured by your wordprocessor,
  put the frontmatter in one file, and the body-text in another file...
  that way, "page 1" in your wordprocessor will be the _real_ "page 1".

  let's see, is there anything else?

  text-styling!  oh my goodness, i almost forgot.  p.g. wants ascii,
  but you should _definitely_ record any text-styling when present,
  like italics and bold.  use your regular wordprocessor formatting.
  when you are finished, you can convert it to the p.g. conventions.
  (what this means is that you do _not_ save your file as a text file.)

  i can create an .html file and a .pdf out of your work, if you want,
  so don't think you must do any special work to accomplish that...

  feel free to send me any of your work once you've started doing it.
  i'll be happy to give you feedback if you're doing anything wrong...

  and if google (or anyone else) has scanned the book you're doing,
  let me know, and i'll take a look at it to see if you need any advice...
  (i will also o.c.r. it for you, and compare the output to your type-in;
  that way, you won't even have to do the proofing if you don't want.
  in contrast to typing, proofing does _not_ imprint upon your d.n.a.)

  welcome to the world of book-digitizing...

  -bowerbird

  p.s.  having your computer do text-to-speech on your file can be
  a _great_ way to do "proofing".  the errors really jump out at you...



  **************
  Ideas to please picky eaters. Watch video on AOL Living.
  (http://living.aol.com/video/how-to-please-your-picky-eater/rachel-campos-duffy/2050827?NCID=aolcmp00300000002598) 


------------------------------------------------------------------------------


  _______________________________________________
  gutvol-d mailing list
  gutvol-d at lists.pglaf.org
  http://lists.pglaf.org/listinfo.cgi/gutvol-d
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080301/14e97ce8/attachment.htm 

From hyphen at hyphenologist.co.uk  Sat Mar  1 23:38:59 2008
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Sun, 2 Mar 2008 07:38:59 -0000
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <15cfa2a50803011113r646e3bbch793873cc6495701c@mail.gmail.com>
References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com>	<1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com>	<8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com>
	<15cfa2a50803011113r646e3bbch793873cc6495701c@mail.gmail.com>
Message-ID: <001801c87c38$7b822a40$72867ec0$@co.uk>

I have a spare scanner in the loft and an unused CDROM of Abbyy Finereader
if there is any way of shipping them  to you, or anyone else for that
matter.

 

I am in the UK.

 

Dave Fawthrop <hyphen at hyphenologist.co.uk>

 

From: gutvol-d-bounces at lists.pglaf.org
[mailto:gutvol-d-bounces at lists.pglaf.org] On Behalf Of Robert Cicconetti
Sent: 01 March 2008 19:14
To: Project Gutenberg Volunteer Discussion
Subject: Re: [gutvol-d] The Old Fashioned Way...

 

 

On Sat, Mar 1, 2008 at 2:08 PM, Kionon <kionon at animemusicvideos.org> wrote:



That, and of course, I lack a scanner, and am not located within a
reasonable radius of a location with which to obtain one (I am, in
fact, not even in an English speaking country).


You have to have access to a scanner (or find a scan online of the same
edition) in order to use the copyright clearance process at copy.pglaf.org.
I think there is an older process involving mailing in photocopies of the
title/verso to MH, but I have never used it...



I

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080302/74bc21be/attachment-0001.htm 

From prosfilaes at gmail.com  Sun Mar  2 07:54:19 2008
From: prosfilaes at gmail.com (David Starner)
Date: Sun, 2 Mar 2008 10:54:19 -0500
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <bd2.1a4b4ff4.34fb983b@aol.com>
References: <bd2.1a4b4ff4.34fb983b@aol.com>
Message-ID: <6d99d1fd0803020754u6c066fe0g839d8573b44a8872@mail.gmail.com>

Let me note, Kionon, that DP has produced 12,000 volumes for Project
Gutenberg, and that the tools that Bowerbird dismisses, gutprep,
gutcheck and jeebies, have been used on most of those. They're fairly
well documented and have the source code available if you're inclined
to dig deeper or change things. On the other hand, Bowerbird has never
done a book for Project Gutenberg, rarely shares his tools and never
his code.

Note also how often he says "i will" instead of "here's this tool that
will let you" or "here's a webpage to show you how". That would be
very concerning to me, if I were to work on a project.

From Bowerbird at aol.com  Sun Mar  2 12:24:48 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 2 Mar 2008 15:24:48 EST
Subject: [gutvol-d] The Old Fashioned Way...
Message-ID: <cf2.296a4c67.34fc6710@aol.com>

al said:
>?? Bowerbird - can you supply a list of the books
>?? you've submitted to PG?? I'd like to have a look a them.

just one, al.? "the universe (or nothing)" by meyer moldeven.? #18257.

it was actually never published, so it was a "type-in" in the purest sense,
which meant that i mostly did the _editing_ on it, not "proofing" per se...
meyer is an old guy who wanted the future to have access to his story...
when he posted of his intentions, i offered to help him do a submission.

i would _love_ to hear your feedback on my treatment of meyer's book.

indeed, if anyone can find anything i did wrong on it, i'd appreciate it...

it is a completely normal book -- just chapter after chapter of text --
so it wasn't like the _formatting_ of it was difficult.   but since it hadn't
been through the hands of a professional copy-editor at a publisher,
it had lots of copy-editing glitches, so i had to write tools to detect 'em.

but there was no one checking _my_ work, to see if i had made errors...

so if anyone wants to do that, i would be nice.   heck, even if you want to
do it so you can poke me in the eye with a mistake, feel free to proceed...

-bowerbird



**************
Ideas to please picky eaters. Watch video on AOL Living.
      
(http://living.aol.com/video/how-to-please-your-picky-eater/rachel-campos-duffy/
2050827?NCID=aolcmp00300000002598)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080302/f3f0064a/attachment.htm 

From schultzk at uni-trier.de  Mon Mar  3 00:10:06 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Mon, 3 Mar 2008 09:10:06 +0100
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <6d99d1fd0803020754u6c066fe0g839d8573b44a8872@mail.gmail.com>
References: <bd2.1a4b4ff4.34fb983b@aol.com>
	<6d99d1fd0803020754u6c066fe0g839d8573b44a8872@mail.gmail.com>
Message-ID: <54D4585A-D932-4B86-BCB4-63FF3D502BEB@uni-trier.de>

Hi Everybody,

	Just followed this thread and I ask how ignorant can people
	get??

		1) Somebody is willing to support and contribute to the project !!

		2) The project mongers say no way Jos? you are tooo backwards !!?

		3) The work you are goning to do will be bad !!???

		4) You need different hardware or we do not want you !!??

	If somebody wants to contribute let them. If they want to do them by  
hand
	then all the more power to them.

	As for typing it in by HAND the old fashioned way. I would trust my  
girl friend
	more than any old (or newer) scanner/OCR system. Why you ask she is  
a professional
	secretary and she will out do any scanner on the first several  
pages. She does not
	need to correct anything. No time wasted. I can even dictate to her  
and in it goes
	into the computer!!!

	To that is proof enough that single persons can be proficient  
enough. Please do not
	bang on those who are willing to good OLD FASHIONED HANDY WORK !!

	Kionon go for it. Do not be stopped by the ignorant. There IS  
NOTHING stopping you
	from doing it and getting it contributed to PG. DP maybe, but then  
again DP is not PG

	
	regards
		Keith.

		

	

From Catenacci at Ieee.Org  Mon Mar  3 04:07:05 2008
From: Catenacci at Ieee.Org (Onorio Catenacci)
Date: Mon, 3 Mar 2008 07:07:05 -0500
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <54D4585A-D932-4B86-BCB4-63FF3D502BEB@uni-trier.de>
References: <bd2.1a4b4ff4.34fb983b@aol.com>
	<6d99d1fd0803020754u6c066fe0g839d8573b44a8872@mail.gmail.com>
	<54D4585A-D932-4B86-BCB4-63FF3D502BEB@uni-trier.de>
Message-ID: <c26320b80803030407g481a22b7g29573b524a44d24b@mail.gmail.com>

On Mon, Mar 3, 2008 at 3:10 AM, Schultz Keith J. <schultzk at uni-trier.de> wrote:
> Hi Everybody,
>
>         Just followed this thread and I ask how ignorant can people
>         get??
>
>                 1) Somebody is willing to support and contribute to the project !!
>
>                 2) The project mongers say no way Jos? you are tooo backwards !!?
>
>                 3) The work you are goning to do will be bad !!???
>
>                 4) You need different hardware or we do not want you !!??
>
>         If somebody wants to contribute let them. If they want to do them by
>  hand
>         then all the more power to them.
>
>         As for typing it in by HAND the old fashioned way. I would trust my
>  girl friend
>         more than any old (or newer) scanner/OCR system. Why you ask she is
>  a professional
>         secretary and she will out do any scanner on the first several
>  pages. She does not
>         need to correct anything. No time wasted. I can even dictate to her
>  and in it goes
>         into the computer!!!
>
>         To that is proof enough that single persons can be proficient
>  enough. Please do not
>         bang on those who are willing to good OLD FASHIONED HANDY WORK !!
>
>         Kionon go for it. Do not be stopped by the ignorant. There IS
>  NOTHING stopping you
>         from doing it and getting it contributed to PG. DP maybe, but then
>  again DP is not PG
>
>

Hi Keith,

There's one little issue that would prevent Kionon from
contributing--that being how will anyone be able to check his
electronic text against the original without scans?  Unless someone
else owns a copy of the book and they're willing to proof it, that
seems like a fairly major problem to me.

-- 
Onorio Catenacci III

From joshua at hutchinson.net  Mon Mar  3 05:26:23 2008
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Mon, 3 Mar 2008 13:26:23 +0000 (GMT)
Subject: [gutvol-d] The Old Fashioned Way...
Message-ID: <683955382.169941204550783529.JavaMail.mail@webmail03>

Don't be ignorant yourself.

No one told him he COULDN'T do it.  They merely told him why it is much much more likely to have problems.

They also suggested seeing how DP does things to get a better idea of some "best practices".

He even got suggestions on how to work around the problem that you need to have scans of the title and verso for clearance.

Josh

On Mar 3, 2008, schultzk at uni-trier.de wrote: 
Hi Everybody,

	Just followed this thread and I ask how ignorant can people
	get??

		1) Somebody is willing to support and contribute to the project !!

		2) The project mongers say no way Jos&#233; you are tooo backwards !!?

		3) The work you are goning to do will be bad !!???

		4) You need different hardware or we do not want you !!??

	If somebody wants to contribute let them. If they want to do them by  
hand
	then all the more power to them.

	As for typing it in by HAND the old fashioned way. I would trust my  
girl friend
	more than any old (or newer) scanner/OCR system. Why you ask she is  
a professional
	secretary and she will out do any scanner on the first several  
pages. She does not
	need to correct anything. No time wasted. I can even dictate to her  
and in it goes
	into the computer!!!

	To that is proof enough that single persons can be proficient  
enough. Please do not
	bang on those who are willing to good OLD FASHIONED HANDY WORK !!

	Kionon go for it. Do not be stopped by the ignorant. There IS  
NOTHING stopping you
	from doing it and getting it contributed to PG. DP maybe, but then  
again DP is not PG

	
	regards
		Keith.

		

	
_______________________________________________
gutvol-d mailing list
gutvol-d at lists.pglaf.org
http://lists.pglaf.org/listinfo.cgi/gutvol-d


From joshua at hutchinson.net  Mon Mar  3 05:28:05 2008
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Mon, 3 Mar 2008 13:28:05 +0000 (GMT)
Subject: [gutvol-d] The Old Fashioned Way...
Message-ID: <433504138.170011204550885503.JavaMail.mail@webmail03>

Robert,

That's a fairly common thing.  A *huge* majority of our books don't have scans to check against.  Most would prefer we had them, but especially for our older stuff, we don't have anything.

Josh


On Mar 3, 2008, Catenacci at Ieee.Org wrote: 
Hi Keith,

There's one little issue that would prevent Kionon from
contributing--that being how will anyone be able to check his
electronic text against the original without scans?  Unless someone
else owns a copy of the book and they're willing to proof it, that
seems like a fairly major problem to me.

-- 
Onorio Catenacci III
_______________________________________________
gutvol-d mailing list
gutvol-d at lists.pglaf.org
http://lists.pglaf.org/listinfo.cgi/gutvol-d
ober

From hyphen at hyphenologist.co.uk  Mon Mar  3 07:55:23 2008
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Mon, 3 Mar 2008 15:55:23 -0000
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <c26320b80803030407g481a22b7g29573b524a44d24b@mail.gmail.com>
References: <bd2.1a4b4ff4.34fb983b@aol.com>	<6d99d1fd0803020754u6c066fe0g839d8573b44a8872@mail.gmail.com>	<54D4585A-D932-4B86-BCB4-63FF3D502BEB@uni-trier.de>
	<c26320b80803030407g481a22b7g29573b524a44d24b@mail.gmail.com>
Message-ID: <001601c87d47$01878db0$0496a910$@co.uk>


Back in the *old* days proofing was done by typing in two copies and then
"diffing" the two.

Dave F

-----Original Message-----
From: gutvol-d-bounces at lists.pglaf.org
[mailto:gutvol-d-bounces at lists.pglaf.org] On Behalf Of Onorio Catenacci
Sent: 03 March 2008 12:07
To: Project Gutenberg Volunteer Discussion
Subject: Re: [gutvol-d] The Old Fashioned Way...


There's one little issue that would prevent Kionon from
contributing--that being how will anyone be able to check his
electronic text against the original without scans?  Unless someone
else owns a copy of the book and they're willing to proof it, that
seems like a fairly major problem to me.



From schultzk at uni-trier.de  Mon Mar  3 09:26:13 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Mon, 3 Mar 2008 18:26:13 +0100
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <c26320b80803030407g481a22b7g29573b524a44d24b@mail.gmail.com>
References: <bd2.1a4b4ff4.34fb983b@aol.com>
	<6d99d1fd0803020754u6c066fe0g839d8573b44a8872@mail.gmail.com>
	<54D4585A-D932-4B86-BCB4-63FF3D502BEB@uni-trier.de>
	<c26320b80803030407g481a22b7g29573b524a44d24b@mail.gmail.com>
Message-ID: <9D3EFA16-00DC-4AA8-8E7F-6554F847089C@uni-trier.de>

Hi,
	I wonder if this really a problem.
	Did it not use to be so?

	Like I said it is up to Kionon. To me
	it is just one more book.

	regards
		Keith

Am 03.03.2008 um 13:07 schrieb Onorio Catenacci:

> On Mon, Mar 3, 2008 at 3:10 AM, Schultz Keith J. <schultzk at uni- 
> trier.de> wrote:
>> Hi Everybody,
[snip, snip]

>>         Kionon go for it. Do not be stopped by the ignorant. There IS
>>  NOTHING stopping you
>>         from doing it and getting it contributed to PG. DP maybe,  
>> but then
>>  again DP is not PG
>>
>>


>
> Hi Keith,
>
> There's one little issue that would prevent Kionon from
> contributing--that being how will anyone be able to check his
> electronic text against the original without scans?  Unless someone
> else owns a copy of the book and they're willing to proof it, that
> seems like a fairly major problem to me.
>

From schultzk at uni-trier.de  Mon Mar  3 09:45:31 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Mon, 3 Mar 2008 18:45:31 +0100
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <683955382.169941204550783529.JavaMail.mail@webmail03>
References: <683955382.169941204550783529.JavaMail.mail@webmail03>
Message-ID: <0AE6D9CB-3E6B-4F00-B1B8-A8C95B6726B6@uni-trier.de>

Hi Joshua,

	I knew you would BITE!

	But talking about ignorance: The Facts:

	1) Karen lofstorm wrote on 1 March:
> On 3/1/08, Kionon <kionon at animemusicvideos.org> wrote:
>
>
>> If I wish to add a public domain book to the project, and I  
>> actually desire
>> to type it up by hand, is there any reason why I can't?
>>
>
> Unfortunately, you can. I say "unfortunately" because it is close to
> certain that you are going to produce a flawed text. Since Project
> Gutenberg, at present, doesn't have any quality controls, it will
> accept your flawed text.
>

	Defacto she is saying do not do it because ... !!!

	2) Robert wrote on 1 March
>
> On Sat, Mar 1, 2008 at 2:08 PM, Kionon  
> <kionon at animemusicvideos.org> wrote:
> That, and of course, I lack a scanner, and am not located within a
> reasonable radius of a location with which to obtain one (I am, in
> fact, not even in an English speaking country).
>
> You have to have access to a scanner (or find a scan online of the  
> same edition) in order to use the copyright clearance process at  
> copy.pglaf.org. I think there is an older process involving mailing  
> in photocopies of the title/verso to MH, but I have never used it...
>
	Defacto if you do NOT have a scan ... No you can not do it.
	Yes, he how he might find a scan. Yet he MUST HAVE a scanned version.

	
	Well Joshua, Kevin ask if there is any reason he can offer his work  
to PG.
	He did not ask if it is what DP wants.

	I personally do not like DP, even though it does good work and is  
the the largest
	contributor to PG. DP does not have a monopoly. Though from your  
reaction
	one gets such a feeling.

	regards
		Keith.

Am 03.03.2008 um 14:26 schrieb Joshua Hutchinson:

> Don't be ignorant yourself.
>
> No one told him he COULDN'T do it.  They merely told him why it is  
> much much more likely to have problems.
>
> They also suggested seeing how DP does things to get a better idea  
> of some "best practices".
>
> He even got suggestions on how to work around the problem that you  
> need to have scans of the title and verso for clearance.
>
> Josh
>
>
[snip, snip]

From ebooks at ibiblio.org  Mon Mar  3 10:59:46 2008
From: ebooks at ibiblio.org (Jose Menendez)
Date: Mon, 03 Mar 2008 13:59:46 -0500
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <cf2.296a4c67.34fc6710@aol.com>
References: <cf2.296a4c67.34fc6710@aol.com>
Message-ID: <47CC4AA2.5050706@ibiblio.org>

On March 2, 2008, Bowerbird wrote:

> al said:
>  >   Bowerbird - can you supply a list of the books
>  >   you've submitted to PG?  I'd like to have a look a them.
> 
> just one, al.  "the universe (or nothing)" by meyer moldeven.  #18257.


I take it you mean this ebook, "The Universe -- or Nothing."

http://www.gutenberg.org/files/18257/18257.txt


> it was actually never published, so it was a "type-in" in the purest sense,
> which meant that i mostly did the _editing_ on it, not "proofing" per se...
> meyer is an old guy who wanted the future to have access to his story...
> when he posted of his intentions, i offered to help him do a submission.


I vaguely remembered when Mr. Moldeven posted about it on the Book 
People mailing list, and a quick search turned up his post in the BP 
archive.

"Seeking Internet Archive" (29 Jun 2005)
http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&post=2005-06-29,6

You replied to him the next day, saying "meyer, i converted your 
science-fiction piece into z.m.l. format a while back, so it should be
acceptable in that form to project gutenberg..."

"re: Seeking Internet Archive" (30 Jun 2005)
http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&post=2005-06-30,2

Note the date of your BP post, June 30, 2005. Now, if we look at the 
PG ebook, we see "Release Date: April 25, 2006." That's nearly *ten* 
full months after your BP post. It's a good thing you don't use those 
inefficient DP workflows you're always criticizing. ;)


> indeed, if anyone can find anything i did wrong on it, i'd appreciate it...
> 
> it is a completely normal book -- just chapter after chapter of text --
> so it wasn't like the _formatting_ of it was difficult.  but since it hadn't
> been through the hands of a professional copy-editor at a publisher,
> it had lots of copy-editing glitches, so i had to write tools to detect 'em.
>
> but there was no one checking _my_ work, to see if i had made errors...


Well, back in late January of 2006, you asked me to check it, but I 
turned you down.


> so if anyone wants to do that, i would be nice.  heck, even if you want to
> do it so you can poke me in the eye with a mistake, feel free to proceed...


A quick check of a simple word frequency list was enough to find a 
mistake. For instance, the ebook contains two occurrences of 
"accello-net" (whatever that may be) and one "accello-nets," but 
there's also one "accelo-nets." There's an "l" missing in that one.

The word frequency list also revealed a number of hyphenation 
inconsistencies, for example, "interregional" vs. "inter-regional," 
"mine-layer" vs. "minelayers," "multicolored" vs. "multi-colored," etc.

Now some may say that I'm just nitpicking about those inconsistencies, 
but I have a good reason. Back in mid-January of 2006, I made an ebook 
of Willa Cather's "My ?ntonia." Since Bowerbird had often taunted Jon 
Noring publicly about how long it was taking him to finish his version 
of the same book, I emailed Bowerbird, Jon, and David Rothman about 
mine. In the ensuing discussion, Bowerbird criticized my version 
because I had retained similar hyphenation inconsistencies that were 
in the original paper book, e.g. "grain-sack" vs. "grainsack" and 
"oil-cloth" vs. "oilcloth." Bowerbird told me forcefully and at great 
length that I should fix those inconsistencies. He even said that he'd 
fixed them in his own version of "My ?ntonia." So I was surprised to 
see so many similar inconsistencies in this ebook that he submitted to 
PG, especially since he submitted it *after* that lengthy discussion 
about the hyphenated words in Cather's book.


Jose Menendez


P.S. A few checks also revealed a number of punctuation errors. For 
example:


"What now?", Zolan asked.

That should be

"What now?" Zolan asked.


"Not much choice." Brad replied in a whisper.

That should be

"Not much choice," Brad replied in a whisper.


"Don't count on it." Ram replied grimly.

That should be

"Don't count on it," Ram replied grimly.

From Catenacci at Ieee.Org  Mon Mar  3 11:32:31 2008
From: Catenacci at Ieee.Org (Onorio Catenacci)
Date: Mon, 3 Mar 2008 14:32:31 -0500
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <433504138.170011204550885503.JavaMail.mail@webmail03>
References: <433504138.170011204550885503.JavaMail.mail@webmail03>
Message-ID: <c26320b80803031132j212c3abfl29c64292924fc6a5@mail.gmail.com>

On Mon, Mar 3, 2008 at 8:28 AM, Joshua Hutchinson <joshua at hutchinson.net> wrote:
> Robert,
>
>  That's a fairly common thing.  A *huge* majority of our books don't have scans to check against.  Most would prefer we had them, but especially for our older stuff, we don't have anything.
>

Ah.  My bad for making an unjustified assumption. :-)

-- 
Onorio Catenacci III

From Bowerbird at aol.com  Mon Mar  3 12:24:19 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 3 Mar 2008 15:24:19 EST
Subject: [gutvol-d] The Old Fashioned Way...
Message-ID: <d05.2a6bed62.34fdb873@aol.com>

oh gee, lookee here, _jose_menendez_ has made an appearance!
great to see you jose!   even though i know you're here to razz me.

***

jose said:
>    Note the date of your BP post, June 30, 2005. 
>    Now, if we look at the PG ebook, we see "Release Date: April 25, 2006." 
>    That's nearly *ten* full months after your BP post. 
>    It's a good thing you don't use those inefficient DP workflows 
>    you're always criticizing. ;)

notice the smiley there, folks.   that means jose is "just kidding".

but he has a good point.   just exactly why _did_ it take so long?

well, the answer to that is pretty simple.   meyer was still making
_changes_ to his book.   like many authors, he kept rewriting it...
however, _unlike_ most editors, i didn't impose a deadline on him.
so that accounted for a good chunk of that time.

at some point, though, he did tire of the rewriting, and "finished".

of course, by that time, i had other things on my plate, so it took
me a little while to get back to it.   and then we did copy-editing.
and then we did more copy-editing.   and then we did even more.
if you've ever copy-edited a "raw" book, you know it takes time...

and then, when _that_ was done, i fully intended to demonstrate the
.pdf and .html conversion possibilities, but still had to program them.

i'm not one of those disciplined programmers, who can make myself
code "on-demand".   i have to wait for "the inspiration".   and it wasn't
all that forthcoming.   so finally meyer wrote me, after a heart-attack,
saying "i don't know if i'm long for this world; can we post my book?",
so i did.   thank goodness, as far as i know, he's still alive and kicking...

and _that's_ why it took 10 months.   actually, i would have guessed
that he waited at least that long just for me to do the programming,
so if that was the _total_ time, then i'm a little bit surprised...


>    Well, back in late January of 2006, you asked me to check it, 

because you're one of the best at finding errors, jose, and i know it.
so i figured that if you couldn't find an error, then _nobody_ could...


>    but I turned you down.

yeah.   you never were one to do me a favor, were you?         ;+)


>    A quick check of a simple word frequency list 
>    was enough to find a mistake.   For instance,

notice that "for instance", folks.   translated into jose, that means
"here's one of the errors i found."   the _unspoken_ part of that is
that he has found _more_, he's just ain't gonna tell you, not yet...


>    the ebook contains two occurrences of   "accello-net" 
>    (whatever that may be) and one "accello-nets," but there's 
>    also one "accelo-nets." There's an "l" missing in that one.

i don't even know what an "accello-net" is.   _or_ an "accelo-net".
or the difference between them.   or if there is a difference.       :+)


>    The word frequency list also revealed a number of hyphenation
>    inconsistencies, for example, "interregional" vs. "inter-regional,"
>    "mine-layer" vs. "minelayers," "multicolored" vs. "multi-colored," etc.

another good catch!   as i said, meyer did a lot of rewriting on this.
so it's obvious that i needed to do the hyphenation checks again...

but it's not like i didn't do them a half-dozen times before that...

or maybe my hyphenation checks just weren't too good back then.
who knows, maybe they're not even too good _now_.   or maybe so.


>    Now some may say that I'm just nitpicking about those inconsistencies

people who said that would be _wrong_, as far as i'm concerned...

and _i'm_ certainly not going to be one of those people who says it.

i _want_ to know about any inconsistencies, and eliminate them.
so i don't consider it to be "nitpicking", jose, not in the slightest...

they might not be a _serious_ error -- ok, they're _not_ a serious error --
but they are an error nonetheless, and i want _all_ errors to be corrected.

so i am _deeply_ appreciative to you for bringing them to my attention...
now, how about the _other_ errors you found...                      :+)


>    Since Bowerbird had often taunted Jon Noring publicly 
>    about how long it was taking him to finish his version
>    of the same book, I emailed Bowerbird, Jon, and David Rothman 
>    about mine. In the ensuing discussion, Bowerbird criticized 
>    my version because I had retained similar hyphenation inconsistencies 
>    that were in the original paper book, e.g. "grain-sack" vs. "grainsack" 
>    and "oil-cloth" vs. "oilcloth." 

i'm quite sure i didn't _criticize_ you for retaining them, jose.

the decision about whether to _retain_ such inconsistencies is
one that can go either way...   since you consider yourself to be
_replicating_ the p-book, your decision would be to retain them.

since i consider myself to be _republishing_ the p-book, i fix 'em.
different strokes for different folks; that's what makes a horse race.


>    Bowerbird told me forcefully

you might have interpreted my posts as being "forceful" -- probably
because of the strength of the logic -- but that's your interpretation.


>    and at great length

well, i do go on for a while.   but you're just jealous, jose, because
you're a 2-finger typist who can't type nearly as fast as he thinks...      
:+)
(have you tried voice-recognition apps?   i hear they're good now.)


>    that I should fix those inconsistencies. 

well, no.   the nature of the discussion would have revolved around
the general issue of whether it is better to _replicate_ or _republish_,
not around the consequent issue of whether to keep inconsistencies.


>    He even said that he'd fixed them 
>    in his own version of "My ?ntonia."

right.   because that's what a republisher _should_ do...


>    So I was surprised to see so many similar inconsistencies 
>    in this ebook that he submitted to PG, especially since he 
>    submitted it *after* that lengthy discussion about 
>    the hyphenated words in Cather's book.

there's absolutely no question those are errors that need to be fixed.

thank you for showing them to me.   like i said, you da best...         :+)


>    P.S. A few checks also revealed a number of punctuation errors. 
>    For example:
>
>    "What now?", Zolan asked.
>    That should be
>    "What now?" Zolan asked.

really?   i would say the comma is correct.
if it's not, i'll need to make a check for it...


>    "Not much choice." Brad replied in a whisper.
>    That should be
>    "Not much choice," Brad replied in a whisper.
>
>    "Don't count on it." Ram replied grimly.
>    That should be
>    "Don't count on it," Ram replied grimly.

i'll have to check the context, but i would agree those look wrong.

thing is, i don't see how i can automate a check for that.   do you?

i didn't proofread this book.   meyer said other people had done that.
i only subjected it to my automated tests.   now, i suppose that i could
locate every occurrence of "period-quotemark-space-name-replied".

but that would fail on occurrences of "responded" instead of "replied",
or "said" or "snorted" or "taunted" or any of a number of similar terms.
i'd also guess that test will turn up too large a number of false alarms.

so it doesn't seem to me to be a _practical_ test to include in my tool.
however, if someone suggests a better way for me to phrase the test
-- and feel free to use regex if it makes it possible for you to do it --
by all means, please show off your cleverness and share it with me...

i'm curious, does gutcheck find this type of error?

-bowerbird

p.s.   also, just for the record, and so everyone is absolutely clear here,
i didn't _force_ any changes on meyer.   didn't even make them for him,
for the most part.   i'd just send him a list of "stuff that i would change",
and he'd either make the changes or not, depending on his own mind...
even if something was "wrong" grammatically, if he _wanted_ it that way,
that's the way it stayed.   when you have a living author, you have _zero_
difficulty determining "the intent of the author", so i gave him free reign.
i do not think he'd be stubborn about fixing the errors reported above,
so i'm not offering that here as an _excuse_, because it does not apply,
but i felt the need to say it to set the record straight...   and jose, 
thanks!



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080303/fa89c54d/attachment.htm 

From prosfilaes at gmail.com  Mon Mar  3 16:06:36 2008
From: prosfilaes at gmail.com (David Starner)
Date: Mon, 3 Mar 2008 19:06:36 -0500
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <54D4585A-D932-4B86-BCB4-63FF3D502BEB@uni-trier.de>
References: <bd2.1a4b4ff4.34fb983b@aol.com>
	<6d99d1fd0803020754u6c066fe0g839d8573b44a8872@mail.gmail.com>
	<54D4585A-D932-4B86-BCB4-63FF3D502BEB@uni-trier.de>
Message-ID: <6d99d1fd0803031606n14fdf485g768e2eaa18f96d7@mail.gmail.com>

On Mon, Mar 3, 2008 at 3:10 AM, Schultz Keith J. <schultzk at uni-trier.de> wrote:
>         Just followed this thread and I ask how ignorant can people
>         get??

If you define ignorant as disagreeing with you, very. But I think
that's an overly parochial definition.

>         If somebody wants to contribute let them. If they want to do them by
>  hand
>         then all the more power to them.

Each book in Project Gutenberg reflects on the quality of the whole.
Not only that, Novel by Joe Shmoe getting posted will stop most other
people from working on it at all, which means that a poor-quality
edition will stop a high-quality edition from being posted. From my
perspective, that's motivation to encourage people to submit only
high-quality copies to PG.

>         To that is proof enough that single persons can be proficient
>  enough.

"I know a person who is perfect at this" is hardly proof; it's barely
even an argument. We've all seen the opposite; DP is proofing the
motion picture copyright filings, and is finding that the original
typing has left several errors a page. To achieve the results that DP
is achieving, most companies have two typists independently type out
the text.

>Please do not
>         bang on those who are willing to good OLD FASHIONED HANDY WORK !!

Hard work frequently isn't a substitute for using the right tools and
right knowledge. The man who picks up a hammer one day and starts
building houses for people may be altruistic, but without the right
knowledge, he's also endangering lives.

> Well Joshua, Kevin ask if there is any reason he can offer his work
> to PG.  He did not ask if it is what DP wants.

DP doesn't want anything; the people who work with DP do. And most of
them want PG to be the greatest it can be. Perhaps we assume, na?vely
perhaps, that other people would share that goal.

> I personally do not like DP,

Didn't you just say
>Please do not
>         bang on those who are willing to good OLD FASHIONED HANDY WORK !!
?

From Bowerbird at aol.com  Tue Mar  4 11:41:51 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 4 Mar 2008 14:41:51 EST
Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people
Message-ID: <ca1.287922dc.34feffff@aol.com>

robert said:
>   how will anyone be able to check his electronic text 
>    against the original without scans?

the same way we check the other books without scans,
by finding a copy of the p-book or finding a scan-set...

by the way, any progress on the process of uploading
the scans from d.p. to p.g.?   c'mon folks, get that done.
if you can't do it yourselves, i'll be happy to do it for you,
working from the p.g. side, if michael and greg approve.

***

keith said:
>   If somebody wants to contribute let them. 
>    If they want to do them by?hand
>   then all the more power to them.

i think the general message that he _could_ contribute
got through to kevin.   (but perhaps kevin could tell us?)

yes, the impression that he had to present a scan of the
titlepage and verso was misleading, but robert also was
fairly quick to correct himself and give kevin an option...

people were mostly concerned about the quality issue.

and yeah, that's kind of a red herring, because there are
people (like your girlfriend) who can do excellent quality,
and we have no idea about the nature of kevin's skills here.
(although if he's willing to do it, he's likely not a bad typist.)

still, it's probably good to sensitize people to that issue...

however...

a recommendation to use the gutcheck tools is not good.

first and foremost in this specific case, they are geared to
the mistakes typical to the process of o.c.r., _not_ typing...
typists don't make he/be errors, or confuse o/c, m/rn, etc.
a wordprocessor's spellchecker fixes _typing_ errors fine...

furthermore, and perhaps this issue is really _foremost_,
those tools are exceedingly difficult for people to install.
some of them also require the installation of _libraries_...

continuing in this vein, those tools are _not_ easy to use.
as script-based tools that generate lists of potential errors,
they are out-of-touch in a world now centered on the g.u.i.

now, obviously, if you installed these tools years ago,
and you've been using them for years, you will not be
sensitive to these concerns.   but the _average_ person
-- instructed to use them before submitting a book --
might well give up on even _doing_ that book instead...

besides, whitewashers will run those tools on the book
that kevin submits anyway.   it's not like it goes undone...

so, instead, i offered my tools to kevin.   they require no
installation.   they're easy to operate, with a g.u.i. on 'em.
moreover, they do the job, as well as those other tools...

you can bet that if they _didn't_, people would already
have pointed out to me errors in the book i submitted.
after all, how long does it take to run checks on a book?
they ran them, and found that my submission was clean.
and if you don't believe me, i dare you to run 'em yourself.

(and, once again, my thanks to jose for reporting errors...)

***

conspicuous by its absence, in contrast, was the failure of
any d.p. people to say "we will check your book for you..."

after all, they've already installed those checker-programs,
and become experienced with using them.   they didn't offer
to step up and offer to help this new volunteer do what _he_
wants to do, which is to have fun typing in a book manually.

instead, they advised him to do what _they_ want him to do,
which is to become a part of their group.   if he did join d.p.,
they'd stick him in the p1 round, proofing, and then maybe,
after enough time on-site and he'd done enough p1 pages,
he could _apply_ to take a _test_ which -- if he passed it --
he would "graduate" up to p2 proofing.   whoop-dee-doo...

this insensitivity, to the actual question which was being asked,
is probably what made you angry, keith, in my humble opinion.

the person who re-contacted me on friday, who i had advised to
check out d.p., quit them because they told him he could _not_
process his own book, since he didn't have enough experience,
at least if i have understood him properly.   it's kinda sad, isn't it?

***

i took issue with the suggestions to join d.p. because
learning their workflow does more harm than good...

that person who had followed my (former) advice to
"join d.p. and observe how they do things over there"
cut off runheads-and-pagenumbers when he did o.c.r.,
as per d.p. policy, which i've informed him is a bad idea.
pagenumbers help you know where you're at in a book,
and the runheads can be deleted later, automatically...

he'd saved the file as a text-file, as per d.p. policy, so
i had to tell him to do the o.c.r. again and save as .rtf.
his book is full of styling.   why throw all of that away?

he rejoined end-of-line hyphenates, as per d.p. policy,
so i told him to switch that up when he re-did the o.c.r.

d.p. policy just plain _sucks_, on the full range of issues.

i've already described their policy on ellipses, which is
as wrong as a policy could be on that particular subject.
they have proofers _changing_ what was in the p-book,
just to implement their judgment-call-required policy.

they "cloth" end-line em-dashes, meaning they bring up
a word from the following line, which is totally ridiculous,
since it creates a super-long "word" (i.e., it joins 2 words),
which _exacerbates_ line-wrapping problems.   so stupid.

i have recently discussed here their _filenaming_ silliness,
and will be continuing with that discussion momentarily...

d.p. pseudo-markup is a put-it-in-take-it-out exercise.

and i won't even begin to discuss the t.e.i. foibles again...

at one choicepoint after another, d.p. takes the wrong fork.
wrong.   wrong.   wrong.   consistently...   sometimes it seems
as if they are actually _trying_ to handicap their volunteers...

(i don't think so.   but i _have_ heard the view espoused that
it would be a good thing to try and slow down the p1 round,
so as to "lessen the backlog it creates for the rest of the site",
which is patently ridiculous.   but still, i think it's incompetence
accounting for awful decisions there, the same incompetence
that makes it impossible for the leaders to craft a conspiracy.)

further, in addition to the _policy_ problems, another level of
incompetence rears its ugly head at _implementation_time_...

for years now, d.p. tolerated some truly _awful_ page-images
(badly-done, crooked, inconsistently placed on a canvas, etc.).
although this appears to have improved recently -- because
they are using scan-sets from the big scanning projects?,
or maybe because i ragged on them so much about this --
there are still plenty of bad scans remaining in the system...

another thing i have harped on ever since i encountered d.p.
was their _negligence_ in performing any post-o.c.r. clean-up.
o.c.r. errors often happen in repeated fashion through a book.
it's not unusual at all to find _the_very_same_scanno_ occurring
time after time after time.   these problems can be quickly and
easily corrected with one global change across the document.

as just one example, old books often had "spacey" punctuation;
that is, there was a space inserted before a comma or a period.
it makes no sense to have humans delete these spaces manually,
when they can be removed across an entire book automatically...

in this regard, too, thanks to my constant haranguing on this,
they've gotten better.   but their performance still sucks badly.

for instance, although they usually tend to remove the space
in front of commas and periods, when i examined their "test"
of "perpetual p1" over this last week, i discovered that they had
_not_ closed up the space in front of the ellipses in that book...

so proofers had to do that manually.   to my mind, that shows
that "the powers that be" (as they call themselves) who _run_
the place don't have the intelligence to insist that people who
_prepare_ the texts show some consideration and respect for
the time and energy of the people who are doing the proofing.

the response -- when i've made this point in the past -- is that
"the content providers are volunteers, and we can't force them
to do what they don't want to do".   pardon me?   that's garbage.
why do you allow _some_ volunteers to place an unfair burden
on the shoulders of _other_ volunteers?   if a job can be done
_efficiently_ and _automatically_ and _simply_ by one volunteer,
why would you instead insist that the job be done by _many_
volunteers who can only do it _inefficiently_ and _manually_
and _with_relatively_much_more_difficulty_.   i'm _positive_ that
if we gave the content providers the choice between making
one global change or literally _thousands_ of manual changes,
they would choose the global change.   how does the equation
change when _other_ people_ are doing the work?   it doesn't...

and -- believe it or not -- it gets even worse.

in that same "perpetual p1" experiment, the pre-processor had
accidentally changed all 1,137 of the em-dashes to en-dashes.
did they go back and fix that disastrous mistake?   they did not.
they just sent the badly disfigured text out for the proofers to fix.
that's totally inexcusable.   and sadly, it is _not_ that uncommon...
all kinds of incompetence is routinely dumped on proofers to fix.

d.p. simply does not respect the time and energy of its volunteers.
it's a good thing those volunteers don't realize the extent of this,
or they would leave in droves...

as it is, i think the _intelligent_ proofers are leaving as individuals,
quietly, without making any fuss.   after all, how long would _you_
continue to close up those ellipses, or correct those em-dashes,
on an individual basis, before you became bored outta your skull?
or how long before you spoke up to say, "um, there's a better way"?

***

finally, the d.p. forums are a _fascinating_ example of groupthink...
time after time, the correct answer to a problem goes unrecognized
even when it _does_ surface, and occurrences of _that_ are even rare.

there is a _huge_ propensity to make everything far too complicated,
and strange unwillingness to even _experiment_ with simple solutions.
they've convinced themselves digitization is a _difficult_undertaking_,
and do not seem to want to be presented the reality that it is _not_...

if you want some evidence, you need look no further than this page:
>    http://www.pgdp.net/w/index.php?title=Confidence_in_Page_analysis
that wikipage is accompanied by an 11-page thread in the forums:
>   http://www.pgdp.net/phpBB2/viewtopic.php?p=431333#431333
they're trying to come up with a way to determine if a page is "done".
they've got a lot of gobbledygook there, but not many solid results.

or take a look at this 4-page thread on the simple matter of filenames:
>    http://www.pgdp.net/phpBB2/viewtopic.php?t=32038
after 60 messages on this topic (not to mention _many_ threads where
this topic was discussed previously), they're now thoroughly confused,
and are actually heading down a path going in the wrong direction...

but to see how _really_ convoluted things can get, do a forum search
on "wordcheck", and check out the huge threads that were generated.
they ended up with a flawed-but-acceptable checker out of that mess,
but one that still doesn't respect the time and energy of the proofers...

nonetheless, it's better than the on-site spellchecker used previously,
so awful it didn't even have the capacity to add words to its dictionary,
which meant that for many unique words in a book (like the _names_
of characters), the proofers had to see _every_occurrence_ be flagged.

***

consider all of this -- and i mean _all_ of it -- and it's easy to see
why i believe that d.p. doesn't respect its volunteers sufficiently...

and that's why i can no longer in good faith send people over there.

in fact, i recommend that p.g. take the banner off the web-portal
that recommends that visitors go over to distributed proofreaders,
until d.p. cleans up its act...

-bowerbird



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080304/24165b50/attachment-0001.htm 

From ebooks at ibiblio.org  Tue Mar  4 13:05:54 2008
From: ebooks at ibiblio.org (Jose Menendez)
Date: Tue, 04 Mar 2008 16:05:54 -0500
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <d05.2a6bed62.34fdb873@aol.com>
References: <d05.2a6bed62.34fdb873@aol.com>
Message-ID: <47CDB9B2.7010404@ibiblio.org>

Bowerbird wrote:

> oh gee, lookee here, _jose_menendez_ has made an appearance!
> great to see you jose!  even though i know you're here to razz me.


If I'd wanted to "razz" you, I would have replied to one or more of 
your recent posts about file-naming. :) You see, now and then, I like 
to check on what sites link to my ebooks. Some time back, I saw that 
there were links from the MobileRead Forums, specifically from this 
post you made in a thread entitled "What 'Cleaning Up' Do Project 
Gutenberg Texts Need."

http://www.mobileread.com/forums/showpost.php?p=112962&postcount=86

In it you linked to my Einstein, Geronimo, and Cather digital 
reprints. (Oddly enough, you didn't link to my reprint of Mabie's 
"Books and Culture," but that's irrelevant.) Here's a brief excerpt 
from your MobileRead post:


> here's another digital reprint, this time geronimo's life story:
>> http://www.ibiblio.org/ebooks/Geronimo/GerStory.pdf
> compare any .pdf page with its scan by using this template:
>> http://z-m-l.com/go/geron/geronp001.jpg
> (as before, replace "001" with the page-number you want.)
> by the way, google's scan-set from this book is the _worst_
> job of scanning a book that i have ever seen from them...
> it's worth downloading just for its humor as a bad example.


Your comments about the quality of Google's scan-set surprised me, 
because the page images I had looked at were pretty good. So I 
followed the link you'd given to your website:

http://z-m-l.com/go/geron/geronp001.jpg

Much to my surprise, I saw an image of a half-title page, which is 
definitely not page 1 of the book. Hmmm... Next I tried to look at 
page 145 with this URL:

http://z-m-l.com/go/geron/geronp145.jpg

The scan for page 119 came up instead. Uh oh! So then I tried the URL 
that should have shown page 119:

http://z-m-l.com/go/geron/geronp119.jpg

Page 99 came up in its place. Oops! So I tried this URL for page 99:

http://z-m-l.com/go/geron/geronp099.jpg

Page 83 came up instead. I finally did find the scan for page 145, 
using this URL:

http://z-m-l.com/go/geron/geronp173.jpg

It's a good thing you have "a tool that does this file renaming 
_automatically_"; otherwise, those scans might have had the wrong file 
names. ;)

By the way, the scans on your website do look bad, but here are links 
to the same page scans in Google Book Search, and they look 
considerably better:

http://books.google.com/books?id=EM6nHWWQ3TIC&pg=PA83
http://books.google.com/books?id=EM6nHWWQ3TIC&pg=PA99
http://books.google.com/books?id=EM6nHWWQ3TIC&pg=PA119
http://books.google.com/books?id=EM6nHWWQ3TIC&pg=PA145


Jose Menendez

From Bowerbird at aol.com  Tue Mar  4 13:48:32 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 4 Mar 2008 16:48:32 EST
Subject: [gutvol-d] The Old Fashioned Way...
Message-ID: <c14.3139083e.34ff1db0@aol.com>

i told you jose wouldn't tell me about the other errors he found.
so i'll just have to see if i can lure it out of him some other time.

jose said:
>    If I'd wanted to "razz" you, I would have replied to 
>    one or more of your recent posts about file-naming. :)

there will be more of them coming soon, so you will have
additional chances to jump in on this matter if you wish...       :+)

but...

the version of geronimo that's up on my site right now
is _not_ a finished version -- precisely because the book
was so badly-done that some pages are totally missing...

and, if i remember correctly, other pages are duplicated,
sometimes several times.   like i said, worst google book
i have seen yet, which is quite an accomplishment, really.
i think the guy who did it must've been drunk as a skunk.

that's what makes this book "badly-done".

yes, the quality of the images that _are_ there is suitable...
but what good does that do if the scan-set is incomplete?

anyway, i keep checking back to see if they've redone it.
or if the o.c.a. has done it.   or if _anyone_ has done it...


>    I followed the link you'd given to your website:
>    http://z-m-l.com/go/geron/geronp001.jpg
>    Much to my surprise, I saw an image of a half-title page, 
>    which is definitely not page 1 of the book. Hmmm... 
>    Next I tried to look at page 145 with this URL:
>    http://z-m-l.com/go/geron/geronp145.jpg
>   The scan for page 119 came up instead.

then the files are obviously using the filenames based on
pagenumbers associated with them from the google .pdf,
which don't account for unnumbered plates in that book...

so i must've uploaded the images before i renamed them.
i can correct them pretty easily, with my file-renaming tool.

that's what happens with badly-named files, occasionally,
is that they get put into a production stream erroneously.
that's why you should give 'em the right names right away.


>    It's a good thing you have 
>    "a tool that does this file renaming _automatically_"; 
>    otherwise, those scans might have had the wrong file names. ;)

yes, it _is_ a good thing i have such a tool.
it's even better when i remember to use it.         :+)

oh, wait, you're trying to _imply_ that i don't even _have_
such a tool, aren't you?   what, do you think that i would
rename all these files _manually_?   maybe with 2 fingers?

no, let me assure you that i do indeed have such a tool.
in fact, over the years, i've written many different versions
of it, including the one that i started just last week, for d.p.

you might remember that i offered to write one for them...
they didn't accept the offer, but i wrote it anyway (big deal)
and i'll be releasing it regardless.   but more on that _later_.


>    By the way, the scans on your website do look bad

i probably pulled them out of the .pdf at a low resolution.
my prechecking showed me that i couldn't finish the book
-- i was hoping to repurpose your clean text in z.m.l. --
because it was incomplete, so i did a quickie on the scans,
just so i could point people to your excellent geronimo .pdf.

evidently, my "quickie" was a bit _too_ quick, if the filenames
were incorrect.   but nobody reported that error.   until _you_.
like i always say, jose, you're the best, jose.                   :+)

-bowerbird



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080304/41e899a5/attachment.htm 

From ajhaines at shaw.ca  Tue Mar  4 14:33:29 2008
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Tue, 04 Mar 2008 14:33:29 -0800
Subject: [gutvol-d] The Old Fashioned Way...
References: <d05.2a6bed62.34fdb873@aol.com>
Message-ID: <001c01c87e47$c52197b0$6401a8c0@ahainesp2400>

Several more errors, all found with one of the previously maligned Gutcheck/Jeebies/Gutspell trio:

"fromtheir" - a simple spellcheck would also have found this one.

There's a Chairman variously named Stabar, Straber, and (twice) Staber.  The first two were in the 
same paragraph!  A spellcheck would have flagged these, too.

"caroomed" - a spellcheck would have found this one, but in a case like this, where I can't decide 
if it's the author's intent, or a typo/spello, I leave the word alone, and add a short transcriber's 
note with what I think is correct, e.g. "...caroomed [Transcriber's note: caromed?] ..."


This line, cited earlier in this thread as incorrect, is definitely incorrect in its context:

>   "What now?", Zolan asked.

However, given that there are too many ways in which the question/quote/comma sequence *is* correct, 
there's probably no way anything short of a full-blown grammar/syntax/context checker could declare 
a given case of the sequence correct or incorrect.  Ditto for exclamation/quote/comma.  And even 
then, *I* wouldn't take such a utility's word for it.  (Many years ago (well before Windows), I ran 
my autoexec.bat file through a grammar checker, and was told it was readable, but dry.  Maybe such 
checkers are better now, but the grain-of-salt principle still applies.)

My take on this submission?  Given the number of errors/inconsistencies found with assorted 
utilities (I've lost count, but at least 6-8 items, I think, so far), I can only assume there are 
others, possibly findable only with a proper proof-reading.  If this had been my submission, 6-8 
errors is 6-8 too many, and I would not have submitted as it stands.

Al

----- Original Message ----- 
From: Bowerbird at aol.com
To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com
Sent: Monday, March 03, 2008 12:24 PM
Subject: Re: [gutvol-d] The Old Fashioned Way...


oh gee, lookee here, _jose_menendez_ has made an appearance!
great to see you jose!  even though i know you're here to razz me.

***

jose said:
>   Note the date of your BP post, June 30, 2005.
>   Now, if we look at the PG ebook, we see "Release Date: April 25, 2006."
>   That's nearly *ten* full months after your BP post.
>   It's a good thing you don't use those inefficient DP workflows
>   you're always criticizing. ;)

notice the smiley there, folks.  that means jose is "just kidding".

but he has a good point.  just exactly why _did_ it take so long?

well, the answer to that is pretty simple.  meyer was still making
_changes_ to his book.  like many authors, he kept rewriting it...
however, _unlike_ most editors, i didn't impose a deadline on him.
so that accounted for a good chunk of that time.

at some point, though, he did tire of the rewriting, and "finished".

of course, by that time, i had other things on my plate, so it took
me a little while to get back to it.  and then we did copy-editing.
and then we did more copy-editing.  and then we did even more.
if you've ever copy-edited a "raw" book, you know it takes time...

and then, when _that_ was done, i fully intended to demonstrate the
.pdf and .html conversion possibilities, but still had to program them.

i'm not one of those disciplined programmers, who can make myself
code "on-demand".  i have to wait for "the inspiration".  and it wasn't
all that forthcoming.  so finally meyer wrote me, after a heart-attack,
saying "i don't know if i'm long for this world; can we post my book?",
so i did.  thank goodness, as far as i know, he's still alive and kicking...

and _that's_ why it took 10 months.  actually, i would have guessed
that he waited at least that long just for me to do the programming,
so if that was the _total_ time, then i'm a little bit surprised...


>   Well, back in late January of 2006, you asked me to check it,

because you're one of the best at finding errors, jose, and i know it.
so i figured that if you couldn't find an error, then _nobody_ could...


>   but I turned you down.

yeah.  you never were one to do me a favor, were you?        ;+)


>   A quick check of a simple word frequency list
>   was enough to find a mistake.  For instance,

notice that "for instance", folks.  translated into jose, that means
"here's one of the errors i found."  the _unspoken_ part of that is
that he has found _more_, he's just ain't gonna tell you, not yet...


>   the ebook contains two occurrences of  "accello-net"
>   (whatever that may be) and one "accello-nets," but there's
>   also one "accelo-nets." There's an "l" missing in that one.

i don't even know what an "accello-net" is.  _or_ an "accelo-net".
or the difference between them.  or if there is a difference.      :+)


>   The word frequency list also revealed a number of hyphenation
>   inconsistencies, for example, "interregional" vs. "inter-regional,"
>   "mine-layer" vs. "minelayers," "multicolored" vs. "multi-colored," etc.

another good catch!  as i said, meyer did a lot of rewriting on this.
so it's obvious that i needed to do the hyphenation checks again...

but it's not like i didn't do them a half-dozen times before that...

or maybe my hyphenation checks just weren't too good back then.
who knows, maybe they're not even too good _now_.  or maybe so.


>   Now some may say that I'm just nitpicking about those inconsistencies

people who said that would be _wrong_, as far as i'm concerned...

and _i'm_ certainly not going to be one of those people who says it.

i _want_ to know about any inconsistencies, and eliminate them.
so i don't consider it to be "nitpicking", jose, not in the slightest...

they might not be a _serious_ error -- ok, they're _not_ a serious error --
but they are an error nonetheless, and i want _all_ errors to be corrected.

so i am _deeply_ appreciative to you for bringing them to my attention...
now, how about the _other_ errors you found...                     :+)


>   Since Bowerbird had often taunted Jon Noring publicly
>   about how long it was taking him to finish his version
>   of the same book, I emailed Bowerbird, Jon, and David Rothman
>   about mine. In the ensuing discussion, Bowerbird criticized
>   my version because I had retained similar hyphenation inconsistencies
>   that were in the original paper book, e.g. "grain-sack" vs. "grainsack"
>   and "oil-cloth" vs. "oilcloth."

i'm quite sure i didn't _criticize_ you for retaining them, jose.

the decision about whether to _retain_ such inconsistencies is
one that can go either way...  since you consider yourself to be
_replicating_ the p-book, your decision would be to retain them.

since i consider myself to be _republishing_ the p-book, i fix 'em.
different strokes for different folks; that's what makes a horse race.


>   Bowerbird told me forcefully

you might have interpreted my posts as being "forceful" -- probably
because of the strength of the logic -- but that's your interpretation.


>   and at great length

well, i do go on for a while.  but you're just jealous, jose, because
you're a 2-finger typist who can't type nearly as fast as he thinks...     :+)
(have you tried voice-recognition apps?  i hear they're good now.)


>   that I should fix those inconsistencies.

well, no.  the nature of the discussion would have revolved around
the general issue of whether it is better to _replicate_ or _republish_,
not around the consequent issue of whether to keep inconsistencies.


>   He even said that he'd fixed them
>   in his own version of "My ?ntonia."

right.  because that's what a republisher _should_ do...


>   So I was surprised to see so many similar inconsistencies
>   in this ebook that he submitted to PG, especially since he
>   submitted it *after* that lengthy discussion about
>   the hyphenated words in Cather's book.

there's absolutely no question those are errors that need to be fixed.

thank you for showing them to me.  like i said, you da best...        :+)


>   P.S. A few checks also revealed a number of punctuation errors.
>   For example:
>
>   "What now?", Zolan asked.
>   That should be
>   "What now?" Zolan asked.

really?  i would say the comma is correct.
if it's not, i'll need to make a check for it...


>   "Not much choice." Brad replied in a whisper.
>   That should be
>   "Not much choice," Brad replied in a whisper.
>
>   "Don't count on it." Ram replied grimly.
>   That should be
>   "Don't count on it," Ram replied grimly.

i'll have to check the context, but i would agree those look wrong.

thing is, i don't see how i can automate a check for that.  do you?

i didn't proofread this book.  meyer said other people had done that.
i only subjected it to my automated tests.  now, i suppose that i could
locate every occurrence of "period-quotemark-space-name-replied".

but that would fail on occurrences of "responded" instead of "replied",
or "said" or "snorted" or "taunted" or any of a number of similar terms.
i'd also guess that test will turn up too large a number of false alarms.

so it doesn't seem to me to be a _practical_ test to include in my tool.
however, if someone suggests a better way for me to phrase the test
-- and feel free to use regex if it makes it possible for you to do it --
by all means, please show off your cleverness and share it with me...

i'm curious, does gutcheck find this type of error?

-bowerbird

p.s.  also, just for the record, and so everyone is absolutely clear here,
i didn't _force_ any changes on meyer.  didn't even make them for him,
for the most part.  i'd just send him a list of "stuff that i would change",
and he'd either make the changes or not, depending on his own mind...
even if something was "wrong" grammatically, if he _wanted_ it that way,
that's the way it stayed.  when you have a living author, you have _zero_
difficulty determining "the intent of the author", so i gave him free reign.
i do not think he'd be stubborn about fixing the errors reported above,
so i'm not offering that here as an _excuse_, because it does not apply,
but i felt the need to say it to set the record straight...  and jose, thanks!



**************
It's Tax Time! Get tips, forms, and advice on AOL Money & Finance.
(http://money.aol.com/tax?NCID=aolprf00030000000001)



_______________________________________________
gutvol-d mailing list
gutvol-d at lists.pglaf.org
http://lists.pglaf.org/listinfo.cgi/gutvol-d 



From Bowerbird at aol.com  Tue Mar  4 15:41:54 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 4 Mar 2008 18:41:54 EST
Subject: [gutvol-d] The Old Fashioned Way...
Message-ID: <c17.2ed504c4.34ff3842@aol.com>

al said:
>    all found with one of the previously 
>    maligned Gutcheck/Jeebies/Gutspell trio:

ok, now let's not perceive "maligning" where none was done.

i've already explained that my advice is based on the fact that
these tools are difficult for average users to install, and to use.

is there anyone who takes issue with that?

because i'd be happy to point them to the forums over at d.p.,
where experienced users need to assist less-experienced ones,
and these threads clearly indicate that these tools are not easy.

and i'm guessing they would be _quite_ formidable to users
who have no digitizing experience _at_all_.   (whether kevin is
among those people or not, we have no real way of knowing.)

besides, i'd already offered to help kevin clean up his work --
both by giving him my tools and by doing a check _myself_ --
so it's not as if i had left him out in the cold to freeze to death.

why didn't anyone here offer to run the "the trio" for him?


>   If this had been my submission, 6-8 errors is 6-8 too many, 
>    and I would not have submitted as it stands.

gee, al, you're hard-core!            :+)

and what does this say about the "planet strappers" test over at
distributed proofreaders?   the p1-p2-p3 process left 6 errors,
and it looks like the 3 iterations of p1 will leave about 10 errors.
this in a book that's smaller (388k) than the one i did (480k)...

i'm just finishing up my post where i've written up the results,
and i gave those proofers a pat on the back for a job well done.

it's far more important to clean up books as errors are reported
than to try to make them perfect from the outset, in my opinion.

***

as for the errors in meyer's book, i'll get them cleaned very soon,
and resubmit.   it will give me a chance to include .html and .pdf...

and _thank_you_ very much for taking a look at my work!
if i can return the favor on a book of yours, let me know...

(or should i instead perceive that you are "maligning" me?)

-bowerbird



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080304/135f8436/attachment-0001.htm 

From ajhaines at shaw.ca  Tue Mar  4 16:43:57 2008
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Tue, 04 Mar 2008 16:43:57 -0800
Subject: [gutvol-d] The Old Fashioned Way...
References: <c17.2ed504c4.34ff3842@aol.com>
Message-ID: <000d01c87e59$ff6e0a90$6401a8c0@ahainesp2400>

Granted, Gutcheck/etc take some command line know-how to get working, but it's well worth the effort, even if some hand-holding is required.

I don't consider myself to be particularly hardcore, but there's no excuse for not finding and fixing such errors as were found in this particular submission.  My personal standard is to submit an e-book with fewer errors in it than the original.  Obviously, I'm biased, so whether that standard has been met or not I leave to others to judge.


  ----- Original Message ----- 
  From: Bowerbird at aol.com 
  To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com 
  Sent: Tuesday, March 04, 2008 3:41 PM
  Subject: Re: [gutvol-d] The Old Fashioned Way...


  al said:
  >   all found with one of the previously 
  >   maligned Gutcheck/Jeebies/Gutspell trio:

  ok, now let's not perceive "maligning" where none was done.

  i've already explained that my advice is based on the fact that
  these tools are difficult for average users to install, and to use.

  is there anyone who takes issue with that?

  because i'd be happy to point them to the forums over at d.p.,
  where experienced users need to assist less-experienced ones,
  and these threads clearly indicate that these tools are not easy.

  and i'm guessing they would be _quite_ formidable to users
  who have no digitizing experience _at_all_.  (whether kevin is
  among those people or not, we have no real way of knowing.)

  besides, i'd already offered to help kevin clean up his work --
  both by giving him my tools and by doing a check _myself_ --
  so it's not as if i had left him out in the cold to freeze to death.

  why didn't anyone here offer to run the "the trio" for him?


  >   If this had been my submission, 6-8 errors is 6-8 too many, 
  >   and I would not have submitted as it stands.

  gee, al, you're hard-core!           :+)

  and what does this say about the "planet strappers" test over at
  distributed proofreaders?  the p1-p2-p3 process left 6 errors,
  and it looks like the 3 iterations of p1 will leave about 10 errors.
  this in a book that's smaller (388k) than the one i did (480k)...

  i'm just finishing up my post where i've written up the results,
  and i gave those proofers a pat on the back for a job well done.

  it's far more important to clean up books as errors are reported
  than to try to make them perfect from the outset, in my opinion.

  ***

  as for the errors in meyer's book, i'll get them cleaned very soon,
  and resubmit.  it will give me a chance to include .html and .pdf...

  and _thank_you_ very much for taking a look at my work!
  if i can return the favor on a book of yours, let me know...

  (or should i instead perceive that you are "maligning" me?)

  -bowerbird



  **************
  It's Tax Time! Get tips, forms, and advice on AOL Money & Finance.
  (http://money.aol.com/tax?NCID=aolprf00030000000001) 


------------------------------------------------------------------------------


  _______________________________________________
  gutvol-d mailing list
  gutvol-d at lists.pglaf.org
  http://lists.pglaf.org/listinfo.cgi/gutvol-d
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080304/e1c8945a/attachment.htm 

From Bowerbird at aol.com  Tue Mar  4 16:49:59 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 4 Mar 2008 19:49:59 EST
Subject: [gutvol-d] The Old Fashioned Way...
Message-ID: <bf3.26ebb9ca.34ff4837@aol.com>

about those geronimo files...

i've discovered what my error was...

i _should_ have pointed to _this_ directory:
>    http://z-m-l.com/go/gerst/

as you'll see, those files have been up since
last november, and they are named _wisely_;
the filenames relate to p-book pagenumbers,
and unnumbered illustration plates stand out.

i discovered these _after_ i had uploaded a _new_
set of corrections to the names, now shown here:
>    http://z-m-l.com/go/geron/

i renamed the folder with the badly-named files:
>    http://z-m-l.com/go/geronbad/

oh, and jose?, as for this page:
>    http://z-m-l.com/go/gerst/gerstp001.jpg

i will routinely shuffle forward-matter pages
to get the pagenumbers in sequence, as will
many publishers when republishing a book...

-bowerbird

p.s.   the google .pdf was missing some pages:
>    http://z-m-l.com/go/gerst/gerstp207.jpg
>    http://z-m-l.com/go/gerst/gerstp206.jpg
>    http://z-m-l.com/go/gerst/gerstp205.jpg
>    http://z-m-l.com/go/gerst/gerstp204.jpg
>    http://z-m-l.com/go/gerst/gerstp203.jpg
>    http://z-m-l.com/go/gerst/gerstp202.jpg
>    http://z-m-l.com/go/gerst/gerstp201.jpg
>    http://z-m-l.com/go/gerst/gerstp187.jpg
>    http://z-m-l.com/go/gerst/gerstp186.jpg
>    http://z-m-l.com/go/gerst/gerstp179.jpg
>    http://z-m-l.com/go/gerst/gerstp178.jpg
>    http://z-m-l.com/go/gerst/gerstf002.jpg



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080304/c4556498/attachment.htm 

From Bowerbird at aol.com  Tue Mar  4 16:57:49 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 4 Mar 2008 19:57:49 EST
Subject: [gutvol-d] The Old Fashioned Way...
Message-ID: <c2d.2ca34d91.34ff4a0d@aol.com>

al said:
>    Granted, Gutcheck/etc take some command line know-how to get working, 
>    but it's well worth the effort, even if some hand-holding is required.

they might be good ways to find flaws in a digitization, i'd agree with that.

but tools that work just as well, yet do _not_ require a complex 
installation,
and which are easier to use (e.g., because they have a user-friendly g.u.i.)
are -- in my opinion -- going to be superior, especially for the newbies...


>    I don't consider myself to be particularly hardcore

i want all books to be perfect.   but even in criticizing d.p., i have said
that a book which had _50_ errors in it was not particularly badly done.


>    but there's no excuse for?not finding and fixing?such errors 
>    as were found in this particular submission.??

no excuses are offered, al.   i'm just gonna go fix them...


>    My personal?standard is to submit an e-book 
>    with fewer errors in it than the original.? 

this _was_ "the original", al.   straight from the author's wordprocessor.
and i can guarantee it was a _lot_ cleaner thanks to my helping him...

-bowerbird



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080304/3dfc346f/attachment.htm 

From schultzk at uni-trier.de  Wed Mar  5 02:02:32 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Wed, 5 Mar 2008 11:02:32 +0100
Subject: [gutvol-d] The Old Fashioned Way...
In-Reply-To: <6d99d1fd0803031606n14fdf485g768e2eaa18f96d7@mail.gmail.com>
References: <bd2.1a4b4ff4.34fb983b@aol.com>
	<6d99d1fd0803020754u6c066fe0g839d8573b44a8872@mail.gmail.com>
	<54D4585A-D932-4B86-BCB4-63FF3D502BEB@uni-trier.de>
	<6d99d1fd0803031606n14fdf485g768e2eaa18f96d7@mail.gmail.com>
Message-ID: <2BCDC783-216C-4464-8242-304B3079F724@uni-trier.de>

Hi David,


Am 04.03.2008 um 01:06 schrieb David Starner:

> On Mon, Mar 3, 2008 at 3:10 AM, Schultz Keith J. <schultzk at uni- 
> trier.de> wrote:
>>         Just followed this thread and I ask how ignorant can people
>>         get??
>
> If you define ignorant as disagreeing with you, very. But I think
> that's an overly parochial definition.
	No.

>
>>         If somebody wants to contribute let them. If they want to  
>> do them by
>>  hand
>>         then all the more power to them.
>
> Each book in Project Gutenberg reflects on the quality of the whole.
> Not only that, Novel by Joe Shmoe getting posted will stop most other
> people from working on it at all, which means that a poor-quality
> edition will stop a high-quality edition from being posted. From my
> perspective, that's motivation to encourage people to submit only
> high-quality copies to PG.
	Hear you show what I mean by ignorance:
	Work down by hand(aka Type in) is considered per se flaw and poor
	quality. That is true ignorance.

	I say give Kevin a chance. YOU and EVERYBODYELSE does not know
	if Kevin does actually produce high quality work.

>
>>         To that is proof enough that single persons can be proficient
>>  enough.
>
> "I know a person who is perfect at this" is hardly proof; it's barely
> even an argument. We've all seen the opposite; DP is proofing the
> motion picture copyright filings, and is finding that the original
> typing has left several errors a page. To achieve the results that DP
> is achieving, most companies have two typists independently type out
> the text.
	Like I said above ignorance is presuming an outcome when it
	can not be determined. It is proof that it possible to produce
	high quality texts without a scanner. It is not proof that
	everybody can.



>> Please do not
>>         bang on those who are willing to good OLD FASHIONED HANDY  
>> WORK !!
>
> Hard work frequently isn't a substitute for using the right tools and
> right knowledge. The man who picks up a hammer one day and starts
> building houses for people may be altruistic, but without the right
> knowledge, he's also endangering lives.
	You never know! Some of the worlds best artist never had a formal
	education in art!! This is more true of many writers of the past.

	Yes, most who do put up a hammer do not know what they are doing.
	Also, I would trust most architects to actually build my house!!

>
>> Well Joshua, Kevin ask if there is any reason he can offer his work
>> to PG.  He did not ask if it is what DP wants.
>
> DP doesn't want anything; the people who work with DP do. And most of
> them want PG to be the greatest it can be. Perhaps we assume, na?vely
> perhaps, that other people would share that goal.
	It was the contributors to DP that gave ignorant advice.

>
>> I personally do not like DP,
>
> Didn't you just say
>> Please do not
>>         bang on those who are willing to good OLD FASHIONED HANDY  
>> WORK !!
	DP is not OLD Fashioned Handy Work.

	Yet, I thank you for proving my point on ignorance. You call it short
	sightedness.

	regards
		Keith.


From schultzk at uni-trier.de  Wed Mar  5 02:09:14 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Wed, 5 Mar 2008 11:09:14 +0100
Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people
In-Reply-To: <ca1.287922dc.34feffff@aol.com>
References: <ca1.287922dc.34feffff@aol.com>
Message-ID: <A849976A-5249-410E-8969-DAA845C35A0D@uni-trier.de>

Hi Bowerbird,

	Thanx for elicating more on my point.

Am 04.03.2008 um 20:41 schrieb Bowerbird at aol.com:

> robert said:
> >   how will anyone be able to check his electronic text
> >   against the original without scans?
>
> the same way we check the other books without scans,
> by finding a copy of the p-book or finding a scan-set...
>
> by the way, any progress on the process of uploading
> the scans from d.p. to p.g.?  c'mon folks, get that done.
> if you can't do it yourselves, i'll be happy to do it for you,
> working from the p.g. side, if michael and greg approve.
>
> ***
>
> keith said:
> >   If somebody wants to contribute let them.
> >   If they want to do them by hand
> >   then all the more power to them.
>
> i think the general message that he _could_ contribute
> got through to kevin.  (but perhaps kevin could tell us?)
>
> yes, the impression that he had to present a scan of the
> titlepage and verso was misleading, but robert also was
> fairly quick to correct himself and give kevin an option...
>
> people were mostly concerned about the quality issue.
>
> and yeah, that's kind of a red herring, because there are
> people (like your girlfriend) who can do excellent quality,
> and we have no idea about the nature of kevin's skills here.
> (although if he's willing to do it, he's likely not a bad typist.)
>
> still, it's probably good to sensitize people to that issue...

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080305/4ddfc2ab/attachment-0001.htm 

From kionon at animemusicvideos.org  Wed Mar  5 04:09:15 2008
From: kionon at animemusicvideos.org (Kionon)
Date: Wed, 5 Mar 2008 21:09:15 +0900
Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people
In-Reply-To: <A849976A-5249-410E-8969-DAA845C35A0D@uni-trier.de>
References: <ca1.287922dc.34feffff@aol.com>
	<A849976A-5249-410E-8969-DAA845C35A0D@uni-trier.de>
Message-ID: <8893d7a30803050409q60201b05n15ebd55da6a0c7d7@mail.gmail.com>

I got the message no one was going to stop me from contributing.

On the other hand the list did not make a good first impression.

On 3/5/08, Schultz Keith J. <schultzk at uni-trier.de> wrote:
>  Hi Bowerbird,
>
>  Thanx for elicating more on my point.
>
>
> Am 04.03.2008 um 20:41 schrieb Bowerbird at aol.com:
> robert said:
>  >   how will anyone be able to check his electronic text
>  >   against the original without scans?
>
>  the same way we check the other books without scans,
>  by finding a copy of the p-book or finding a scan-set...
>
>  by the way, any progress on the process of uploading
>  the scans from d.p. to p.g.?  c'mon folks, get that done.
>  if you can't do it yourselves, i'll be happy to do it for you,
>  working from the p.g. side, if michael and greg approve.
>
>  ***
>
>  keith said:
>  >   If somebody wants to contribute let them.
>  >   If they want to do them by hand
>  >   then all the more power to them.
>
>  i think the general message that he _could_ contribute
>  got through to kevin.  (but perhaps kevin could tell us?)
>
>  yes, the impression that he had to present a scan of the
>  titlepage and verso was misleading, but robert also was
>  fairly quick to correct himself and give kevin an option...
>
>  people were mostly concerned about the quality issue.
>
>  and yeah, that's kind of a red herring, because there are
>  people (like your girlfriend) who can do excellent quality,
>  and we have no idea about the nature of kevin's skills here.
>  (although if he's willing to do it, he's likely not a bad typist.)
>
>  still, it's probably good to sensitize people to that issue...
>
>
> _______________________________________________
>  gutvol-d mailing list
>  gutvol-d at lists.pglaf.org
>  http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
>

From grythumn at gmail.com  Wed Mar  5 05:18:03 2008
From: grythumn at gmail.com (Robert Cicconetti)
Date: Wed, 5 Mar 2008 08:18:03 -0500
Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people
In-Reply-To: <8893d7a30803050409q60201b05n15ebd55da6a0c7d7@mail.gmail.com>
References: <ca1.287922dc.34feffff@aol.com>
	<A849976A-5249-410E-8969-DAA845C35A0D@uni-trier.de>
	<8893d7a30803050409q60201b05n15ebd55da6a0c7d7@mail.gmail.com>
Message-ID: <15cfa2a50803050518r68ae820aib0d28c403a162de6@mail.gmail.com>

Shrug. There are several resident trolls on the list, whom the list
moderators refuse to censor; nothing most of us can do about it except block
their email and try to respond to other messages in a reasonable way.

What did you end up deciding about your book? You have several options open
to you, if the.. varied.. responses have not made you decide not to bother.
PG doesn't actually REQUIRE much, aside from some sort of proof that the
book is in the public domain.

I will say, from experience, that I would recommend working on something
short and simple first (I didn't[1], and had to redo a lot of work before I
got it right, even sending the book through DP.) and finding someone to help
you with it; you never did say what the title of the book is or where
(approximately) you are located.

R C
[1] The first book I scanned for PG (Although not the first one posted :) )
is A Rudimentary Treatise on Clocks, Watches, and Bells[2],
http://www.gutenberg.org/etext/17576
[2] Or, as I like to call it, the Evil Clock Book.


On Wed, Mar 5, 2008 at 7:09 AM, Kionon <kionon at animemusicvideos.org> wrote:

> I got the message no one was going to stop me from contributing.
>
> On the other hand the list did not make a good first impression.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080305/60f4bf76/attachment.htm 

From kionon at animemusicvideos.org  Wed Mar  5 06:21:29 2008
From: kionon at animemusicvideos.org (Kionon)
Date: Wed, 5 Mar 2008 23:21:29 +0900
Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people
In-Reply-To: <15cfa2a50803050518r68ae820aib0d28c403a162de6@mail.gmail.com>
References: <ca1.287922dc.34feffff@aol.com>
	<A849976A-5249-410E-8969-DAA845C35A0D@uni-trier.de>
	<8893d7a30803050409q60201b05n15ebd55da6a0c7d7@mail.gmail.com>
	<15cfa2a50803050518r68ae820aib0d28c403a162de6@mail.gmail.com>
Message-ID: <8893d7a30803050621l4ae58510t38c7508358129ea8@mail.gmail.com>

On 3/5/08, Robert Cicconetti <grythumn at gmail.com> wrote:
> Shrug. There are several resident trolls on the list, whom the list
> moderators refuse to censor; nothing most of us can do about it except block
> their email and try to respond to other messages in a reasonable way.

I was waiting until the squabbling ended, but since I was directly asked...

>  What did you end up deciding about your book? You have several options open
> to you, if the.. varied.. responses have not made you decide not to bother.
> PG doesn't actually REQUIRE much, aside from some sort of proof that the
> book is in the public domain.

Still have not made up my mind.

> I will say, from experience, that I would recommend working on something
> short and simple first (I didn't[1], and had to redo a lot of work before I
> got it right, even sending the book through DP.) and finding someone to help
> you with it; you never did say what the title of the book is or where
> (approximately) you are located.

I guarantee there are scans of what I wanted to do. I would be very
surprised if there were not. I was planning to do some of the work by
Virginia Woolf not already listed on PG. As for my location, I'm
roughly 40 minutes outside of Seoul, South Korea. You'd think I could
buy a scanner at any of the dozens of electronics stores around my
suburb, but so far that theory has proved false. I could get one from
the center of Seoul, most notably Yongsan Electronics Market. However,
I have no vehicle and that means two hours or more on the subway...
carrying a scanner... Eugh.

From steven at desjardins.org  Wed Mar  5 09:08:18 2008
From: steven at desjardins.org (Steven desJardins)
Date: Wed, 5 Mar 2008 12:08:18 -0500
Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people
In-Reply-To: <8893d7a30803050621l4ae58510t38c7508358129ea8@mail.gmail.com>
References: <ca1.287922dc.34feffff@aol.com>
	<A849976A-5249-410E-8969-DAA845C35A0D@uni-trier.de>
	<8893d7a30803050409q60201b05n15ebd55da6a0c7d7@mail.gmail.com>
	<15cfa2a50803050518r68ae820aib0d28c403a162de6@mail.gmail.com>
	<8893d7a30803050621l4ae58510t38c7508358129ea8@mail.gmail.com>
Message-ID: <41fd8970803050908i69ad820crb47a3b26c74759a7@mail.gmail.com>

On Wed, Mar 5, 2008 at 9:21 AM, Kionon <kionon at animemusicvideos.org> wrote:
>  I guarantee there are scans of what I wanted to do. I would be very
>  surprised if there were not. I was planning to do some of the work by
>  Virginia Woolf not already listed on PG.

Most of Virginia Woolf's work is not on PG because it's still under
copyright in the United States. Much of her work is available from
Project Gutenberg Australia. It's possible the works you're interested
in have already been done.

From Bowerbird at aol.com  Wed Mar  5 09:27:55 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 5 Mar 2008 12:27:55 EST
Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people
Message-ID: <c2d.2cb8a512.3500321b@aol.com>

robert said:
>    Shrug. There are several resident trolls on the list

interesting interpretation, robert.

i'll put my posts against yours in any test of utility...

and i'm confident the future will validate my point of view.

now...

at any rate, as i said, kevin, i'm willing to double-check
your book, and scans will make the job extremely easy.
so i encourage you to make up your mind to go for it...

there's no deeper way to interact with a book than typing it.
if a book is meaningful to you, keying it in will be satisfying.

-bowerbird



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080305/33f0e237/attachment.htm 

From creeva at gmail.com  Wed Mar  5 11:09:13 2008
From: creeva at gmail.com (Brent Gueth)
Date: Wed, 5 Mar 2008 14:09:13 -0500
Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people
In-Reply-To: <8893d7a30803050621l4ae58510t38c7508358129ea8@mail.gmail.com>
References: <ca1.287922dc.34feffff@aol.com>
	<A849976A-5249-410E-8969-DAA845C35A0D@uni-trier.de>
	<8893d7a30803050409q60201b05n15ebd55da6a0c7d7@mail.gmail.com>
	<15cfa2a50803050518r68ae820aib0d28c403a162de6@mail.gmail.com>
	<8893d7a30803050621l4ae58510t38c7508358129ea8@mail.gmail.com>
Message-ID: <2510ddab0803051109y6a89eee7k6cad5f066253dbbb@mail.gmail.com>

The mailing list is something that takes time to be a bit comfortable with.
I joined in 2003 and only recently have been a bit more vocal.  One thing I
can say, the squabbling will never end.  That being said ignore the
squabbles and move on with the people that do respond positively - don't
ignore a difference of opinion - you just don't have to acknowledge it.

On Wed, Mar 5, 2008 at 9:21 AM, Kionon <kionon at animemusicvideos.org> wrote:

> On 3/5/08, Robert Cicconetti <grythumn at gmail.com> wrote:
> > Shrug. There are several resident trolls on the list, whom the list
> > moderators refuse to censor; nothing most of us can do about it except
> block
> > their email and try to respond to other messages in a reasonable way.
>
> I was waiting until the squabbling ended, but since I was directly
> asked...
>
> >  What did you end up deciding about your book? You have several options
> open
> > to you, if the.. varied.. responses have not made you decide not to
> bother.
> > PG doesn't actually REQUIRE much, aside from some sort of proof that the
> > book is in the public domain.
>
> Still have not made up my mind.
>
> > I will say, from experience, that I would recommend working on something
> > short and simple first (I didn't[1], and had to redo a lot of work
> before I
> > got it right, even sending the book through DP.) and finding someone to
> help
> > you with it; you never did say what the title of the book is or where
> > (approximately) you are located.
>
> I guarantee there are scans of what I wanted to do. I would be very
> surprised if there were not. I was planning to do some of the work by
> Virginia Woolf not already listed on PG. As for my location, I'm
> roughly 40 minutes outside of Seoul, South Korea. You'd think I could
> buy a scanner at any of the dozens of electronics stores around my
> suburb, but so far that theory has proved false. I could get one from
> the center of Seoul, most notably Yongsan Electronics Market. However,
> I have no vehicle and that means two hours or more on the subway...
> carrying a scanner... Eugh.
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080305/9f49c586/attachment.htm 

From creeva at gmail.com  Wed Mar  5 11:13:04 2008
From: creeva at gmail.com (Brent Gueth)
Date: Wed, 5 Mar 2008 14:13:04 -0500
Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people
In-Reply-To: <c2d.2cb8a512.3500321b@aol.com>
References: <c2d.2cb8a512.3500321b@aol.com>
Message-ID: <2510ddab0803051113hd6a3974n1b96081ae0389048@mail.gmail.com>

Touchy that he assumes you - have a guilt complex?

Actually I mostly agree with you bowerbird, I just hide in the foxholes
more.  I give opinions when I think they may make a difference - otherwise I
keep my mouth shut and just read along.   I know I hold no "weight" with you
guys, and that's fine - if I can contribute one little thing that is enough
for me to stay reading and keep the involvement that I do.

On Wed, Mar 5, 2008 at 12:27 PM, <Bowerbird at aol.com> wrote:

> robert said:
> >   Shrug. There are several resident trolls on the list
>
> interesting interpretation, robert.
>
> i'll put my posts against yours in any test of utility...
>
> and i'm confident the future will validate my point of view.
>
> now...
>
> at any rate, as i said, kevin, i'm willing to double-check
> your book, and scans will make the job extremely easy.
> so i encourage you to make up your mind to go for it...
>
> there's no deeper way to interact with a book than typing it.
> if a book is meaningful to you, keying it in will be satisfying.
>
> -bowerbird
>
>
>
> **************
> It's Tax Time! Get tips, forms, and advice on AOL Money & Finance.
> (http://money.aol.com/tax?NCID=aolprf00030000000001)
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080305/b318b2f5/attachment-0001.htm 

From Bowerbird at aol.com  Wed Mar  5 13:14:56 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 5 Mar 2008 16:14:56 EST
Subject: [gutvol-d] a write-up of the final results on the "perpetual p1"
	experiment
Message-ID: <ce2.2715ec92.35006750@aol.com>

here are the results of the "perpetual p1" test at distributed 
proofreaders...

the experiment was geared to see if running an e-text through p1 repeatedly
would produce a text that was as clean as that from the regular d.p. 
workflow,
which consists of a p1 round, followed by a p2 round (with "better" 
proofers),
and then a p3 round (with the "best" proofers, as tested and certified by 
d.p.).

the first thing to note is that the proofers did an excellent job on this 
book...
they caught numerous errors in the original p-book, not just the o.c.r. 
errors.
in general, they should be congratulated on their fine job of proofing 
here...

the results clearly show that repeated p1 produces text as clean as p1-p2-p3,
and calls into question whether the "better" proofers are _really_ better at 
all...

specifically, my analysis of the results shows 274 error to begin with...

this 274 number does _not_ include the changes that proofers had to make
in order to repair the 1,137 em-dashes in this book, which were accidentally
changed to en-dashes by inappropriate handling by the content preparer...

neither does it include corrections of the 504 ellipses throughout the book,
which had to be "closed up" and/or changed (unnecessarily) to 4 dots, since
the first of those two tasks could have been attained with one global change,
and the second is totally uncalled for.

finally, it does not include 715 end-line-hyphenates which proofers had to
rejoin, under d.p. policy, which is unnecessary, since the machine can do it;
nor does it include 74 changes to "clothe" em-dashes, as per d.p. policy.

some of those numbers might be off slightly, but the overall thrust is clear;
compared to the 274 _real_ errors in this book which _needed_ to be fixed,
there were over _two_thousand_ unnecessary changes that had to be made,
according to d.p. policy.   roughly 8 unnecessary changes for every real one.
this is why the d.p. workflow is so inefficient, and disrespectful of 
proofers...

one other note: since the proofers did such a good job of finding errors
in the original p-book, i've included all of those in this results 
write-up...
it's worth reminding ourselves, though, that this is "outside of the scope"
of what we consider the job of the proofers to actually be, so _reward_
them for going the extra mile, and don't dwell on what they "missed"...

not that they missed all that much, mind you.   so let's take a good look...

***

so, how did p1 do the first time around?

p1 removed 205 -- 75% -- of the 274 errors.   laudable performance...

***

so how did the normal workflow go after this kick-off by p1?

p2 found 55 of the remaining 73 errors, a rate of 75%...   again, laudable.

p3 found 9 of the remaining 18 errors, a "measly" 50%...   not so laudable.

luckily, half of the 9 errors that p3 missed were auto-detectable...

***

and how did the "perpetual p1" proofings go, in comparison?

iteration#2 -- the second pass of the text through the p1 experiment --
found 55 or the remaining 73 errors, _exactly_ matching the p2 results...
i2 found 40 of the same errors p2 had found, and 15 that p2 had missed.
(likewise, p2 found 15 errors i2 had missed.)   thus, i2's accuracy was 75%.

iteration#3 -- the third cycling of the text through the p1 experiment --
is finishing as i post this, but they _almost_ matched p3 _exactly_ as well;
the i3 people found 8 of the 18 errors, just 1 less than the p3 proofers...

but while we're noting that the i3 proofers missed some errors, to be sure,
the bright spot was that i3 also _found_ 3 new errors, which is surprising,
since the "marines" of the d.p. proofers -- the p3 crew -- had missed 'em.

so... what's remarkable here is that the p2 and i2 figures matched _exactly_,
and the p3 and i3 figures were also _almost_identical_...   it's kind of 
freaky.

thus, again, no evidence that the p1 proofers are "inferior" in any way at 
all.

***

curiously, a good percentage of the errors that were missed by the proofers
would've been _easily_ detected by any respectable post-o.c.r. clean-up tool.
which means they should've been eliminated before _any_ proofing was done.

(as just one example here, there was an improper period located right in the
middle of a sentence, a period which was not followed by a capitalized word.
that's one of the most simple, and most predictable, tests that you can 
make.)

and proofers might well have caught even more of the mistakes, except they
were probably fatigued by all of the unnecessary changes they had to make.

distributed proofreaders needs to tighten up its post-o.c.r. pre-processing.

again, compared to the 274 o.c.r. errors requiring proofer action, there were
over _two_thousand_ totally unnecessary changes requiring proofer action...
in my opinion, that's shameful.   extreme streamlining is called for, 
quickly!

***

in sum, the text coming from 3 rounds of p1 was not significantly different
from the text that was produced by the p1-p2-p3 "normal" workflow at d.p.
both versions of the text had approximately 5-10 errors remaining within...
for a 150-page book like this one, that is quite an acceptable rate of 
errors.

and by doing _5_ rounds, even those 5-10 remaining errors were detected,
although -- in my opinion -- it's not worth the extra work to get that level.
proofers routinely spent 2-5 minutes on a page, which is a _lot_ of time...

of course, even after these _5_ rounds, there might well be additional 
errors.
indeed, i spotted 2, just by accident, in the course of conducting this 
review.
moreover, it appears a few of the "corrections" of original p-book "errors"
might have been just a touch over-zealous.   (if you're curious, the "errors"
on page 86 look intentional in retrospect.)   that can happen sometimes...

at any rate, though, proofers have done an outstanding job on this book.
and the p1 proofers proved that they can keep pace with the p3 marines...

-bowerbird

p.s.   i'll have materials documenting this analysis on my site very soon...



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080305/fdfcb1bf/attachment.htm 

From Bowerbird at aol.com  Wed Mar  5 14:06:27 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 5 Mar 2008 17:06:27 EST
Subject: [gutvol-d] why i cannot in good faith recommend d.p. to people
Message-ID: <bf9.31b19dd3.35007363@aol.com>

robert said:
>    Touchy that he assumes you - have a guilt complex?

um, there's no "assumption" there at all, robert...

these guys have been calling me a "troll" for years now.

yet i put up meaty post after meaty post after meaty post,
with zero response from them.   except the ad hominem...

if they had any logic to offer, they would.   but they don't.
so they throw in the occasional insult.   best they can do...

it doesn't bother me.   the future will validate my input.
and wonder why they had their heads up their butts...


>   I know I hold no "weight" with you guys

_everyone_ holds weight in the marketplace of truth, robert...
just make sure the tuning fork of truth hums when you speak.

-bowerbird



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080305/c6d5951d/attachment.htm 

From Bowerbird at aol.com  Wed Mar  5 15:42:01 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 5 Mar 2008 18:42:01 EST
Subject: [gutvol-d] it's comical
Message-ID: <ca1.28a31120.350089c9@aol.com>

it's downright _comical_ how twisted things get over at d.p.

really, it's like a car full of clowns.   very entertaining...          :+)

take a look at the filenaming thread -- already up to 4 pages:
>    http://www.pgdp.net/phpBB2/viewtopic.php?p=430126#430126

you might well remember that, last week or so, i offered to
write a program they could use to name their files correctly.

you might have also noticed that nobody accepted that offer.

thus, in that light, it's quite amusing to contrast how _quickly_
that _several_ d.p. people jumped in on a thread a while back
to make the (bogus) claim that i am "unwilling to help them"...

here was yet another offer to help.   and again, it was ignored...

just in case you're unfamiliar with it, at this point in the dance,
i usually revoke my offer, and put my tool back up on the shelf.

this time will be a bit different, though, for the simple reason that
people who want to digitize books individually will need this tool,
to save them grief from bad filenames assigned by other people...

so i'm making this app generally available...

***

this is a simple tool that you run against a folder full of image-files.
it's built _specifically_ and _solely_ to rename the files in a scan-set.

it renames sequentially-named files...   for example, 001.png to
388.png might be renamed f001.png through f012.png for the
forward-matter, then p001.png through p376.png for body-text.

if there are unnumbered illustration plates in the middle of the book,
you can specify them, and the renaming will take them into account...

there's a screenshot of the interface here:
>    http://z-m-l.com/misc/ocr-renamer01.png

you will notice that this screenshot was taken when i was renaming
those "geronimo" files that jose posted a message about yesterday.
you can observe that there were _unnumbered_ illustration plates
after pages 18, 22, and 30, around which the renaming was done...

as the screenshot shows, you can step through and _view_ files,
so as to do a visible confirmation on the accuracy of the naming.
this step-through capability helps ensure a scan-set is complete.
and it's often extremely helpful -- even necessary -- to be able to
view the files so that you can confirm what the filename should be.
but very few file-renaming utilities give you the ability to do that...

you can use the cursor keys to step through the images, or click
the left side of the image to go back, the right side to go forward.

***

as a little bonus, i put in a contextual menu which allows you to
make annotations on all of the pages, to be stored in a text-file;
these annotations include things like "chapter heading", "greek",
"equations", "italics", and so on, info that you might wanna collect
about each page, a "log" to make sure pages are handled properly.
(it doesn't actually _store_ that info yet; just a little feature teaser.)

a screenshot showing this contextual (right-click) menu is here:
>    http://z-m-l.com/misc/ocr-renamer02.png

the contextual menu is located off to the right...

and you can see the annotations in column 11 of the listbox...

i've annotated the pages very thoroughly here, because it's easy.
just right-click, then select the item from the contextual menu...
you can even add items to the menu on-the-fly, when necessary.

in this case, i added everything below "winter", from "contents" on.
this is what allowed me to create the "quotable quote" menu item,
complete with the actual quote itself.   (a gimmick for this demo.)

if you want to check on any of those pages, use this template:
>    http://z-m-l.com/go/gerst/gerstp001.jpg

this second screenshot also gives you a better view concerning
the _previous_ filename and the _new_ name that will be given.
for example, geronf002.jpg will here be renamed gerstf002.jpg.
(the first screenshot has similar info, just not so nicely arranged.)

***

and in case you hadn't realized, the tool will rename the *.txt file
associated with each image file, so all the names will stay in sync.

***

oh yeah, if you ask the tool nicely, it will not just rename the files,
but also generate a batch file that you could run on the same files
located on a server, to generate the same set of new names there.
because, let's be honest, it's just silly to go through the hassle of
downloading and re-uploading files just to _rename_ them.   silly...

***

of course, you can also create a script like my app in the first place,
and then just run it on the server to begin with.   but my assumption
is that somebody will have all of the files on their machine, such as
the postprocessor.   or the original content provider, so the files can
be given the correct set of names before they're uploaded originally.

***

i'll eventually put apps for different platforms on my website, but
if anyone wants it now, just backchannel me and request a copy...

-bowerbird



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080305/8addabb4/attachment-0001.htm 

From hyphen at hyphenologist.co.uk  Thu Mar  6 01:22:30 2008
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Thu, 6 Mar 2008 09:22:30 -0000
Subject: [gutvol-d] The Old Fashioned Way and other things.
In-Reply-To: <006901c87edd$831c47c0$660fa8c0@atlanticbb.net>
References: <8893d7a30803011035v4e2ae794x6d58cf9ece9050f6@mail.gmail.com>	<1e8e65080803011056q3bb05252p6168a13abb7ef713@mail.gmail.com>	<8893d7a30803011108i5f916a1wd3da5df8cb430d35@mail.gmail.com><15cfa2a50803011113r646e3bbch793873cc6495701c@mail.gmail.com>
	<001801c87c38$7b822a40$72867ec0$@co.uk>
	<006901c87edd$831c47c0$660fa8c0@atlanticbb.net>
Message-ID: <001801c87f6b$9befee80$d3cfcb80$@co.uk>

No problem, I can send the Finereader  CDROM to the USA quite cheaply, just
let me know your Address. It is Finereader *sprint*, not sure what version
that means.

 

The 19th century books which I have are Yorkshire Dialect works, I have been
ill for some time and hope to get back to PG work.

 

Yes I can borrow British Library http://www.bl.uk/ books from local
libraries very cheaply.

But the fines are horrendous.   The BL is now expected to make a profit so
the web site has changed and most things are charged for at an exhorbitant
rate L.  Use your academic identity when dealing with BL.

 

At least I can borrow anything which is at *Boston Spa*
http://www.bl.uk/services/reading/bspareadingroom.html , but I can not
borrow which are at the British library in London
http://www.bl.uk/services/reading/rrhome.shtm.  This is too far away and I
do *not* like London, and they are snooty about what you can do.    I have
not found how to determine from the catalogue what is at Boston Spa.   

 

I can drive to Boston Spa reading room but you have to use their copying
machines, at the horrendous cost of 20 pence per photocopied A4 sheet. so
going there is not worth the effort, for PG work. 

 

The UK is on copyright of Life plus 70, so if you particularly want any
*one*  book which is out of copyright in the UK you particularly want let me
know and I will borrow it, and scan it in for you.   I would not wish to get
in the bad books of local libraries.  If that works we can try another one.

 

The above  facilities are available to *anyone* with a local public  library
card in the UK, but this facility is not well known, so I have copied this
to gutvol-d.

 

Dave Fawthrop <hyphen at hyphenologist.co.uk>

 

From: Norm Wolcott [mailto:nwolcott2ster at gmail.com] 
Sent: 05 March 2008 15:06
To: Dave Fawthrop
Subject: Re: [gutvol-d] The Old Fashioned Way and other things. 

 

If you haven't found a home for your ABBY Finereader yet I would be glad to
pay the postage  for here in the US. I have a UK bank account and can mail
you a cheque for the cost. I am still lurching along on Omnipage Pro, and
have almost given up on OCR since ABBY came along and I realized I was
wasting my time. 

 

On another topic, did I read on one of your earlier posts that books could
be borrowed through a local library from the some storehouse of books
maintained in England (BM?) and that you got several 19th cent books from
there? I hope to be in the UK this spring for some Verne research, and would
like to investigate this. What is the wait time etc. The BM does not allow
taking photos etc of books in in their reading rooms,  and I believe you
have to use a book every day or back it goes to Northhampton or whatever. 

 

Norm Wolcott 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080306/bb95e3a4/attachment.htm 

From Bowerbird at aol.com  Fri Mar  7 10:55:24 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 7 Mar 2008 13:55:24 EST
Subject: [gutvol-d] perpetual comical
Message-ID: <bfa.c7052c1.3502e99c@aol.com>

it's a three ring circus over at d.p.!   clowns all over the place!         
:+)

when i first saw that the roundless experiment over there was called
"perpetual p1", i gave a little laugh that it implied that they would just
continue recycling a book through p1 over and over and over, forever.
yeah, _that's_ the best way to solve your backlog problem!            :+)

but now it's getting scary, because i think they're really gonna do that!

you might remember that i've written up the "final results" of this test...
drew my conclusions and put it to bed, paperwork to be delivered soon.

well, evidently, they're not quite done with it yet...

they just sent it back for iteration#4.   oh lordy lordy.
and they've already placed iteration#5 on the docket...

if we were to correct the errors that can be located _automatically_,
there would be about 5 errors left after i3...   or let's round it to 6...

now, on their _last_ pass through p1, iteration#3, the proofers found
_half_ of the remaining errors.   so we can project iteration#4 will find
_3_ out of the 6, leaving 3.   and then iteration#5 would find 1 or 2...
then iteration#6 might find 1.   or maybe not.   that's just a crap shoot.

note that a pass through p1 burns about 8 hours in proofing time...
let me tell you, the cost of finding that last error is gonna be a doozy!
i hope it's worth it!

but wait.   it gets even better.

because i'm talking about the _real_ errors.   you know, _mistakes_.

but that's not all that these proofers are changing, no sir, not at all...

remember those ellipses?   the ones that _could_ -- and _should_ --
have been corrected in about 10 seconds, with one global change?

yep, they're still plaguing this text.

so let's look at the 15 pages that had "spacey ellipses" after iteration#3:
pages 1, 6, 16, 29, 49, 50, 78, 80, 88, 94, 100, 103, 115, 118, and 136.

now that's 15 pages right there that are going to show "diffs" in i4 --
presuming, of course, that they are actually located, and corrected...
the diffs are meaningless, of course, but nonetheless there will be diffs.

and if you thought _that_ was bad, well, it gets even worse...

because it seems that those very same ellipses give the p1 proofers
all kinds of ways to change 'em, then change 'em to something else,
then change them back again.   it ends up you can do this _forever_.
(and "forever" and "perpetual" are kissing cousins, don't you know?)

so, in i4 so far -- with some 25 pages in -- we have cases where
the proofers have _missed_ ellipses that needed to be closed up...
and a case where one proofer closed up _both_ sides of an ellipse.
and lots of cases where a (correct) closed-up ellipse was changed
to an (incorrect) spacey ellipse.   not to mention several cases where
a 3-dot ellipse was changed to a 4-dot one, and vice versa.   sheesh!
this is madness.   sheer stupidity.   i apologize profusely because i
evidently haven't _stressed_ the fact that a roundless system needs
to have some _user-training_ so this ugly circularity won't happen.

but i thought that was _obvious_.   does nobody watch the process?

there are other meaningless changes being made too.   one of them
is an old standby over at d.p. -- the blank line at the top of a page...

oh, and end-line-hyphenates.   it's amusing to watch 'em over time.
for instance, there was a case of the end-line hyphenate of grand-
father.   the first proofer came along and changed it to grand-father,
simply bringing up the trailing word.   the next made it grand-*father,
which is d.p. code for "hey post-processor, take a close look at this".
the next proofer changed it to "grandfather", which might be correct,
i'm not sure, i'd have to look it up in the dictionary, which, by the way,
we could've had the computer do automatically, right at the _outset_,
so the joining would have been correct before any proofer ever saw it,
and avoided all of this mess, and thus not wasted _any_ proofer time.
because ultimately we have to go to the dictionary to decide anyway,
that and look at other cases in the same book, which page-at-a-time
proofers can't do. so why are proofers involved in this decision at all?

again, this is supreme stupidity.   stupidity piled on top of stupidity,
with a little bit of incompetence thrown in.   how long will it go on?

well hey, this is distributed proofreaders.   so it might be _perpetual_...

-bowerbird

p.s.   so far, i4 has encountered _one_ real error.   they missed it...



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080307/a048a6a9/attachment.htm 

From Bowerbird at aol.com  Fri Mar  7 14:45:01 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 7 Mar 2008 17:45:01 EST
Subject: [gutvol-d] the comedy is contagious
Message-ID: <ce4.2ac1b564.35031f6d@aol.com>

so, on the afternoon of march 5th, i post a message
that includes screenshots from my file-renaming tool.
>?? http://z-m-l.com/misc/ocr-renamer01.png
>?? http://z-m-l.com/misc/ocr-renamer02.png

you know the tool... the one i offered to write for d.p.,
but they ignored the offer, but i'm releasing it anyway?

don't you know, but less than 24 hours later, we have:
>    
http://www.pgdp.org/~dkretz/c/images_index.php?projectid=projectID466eb97ee3ca7

and suddenly the thread over at d.p. has a burst of clarity:
>    http://www.pgdp.net/phpBB2/viewtopic.php?p=433169#433169

not total clarity, mind you, far from it, but nonetheless,
a significant leap from "muddled" to "on the right track".

sometimes the comedy is contagious...

-bowerbird



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080307/a4537b1d/attachment.htm 

From Bowerbird at aol.com  Fri Mar  7 16:55:05 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 7 Mar 2008 19:55:05 EST
Subject: [gutvol-d] data and equations and hype heals all wounds
Message-ID: <c60.2454eb43.35033de9@aol.com>

this is the data that i used to prepare my "final write-up" for
the "perpetual p1" experiment at distributed proofreaders...

side-by-side data showing the "planet strappers" changes:
>    http://z-m-l.com/misc/strappers73-p1p2p3.html
>    http://z-m-l.com/misc/strappers73-i1i2i3.html

there are a couple glitches in it, but you'll get the picture...

and the picture is clear.   after p1 cleared up _thousands_
of errors -- yes, _literally_ thousands -- the next round
(whether re-done by p1 proofers, or by the p2 proofers)
located 55 of the remaining 73 errors, for roughly 75%...

the round after that, whether by p1 proofers again or p3,
found about _half_ of the remaining 18 errors -- 50%...

the second error exposed to iteration#4 so far was _caught_,
-- yay! -- meaning they've caught 1 out of 2, or, um, 50%...

***

by the way, you might remember that i said that there was
some formula -- when you had two independent proofings --
that would project the remaining number of expected errors,
based on the (a) number of errors the independent proofings
caught in common, and (b) the number unique to each of 'em.
as i said, this formula was buried in some d.p. forum thread...

that equation has now been dug up, independently.   it's here:
>    http://mathworld.wolfram.com/ProofreadingMistakes.html

***

so now, what should you do if you're being criticized for
not respecting the time and energy of your volunteers?

well, it's obvious, isn't it?   you should put some p.r. out
to tell your volunteers all the things, some of which they
might not be aware of, that you've been hard at work at,
to "make things better".   because hype heals all wounds...

>    http://www.pgdp.net/phpBB2/viewtopic.php?t=32255

-bowerbird



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080307/93c05d25/attachment.htm 

From brett at dimetrodon.demon.co.uk  Tue Mar 11 04:55:02 2008
From: brett at dimetrodon.demon.co.uk (Brett Paul Dunbar)
Date: Tue, 11 Mar 2008 11:55:02 +0000
Subject: [gutvol-d] california international antiquarian book fair
In-Reply-To: <41fd8970802191753x16b30534h70963259879e42ec@mail.gmail.com>
References: <1913024193.267081203465434613.JavaMail.mail@webmail02>
	<41fd8970802191753x16b30534h70963259879e42ec@mail.gmail.com>
Message-ID: <BQxfT8BWMn1HFw+l@dimetrodon.demon.co.uk>

Steven desJardins <steven at desjardins.org> writes
>On Feb 19, 2008 6:57 PM, Joshua Hutchinson <joshua at hutchinson.net> wrote:
>> I think you can easily make the argument that this old manuscript WAS 
>>published, though not mass produced.
>>
>> It was created by someone and sold to someone else (or perhaps 
>>created as a work for hire, etc).
>>
>> That rule you refer to is meant to cover things like a manuscript of 
>>text unpublished by the author and hidden away in an attic then found 
>>years later when his great-granddaughter decided to clean out the old 
>>family junk pile.  Or maybe a scientist's lab journal that was never 
>>meant for public consumption, but after she became famous was 
>>published posthumously.  IMHO, of course.
>
>That's a reasonable argument, but the dictionaries I consulted agree
>that to "publish" something is to make it generally available to the
>public. I would want to see an dictionary or legal citation before
>being convinced that the sale of a unique, unpublished manuscript can
>constitute publication.

In English law at any rate the offence of "Publishing a Libel" can 
include showing a defamatory letter to your secretary even if no one 
else has seen it. That is one example of publishing being used for a 
document with an extremely limited circulation. In the context of 
defamation publishing means voluntarily showing the document to any 
other person.
-- 
Great Internet Mersenne Prime Search http://www.mersenne.org/prime.htm
Livejournal http://brett-dunbar.livejournal.com/
Brett Paul Dunbar
To email me, use reply-to address

From prosfilaes at gmail.com  Tue Mar 11 16:12:02 2008
From: prosfilaes at gmail.com (David Starner)
Date: Tue, 11 Mar 2008 19:12:02 -0400
Subject: [gutvol-d] california international antiquarian book fair
In-Reply-To: <BQxfT8BWMn1HFw+l@dimetrodon.demon.co.uk>
References: <1913024193.267081203465434613.JavaMail.mail@webmail02>
	<41fd8970802191753x16b30534h70963259879e42ec@mail.gmail.com>
	<BQxfT8BWMn1HFw+l@dimetrodon.demon.co.uk>
Message-ID: <6d99d1fd0803111612i13d5f779vedcc780fa84c41eb@mail.gmail.com>

From <http://www.nysd.uscourts.gov/courtweb/Pdf/D02NYSC/03-08697.PDF>,
the court decision that put A Course in Miracles in the public domain.

             The showing of a work to a select group of people for
a limited purpose (such as to seek commentary or criticism) does
not constitute "publication" within the meaning of the copyright
law, and is legally insufficient to place the work into the public
domain.   E,g., Acad. of Motion Picture Arts and Sciences v.
Creative House Promotions, Inc., 944 F.2d 1446 (9th Cir. 1991). In
particular, the creator of a work has the right to show it to a
limited  class  of  people  without  jeopardizing  the  common law
                                 24
copyright, and, under such circumstances, the publication will be
deemed "limited."   Id. at 1451; Proctor & Gamble Co. v. Colgate-
Palmolive Co., No. 96 Civ. 9123, 1998 WL 788802, at *38 (S.D.N.Y.
1998).
          Such a limited publication will be found where the
publication was (1) to a definitely select group, (2) for a limited
purpose, and (3) without the right of diffusion, reproduction
distribution or sale.  White v. Kimmell, 193 F.2d 744, 746-47 (9th
Cir. 1952); Continental Casualty Co. v. Beardsley, 253 F.2d 702,
706-07 (2d Cir. 1958), cert. denied, 358 U.S. 816 (1958); Proctor
& Gamble Co., 1998 WL 788802 at *38.

[...]

          "A general publication 'occurs when by the consent of the
copyright owner, the original or tangible copies of a work are
sold, leased, loaned, given away or otherwise made available to the
general public, or when an authorized offer is made to dispose of
the  work  in  any  such manner   even if  a  sale  or  other  such
disposition does not in fact occur.'"   Penguin Books U.S.A., 2000
WL 1028634, at *16 (citing Proctor and Gamble Co., 1998 WL 788802,
at *38 (S.D.N.Y. 1998); Nimmer ? 4l04 at 4-20 (3d ed. 1997)).
          A distribution of a work to one person constitutes a
publication.  Kakizaki v. Riedel, 811 F. Supp. 129, 131 (S.D.N.Y.
1992); Burke v. Nat'l Broad. Co., Inc., 598 F.2d 688, 691 (1st Cir.
1979).

[...]

                                                  Specifically, to
satisfy that a distribution qualifies as a limited publication, the
plaintiffs must sustain their burden of proof to put forth evidence
that the publication was (1) to a definitely select group, (2) for
a  limited  purpose,  and  (3)  without  the  right  of  diffusion,
reproduction, distribution or sale.

[...]

          A select group cannot be created by an author's "subjec-
tive 'test of cordiality.'"  Thus, when works are given or sold to
persons deemed "worthy" a select call is not created and the
publication is not limited.  When plaintiffs sell or give the Work
to "congenial strangers" the Court is "unable to see in this
picture  any  definitely  selected  individuals  or  any  limited,
ascertained  group   or  class  to  whom  the   communication  was
restricted." Schatt v. Curtis Mgmt. Group, Inc., 764 F. Supp. 902,
911 (S.D.N.Y. 1991) (quoting White, 193 F.2d at 747)


That's what publication means in a copyright sense in the US, at least
according to this judge.

From richfield at telkomsa.net  Wed Mar 12 00:53:41 2008
From: richfield at telkomsa.net (Jon Richfield)
Date: Wed, 12 Mar 2008 09:53:41 +0200
Subject: [gutvol-d] Gothic or Gothic?
Message-ID: <47D78C05.5000303@telkomsa.net>

I have a canon scanner that came with an Omniscan subset, and I run them 
under Windows 2K.  For the most part I find both of them satisfactory 
and sufficient, in fact, downright gratifying.  I may want to convert to 
Linux some time, but I am too busy to sharpen my axe, so that must wait. 

Problem: I have a couple of books with a lot of the "old" (pre WWII 
mostly) German style of "Gothic" script.  In particular, having holes in 
my head,  I would love to scan in Kluge's Etymological German Dictionary 
as soon as I get a breather, and I might be able to get a sound copy of 
"Mein Kampf" as well.  Unfortunately, as usual I need something free.  
Does anyone have any constructive suggestions, preferably for something 
that I can bolt onto what I have?

No hurry, It won't happen this month, but if I know that I have 
something that works well enough to rely on, then I can scan in or 
photograph material against the time that I can afford to process it. 

BTW, just as a matter of curiosity, what is the copyright situation with 
Hitler's works?  I know that it has lapsed in Australia and presumably 
Canada, but it should nominally be in copyright in the US.  Is it 
regarded as such, and if so, is it an academic question, or would it be 
enforced, and if so , by whom?

FTM, who enforces copyright?  Is it done automatically by any authority, 
or must some materially interested party sue or threaten or lay criminal 
charges?

Thanks for your attention,

Jon


From steven at desjardins.org  Wed Mar 12 01:44:14 2008
From: steven at desjardins.org (Steven desJardins)
Date: Wed, 12 Mar 2008 04:44:14 -0400
Subject: [gutvol-d] Gothic or Gothic?
In-Reply-To: <47D78C05.5000303@telkomsa.net>
References: <47D78C05.5000303@telkomsa.net>
Message-ID: <41fd8970803120144j1f9609c6o785714ba22188b7d@mail.gmail.com>

On Wed, Mar 12, 2008 at 3:53 AM, Jon Richfield <richfield at telkomsa.net> wrote:
>  BTW, just as a matter of curiosity, what is the copyright situation with
>  Hitler's works?  I know that it has lapsed in Australia and presumably
>  Canada, but it should nominally be in copyright in the US.  Is it
>  regarded as such, and if so, is it an academic question, or would it be
>  enforced, and if so , by whom?

According to Wikipedia, "The U.S. government seized the copyright
during the Second World War as part of the Trading with the Enemy Act
and in 1979, Houghton Mifflin, the U.S. publisher of the book, bought
the rights from the government. "

From grythumn at gmail.com  Wed Mar 12 05:24:27 2008
From: grythumn at gmail.com (Robert Cicconetti)
Date: Wed, 12 Mar 2008 08:24:27 -0400
Subject: [gutvol-d] Gothic or Gothic?
In-Reply-To: <47D78C05.5000303@telkomsa.net>
References: <47D78C05.5000303@telkomsa.net>
Message-ID: <15cfa2a50803120524q5195b753t577b9d7bbdae855b@mail.gmail.com>

On Wed, Mar 12, 2008 at 3:53 AM, Jon Richfield <richfield at telkomsa.net>
wrote:

> Problem: I have a couple of books with a lot of the "old" (pre WWII
> mostly) German style of "Gothic" script.  In particular, having holes in
> my head,  I would love to scan in Kluge's Etymological German Dictionary
> as soon as I get a breather, and I might be able to get a sound copy of
> "Mein Kampf" as well.  Unfortunately, as usual I need something free.
> Does anyone have any constructive suggestions, preferably for something
> that I can bolt onto what I have?
>

Fraktur fonts are difficult to OCR well; I have not tried in a while, but I
understand older versions of OCR software actually do better (for
Finereader, it was v5 or v6; can't recall) as they make fewer assumptions
about the typeface. There has also been some work done on the open-source
OCR engine Tesseract by piggy, a member of DP; I have not used it myself so
I cannot comment on how well it works as yet.

I can say that I spent many hours trying to train FR7 to understand Fraktur
and other blackletter fonts, and got absolutely nowhere.

R C
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080312/3e340146/attachment.htm 

From piggy at netronome.com  Wed Mar 12 06:09:30 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Wed, 12 Mar 2008 09:09:30 -0400
Subject: [gutvol-d] Gothic or Gothic?
In-Reply-To: <47D78C05.5000303@telkomsa.net>
References: <47D78C05.5000303@telkomsa.net>
Message-ID: <47D7D60A.4070409@netronome.com>

Jon Richfield wrote:
> ...
> Problem: I have a couple of books with a lot of the "old" (pre WWII 
> mostly) German style of "Gothic" script.  In particular, having holes in 
> my head,  I would love to scan in Kluge's Etymological German Dictionary 
> as soon as I get a breather, and I might be able to get a sound copy of 
> "Mein Kampf" as well.  Unfortunately, as usual I need something free.  
> Does anyone have any constructive suggestions, preferably for something 
> that I can bolt onto what I have?
>   
The OCR package tesseract now has usable fraktur support. You want to 
use the deu-f language package.

If you find pages that don't OCR well, send them to me and I'll fix the 
tesseract training to work better with them.


From piggy at netronome.com  Thu Mar 13 12:43:43 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Thu, 13 Mar 2008 15:43:43 -0400
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1"	experiment
In-Reply-To: <ce2.2715ec92.35006750@aol.com>
References: <ce2.2715ec92.35006750@aol.com>
Message-ID: <47D983EF.5060002@netronome.com>

Great writeup! I really appreciate the detailed data analysis.

Could I trouble you to add a section to 
http://www.pgdp.net/wiki/Confidence_in_Page_analysis ?

If that's difficult, do you mind if I include an edited form of this 
message?

I'm also very interested in your detailed list of which errors are in 
which categories. I'm starting to look at finer-grained page difference 
metrics than wdiff alterations. Do you have a tool that makes the 
classifications or did you do it by hand?

Bowerbird at aol.com wrote:
> here are the results of the "perpetual p1" test at distributed 
> proofreaders...
>
> the experiment was geared to see if running an e-text through p1 
> repeatedly
> would produce a text that was as clean as that from the regular d.p. 
> workflow,
> which consists of a p1 round, followed by a p2 round (with "better" 
> proofers),
> and then a p3 round (with the "best" proofers, as tested and certified 
> by d.p.).
>
> the first thing to note is that the proofers did an excellent job on 
> this book...
> they caught numerous errors in the original p-book, not just the 
> o.c.r. errors.
> in general, they should be congratulated on their fine job of proofing 
> here...
>
> the results clearly show that repeated p1 produces text as clean as 
> p1-p2-p3,
> and calls into question whether the "better" proofers are _really_ 
> better at all...
>
> specifically, my analysis of the results shows 274 error to begin with...
>
> this 274 number does _not_ include the changes that proofers had to make
> in order to repair the 1,137 em-dashes in this book, which were 
> accidentally
> changed to en-dashes by inappropriate handling by the content preparer...
>
> neither does it include corrections of the 504 ellipses throughout the 
> book,
> which had to be "closed up" and/or changed (unnecessarily) to 4 dots, 
> since
> the first of those two tasks could have been attained with one global 
> change,
> and the second is totally uncalled for.
>
> finally, it does not include 715 end-line-hyphenates which proofers had to
> rejoin, under d.p. policy, which is unnecessary, since the machine can 
> do it;
> nor does it include 74 changes to "clothe" em-dashes, as per d.p. policy.
>
> some of those numbers might be off slightly, but the overall thrust is 
> clear;
> compared to the 274 _real_ errors in this book which _needed_ to be fixed,
> there were over _two_thousand_ unnecessary changes that had to be made,
> according to d.p. policy.  roughly 8 unnecessary changes for every 
> real one.
> this is why the d.p. workflow is so inefficient, and disrespectful of 
> proofers...
>
> one other note: since the proofers did such a good job of finding errors
> in the original p-book, i've included all of those in this results 
> write-up...
> it's worth reminding ourselves, though, that this is "outside of the 
> scope"
> of what we consider the job of the proofers to actually be, so _reward_
> them for going the extra mile, and don't dwell on what they "missed"...
>
> not that they missed all that much, mind you.  so let's take a good 
> look...
>
> ***
>
> so, how did p1 do the first time around?
>
> p1 removed 205 -- 75% -- of the 274 errors.  laudable performance...
>
> ***
>
> so how did the normal workflow go after this kick-off by p1?
>
> p2 found 55 of the remaining 73 errors, a rate of 75%...  again, laudable.
>
> p3 found 9 of the remaining 18 errors, a "measly" 50%...  not so laudable.
>
> luckily, half of the 9 errors that p3 missed were auto-detectable...
>
> ***
>
> and how did the "perpetual p1" proofings go, in comparison?
>
> iteration#2 -- the second pass of the text through the p1 experiment --
> found 55 or the remaining 73 errors, _exactly_ matching the p2 results...
> i2 found 40 of the same errors p2 had found, and 15 that p2 had missed.
> (likewise, p2 found 15 errors i2 had missed.)  thus, i2's accuracy was 
> 75%.
>
> iteration#3 -- the third cycling of the text through the p1 experiment --
> is finishing as i post this, but they _almost_ matched p3 _exactly_ as 
> well;
> the i3 people found 8 of the 18 errors, just 1 less than the p3 
> proofers...
>
> but while we're noting that the i3 proofers missed some errors, to be 
> sure,
> the bright spot was that i3 also _found_ 3 new errors, which is 
> surprising,
> since the "marines" of the d.p. proofers -- the p3 crew -- had missed 'em.
>
> so... what's remarkable here is that the p2 and i2 figures matched 
> _exactly_,
> and the p3 and i3 figures were also _almost_identical_...  it's kind 
> of freaky.
>
> thus, again, no evidence that the p1 proofers are "inferior" in any 
> way at all.
>
> ***
>
> curiously, a good percentage of the errors that were missed by the 
> proofers
> would've been _easily_ detected by any respectable post-o.c.r. 
> clean-up tool.
> which means they should've been eliminated before _any_ proofing was done.
>
> (as just one example here, there was an improper period located right 
> in the
> middle of a sentence, a period which was not followed by a capitalized 
> word.
> that's one of the most simple, and most predictable, tests that you 
> can make.)
>
> and proofers might well have caught even more of the mistakes, except they
> were probably fatigued by all of the unnecessary changes they had to make.
>
> distributed proofreaders needs to tighten up its post-o.c.r. 
> pre-processing.
>
> again, compared to the 274 o.c.r. errors requiring proofer action, 
> there were
> over _two_thousand_ totally unnecessary changes requiring proofer 
> action...
> in my opinion, that's shameful.  extreme streamlining is called for, 
> quickly!
>
> ***
>
> in sum, the text coming from 3 rounds of p1 was not significantly 
> different
> from the text that was produced by the p1-p2-p3 "normal" workflow at d.p.
> both versions of the text had approximately 5-10 errors remaining 
> within...
> for a 150-page book like this one, that is quite an acceptable rate of 
> errors.
>
> and by doing _5_ rounds, even those 5-10 remaining errors were detected,
> although -- in my opinion -- it's not worth the extra work to get that 
> level.
> proofers routinely spent 2-5 minutes on a page, which is a _lot_ of 
> time...
>
> of course, even after these _5_ rounds, there might well be additional 
> errors.
> indeed, i spotted 2, just by accident, in the course of conducting 
> this review.
> moreover, it appears a few of the "corrections" of original p-book 
> "errors"
> might have been just a touch over-zealous.  (if you're curious, the 
> "errors"
> on page 86 look intentional in retrospect.)  that can happen sometimes...
>
> at any rate, though, proofers have done an outstanding job on this book.
> and the p1 proofers proved that they can keep pace with the p3 marines...
>
> -bowerbird
>
> p.s.  i'll have materials documenting this analysis on my site very 
> soon...
>
>
>
> **************
> It's Tax Time! Get tips, forms, and advice on AOL Money & Finance.
> (http://money.aol.com/tax?NCID=aolprf00030000000001)
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d


From piggy at netronome.com  Fri Mar 14 20:03:14 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Fri, 14 Mar 2008 23:03:14 -0400
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1"	experiment
In-Reply-To: <ce2.2715ec92.35006750@aol.com>
References: <ce2.2715ec92.35006750@aol.com>
Message-ID: <47DB3C72.8020804@netronome.com>

I've put the numerical content of this posting into the CiP wiki page:

http://www.pgdp.net/wiki/Confidence_in_Page_analysis#Detailed_analysis_of_PP1.2C_I1-I3

Bowerbird at aol.com wrote:
> here are the results of the "perpetual p1" test at distributed 
> proofreaders...


From hyphen at hyphenologist.co.uk  Sun Mar 16 20:08:16 2008
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Mon, 17 Mar 2008 03:08:16 -0000
Subject: [gutvol-d] a write-up of the final results on the "perpetual
	p1"	experiment
In-Reply-To: <47D983EF.5060002@netronome.com>
References: <ce2.2715ec92.35006750@aol.com> <47D983EF.5060002@netronome.com>
Message-ID: <000301c887dc$284710c0$78d53240$@co.uk>

I am an ex-engineer, where attempts at perfection are treated with derision.
*Everything* is produced to a standard *good enough to do the job*,
*everything* has a tolerance attached to it, something say 12 inches long 
may have a tolerance of 1/10,000 of an inch or indeed 1/8 of an inch 
depending on its proposed use.
Anyone who produced a drawing asking for perfection, the saying was 
*dead flat and real smooth*, never heard the last of it.

Has anyone done a cost/benefit analysis  on second and third rounds of
proofing?
Is a book with 18 errors any easier to read than one with 75 errors?
There are most certainly errors in the same ball court as 18 to 75, 
in the original printed texts of the books I read or make into e-text.  
Why try for perfection when the paper original is far from perfect? 
Would any ordinary reader notice that level of errors?
Different editions of old hand composed books have different errors.
I personally just ignore errors in paper or e-books.

Before anyone mentions academics who must have everything perfect.
What proportion of our readers are academics or terminal pedants? 
Why pander to the whims an unrepresentative sample of readers?

Is not 75 errors per book good enough for a Science Fiction novel?

Dave Fawthrop 

-----Original Message-----


Bowerbird at aol.com wrote:
> here are the results of the "perpetual p1" test at distributed 
> proofreaders...
>
> the experiment was geared to see if running an e-text through p1 
> repeatedly
> would produce a text that was as clean as that from the regular d.p. 
> workflow,
> which consists of a p1 round, followed by a p2 round (with "better" 
> proofers),
> and then a p3 round (with the "best" proofers, as tested and certified 
> by d.p.).
>
> the first thing to note is that the proofers did an excellent job on 
> this book...
> they caught numerous errors in the original p-book, not just the 
> o.c.r. errors.
> in general, they should be congratulated on their fine job of proofing 
> here...
>
> the results clearly show that repeated p1 produces text as clean as 
> p1-p2-p3,
> and calls into question whether the "better" proofers are _really_ 
> better at all...
>
> specifically, my analysis of the results shows 274 error to begin with...
>
> this 274 number does _not_ include the changes that proofers had to make
> in order to repair the 1,137 em-dashes in this book, which were 
> accidentally
> changed to en-dashes by inappropriate handling by the content preparer...
>
> neither does it include corrections of the 504 ellipses throughout the 
> book,
> which had to be "closed up" and/or changed (unnecessarily) to 4 dots, 
> since
> the first of those two tasks could have been attained with one global 
> change,
> and the second is totally uncalled for.
>
> finally, it does not include 715 end-line-hyphenates which proofers had to
> rejoin, under d.p. policy, which is unnecessary, since the machine can 
> do it;
> nor does it include 74 changes to "clothe" em-dashes, as per d.p. policy.
>
> some of those numbers might be off slightly, but the overall thrust is 
> clear;
> compared to the 274 _real_ errors in this book which _needed_ to be fixed,
> there were over _two_thousand_ unnecessary changes that had to be made,
> according to d.p. policy.  roughly 8 unnecessary changes for every 
> real one.
> this is why the d.p. workflow is so inefficient, and disrespectful of 
> proofers...
>
> one other note: since the proofers did such a good job of finding errors
> in the original p-book, i've included all of those in this results 
> write-up...
> it's worth reminding ourselves, though, that this is "outside of the 
> scope"
> of what we consider the job of the proofers to actually be, so _reward_
> them for going the extra mile, and don't dwell on what they "missed"...
>
> not that they missed all that much, mind you.  so let's take a good 
> look...
>
> ***
>
> so, how did p1 do the first time around?
>
> p1 removed 205 -- 75% -- of the 274 errors.  laudable performance...
>
> ***
>
> so how did the normal workflow go after this kick-off by p1?
>
> p2 found 55 of the remaining 73 errors, a rate of 75%...  again, laudable.
>
> p3 found 9 of the remaining 18 errors, a "measly" 50%...  not so laudable.
>
> luckily, half of the 9 errors that p3 missed were auto-detectable...
>
> ***
>
> and how did the "perpetual p1" proofings go, in comparison?
>
> iteration#2 -- the second pass of the text through the p1 experiment --
> found 55 or the remaining 73 errors, _exactly_ matching the p2 results...
> i2 found 40 of the same errors p2 had found, and 15 that p2 had missed.
> (likewise, p2 found 15 errors i2 had missed.)  thus, i2's accuracy was 
> 75%.
>
> iteration#3 -- the third cycling of the text through the p1 experiment --
> is finishing as i post this, but they _almost_ matched p3 _exactly_ as 
> well;
> the i3 people found 8 of the 18 errors, just 1 less than the p3 
> proofers...
>
> but while we're noting that the i3 proofers missed some errors, to be 
> sure,
> the bright spot was that i3 also _found_ 3 new errors, which is 
> surprising,
> since the "marines" of the d.p. proofers -- the p3 crew -- had missed 'em.
>
> so... what's remarkable here is that the p2 and i2 figures matched 
> _exactly_,
> and the p3 and i3 figures were also _almost_identical_...  it's kind 
> of freaky.
>
> thus, again, no evidence that the p1 proofers are "inferior" in any 
> way at all.
>
> ***
>
> curiously, a good percentage of the errors that were missed by the 
> proofers
> would've been _easily_ detected by any respectable post-o.c.r. 
> clean-up tool.
> which means they should've been eliminated before _any_ proofing was done.
>
> (as just one example here, there was an improper period located right 
> in the
> middle of a sentence, a period which was not followed by a capitalized 
> word.
> that's one of the most simple, and most predictable, tests that you 
> can make.)
>
> and proofers might well have caught even more of the mistakes, except they
> were probably fatigued by all of the unnecessary changes they had to make.
>
> distributed proofreaders needs to tighten up its post-o.c.r. 
> pre-processing.
>
> again, compared to the 274 o.c.r. errors requiring proofer action, 
> there were
> over _two_thousand_ totally unnecessary changes requiring proofer 
> action...
> in my opinion, that's shameful.  extreme streamlining is called for, 
> quickly!
>
> ***
>
> in sum, the text coming from 3 rounds of p1 was not significantly 
> different
> from the text that was produced by the p1-p2-p3 "normal" workflow at d.p.
> both versions of the text had approximately 5-10 errors remaining 
> within...
> for a 150-page book like this one, that is quite an acceptable rate of 
> errors.
>
> and by doing _5_ rounds, even those 5-10 remaining errors were detected,
> although -- in my opinion -- it's not worth the extra work to get that 
> level.
> proofers routinely spent 2-5 minutes on a page, which is a _lot_ of 
> time...
>
> of course, even after these _5_ rounds, there might well be additional 
> errors.
> indeed, i spotted 2, just by accident, in the course of conducting 
> this review.
> moreover, it appears a few of the "corrections" of original p-book 
> "errors"
> might have been just a touch over-zealous.  (if you're curious, the 
> "errors"
> on page 86 look intentional in retrospect.)  that can happen sometimes...
>
> at any rate, though, proofers have done an outstanding job on this book.
> and the p1 proofers proved that they can keep pace with the p3 marines...
>
> -bowerbird
>
> p.s.  i'll have materials documenting this analysis on my site very 
> soon...
>
>
>
> **************
> It's Tax Time! Get tips, forms, and advice on AOL Money & Finance.
> (http://money.aol.com/tax?NCID=aolprf00030000000001)
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d

_______________________________________________
gutvol-d mailing list
gutvol-d at lists.pglaf.org
http://lists.pglaf.org/listinfo.cgi/gutvol-d


From klofstrom at gmail.com  Sun Mar 16 20:31:25 2008
From: klofstrom at gmail.com (Karen Lofstrom)
Date: Sun, 16 Mar 2008 17:31:25 -1000
Subject: [gutvol-d] a write-up of the final results on the "perpetual
	p1" experiment
In-Reply-To: <000301c887dc$284710c0$78d53240$@co.uk>
References: <ce2.2715ec92.35006750@aol.com> <47D983EF.5060002@netronome.com>
	<000301c887dc$284710c0$78d53240$@co.uk>
Message-ID: <1e8e65080803162031t598a25eewb380cfb3c340cf7d@mail.gmail.com>

On Sun, Mar 16, 2008 at 5:08 PM, Dave Fawthrop
<hyphen at hyphenologist.co.uk> wrote:

> I am an ex-engineer, where attempts at perfection are treated with derision.

<snip>

>  What proportion of our readers are academics or terminal pedants?  Why pander to the whims an unrepresentative sample of readers?

Dave, I'm a terminal pedant. A lot of us at DP are. Furthermore, the
books we're processing are, many of them, destined only to be used by
academics and pedants. Few people read Rosa Nouchette Carey for fun;
if they do, they're either pedants like me, or an academic working on
a book or paper.

Scholars want and need accuracy in texts. Over the centuries, many
person-hours have been devoted to making sure that editions are the
best possible. Academics spot what seem to be errors in texts, propose
emendations, and then argue about the emendations.

Applying engineering standards to text is like applying lit crit to
engineering. It's a category mistake. Sure, you may not care if your
1930s SF has errors, or if the abominable typesetting in the original
has been emended, but an academic writing a history of SF wants good
texts.

Since us pedants and academics gravitate towards DP, and want to see
our work USED, we are working towards academic standards.

If you want to start a rival book digitization project whose motto is
"Good enough, I guess," go right ahead :) Readers will download the
versions that they like, as long as the versions are clearly labeled.

-- 
Zora
a pedant

From steven at desjardins.org  Sun Mar 16 21:22:47 2008
From: steven at desjardins.org (Steven desJardins)
Date: Mon, 17 Mar 2008 00:22:47 -0400
Subject: [gutvol-d] a write-up of the final results on the "perpetual
	p1" experiment
In-Reply-To: <000301c887dc$284710c0$78d53240$@co.uk>
References: <ce2.2715ec92.35006750@aol.com> <47D983EF.5060002@netronome.com>
	<000301c887dc$284710c0$78d53240$@co.uk>
Message-ID: <41fd8970803162122s65497ddcl915be8a889fc9353@mail.gmail.com>

On Sun, Mar 16, 2008 at 11:08 PM, Dave Fawthrop
<hyphen at hyphenologist.co.uk> wrote:
>  Why try for perfection when the paper original is far from perfect?
>  Would any ordinary reader notice that level of errors?
>  Different editions of old hand composed books have different errors.
>  I personally just ignore errors in paper or e-books.

I try for perfection because _somebody_ should. I produce what may
become the definitive e-text version of any particular work, read
potentially by tens or hundreds of thousands of people per year. To
suggest that it's too much trouble to carefully examine the book four
times for defects seems preposterous. Even if it makes only a small
difference to each of those readers, a small benefit multiplied by ten
thousand justifies a great deal of care.

>  Is not 75 errors per book good enough for a Science Fiction novel?

Not if I'm responsible for it, it isn't.

From Morasch at aol.com  Sun Mar 16 23:00:03 2008
From: Morasch at aol.com (Morasch at aol.com)
Date: Mon, 17 Mar 2008 02:00:03 EDT
Subject: [gutvol-d] a write-up of the final results on the "perpetual
	p1" experiment
Message-ID: <bf5.2b078114.350f62e3@aol.com>

dave said:
>    I am an ex-engineer, where attempts at perfection are treated with 
derision.
>    *Everything* is produced to a standard *good enough to do the job*,
>    *everything* has a tolerance attached to it

good point, dave.

except it's already been covered, at least by me.   lots of times.

i've mentioned -- probably dozens of times by now -- that _my_ standard
for moving text to the general public for a "continuous proofreading" stage
is 1-error-in-every-10-pages.   the general public will help us from then on.

i've also mentioned that my tools generally produce much higher accuracy,
up to something around the rate of 1-error-in-every-100-pages, meaning
we can certainly start from a high point before we involve the general 
public.

but even more important than the level of accuracy we attain, i believe, is
the attitude and the infrastructure that we provide to correct any mistakes.

if we make it clear to the general public that these e-books _belong_ to 
them,
and that they are responsible for _reporting_ any errors they find in the 
text,
and then we make it _easy_ for them to actually check text against the scans,
and we make it _easy_ for them to _report_ errors, and then we make it _easy_
for them to see that all the error-reports are acted upon extremely 
_quickly_,
then we'll have greased the skids for the e-texts to move toward 
perfection...

currently, p.g. isn't good at doing _any_ of those things.   not a one.   
sadly.

maybe 10 years ago, that was ok.   but in the wake of wikipedia, it's _not_ 
ok.
wikipedia has shown people how _collective_responsibility_ works with text.

***

steve said:
>   read potentially by tens or hundreds of thousands of people per year. 

that cuts both ways.   if an e-text really has that many readers, and they 
feel
a sense of _ownership_ of the e-text, then _they_ will help us find the 
errors.

the problem is, we're not imbuing them with that sense of ownership.
and that failure has _lots_ of implications, ranging far beyond errors...


>    To suggest that it's too much trouble to carefully examine the book 
>    four times for defects seems preposterous.

in one sense, yes.   but in another sense, which is equally valid,
all of the time and energy _unnecessarily_ spent on one book is
time and energy that _could_ have been spent digitizing another.

so the _correct_ answer is to spend the _proper_ time and energy
on each book.   but that's a rather more difficult thing to calculate.
i don't think it's _impossible_ to compute it, not by any means at all.
but i do know for sure your "four times" answer is not the right way;
for some pages, sure.   but for all of them?, no way jose...


>    Even if it makes only a small difference to each of those readers, 
>    a small benefit multiplied by ten thousand justifies a great deal of 
care.

you're badly overestimating your own importance in the overall equation...

the proofing/digitization process is _not_ completed when a book is posted.
it has only _begun_.   and your inability to see that is what causes the 
problem.

you are _not_ the final line.   errors can be corrected long after your 
input...

(and it's a good thing you're not the final line, because you're not as good 
as
you think you are, not if you're the average distributed proofreader 
person...)

***

dave said:
>   Has anyone done a cost/benefit analysis
>    on second and third rounds of proofing?

o.c.r. alone -- when well-done -- can get many pages _completely_ correct.

if spell-check reveals zero _flags_ (where "flags" refers to any words that 
are
(1) not in the spell-check dictionary and (2) are low-frequency in the book),
then my sense is that that page could be passed through without a check...

for pages with a couple flags, i'd suggest those flags be scrutinized 
closely...

on pages with many flags, thorough proofing of the entire page is called for.

any and all changes made to a page should be verified by a second person...

as long as a page has _any_ errors on it, you should assume there are more,
meaning specifically (over and above verification) it should be checked 
again.

once a page has been "certified" as "clean" without any changes made to it,
i'd consider it to be "clean enough".   if you want a higher degree of 
certainty,
you could require a second "certification" without any changes being made.
anything over and above that would have little claim on a cost-benefit basis.

decisions made on the basis of "rounds" are fatally flawed from the outset,
since some of the pages will always be easy, and some will always be hard.


>   Is not 75 errors per book good enough for a Science Fiction novel?

for a 750-page book, maybe.   but even then, i'd think we can do better.

-bowerbird

p.s.   i see in my spam folder that zora (klofstrom) has weighed in on this 
thread.
i would guess she's making some "d.p. has high standards of accuracy" 
comment.
meanwhile, i examined one of her submissions and found _dozens_ of errors in 
it,
all of which i documented on this list years ago; yet they still haven't been 
corrected.
so let us remember that some of the people bellowing the loudest about the 
"quality"
of their efforts -- in an effort to try to tell you what to do -- are doing 
slipshod work...



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080317/00e05a98/attachment-0001.htm 

From Bowerbird at aol.com  Sun Mar 16 23:03:03 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 17 Mar 2008 02:03:03 EDT
Subject: [gutvol-d] a write-up of the final results on the "perpetual
	p1" experiment
Message-ID: <d3b.208c9115.350f6397@aol.com>

sending again, from the proper account.   sorry to confuse you...           
:+)

-bowerbird

***

dave said:
>?? I am an ex-engineer, where attempts at perfection are treated with 
derision.
>?? *Everything* is produced to a standard *good enough to do the job*,
>?? *everything* has a tolerance attached to it

good point, dave.

except it's already been covered, at least by me.? lots of times.

i've mentioned -- probably dozens of times by now -- that _my_ standard
for moving text to the general public for a "continuous proofreading" stage
is 1-error-in-every-10-pages.? the general public will help us from then on.

i've also mentioned that my tools generally produce much higher accuracy,
up to something around the rate of 1-error-in-every-100-pages, meaning
we can certainly start from a high point before we involve the general 
public.

but even more important than the level of accuracy we attain, i believe, is
the attitude and the infrastructure that we provide to correct any mistakes.

if we make it clear to the general public that these e-books _belong_ to 
them,
and that they are responsible for _reporting_ any errors they find in the 
text,
and then we make it _easy_ for them to actually check text against the scans,
and we make it _easy_ for them to _report_ errors, and then we make it _easy_
for them to see that all the error-reports are acted upon extremely 
_quickly_,
then we'll have greased the skids for the e-texts to move toward 
perfection...

currently, p.g. isn't good at doing _any_ of those things.? not a one.? 
sadly.

maybe 10 years ago, that was ok.? but in the wake of wikipedia, it's _not_ 
ok.
wikipedia has shown people how _collective_responsibility_ works with text.

***

steve said:
>?? read potentially by tens or hundreds of thousands of people per year.

that cuts both ways.? if an e-text really has that many readers, and they 
feel
a sense of _ownership_ of the e-text, then _they_ will help us find the 
errors.

the problem is, we're not imbuing them with that sense of ownership.
and that failure has _lots_ of implications, ranging far beyond errors...


>?? To suggest that it's too much trouble to carefully examine the book
>?? four times for defects seems preposterous.

in one sense, yes.? but in another sense, which is equally valid,
all of the time and energy _unnecessarily_ spent on one book is
time and energy that _could_ have been spent digitizing another.

so the _correct_ answer is to spend the _proper_ time and energy
on each book.? but that's a rather more difficult thing to calculate.
i don't think it's _impossible_ to compute it, not by any means at all.
but i do know for sure your "four times" answer is not the right way;
for some pages, sure.? but for all of them?, no way jose...


>?? Even if it makes only a small difference to each of those readers,
>?? a small benefit multiplied by ten thousand justifies a great deal of 
care.

you're badly overestimating your own importance in the overall equation...

the proofing/digitization process is _not_ completed when a book is posted.
it has only _begun_.? and your inability to see that is what causes the pro
blem.

you are _not_ the final line.? errors can be corrected long after your 
input...

(and it's a good thing you're not the final line, because you're not as good 
as
you think you are, not if you're the average distributed proofreader 
person...)

***

dave said:
>?? Has anyone done a cost/benefit analysis
>?? on second and third rounds of proofing?

o.c.r. alone -- when well-done -- can get many pages _completely_ correct.

if spell-check reveals zero _flags_ (where "flags" refers to any words that 
are
(1) not in the spell-check dictionary and (2) are low-frequency in the book),
then my sense is that that page could be passed through without a check...

for pages with a couple flags, i'd suggest those flags be scrutinized 
closely...

on pages with many flags, thorough proofing of the entire page is called for.

any and all changes made to a page should be verified by a second person...

as long as a page has _any_ errors on it, you should assume there are more,
meaning specifically (over and above verification) it should be checked 
again.

once a page has been "certified" as "clean" without any changes made to it,
i'd consider it to be "clean enough".? if you want a higher degree of 
certainty,
you could require a second "certification" without any changes being made.
anything over and above that would have little claim on a cost-benefit basis.

decisions made on the basis of "rounds" are fatally flawed from the outset,
since some of the pages will always be easy, and some will always be hard.


>?? Is not 75 errors per book good enough for a Science Fiction novel?

for a 750-page book, maybe.? but even then, i'd think we can do better.

-bowerbird

p.s.? i see in my spam folder that zora (klofstrom) has weighed in on this 
thread.
i would guess she's making some "d.p. has high standards of accuracy" 
comment.
meanwhile, i examined one of her submissions and found _dozens_ of errors in 
it,
all of which i documented on this list years ago; yet they still haven't been 
corrected.
so let us remember that some of the people bellowing the loudest about the 
"quality"
of their efforts -- in an effort to try to tell you what to do -- are doing 
slipshod work...



**************
It's Tax Time! Get tips, forms, and advice on AOL Money & Finance.
(http://money.aol.com/tax?NCID=aolprf00030000000001)

_______________________________________________
gutvol-d mailing list
gutvol-d at lists.pglaf.org
http://lists.pglaf.org/listinfo.cgi/gutvol-d



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080317/456b577e/attachment.htm 

From hyphen at hyphenologist.co.uk  Mon Mar 17 00:39:39 2008
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Mon, 17 Mar 2008 07:39:39 -0000
Subject: [gutvol-d] a write-up of the final results on the
	"perpetual	p1" experiment
In-Reply-To: <41fd8970803162122s65497ddcl915be8a889fc9353@mail.gmail.com>
References: <ce2.2715ec92.35006750@aol.com>
	<47D983EF.5060002@netronome.com>	<000301c887dc$284710c0$78d53240$@co.uk>
	<41fd8970803162122s65497ddcl915be8a889fc9353@mail.gmail.com>
Message-ID: <000f01c88802$145228e0$3cf67aa0$@co.uk>

Then you are admittedly a terminal pedant and spend IMO too much 
time and effort on proofreading.

Dave Fawthrop

On Sun, Mar 16, 2008 at 11:08 PM, Dave Fawthrop
<hyphen at hyphenologist.co.uk> wrote:
>  Why try for perfection when the paper original is far from perfect?
>  Would any ordinary reader notice that level of errors?
>  Different editions of old hand composed books have different errors.
>  I personally just ignore errors in paper or e-books.

I try for perfection because _somebody_ should. I produce what may
become the definitive e-text version of any particular work, read
potentially by tens or hundreds of thousands of people per year. To
suggest that it's too much trouble to carefully examine the book four
times for defects seems preposterous. Even if it makes only a small
difference to each of those readers, a small benefit multiplied by ten
thousand justifies a great deal of care.

>  Is not 75 errors per book good enough for a Science Fiction novel?

Not if I'm responsible for it, it isn't.


From steven at desjardins.org  Mon Mar 17 00:50:14 2008
From: steven at desjardins.org (Steven desJardins)
Date: Mon, 17 Mar 2008 03:50:14 -0400
Subject: [gutvol-d] a write-up of the final results on the "perpetual
	p1" experiment
In-Reply-To: <000f01c88802$145228e0$3cf67aa0$@co.uk>
References: <ce2.2715ec92.35006750@aol.com> <47D983EF.5060002@netronome.com>
	<000301c887dc$284710c0$78d53240$@co.uk>
	<41fd8970803162122s65497ddcl915be8a889fc9353@mail.gmail.com>
	<000f01c88802$145228e0$3cf67aa0$@co.uk>
Message-ID: <41fd8970803170050y7c9fbb58l1ad0b0d9314daa7e@mail.gmail.com>

On Mon, Mar 17, 2008 at 3:39 AM, Dave Fawthrop
<hyphen at hyphenologist.co.uk> wrote:
> Then you are admittedly a terminal pedant and spend IMO too much
>  time and effort on proofreading.

I don't think you know what the word "admittedly" means, since I've
made no claims of pedantry, only of aspirations towards accuracy. (But
I could be wrong; maybe you don't know what a "pedant" is.)

In my opinion, the time and effort I spend on proofreading is well
spent. Since it's my time and my effort, I think my opinion counts
somewhat more highly than yours.

From ralf at ark.in-berlin.de  Mon Mar 17 01:32:47 2008
From: ralf at ark.in-berlin.de (Ralf Stephan)
Date: Mon, 17 Mar 2008 09:32:47 +0100
Subject: [gutvol-d] a write-up of the final results on the
	"perpetual	p1" experiment
In-Reply-To: <bf5.2b078114.350f62e3@aol.com>
References: <bf5.2b078114.350f62e3@aol.com>
Message-ID: <20080317083247.GA5920@ark.in-berlin.de>

Morasch wrote 
> wikipedia has shown people how _collective_responsibility_ works with text.

What nonsense. Wikisource texts have more errors than even
pre-10k PG eTexts. That's also because the interface for working
with BOTH text and page image is so bad. Additionally, they
have only two rounds of proofreading, compared to SEVEN at DP.


ralf

From bzg at altern.org  Mon Mar 17 01:44:31 2008
From: bzg at altern.org (Bastien Guerry)
Date: Mon, 17 Mar 2008 08:44:31 +0000
Subject: [gutvol-d] a write-up of the final results on the "perpetual
	p1" experiment
In-Reply-To: <1e8e65080803162031t598a25eewb380cfb3c340cf7d@mail.gmail.com>
	(Karen Lofstrom's message of "Sun, 16 Mar 2008 17:31:25 -1000")
References: <ce2.2715ec92.35006750@aol.com> <47D983EF.5060002@netronome.com>
	<000301c887dc$284710c0$78d53240$@co.uk>
	<1e8e65080803162031t598a25eewb380cfb3c340cf7d@mail.gmail.com>
Message-ID: <87eja9n3io.fsf@bzg.ath.cx>

"Karen Lofstrom" <klofstrom at gmail.com> writes:

> Applying engineering standards to text is like applying lit crit to
> engineering. It's a category mistake. 

No.  

You are confusing "text" and the process DP go through when producing
text.  Applying engineering standards to this process sounds perfectly
reasonable to me.  (Ryle would be scared how his concept of "category
mistake" is now so mainstream that people are abusing it.  That was 
for the pedantic note.)

As far as I know, thinking in terms of "good enough" doesn't prevent
anyone to try to improve a system -- yes, even stupid enginers want to
improve machines!  And the will to improve something always calls for a
direction.  So "good enough" qualifies what DP does today, and stating
this doesn't prevent people from trying to improve it.

-- 
Bastien

From traverso at posso.dm.unipi.it  Mon Mar 17 02:09:31 2008
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Mon, 17 Mar 2008 10:09:31 +0100 (CET)
Subject: [gutvol-d] ***SPAM*** Re: a write-up of the final results on
	the "perpetual	p1"	experiment
In-Reply-To: <000301c887dc$284710c0$78d53240$@co.uk>
	(hyphen@hyphenologist.co.uk)
References: <ce2.2715ec92.35006750@aol.com> <47D983EF.5060002@netronome.com>
	<000301c887dc$284710c0$78d53240$@co.uk>
Message-ID: <20080317090931.14A8D93B61@posso.dm.unipi.it>

>>>>> "Dave" == Dave Fawthrop <hyphen at hyphenologist.co.uk> writes:

    Dave> Is not 75 errors per book good enough for a Science Fiction
    Dave> novel?

Only if the novel is 750 pages, if it is 150 pages I will not go on
reading for pleasure after page 10, I would rather read something
else. Too many errors make reading annoying.

In any case I believe that a further proofreading that can bring the
error to 15, or one every 50 pages, is well spent.

I agree that many old books are much worse that an error every 10
pages, and also agree that, for easy reading, some kind of harder to
spot transcription errors do not matter much. And agree that a
cost/benefit analysis coud be beneficial, but PG has to make his own
analysis, and every volunteer can make his own too. I believe that if
the whitewashers suspect that the error ratio exceeds one error every
two pages the submission is resent to the contributor without much
thinking.


Carlo


From julio.reis at tintazul.com.pt  Mon Mar 17 03:36:40 2008
From: julio.reis at tintazul.com.pt (=?ISO-8859-1?Q?J=FAlio?= Reis)
Date: Mon, 17 Mar 2008 10:36:40 +0000
Subject: [gutvol-d] gutvol-d Digest, Vol 44, Issue 19
In-Reply-To: <mailman.481.1205733613.2809.gutvol-d@lists.pglaf.org>
References: <mailman.481.1205733613.2809.gutvol-d@lists.pglaf.org>
Message-ID: <1205750200.7554.53.camel@abetarda.mshome.net>

> Is not 75 errors per book good enough for a Science Fiction novel?

Of course it is, fellow engineer! Because your SF novel is 1,500 pages
long, right? So, one mistake every 20 pages, that's very good. Unless
you *can* find out those 75 errors, in which case... go and get them.

Producing accurate texts is great, as long as it doesn't get to the
point of faithfully reproducing the errors found in the original. It's
all right to correct those and leave transcriber's notes to that effect.

That said... the amount of effort put in *must* be measured and feel
"decent" in proportion to the benefit derived. How many days would it
take to find those 75 errors? Some here would say no one's rushing you,
but that's beside the point IMHO. You *need* to finish stuff to get on
with the next project.

So, accurate yes. Quick, if possible.


From walter.van.holst at xs4all.nl  Mon Mar 17 05:36:11 2008
From: walter.van.holst at xs4all.nl (Walter van Holst)
Date: Mon, 17 Mar 2008 13:36:11 +0100
Subject: [gutvol-d] OCR OS Xquestion
Message-ID: <47DE65BB.1040402@xs4all.nl>

L.S.,

I will be asking the same question on the DP-fora, what OCR software 
would one recommend on Mac OS X? Is IRIS any good?

Regards,

  Walter

From joshua at hutchinson.net  Mon Mar 17 05:51:49 2008
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Mon, 17 Mar 2008 12:51:49 +0000 (GMT)
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1" experiment
Message-ID: <1751743406.11431205758310116.JavaMail.mail@webmail02>

Oh hell no.  You did NOT just label PP and PPV at DP as proofreading rounds.

Believe me, PP and PPV are not and have never been meant to be true proofreading rounds (and the reason we left the old 2 round system was to get away from the necessity of proofreading in PP and PPV).

I now return you to the age old argument over quality vs quantity.

Josh

On Mar 17, 2008, ralf at ark.in-berlin.de wrote: 
Morasch wrote 
> wikipedia has shown people how _collective_responsibility_ works with text.

What nonsense. Wikisource texts have more errors than even
pre-10k PG eTexts. That's also because the interface for working
with BOTH text and page image is so bad. Additionally, they
have only two rounds of proofreading, compared to SEVEN at DP.


ralf
_______________________________________________
gutvol-d mailing list
gutvol-d at lists.pglaf.org
http://lists.pglaf.org/listinfo.cgi/gutvol-d


From piggy at netronome.com  Mon Mar 17 06:57:27 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Mon, 17 Mar 2008 09:57:27 -0400
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1"	experiment
In-Reply-To: <000301c887dc$284710c0$78d53240$@co.uk>
References: <ce2.2715ec92.35006750@aol.com> <47D983EF.5060002@netronome.com>
	<000301c887dc$284710c0$78d53240$@co.uk>
Message-ID: <47DE78C7.60700@netronome.com>

Dave Fawthrop wrote:
> ...
> Anyone who produced a drawing asking for perfection, the saying was 
> *dead flat and real smooth*, never heard the last of it.
>
> Has anyone done a cost/benefit analysis  on second and third rounds of
> proofing?...
>   

That is a significant part of what I am trying to do.

Right now I am attempting to build the apparatus to estimate the cost 
part. The primary unit of cost is human lifetime, i.e. proofer-time.

To see my current thinking, I would encourage folks to read 
http://www.pgdp.net/wiki/Confidence_in_Page_analysis#The_Ferguson-Hardwick_Algorithm 
.

The benefit side is a little harder. I think I have a handle on 
calculating effectiveness--the number and possible kinds of errors we 
remove. But actually calculating a cost to undiscovered misprints is 
much more difficult. We need to be able to compare the cost of 
undiscovered misprints with the cost of doing the work.

Can anybody think of a way of quantifying the cost of undiscovered 
misprints in terms of human lifetime? Clearly there is a significant 
difference between Rosa Nouchette Carey and Raymond Zinke Gallun. I 
think Mr. Gallun himself would have agreed. Where can we get data to 
differentiate these two cases?

To those seeking perfection: I'm sorry, but the tools to confirm 
perfection don't exist. Our current understanding of the universe limits 
us to deciding how close to perfection we probably are.

What is the best way to spend our lives? I can't claim to offer a 
general solution, but I'm trying my darndest to offer a quantitative 
recommendation on how to spend that portion of your life you see fit to 
dedicate to proofing books at PGDP.


From piggy at netronome.com  Mon Mar 17 07:04:08 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Mon, 17 Mar 2008 10:04:08 -0400
Subject: [gutvol-d] gutvol-d Digest, Vol 44, Issue 19
In-Reply-To: <1205750200.7554.53.camel@abetarda.mshome.net>
References: <mailman.481.1205733613.2809.gutvol-d@lists.pglaf.org>
	<1205750200.7554.53.camel@abetarda.mshome.net>
Message-ID: <47DE7A58.5080601@netronome.com>

J?lio Reis wrote:
>> Is not 75 errors per book good enough for a Science Fiction novel?
>>     
> ...
> That said... the amount of effort put in *must* be measured and feel
> "decent" in proportion to the benefit derived. How many days would it
> take to find those 75 errors? Some here would say no one's rushing you,
> but that's beside the point IMHO. You *need* to finish stuff to get on
> with the next project.
>
> So, accurate yes. Quick, if possible.
>   

In the US, we have until 2019 to catch up. Congress did us the "favor" 
of freezing the Public Domain.


From hyphen at hyphenologist.co.uk  Mon Mar 17 08:56:34 2008
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Mon, 17 Mar 2008 15:56:34 -0000
Subject: [gutvol-d] a write-up of the final results on the "perpetual
	p1"	experiment
In-Reply-To: <47DE78C7.60700@netronome.com>
References: <ce2.2715ec92.35006750@aol.com>
	<47D983EF.5060002@netronome.com>	<000301c887dc$284710c0$78d53240$@co.uk>
	<47DE78C7.60700@netronome.com>
Message-ID: <000901c88847$7a8066f0$6f8134d0$@co.uk>



La Monte H.P. Yarroll wrote

Dave Fawthrop wrote:
>> ...
>> Anyone who produced a drawing asking for perfection, the saying was 
>> *dead flat and real smooth*, never heard the last of it.
>>
>> Has anyone done a cost/benefit analysis  on second and third rounds of
>> proofing?...
>>   

>That is a significant part of what I am trying to do.

>Right now I am attempting to build the apparatus to estimate the cost 
>part. The primary unit of cost is human lifetime, i.e. proofer-time.

In the UK we have a national minimum wage which is now slightly more 
than GBP 5 per hour.

Being retired, I expect someone somewhere to get 5 GBP for every hour 
Which I spend doing voluntary work.

Dave Fawthrop


From hyphen at hyphenologist.co.uk  Mon Mar 17 09:00:04 2008
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Mon, 17 Mar 2008 16:00:04 -0000
Subject: [gutvol-d] OCR OS Xquestion
In-Reply-To: <47DE65BB.1040402@xs4all.nl>
References: <47DE65BB.1040402@xs4all.nl>
Message-ID: <000a01c88847$f7c445f0$e74cd1d0$@co.uk>



Walter van Holst wrote

>L.S.,

>I will be asking the same question on the DP-fora, what OCR software 
>would one recommend on Mac OS X? Is IRIS any good?

I ditched IRIS on a PC in favour of Abbyy finereader which is IME much
better.

Dave Fawthrop


From walter.van.holst at xs4all.nl  Mon Mar 17 09:24:34 2008
From: walter.van.holst at xs4all.nl (Walter van Holst)
Date: Mon, 17 Mar 2008 17:24:34 +0100
Subject: [gutvol-d] OCR OS Xquestion
In-Reply-To: <000a01c88847$f7c445f0$e74cd1d0$@co.uk>
References: <47DE65BB.1040402@xs4all.nl>
	<000a01c88847$f7c445f0$e74cd1d0$@co.uk>
Message-ID: <47DE9B42.2070504@xs4all.nl>

Dave Fawthrop wrote:

>> I will be asking the same question on the DP-fora, what OCR software 
>> would one recommend on Mac OS X? Is IRIS any good?
> 
> I ditched IRIS on a PC in favour of Abbyy finereader which is IME much
> better.

There is no Abby Finereader edition available for OS X anymore, 
therefore my question.

Regards,

  Walter

From piggy at netronome.com  Mon Mar 17 10:02:43 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Mon, 17 Mar 2008 13:02:43 -0400
Subject: [gutvol-d] OCR OS Xquestion
In-Reply-To: <47DE9B42.2070504@xs4all.nl>
References: <47DE65BB.1040402@xs4all.nl>	<000a01c88847$f7c445f0$e74cd1d0$@co.uk>
	<47DE9B42.2070504@xs4all.nl>
Message-ID: <47DEA433.4000204@netronome.com>

Walter van Holst wrote:
> Dave Fawthrop wrote:
>
>   
>>> I will be asking the same question on the DP-fora, what OCR software 
>>> would one recommend on Mac OS X? Is IRIS any good?
>>>       
>> I ditched IRIS on a PC in favour of Abbyy finereader which is IME much
>> better.
>>     
>
> There is no Abby Finereader edition available for OS X anymore, 
> therefore my question.
>
> Regards,
>
>   Walter
>   

Has anybody tried tesseract OCR on Mac OS X? It's not an officially 
supported platform, but Wikipedia claims folks have used it successfully.


From Bowerbird at aol.com  Mon Mar 17 10:10:04 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 17 Mar 2008 13:10:04 EDT
Subject: [gutvol-d] OCR OS Xquestion
Message-ID: <c55.2b894678.350fffec@aol.com>


iris is a waste of money _and_ time.

-bowerbird



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080317/0f8de921/attachment.htm 

From steven at desjardins.org  Mon Mar 17 10:12:13 2008
From: steven at desjardins.org (Steven desJardins)
Date: Mon, 17 Mar 2008 13:12:13 -0400
Subject: [gutvol-d] a write-up of the final results on the "perpetual
	p1" experiment
In-Reply-To: <000901c88847$7a8066f0$6f8134d0$@co.uk>
References: <ce2.2715ec92.35006750@aol.com> <47D983EF.5060002@netronome.com>
	<000301c887dc$284710c0$78d53240$@co.uk> <47DE78C7.60700@netronome.com>
	<000901c88847$7a8066f0$6f8134d0$@co.uk>
Message-ID: <41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com>

On Mon, Mar 17, 2008 at 11:56 AM, Dave Fawthrop
<hyphen at hyphenologist.co.uk> wrote:
>  In the UK we have a national minimum wage which is now slightly more
>  than GBP 5 per hour.
>
>  Being retired, I expect someone somewhere to get 5 GBP for every hour
>  Which I spend doing voluntary work.

So if a basic proofreading job (P1) is worth, say, 10p. to the average
reader, and takes 10 hours of volunteer time, you would say that it's
justified if the resulting e-book has 500 readers. And if a better
proofreader job (P2) were worth only an additional 1p., and took 15
hours, the e-book would need 7500 readers for you to consider it
justified. And if a thoroughly nitpicky proofreading job (P3 and PP)
were worth only 0.1p. and took 30 additional hours, the e-book would
need 150,000 readers.

I think a good proofreading job is worth more than that, not
necessarily to the median reader (who, like you, may be indifferent),
but to a minority who does care highly enough about quality, and who
place a high enough value on e-books, to radically boost the mean. But
even using a very low valuation, I don't see how you can justify
leaving 75 errors in a book; are you really suggesting that it's worth
less than 1p. per reader to put the novel at least through P2?

(I don't agree with your metric, by the way--I think reducing every
activity to economic value is tunnel-visioned and unsophisticated--but
it seems more to undermine your position than to support it.)

From piggy at netronome.com  Mon Mar 17 10:18:39 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Mon, 17 Mar 2008 13:18:39 -0400
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1"	experiment
In-Reply-To: <000901c88847$7a8066f0$6f8134d0$@co.uk>
References: <ce2.2715ec92.35006750@aol.com>	<47D983EF.5060002@netronome.com>	<000301c887dc$284710c0$78d53240$@co.uk>	<47DE78C7.60700@netronome.com>
	<000901c88847$7a8066f0$6f8134d0$@co.uk>
Message-ID: <47DEA7EF.5000706@netronome.com>

Dave Fawthrop wrote:
> La Monte H.P. Yarroll wrote
>
> Dave Fawthrop wrote:
>   
>>> ...
>>> Anyone who produced a drawing asking for perfection, the saying was 
>>> *dead flat and real smooth*, never heard the last of it.
>>>
>>> Has anyone done a cost/benefit analysis  on second and third rounds of
>>> proofing?...
>>>   
>>>       
>
>   
>> That is a significant part of what I am trying to do.
>>     
>
>   
>> Right now I am attempting to build the apparatus to estimate the cost 
>> part. The primary unit of cost is human lifetime, i.e. proofer-time.
>>     
>
> In the UK we have a national minimum wage which is now slightly more 
> than GBP 5 per hour.
>
> Being retired, I expect someone somewhere to get 5 GBP for every hour 
> Which I spend doing voluntary work.
>
> Dave Fawthrop
>   

"Cost functions" in statistics are rarely denominated in currency. In 
this case, there is no money changing hands, so there is little value in 
converting time to currency.

We want "cost functions" so that we can answer the question "Is the 
expected result of another round of proofreading worth the effort it 
will take?"


From Bowerbird at aol.com  Mon Mar 17 10:51:39 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 17 Mar 2008 13:51:39 EDT
Subject: [gutvol-d] 75 errors in a book -- putting things back into
	perspective
Message-ID: <c55.2b8a67fc.351009ab@aol.com>

ok, let's put things back into perspective, ok?

i'm happy to see the d.p. people are getting all righteous about
dave's suggestion that 75 errors in a book might be acceptable.

now, how about showing some similar outrage over the fact that
-- in the "perpetual p1" book -- your proofers had to find and fix
_one_thousand_one_hundred_and_thirty_seven_ (1,137) em-dash
errors _introduced_ into the document by an incompetent person?

where is the big huff that proofers had to do _over_seven_hundred_
(700+) unnecessary "corrections" to rejoin end-of-line hyphenates,
which could have been done by the computer instead, in seconds?

and why is there no complaining about the fact that the proofers
had to "clothe" 75 end-of-line em-dashes, for no good reason?

and what about the 500+ ellipses that required "corrections" too?
those could have been handled easily by automated routines too.

these are the _real_ problems with the workflow at d.p.!
_8_ unnecessary fixes required for every _1_ o.c.r. error!

if you're going to consider the "cost" of doing a round of proofing,
then separate out the cost of these _needless_ changes beforehand.

once you've done that, you'll see the cost of _correcting_the_o.c.r._
is -- in a relative sense -- a very small cost indeed...

the solution is clear -- distributed proofreaders, fix your workflow!

-bowerbird



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080317/4a420954/attachment.htm 

From marcello at perathoner.de  Mon Mar 17 10:57:41 2008
From: marcello at perathoner.de (Marcello Perathoner)
Date: Mon, 17 Mar 2008 18:57:41 +0100
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1"	experiment
In-Reply-To: <47DE78C7.60700@netronome.com>
References: <ce2.2715ec92.35006750@aol.com>
	<47D983EF.5060002@netronome.com>	<000301c887dc$284710c0$78d53240$@co.uk>
	<47DE78C7.60700@netronome.com>
Message-ID: <47DEB115.604@perathoner.de>

La Monte H.P. Yarroll wrote:

> What is the best way to spend our lives? I can't claim to offer a 
> general solution, but I'm trying my darndest to offer a quantitative 
> recommendation on how to spend that portion of your life you see fit to 
> dedicate to proofing books at PGDP.

Has anybody yet come up with the revolutionary idea that people might 
proofread books because they have fun?



-- 
Marcello Perathoner
webmaster at gutenberg.org


From ralf at ark.in-berlin.de  Mon Mar 17 11:11:55 2008
From: ralf at ark.in-berlin.de (Ralf Stephan)
Date: Mon, 17 Mar 2008 19:11:55 +0100
Subject: [gutvol-d] a write-up of the final results on the
	"perpetual	p1" experiment
In-Reply-To: <1751743406.11431205758310116.JavaMail.mail@webmail02>
References: <1751743406.11431205758310116.JavaMail.mail@webmail02>
Message-ID: <20080317181155.GB7041@ark.in-berlin.de>

> Believe me, PP and PPV are not and have never been meant to be true proofreading rounds (and the reason we left the old 2 round system was to get away from the necessity of proofreading in PP and PPV).

Josh, I know why you say this. Because people should concentrate
on the specialty of each round. But you know quite well that you
as PP are responsible for anything that gets through. Also, you
can't help reading while working on it, and you can't help correcting
when something hits your eyes.


ralf


From Bowerbird at aol.com  Mon Mar 17 12:17:38 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 17 Mar 2008 15:17:38 EDT
Subject: [gutvol-d] "perpetual" documentation of the errors found post-p1
Message-ID: <d37.27ac5781.35101dd2@aol.com>

here are the latest reports on the "perpetual p1" experiment:
>    http://z-m-l.com/misc/strappers77-p1p2p3.html
>    http://z-m-l.com/misc/strappers77-i1i2i3i4.html

p1 corrected about 222 o.c.r. errors -- plus thousands of other "errors".
p1 missed 77 errors discovered by later proofers, as documented below.

p2 fixed 55 errors; there were 15 unique ones and 40 in common with i2.
likewise, i2 fixed 55 errors -- 15 unique ones and 40 in common with p2.
taken together, then, p2 and i2 found 70 of the 77 errors which they faced.

p3 found 10 of the 22 errors with which it was faced, finding 3 unique ones.
i3 found 8 of the 22 errors with which it was faced, also finding 3 unique 
ones.
p3 and i3 left 5 undetectable o.c.r. errors each; they were completely 
different.
taken together, p3 and i3 found 6 unique errors, and left only 1 
undiscovered...

i4 found 7 of the 14 errors with which it was faced, finding the last unique 
one.
(since that last error was a misspelled word, spellcheck also would've caught 
it.)
yes, that's right, 6 rounds of proofers missed a word that didn't pass 
spellcheck!
of the 7 errors missed by i4, only one of them was an undetectable o.c.r. 
error,
and the p2 proofers had caught that one in their round...

a complete list of the ~300 o.c.r. errors will be compiled and posted soon, 
and
a clean version of the text posted.   after that, i'll probably put this baby 
to bed...

-bowerbird

p.s.   as for meaningless changes, i4 missed 12 of the 15 pages with spacey 
ellipses:
1 no
6 no
16 yes!
29 no
49 yes!
50 yes!
78 no
80 no
88 no
94 no
100 no
103 no
115 no
118 no
136 no



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080317/3ba7eae7/attachment.htm 

From prosfilaes at gmail.com  Mon Mar 17 15:34:45 2008
From: prosfilaes at gmail.com (David Starner)
Date: Mon, 17 Mar 2008 18:34:45 -0400
Subject: [gutvol-d] a write-up of the final results on the "perpetual
	p1" experiment
In-Reply-To: <000301c887dc$284710c0$78d53240$@co.uk>
References: <ce2.2715ec92.35006750@aol.com> <47D983EF.5060002@netronome.com>
	<000301c887dc$284710c0$78d53240$@co.uk>
Message-ID: <6d99d1fd0803171534o5e421447l47be9ec8c31f7a6@mail.gmail.com>

On Sun, Mar 16, 2008 at 11:08 PM, Dave Fawthrop
<hyphen at hyphenologist.co.uk> wrote:
> I am an ex-engineer, where attempts at perfection are treated with derision.

Really? As far as I can tell, Boeing works pretty hard to make sure
that _no_ planes fall out of the sky. Nuclear power plants are
designed so that _none_ of them blow up. A perfect etext is possible
and could be achieved.

>  Is a book with 18 errors any easier to read than one with 75 errors?

Of course. The fewer times I get yanked out of the story by a typo,
the better. It's unacceptable for me to have to deal with a typo that
obscures what the original said.

>  Before anyone mentions academics who must have everything perfect.
>  What proportion of our readers are academics or terminal pedants?
>  Why pander to the whims an unrepresentative sample of readers?

I don't think readers that object to typos is all that an
unrepresentative sample of readers. Furthermore, especially in a
volunteer organization, it's a reasonable goal to turn out material
that will make the producers happy, even if the audience would settle
for something lesser.

From piggy at netronome.com  Mon Mar 17 20:01:43 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Mon, 17 Mar 2008 23:01:43 -0400
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1"	experiment
In-Reply-To: <47DEB115.604@perathoner.de>
References: <ce2.2715ec92.35006750@aol.com>	<47D983EF.5060002@netronome.com>	<000301c887dc$284710c0$78d53240$@co.uk>	<47DE78C7.60700@netronome.com>
	<47DEB115.604@perathoner.de>
Message-ID: <47DF3097.2000501@netronome.com>

Marcello Perathoner wrote:
> La Monte H.P. Yarroll wrote:
>
>   
>> What is the best way to spend our lives? I can't claim to offer a 
>> general solution, but I'm trying my darndest to offer a quantitative 
>> recommendation on how to spend that portion of your life you see fit to 
>> dedicate to proofing books at PGDP.
>>     
>
> Has anybody yet come up with the revolutionary idea that people might 
> proofread books because they have fun?
>   

Absolutely! If we could denominate the time spent proofing in units of 
fun and escaped misprints in units of un-fun, we'd have a viable pair of 
cost functions that would help us decide when to stop proofreading a 
particular book.

I for one get a small kick out of finding an obscure misprint, but a 
much larger kick from seeing a book I worked on posted to PG. If I could 
quantify these "kicks" I'd also have a good start on the cost functions 
I'm looking for.


From piggy at netronome.com  Mon Mar 17 20:12:17 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Mon, 17 Mar 2008 23:12:17 -0400
Subject: [gutvol-d] "perpetual" documentation of the errors found post-p1
In-Reply-To: <d37.27ac5781.35101dd2@aol.com>
References: <d37.27ac5781.35101dd2@aol.com>
Message-ID: <47DF3311.3080902@netronome.com>

Bowerbird at aol.com wrote:
> ,,,
> p.s.  as for meaningless changes, i4 missed 12 of the 15 pages with 
> spacey ellipses:
> 1 no
> 6 no
> 16 yes!
> 29 no
> 49 yes!
> 50 yes!
> 78 no
> 80 no
> 88 no
> 94 no
> 100 no
> 103 no
> 115 no
> 118 no
> 136 no

Don't hold it against them--after about page 80 I added the following 
note to the project guidelines:

Please ignore ellipses. Leave them as they currently stand. They will be 
handled in PP.

Your analysis suggests that people are reading the guidelines.


From piggy at netronome.com  Mon Mar 17 20:41:33 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Mon, 17 Mar 2008 23:41:33 -0400
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1"	experiment
In-Reply-To: <47DF3097.2000501@netronome.com>
References: <ce2.2715ec92.35006750@aol.com>	<47D983EF.5060002@netronome.com>	<000301c887dc$284710c0$78d53240$@co.uk>	<47DE78C7.60700@netronome.com>	<47DEB115.604@perathoner.de>
	<47DF3097.2000501@netronome.com>
Message-ID: <47DF39ED.4050103@netronome.com>

La Monte H.P. Yarroll wrote:
> Marcello Perathoner wrote:
>   
>> La Monte H.P. Yarroll wrote:
>>
>>   
>>     
>>> What is the best way to spend our lives? I can't claim to offer a 
>>> general solution, but I'm trying my darndest to offer a quantitative 
>>> recommendation on how to spend that portion of your life you see fit to 
>>> dedicate to proofing books at PGDP.
>>>     
>>>       
>> Has anybody yet come up with the revolutionary idea that people might 
>> proofread books because they have fun?
>>   
>>     
>
> Absolutely! If we could denominate the time spent proofing in units of 
> fun and escaped misprints in units of un-fun, we'd have a viable pair of 
> cost functions that would help us decide when to stop proofreading a 
> particular book.
>   

Oops. That's not quite right. Surely it's more fun to proof different 
books than to proof the same book over and over. How much less fun is it 
to proof the same book over one more time? When that fun drops below the 
sum total of unfun we would get from the likely number of remaining 
errors, then we are done.

Different people enjoy specific kinds of books. This suggests that a fun 
metric would be proofer and book specific. Could we ask people how much 
fun they had proofing a particular page? Could we get a complementary 
rating from people who find errors in posted PG texts?

Another possibility is to equate attention with fun. If people stop 
paying attention to a book, we could presume that they no longer find it 
fun. The amount of daily attention a book gets could be compared to the 
mean for all projects. Diminishing attention could be treated as 
diminishing fun, i.e. rising cost. This could conceivably let us ignore 
the cost of missed misprints completely.

I really like the idea of trying to optimize PGDP for fun. More suggests 
are solicited!


From piggy at netronome.com  Mon Mar 17 20:54:26 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Mon, 17 Mar 2008 23:54:26 -0400
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1" experiment
In-Reply-To: <41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com>
References: <ce2.2715ec92.35006750@aol.com>
	<47D983EF.5060002@netronome.com>	<000301c887dc$284710c0$78d53240$@co.uk>
	<47DE78C7.60700@netronome.com>	<000901c88847$7a8066f0$6f8134d0$@co.uk>
	<41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com>
Message-ID: <47DF3CF2.6040103@netronome.com>

Steven desJardins wrote:
> ...
> I think a good proofreading job is worth more than that, not
> necessarily to the median reader (who, like you, may be indifferent),
> but to a minority who does care highly enough about quality, and who
> place a high enough value on e-books, to radically boost the mean. But
> even using a very low valuation, I don't see how you can justify
> leaving 75 errors in a book; are you really suggesting that it's worth
> less than 1p. per reader to put the novel at least through P2?
>   
...


Thanks, that reminds me of a fallacy I've been meaning to point out to 
folks.

The number of final errors in a book is not predominantly a function of 
the last round it finishes. The initial number of errors in the book is 
very important.

There are a handful of books which come out of P1 with phenomenally low 
error rates. There are also a handful of books which finish P3 with 
error rates comparable to really poor OCR.

Much more important than finishing a certain number of rounds is to 
actually predict the likely number of remaining errors in a specific 
text (which we can do with moderate reliability) and then decide which 
kind of round to subject it to.

Examine these pictures to visualize the problem:
http://www.pgdp.net/wiki/Confidence_in_Page_analysis#Changes_III


From Bowerbird at aol.com  Mon Mar 17 22:18:52 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 18 Mar 2008 01:18:52 EDT
Subject: [gutvol-d] the o.c.r. errors in the "perpetual p1" experiment
Message-ID: <c2a.203beed3.3510aabc@aol.com>

here are the o.c.r. errors in the "perpetual p1" experiment:
>    http://z-m-l.com/misc/strapper269errors.html

269 lines were changed from the o.c.r. to the "final" version.

note that these are o.c.r. errors _only_.

there's no computation of em-dashes, clothing of em-dashes,
rejoined end-of-line hyphenates, ellipses, or asterisks (notes).

in other words, assume capable handling by the 
content provider, and an intelligent workflow...

if we assume about 10% of these lines contain more than 1 error,
then we've got about 300 separate errors here.   counting the 77
errors that were found after that initial p1 round, that means
p1 fixed around 225 of the 300, giving an accuracy rate of 75%.

the p2/i2 rounds, which each caught 55 of the remaining 77 errors,
thus had an accuracy rate of 70%.

most of the rounds after that had an accuracy rate around 50%...

the i2/i3/i4 "iterations" did _not_ have the benefit of wordcheck
-- the "good" and "bad" word-lists were not maintained for them --
which makes their ability to keep pace with p2/p3 more remarkable.

-bowerbird



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080318/a95eba28/attachment-0001.htm 

From hyphen at hyphenologist.co.uk  Tue Mar 18 00:22:52 2008
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Tue, 18 Mar 2008 07:22:52 -0000
Subject: [gutvol-d] a write-up of the final results on the
	"perpetual	p1" experiment
In-Reply-To: <6d99d1fd0803171534o5e421447l47be9ec8c31f7a6@mail.gmail.com>
References: <ce2.2715ec92.35006750@aol.com>
	<47D983EF.5060002@netronome.com>	<000301c887dc$284710c0$78d53240$@co.uk>
	<6d99d1fd0803171534o5e421447l47be9ec8c31f7a6@mail.gmail.com>
Message-ID: <000001c888c8$e345ac50$a9d104f0$@co.uk>



David Starner wrote


On Sun, Mar 16, 2008 at 11:08 PM, Dave Fawthrop
<hyphen at hyphenologist.co.uk> wrote:
>> I am an ex-engineer, where attempts at perfection are treated with
derision.

>Really? As far as I can tell, Boeing works pretty hard to make sure
>that _no_ planes fall out of the sky. 

As an example Aircraft Engines fail occasion, which is why passenger
aircraft 
have at least two engines, when one fails the planes will still get down
safely 
without that engine.  Landing with a dead engine is far from perfection.

I can only remember one case where two engines failed at the same time.
http://images.cnn.com/2008/WORLD/europe/01/18/heathrow.incident/index.html
We do not yet know why.

>Nuclear power plants are
>designed so that _none_ of them blow up. 

Three Mile Island, Chernobyl, Windscale for example.

Dave Fawthrop


From traverso at posso.dm.unipi.it  Tue Mar 18 00:31:36 2008
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Tue, 18 Mar 2008 08:31:36 +0100 (CET)
Subject: [gutvol-d] a write-up of the final results on
	the	"perpetual	p1" experiment
In-Reply-To: <000001c888c8$e345ac50$a9d104f0$@co.uk>
	(hyphen@hyphenologist.co.uk)
References: <ce2.2715ec92.35006750@aol.com>
	<47D983EF.5060002@netronome.com>	<000301c887dc$284710c0$78d53240$@co.uk>
	<6d99d1fd0803171534o5e421447l47be9ec8c31f7a6@mail.gmail.com>
	<000001c888c8$e345ac50$a9d104f0$@co.uk>
Message-ID: <20080318073136.3AC5E93B61@posso.dm.unipi.it>

>>>>> "Dave" == Dave Fawthrop <hyphen at hyphenologist.co.uk> writes:

    Dave> David Starner wrote


    Dave> On Sun, Mar 16, 2008 at 11:08 PM, Dave Fawthrop
    Dave> <hyphen at hyphenologist.co.uk> wrote:
    >>> I am an ex-engineer, where attempts at perfection are treated
    >>> with
    Dave> derision.

    >> Really? As far as I can tell, Boeing works pretty hard to make
    >> sure that _no_ planes fall out of the sky.

    Dave> As an example Aircraft Engines fail occasion, which is why
    Dave> passenger aircraft have at least two engines, when one fails
    Dave> the planes will still get down safely without that engine.
    Dave> Landing with a dead engine is far from perfection.

And for the same rason there are several round of proofreading.
Proposing to post books with 75 errors in 150 pages is like proposing
to have aircrafts that crash when an engine fails. It can be done, but
nobody is willing to use them.

Carlo


From Bowerbird at aol.com  Tue Mar 18 01:51:29 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 18 Mar 2008 04:51:29 EDT
Subject: [gutvol-d] fully informed of the d.p. follies
Message-ID: <cf4.24f0857e.3510dc91@aol.com>

carlo said:
>    Proposing to post books with 75 errors in 150 pages is like 
>    proposing to have aircrafts that crash when an engine fails. 
>    It can be done, but nobody is willing to use them.

notice how willingly the d.p. people jump on the false issues, and
contrast that with their silence when it comes to the _real_ ones...

so far i've done them the favor of making my points on this list,
one that is hidden to google's robots behind a subscriber wall...

but starting this spring -- which begins this friday, i believe --
i'll be reposting my messages on a blog that google will crawl,
so the world finally becomes fully informed of the d.p. follies...

just to remind them, off the dome, this is what needs to be done:
1.   ensure you have decent scans, and name them intelligently.
2.   use a decent o.c.r. program, and ensure quality results.
3.   do not tolerate bad text handling by content providers.
4.   do a decent post-o.c.r. cleanup, before _any_ proofing.
5.   retain linebreaks (don't rejoin hyphenates or clothe em-dashes).
6.   change the ridiculous ellipse policy to something sensible.
7.   stop doing small-cap markup with no semantic meaning.
8.   i forget what 8 was for.
9.   retain pagenumber information, in an unobtrusive manner.
10.   format the ascii version using light markup, for auto-html.

-bowerbird



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080318/2292eebd/attachment.htm 

From schultzk at uni-trier.de  Tue Mar 18 03:25:04 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Tue, 18 Mar 2008 11:25:04 +0100
Subject: [gutvol-d] OCR OS Xquestion
In-Reply-To: <47DE65BB.1040402@xs4all.nl>
References: <47DE65BB.1040402@xs4all.nl>
Message-ID: <841F5114-875F-41D3-B3D1-D7FA85748475@uni-trier.de>

Hi Walter,

	I use OmniPage. not perfect, but does the
	job for me.

	Keith.

Am 17.03.2008 um 13:36 schrieb Walter van Holst:

> L.S.,
>
> I will be asking the same question on the DP-fora, what OCR software
> would one recommend on Mac OS X? Is IRIS any good?
>
> Regards,
>
>   Walter
>

From prosfilaes at gmail.com  Tue Mar 18 04:41:07 2008
From: prosfilaes at gmail.com (David Starner)
Date: Tue, 18 Mar 2008 07:41:07 -0400
Subject: [gutvol-d] a write-up of the final results on the "perpetual
	p1" experiment
In-Reply-To: <000001c888c8$e345ac50$a9d104f0$@co.uk>
References: <ce2.2715ec92.35006750@aol.com> <47D983EF.5060002@netronome.com>
	<000301c887dc$284710c0$78d53240$@co.uk>
	<6d99d1fd0803171534o5e421447l47be9ec8c31f7a6@mail.gmail.com>
	<000001c888c8$e345ac50$a9d104f0$@co.uk>
Message-ID: <6d99d1fd0803180441n22eab75ew7bc12678e97af3d6@mail.gmail.com>

On Tue, Mar 18, 2008 at 3:22 AM, Dave Fawthrop
<hyphen at hyphenologist.co.uk> wrote:
>  David Starner wrote
>
>  On Sun, Mar 16, 2008 at 11:08 PM, Dave Fawthrop
>  <hyphen at hyphenologist.co.uk> wrote:
>  >> I am an ex-engineer, where attempts at perfection are treated with
>  derision.
>
>  >Really? As far as I can tell, Boeing works pretty hard to make sure
>  >that _no_ planes fall out of the sky.
>
>  As an example Aircraft Engines fail occasion, which is why passenger
>  aircraft
>  have at least two engines, when one fails the planes will still get down
>  safely
>  without that engine.  Landing with a dead engine is far from perfection.

The whole thing about engineering standards is that you specify
clearly what you're trying to achieve. If the goal is to have no
planes fall out of the sky, which would be considered underspecified,
then landing with a dead engine is meeting that goal.

>  I can only remember one case where two engines failed at the same time.

Gimli Glider is another example. The goal, however, is just that; a
goal. No matter what your goals are, and how hard you try to meet
them, sometimes you'll fail; you'll get a wrench that doesn't meet
your 1/8" specification and somehow slipped past the checks. That
doesn't mean you give up the goals.

> Three Mile Island, Chernobyl, Windscale for example.

Two of which didn't blow up. Only SL-1 and Chernobyl have accidentally
blown up. (For research reasons, a couple early reactors were sent
critical at the end of the lifespan, just to see what would happen.)

From piggy at netronome.com  Tue Mar 18 06:22:32 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Tue, 18 Mar 2008 09:22:32 -0400
Subject: [gutvol-d] the o.c.r. errors in the "perpetual p1" experiment
In-Reply-To: <c2a.203beed3.3510aabc@aol.com>
References: <c2a.203beed3.3510aabc@aol.com>
Message-ID: <47DFC218.8000409@netronome.com>

Bowerbird at aol.com wrote:
> here are the o.c.r. errors in the "perpetual p1" experiment:
> >   http://z-m-l.com/misc/strapper269errors.html
>
> 269 lines were changed from the o.c.r. to the "final" version.
May I have permission to copy your detailed list into the PGDP wiki? I'd 
like a single archival location for all the data related to this experiment.

I will be using your list of "real" changes to test an automated 
difference tool to see if it can approximate the accuracy of your manual 
analysis.

You mentioned earlier a handful of defects found in the original text. 
I'm very interested in seeing those too.

Again, I really appreciate the energy you are putting into this experiment.


From Bowerbird at aol.com  Tue Mar 18 11:03:17 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 18 Mar 2008 14:03:17 EDT
Subject: [gutvol-d] the o.c.r. errors in the "perpetual p1" experiment
Message-ID: <d1d.1dd8970f.35115de5@aol.com>

piggy said:
>    May I have permission to copy your detailed list into the PGDP wiki?

again, facts is facts.   use my stuff...   you never need to ask 
permission...
heck, do you think marcello asked my "permission" when he assembled
his "fan site" for me?   and he didn't even maintain the proper context...


>    I will be using your list of "real" changes 
>    to test an automated difference tool 
>    to see if it can approximate 
>    the accuracy of your manual analysis.

um, my analysis isn't "manual" by a long shot.   just back from vacation,
so i haven't written it up fully, but i will get around to doing that soon...

the bottom line is pretty simple, though.   if d.p. would simply ditch
all the unnecessary changes you have the proofers make, it will be
dirt-simple for you to get analyses like the ones i've provided here.


>    You mentioned earlier a handful of defects found in the original text.
>    I'm very interested in seeing those too.

i can't place that.   i'd need a more solid reminder of what i said.


>    Again, I really appreciate the energy you are putting into this 
experiment.

real-world data is fun, in general.

this project is kind of a drag, because it puts a magnifying glass on
d.p. dysfunctionality, and it would sure be nice if _sometime_ i could
talk about what people are doing _right_ instead of doing _wrong_,
plus this project has the additional burden of me knowing that juliet
will almost certainly fail to act appropriately on the conclusions, but
again, i love to explore real-world data, even if the results are obvious.

-bowerbird



**************
It's Tax Time! Get tips, forms, and advice on AOL Money &amp; 
Finance.
      (http://money.aol.com/tax?NCID=aolprf00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080318/6dc40491/attachment.htm 

From marcello at perathoner.de  Tue Mar 18 12:25:13 2008
From: marcello at perathoner.de (Marcello Perathoner)
Date: Tue, 18 Mar 2008 20:25:13 +0100
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1" experiment
In-Reply-To: <47DF3CF2.6040103@netronome.com>
References: <ce2.2715ec92.35006750@aol.com>	<47D983EF.5060002@netronome.com>	<000301c887dc$284710c0$78d53240$@co.uk>	<47DE78C7.60700@netronome.com>	<000901c88847$7a8066f0$6f8134d0$@co.uk>	<41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com>
	<47DF3CF2.6040103@netronome.com>
Message-ID: <47E01719.9000307@perathoner.de>

La Monte H.P. Yarroll wrote:

> Much more important than finishing a certain number of rounds is to 
> actually predict the likely number of remaining errors in a specific 
> text (which we can do with moderate reliability) and then decide which 
> kind of round to subject it to.

Why would the "likely number of remaining errors" be a better estimator 
for which round to send the text to, than the number of errors found in 
the last round?

The set of errors in a text is recursively enumerable, meaning there is 
no way to know if you already found them all.

Meaning also, you cannot verify your predictions. You will know if your 
predictions were too low, but never if they were too high.

My advise is to just stick to the number of errors found. Do a thing 
like this: let z << y << x

- the text goes to round 1 if there is no preceding round

- the text is done if the preceding round finds
   less than z errors

- the text goes to round 3 if the preceding round finds
   more than z and less than y errors

- the text goes to round 2 if the preceding round finds
   more than y and less than x errors

- else the text goes to round 1



-- 
Marcello Perathoner
webmaster at gutenberg.org


From Bowerbird at aol.com  Tue Mar 18 13:07:36 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 18 Mar 2008 16:07:36 EDT
Subject: [gutvol-d] a write-up of the final results on the "perpetual
	p1" experiment
Message-ID: <c96.21a1a404.35117b08@aol.com>

ok, i dug out what i'd written up on doing my analysis.

i write quick-and-dirty tools to do _most_ of my work,
because that offers me the flexibility i typically need...

i haven't yet tried to program an application that can
handle an arbitrary o.c.r. output file dropped upon it.

i've done enough hacking to know it _can_ be written,
and fully expect to write it within the next year or two,
but it'll take serious testing to make it robust enough,
and then further refinement to make it user-friendly...

alas, such a tool would realistically be fine-tuned for
o.c.a. output, and until they clean up their o.c.r. act --
they have quotemark, em-dash, and pagebreak issues
-- there is no sense in refining things prematurely...

so it's easier for me right now to just hack what i need.

***

piggy said:
>    Do you have a tool that makes the classifications 
>    or did you do it by hand?

a little bit of both.   actually, a whole _lot_ of both.

for my money, line-based stuff is the only way to go.

first of all, lines are a fairly good count on actual diffs,
since the most common line will only contain 1 error...

second, a line gives sufficient context to grok the error.

third, and perhaps most important of all, lines are easy.

at least lines are _usually_ quite easy to handle, _except_
when it comes to d.p. content.   the d.p. workflow calls for
unnecessary and extensive reworking of the line-endings
-- rejoining end-line hyphenates, clothing hyphens, etc. --
so massaging d.p. content _back_ to the p-book linebreaks
is the most painful and labor-intensive part of the process...

that's why i've recommended before -- and do so again --
that you _not_ have proofers do those unnecessary changes.

i have written routines to help with the massaging, but i also
end up doing lots of it -- more than i would like -- manually.

however, once the linebreaks are normalized between files
-- or if they had never been subjected to such distortion --
it's a simple matter to pull out the lines that have differences:
just store the lines of each file in arrays, and compare them...

treatment of differing lines can also isolate their difference,
for useful categorization (e.g., incorrect letter, joined words,
improper casing, incorrect punctuation, and various others),
and even automated corrections.   (this strategy is best used
when you resolve two separate digitizations, or you compare
two rounds of parallel proofing.)   for instance, if the words
of the two lines are all identical, except for one pair of them,
and one exception-word is found in a spellcheck dictionary
while the other is not, you'd auto-change the one that's not.
or, if there's a comma-period difference, and the following
word is uncapitalized, you'd change the period to a comma.

i've used this line-based method of presentation for years,
as seen in "revolutionary o.c.r. proofing" on the d.p. forum:
>    http://www.pgdp.net/phpBB2/viewtopic.php?t=24008

some other examples are here:
>    http://z-m-l.com/go/oneoo/webone.html
>    http://z-m-l.com/go/oneoo/weball.html

i've also used this paired-line format as the input-format for
machine-executed corrections, wherein that paired-line file
then become the _change-log_ that reflects the corrections,
but we're probably getting a little too far afield with _that_...

it _is_ important to understand, however, that what i'm doing
carves a very large swath in terms of what _needs_ to be done
across the complete range of the electronic-library workflow,
and isn't just geared to the performance of this solitary task...

there is a certain kind of bad myopia over at d.p. that when
an e-text gets posted to p.g., it's "finished".   but from _my_
perspective, that represents the _beginning_ of its lifespan.

my modus operandi is to think in terms of a whole library...

-bowerbird

p.s.   it's not necessary to use my name over on the d.p. wiki,
piggy.   it's a generous, civil gesture on your part, to be sure,
but some of "the powers that be" (as they call themselves), like
juliet and donovan, will reject _anything_ from me out of hand.
so there's really no need to put yourself under that handicap...
i certainly don't need -- won't even claim -- any of the "credit"
if distributed proofreaders suddenly does straighten out its act.



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080318/53411c9e/attachment.htm 

From jeroen.mailinglist at bohol.ph  Tue Mar 18 13:11:34 2008
From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account))
Date: Tue, 18 Mar 2008 21:11:34 +0100
Subject: [gutvol-d] a write-up of the final results on the	"perpetual
 p1" experiment
In-Reply-To: <20080317181155.GB7041@ark.in-berlin.de>
References: <1751743406.11431205758310116.JavaMail.mail@webmail02>
	<20080317181155.GB7041@ark.in-berlin.de>
Message-ID: <47E021F6.40208@bohol.ph>


When I do PP, I have a kind of overview of word-statistics and 
background information that not all Proofers will have access to. 
Because of that, I often catch things that I can never blame the
proofers not catching, such as inconsistencies in the original. These do 
get fixed (with transcriber notes) before I post a file to PG.

In almost every book, I also catch a few things that the proofers ought 
to catch, but I hardly complain. I would let through more mistakes myself...

Over 300 books PP-ed.

Jeroen.

Ralf Stephan wrote:
>> Believe me, PP and PPV are not and have never been meant to be true proofreading rounds (and the reason we left the old 2 round system was to get away from the necessity of proofreading in PP and PPV).
>>     
>
> Josh, I know why you say this. Because people should concentrate
> on the specialty of each round. But you know quite well that you
> as PP are responsible for anything that gets through. Also, you
> can't help reading while working on it, and you can't help correcting
> when something hits your eyes.
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080318/c27ec68b/attachment.htm 

From Bowerbird at aol.com  Tue Mar 18 13:31:20 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 18 Mar 2008 16:31:20 EDT
Subject: [gutvol-d] a write-up of the final results on the "perpetual
	p1" experiment
Message-ID: <cae.2d1bb823.35118098@aol.com>

jeroen said:
>    When I do PP, I have a kind of overview of word-statistics and 
>    background information that not all Proofers will have access to.

i don't see a reason those "statistics and information" should be
kept from the proofers.   for instance, i've tested (and enjoyed!)
a page display where every word is _colorized_ independently...

the higher the frequency of the word, the lighter it became, so
very common words like "and" and "the" were practically white.
words with just one occurrence in the book were _pure_black_.
low-frequency words which weren't in the dictionary were red.
inconsistent hyphenation, spelling, and so on were turned blue.

won't work fully for color-blind people, but they are kinda rare.

as i said, i enjoyed this interface, a lot, and i felt it was effective...
i will definitely be incorporating some aspect of it in future work.

-bowerbird

p.s.   i've also found it workable to have an interface where you can
right-click on a word and show a contextual menu that lists all the
lines in the rest of the book that contain that word.   _very_ useful...



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080318/50aa711f/attachment.htm 

From piggy at netronome.com  Tue Mar 18 13:50:24 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Tue, 18 Mar 2008 16:50:24 -0400
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1" experiment
In-Reply-To: <47E01719.9000307@perathoner.de>
References: <ce2.2715ec92.35006750@aol.com>	<47D983EF.5060002@netronome.com>	<000301c887dc$284710c0$78d53240$@co.uk>	<47DE78C7.60700@netronome.com>	<000901c88847$7a8066f0$6f8134d0$@co.uk>	<41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com>	<47DF3CF2.6040103@netronome.com>
	<47E01719.9000307@perathoner.de>
Message-ID: <47E02B10.7090203@netronome.com>

Marcello Perathoner wrote:
> La Monte H.P. Yarroll wrote:
>
>   
>> Much more important than finishing a certain number of rounds is to 
>> actually predict the likely number of remaining errors in a specific 
>> text (which we can do with moderate reliability) and then decide which 
>> kind of round to subject it to.
>>     
>
> Why would the "likely number of remaining errors" be a better estimator 
> for which round to send the text to, than the number of errors found in 
> the last round?
>   

Someone reading a text does not care how many errors were found in the 
last round of proofreading. They care about the number remaining.

Actually, the number of errors found in the last round appears to be a 
pretty good predictor for the number of remaining errors, so the 
distinction is not terribly critical. The relationship is not linear, 
but it is has a high correlation.

> The set of errors in a text is recursively enumerable, meaning there is 
> no way to know if you already found them all.
>   

But if we know the probability distributions of the errors, we can 
estimate the likely number remaining, which is really what readers care 
about.
> Meaning also, you cannot verify your predictions. You will know if your 
> predictions were too low, but never if they were too high.
>   

It's a little strong to say that we can't verify our predictions. We can 
observe over a large number of experiments how closely the number of 
errors we find matches what we expect given our model(s).

Guiness does not need to taste every bottle of brew to have a high 
confidence that they are keeping their quality standards.
> My advise is to just stick to the number of errors found. Do a thing 
> like this: let z << y << x
>   

Would you care to quantify these thresholds?
> - the text goes to round 1 if there is no preceding round
>
> - the text is done if the preceding round finds
>    less than z errors
>
> - the text goes to round 3 if the preceding round finds
>    more than z and less than y errors
>
> - the text goes to round 2 if the preceding round finds
>    more than y and less than x errors
>
> - else the text goes to round 1
>   

This algorithm only works if all three resources are completely 
fungable, or equivalently, they are perfectly balanced. In practice, 
deciding to put a page (or a whole project) through P3 reduces the 
amount of proofer-time available for P1. The result is bottlenecking, a 
problem we are seeing now.

I highly recommend reading 
http://www.pgdp.net/wiki/Confidence_in_Page_analysis#The_Ferguson-Hardwick_Algorithm 
.

The core problem is to devise the two cost functions C_k (cost of a 
round), and c_k (cost of a missed misprint). If C_k exceeds E[c_k] (the 
cost of expected errors left before applying another round of proofing) 
then you are done. In our case, we have a family of C_k functions, one 
for each kind of round.

It is also import to understand p, the probability of finding a 
particular error, and lambda, the rate at which particular errors occur 
in the text. Note that neither p nor lambda are constant for our data. 
Some errors are more common than others (different lambdas), and some 
are harder to find than others (different ps).


From joshua at hutchinson.net  Tue Mar 18 13:54:53 2008
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Tue, 18 Mar 2008 20:54:53 +0000 (GMT)
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1" experiment
Message-ID: <1253528362.2961205873693940.JavaMail.mail@webmail02>

I'm not involved in this project, but my outsiders' understanding is that they are trying to do just that.

The values for X, Y, and Z are what they are trying to find the optimal values for.

Hey, nothing like a bunch of geeks on the Internet obsessing over numbers, right?  :)

(No offense intended ... I consider "geek" a badge of honor)

Josh

On Mar 18, 2008, marcello at perathoner.de wrote: 

My advise is to just stick to the number of errors found. Do a thing 
like this: let z << y << x

- the text goes to round 1 if there is no preceding round

- the text is done if the preceding round finds
   less than z errors

- the text goes to round 3 if the preceding round finds
   more than z and less than y errors

- the text goes to round 2 if the preceding round finds
   more than y and less than x errors

- else the text goes to round 1



From creeva at gmail.com  Tue Mar 18 14:06:09 2008
From: creeva at gmail.com (Brent Gueth)
Date: Tue, 18 Mar 2008 17:06:09 -0400
Subject: [gutvol-d] a write-up of the final results on the "perpetual
	p1" experiment
In-Reply-To: <1253528362.2961205873693940.JavaMail.mail@webmail02>
References: <1253528362.2961205873693940.JavaMail.mail@webmail02>
Message-ID: <2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com>

You know on this thread I completely agree with alot of the ideas here and
LOVE the idea of the color based frequency word proof reading system.

That being said.

One thing I noticed being mentioned was correction even if the source
material included an error.   If your doing a study on classic sci-fi and
how it appeared printed in issue X of generic magazine - shouldn't any
errors included in the original printing be maintained verbatim?

If you fixing possible spelling mistakes at first, what about the trend
later to fix grammar mistakes, so essentially PG is going to become the
editor to fix things that the original may of honestly put in there
intentionally.  I have heard that some publishers or authors put in an
occasional mistake on purpose to verify if anyone else copied their work.
Granted this is more likely to happen in a public domain anthology, but at
the same time something may com across as intentional.

What about when Twain writes in a dialect - graned we know what the words
should be, but at the same time any proof reader would see this as a
mispelling.


I'm sure that the proofreaders are doing the best they can - but in the end
are we looking to end all errors - or are we looking to make sure that the
finished text is 100% accurate to the source text?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080318/e0f88fee/attachment.htm 

From klofstrom at gmail.com  Tue Mar 18 15:06:47 2008
From: klofstrom at gmail.com (Karen Lofstrom)
Date: Tue, 18 Mar 2008 12:06:47 -1000
Subject: [gutvol-d] a write-up of the final results on the "perpetual
	p1" experiment
In-Reply-To: <2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com>
References: <1253528362.2961205873693940.JavaMail.mail@webmail02>
	<2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com>
Message-ID: <1e8e65080803181506q11f4fd8eiad7d41d2e0efc29c@mail.gmail.com>

On Tue, Mar 18, 2008 at 11:06 AM, Brent Gueth <creeva at gmail.com> wrote:

> I'm sure that the proofreaders are doing the best they can - but in the end are we looking to end all errors - or are we looking to make sure that the finished text is 100% accurate to the source text?

The purpose of the transcriber's notes is to alert readers to changes
made in the text, correcting typos in the original. I don't think that
DP has always been as careful as it is now to note corrections, but I
believe that the current state of affairs honors both reading ease
(not being pulled up short by an obvious typo) and accuracy to the
original.

I believe that this is one virtue of TEI, ne? It has a protocol for
noting original and emendation IN the flow of the text, so that
presumably you could write viewing software that would display both.

--
Karen Lofstrom

From Bowerbird at aol.com  Tue Mar 18 15:15:47 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 18 Mar 2008 18:15:47 EDT
Subject: [gutvol-d] let's make it simple, ok?
Message-ID: <cd7.2a127f70.35119913@aol.com>

let's make it simple, ok?

the discussion is -- or should be -- how to make a roundless system work.
specifically, the question is "when should we consider a page to be _done_?"

the answer is really very simple.

if the last person found and fixed an error, they changed the page,
and thus their change needs to be verified, so the page is not done.

as long as the next proofer finds an error, keep sending the page out;
even if it's already done 27 rounds, you must keep sending it for more.
(this is why you want your workflow not to allow meaningless changes.)

when there's a round with no error found (i.e., the page is unchanged),
you can figure (for simplicity) there's a 60/40 chance the page is perfect.

60/40 certainly isn't good enough odds, however, so do the page again.

if the next person finds no error either, then odds are 84/16 it's perfect.
(this assumes that this round, like the last one, gets _60%_ of the errors.)

if 84/16 is good enough for you, fine.   if not, send it through again, and
-- if it comes out clean again -- the odds will then be 90/10 it's perfect.

if that's not good enough either, do it again.   clean again?   now it's 
96/4.

your assumption throughout is that your proofers catch 60% of the errors.

you can take my word that is a safe assumption.   but you don't need to.

because over time, the _data_ tells you if your assumption is warranted...

sometimes -- even after a "clean" judgment by one or more proofers --
the next proofer will find an error.   oops!   in fact, if a page has gotten
two "clean" judgments, at 60% accuracy, odds are 84/16 it has an error.
and -- at 60% accuracy -- the odds are that the next proofer will find it.

so you just pay attention to the _results_ you actually obtain.   if you find
such pages -- 2 people said it was clean, but person #3 found an error --
happen 16% of the time, your proofers _do_ have an accuracy rate of 60%.

if such pages happen _less_ than 16% of the time, their accuracy if higher.
if such pages happen _more_ than 16% of the time, their accuracy is lower.

with thousands of proofers doing thousands of pages, it won't take long
(not long at all) to get a very good assessment of your proofer accuracy.
and knowledge of that figure tells how many "clean" rounds are needed
to get to _whatever_ level of accuracy you decide that you want to attain...

in sum, you don't need a college statistics professor to solve this problem.
you don't even need college-level statistics...   you really don't...   
honestly...

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080318/c0e1ddc6/attachment.htm 

From prosfilaes at gmail.com  Tue Mar 18 15:30:42 2008
From: prosfilaes at gmail.com (David Starner)
Date: Tue, 18 Mar 2008 18:30:42 -0400
Subject: [gutvol-d] a write-up of the final results on the "perpetual
	p1" experiment
In-Reply-To: <2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com>
References: <1253528362.2961205873693940.JavaMail.mail@webmail02>
	<2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com>
Message-ID: <6d99d1fd0803181530ndebf17cg3a3809bb158dccef@mail.gmail.com>

On Tue, Mar 18, 2008 at 5:06 PM, Brent Gueth <creeva at gmail.com> wrote:
> One thing I noticed being mentioned was correction even if the source
> material included an error.   If your doing a study on classic sci-fi and
> how it appeared printed in issue X of generic magazine - shouldn't any
> errors included in the original printing be maintained verbatim?

If you're doing a study on how sci-fi was printed in the original
issues, there's nothing like having the original issues at hand. Our
posts certainly won't cut it, after copyrighted material has been
removed and the whole thing reformatted into HTML. There's no way to
make an ebook that's perfect for everyone's goals.

> If you fixing possible spelling mistakes at first, what about the trend
> later to fix grammar mistakes,

That's a slippery slope argument. We can choose how liberal the
changes we make in the text are.

> What about when Twain writes in a dialect - graned we know what the words
> should be, but at the same time any proof reader would see this as a
> mispelling.

Well, no, because the proofreaders have brains.

Carrying errors around has costs too. Every mistake we keep can get
people pointing it out to errata. Original mistakes can be as
distracting to readers as new mistakes. Nothing says we shouldn't
carry information about corrections along, but I think for a text
whose primary use will be as a reading text, we should make the
obvious corrections. If you want the exact pedantic original, then
look at the scans we should make available.

From sly at victoria.tc.ca  Tue Mar 18 18:25:40 2008
From: sly at victoria.tc.ca (Andrew Sly)
Date: Tue, 18 Mar 2008 18:25:40 -0700 (PDT)
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1" experiment
In-Reply-To: <1e8e65080803181506q11f4fd8eiad7d41d2e0efc29c@mail.gmail.com>
References: <1253528362.2961205873693940.JavaMail.mail@webmail02>
	<2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com>
	<1e8e65080803181506q11f4fd8eiad7d41d2e0efc29c@mail.gmail.com>
Message-ID: <Pine.GSO.4.58.0803181816440.2897@vtn1.victoria.tc.ca>



On Tue, 18 Mar 2008, Karen Lofstrom wrote:

> I believe that this is one virtue of TEI, ne? It has a protocol for
> noting original and emendation IN the flow of the text, so that
> presumably you could write viewing software that would display both.
>

TEI is flexible, and has multiple courses you could take,
depending on your desired outcome. For a more general
approach, you could just add something like a DP
transcriber's note, in the appropriate place in the
TEI header. Otherwise, you could use the <sic> element
which leaves an error in place, while suggesting a
correction in an attribute. Or the <corr> element
which presents a corrected version and records the
original in an attribute.

In
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/PH.html
See Section 11.3 Altered, Corrected, and Erroneous Texts


Andrew

From greg at durendal.org  Tue Mar 18 18:47:07 2008
From: greg at durendal.org (Greg Weeks)
Date: Tue, 18 Mar 2008 21:47:07 -0400 (EDT)
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1" experiment
In-Reply-To: <2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com>
References: <1253528362.2961205873693940.JavaMail.mail@webmail02>
	<2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com>
Message-ID: <Pine.LNX.4.63.0803182145340.3635@durendal.durendal.org>

On Tue, 18 Mar 2008, Brent Gueth wrote:

> One thing I noticed being mentioned was correction even if the source
> material included an error.   If your doing a study on classic sci-fi and
> how it appeared printed in issue X of generic magazine - shouldn't any
> errors included in the original printing be maintained verbatim?

That's what the page scans are for. Anyone that cares about that level of 
detail isn't going to trust the proofing job no matter what. If we provide 
page scans to go with the proofed text they have the best of both.

-- 
Greg Weeks
http://durendal.org:8080/greg/


From piggy at netronome.com  Tue Mar 18 22:04:50 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Wed, 19 Mar 2008 01:04:50 -0400
Subject: [gutvol-d] let's make it simple, ok?
In-Reply-To: <cd7.2a127f70.35119913@aol.com>
References: <cd7.2a127f70.35119913@aol.com>
Message-ID: <47E09EF2.3060605@netronome.com>

Bowerbird at aol.com wrote:
> let's make it simple, ok?
>
> the discussion is -- or should be -- how to make a roundless system work.
> specifically, the question is "when should we consider a page to be 
> _done_?"
>
> the answer is really very simple.
>
> if the last person found and fixed an error, they changed the page,
> and thus their change needs to be verified, so the page is not done.
>
> as long as the next proofer finds an error, keep sending the page out;
> even if it's already done 27 rounds, you must keep sending it for more.
> (this is why you want your workflow not to allow meaningless changes.)
>
> when there's a round with no error found (i.e., the page is unchanged),
> you can figure (for simplicity) there's a 60/40 chance the page is 
> perfect.
>
> 60/40 certainly isn't good enough odds, however, so do the page again.
>
> if the next person finds no error either, then odds are 84/16 it's 
> perfect.
> (this assumes that this round, like the last one, gets _60%_ of the 
> errors.)
>
> if 84/16 is good enough for you, fine.  if not, send it through again, and
> -- if it comes out clean again -- the odds will then be 90/10 it's 
> perfect.
>
> if that's not good enough either, do it again.  clean again?  now it's 
> 96/4.

If you have a good way to get a solid consensus on what that probability 
should be, I would like to hear your suggestions. In a way, it's 
equivalent to my request for suggestions for a missed misprint cost 
function.

One of my near-term goals is to provide a model which at least allows 
people to understand the time consequences of picking a specific threshold.

Picking a simple threshold also neglects the relative importance of 
different kinds of errors. I think for most books, garbled words are a 
much more serious problem than period-comma confusion. It could be 
reasonable to say that we're happy with a 99% certainty on the removal 
of all garbled words and only a 50% certainty of the removal of all 
period-comma confusions.
>
> your assumption throughout is that your proofers catch 60% of the errors.
>
> you can take my word that is a safe assumption.  but you don't need to.
>
> because over time, the _data_ tells you if your assumption is warranted...

We already have a very large dataset with which to test the assumption. 
I would agree that 60%-75% is about right for the most common kinds of 
errors. But the rate is not constant. It falls steadily and very fast. I 
can't say for certain yet, but I think the most difficult class of error 
has a discovery rate below 15% in a single pass.
>
> sometimes -- even after a "clean" judgment by one or more proofers --
> the next proofer will find an error.  oops!  in fact, if a page has gotten
> two "clean" judgments, at 60% accuracy, odds are 84/16 it has an error.
> and -- at 60% accuracy -- the odds are that the next proofer will find it.
>
> so you just pay attention to the _results_ you actually obtain.  if 
> you find
> such pages -- 2 people said it was clean, but person #3 found an error --
> happen 16% of the time, your proofers _do_ have an accuracy rate of 60%.
>
> if such pages happen _less_ than 16% of the time, their accuracy if 
> higher.
> if such pages happen _more_ than 16% of the time, their accuracy is lower.
>
> with thousands of proofers doing thousands of pages, it won't take long
> (not long at all) to get a very good assessment of your proofer accuracy.
> and knowledge of that figure tells how many "clean" rounds are needed
> to get to _whatever_ level of accuracy you decide that you want to 
> attain...

You are neglecting error injection rate. Proofers don't just remove 
errors, they add them too. The error injection rate puts a lower bound 
on the accuracy we can achieve through serial proofing. If needed, 
parallel voting rounds can be used to compensate for the error injection 
rate.

If there are defects which have detection rates down near the error 
injection floor, it may not be possible to remove them with any level of 
confidence at all.

This is why I'm interested in a difference metric which can ignore 
"silly changes". It looks very likely that the noise floor (error 
injection rate) for "real errors" is substantially lower than the noise 
floor caused by "silly changes".
>
> in sum, you don't need a college statistics professor to solve this 
> problem.
> you don't even need college-level statistics...  you really don't...  
> honestly...

We can definitely make things a lot better without resorting to really 
high-power statistics.

Gaa! Does anybody remember where I posted my detailed analysis of the 
"shoe plot"?

Oh well, the point is that based on a simple graphical analysis I was 
able to make a strong recommendation that any project with more than 0.1 
wa/w (roughly 1 change every 10 words) should repeat P1.

I have to admit that the full generality of the problem fascinates me. I 
am trying hard to balance my interest in closed-form solutions with 
concrete suggestions which people can act on immediately.
>
> -bowerbird
>

From hart at pglaf.org  Wed Mar 19 00:20:35 2008
From: hart at pglaf.org (Michael Hart)
Date: Wed, 19 Mar 2008 00:20:35 -0700 (PDT)
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1" experiment
In-Reply-To: <2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com>
References: <1253528362.2961205873693940.JavaMail.mail@webmail02>
	<2510ddab0803181406p681c49f5nf78508e4a4304264@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0803190016240.8247@pglaf.org>


On Tue, 18 Mar 2008, Brent Gueth wrote:

> You know on this thread I completely agree with alot of the ideas 
> here and LOVE the idea of the color based frequency word proof 
> reading system.
>
> That being said.
>
> One thing I noticed being mentioned was correction even if the 
> source material included an error.  If your doing a study on 
> classic sci-fi and how it appeared printed in issue X of generic 
> magazine - shouldn't any errors included in the original printing 
> be maintained verbatim?
>
> If you fixing possible spelling mistakes at first, what about the 
> trend later to fix grammar mistakes, so essentially PG is going to 
> become the editor to fix things that the original may of honestly 
> put in there intentionally.  I have heard that some publishers or 
> authors put in an occasional mistake on purpose to verify if 
> anyone else copied their work. Granted this is more likely to 
> happen in a public domain anthology, but at the same time 
> something may com across as intentional.
>
> What about when Twain writes in a dialect - graned we know what 
> the words should be, but at the same time any proof reader would 
> see this as a mispelling.
>
>
> I'm sure that the proofreaders are doing the best they can - but 
> in the end are we looking to end all errors - or are we looking to 
> make sure that the finished text is 100% accurate to the source 
> text?
>

Neither.

If you want the latter, just use the raw scans or a Xerox.

If you want the latter as full text eBooks, just do accurate OCR.

If you want to correct obvious errors, that's just fine, most of
our readers, including myself, would appreciate it.

If you want the former, all I can see is "bon voyage."


Thanks!!!

Michael S. Hart
Founder
Project Gutenberg

Recommended Books:

Dandelion Wine, by Ray Bradbury:  For The Right Brain
Atlas Shrugged, by Ayn Ran,:  For The Left Brain [or both]
Diamond Age, by Neal Stephenson:  To Understand The Internet
The Phantom Toobooth, by Norton Juster:  Lesson of Life. . .


From marcello at perathoner.de  Wed Mar 19 00:21:04 2008
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed, 19 Mar 2008 08:21:04 +0100
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1" experiment
In-Reply-To: <47E02B10.7090203@netronome.com>
References: <ce2.2715ec92.35006750@aol.com>	<47D983EF.5060002@netronome.com>	<000301c887dc$284710c0$78d53240$@co.uk>	<47DE78C7.60700@netronome.com>	<000901c88847$7a8066f0$6f8134d0$@co.uk>	<41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com>	<47DF3CF2.6040103@netronome.com>	<47E01719.9000307@perathoner.de>
	<47E02B10.7090203@netronome.com>
Message-ID: <47E0BEE0.1040303@perathoner.de>

La Monte H.P. Yarroll wrote:
> Marcello Perathoner wrote:
>> La Monte H.P. Yarroll wrote:
>>
>>   
>>> Much more important than finishing a certain number of rounds is to 
>>> actually predict the likely number of remaining errors in a specific 
>>> text (which we can do with moderate reliability) and then decide which 
>>> kind of round to subject it to.
>>>     
>> Why would the "likely number of remaining errors" be a better estimator 
>> for which round to send the text to, than the number of errors found in 
>> the last round?
>>   
> 
> Someone reading a text does not care how many errors were found in the 
> last round of proofreading. They care about the number remaining.
> 
> Actually, the number of errors found in the last round appears to be a 
> pretty good predictor for the number of remaining errors, so the 
> distinction is not terribly critical. The relationship is not linear, 
> but it is has a high correlation.

That's what I was saying. If the two values highly correlate, why go to 
the extra trouble to calculate the second value?

>> The set of errors in a text is recursively enumerable, meaning there is 
>> no way to know if you already found them all.
>>   
> 
> But if we know the probability distributions of the errors, we can 
> estimate the likely number remaining, which is really what readers care 
> about.

Thinko.

The reader doesn't care about "the errors remaining". She cares about 
how many error she "finds" while reading the text. Which probably is a 
lot less than what an experienced proofreader will find.


> It's a little strong to say that we can't verify our predictions. We can 
> observe over a large number of experiments how closely the number of 
> errors we find matches what we expect given our model(s).

You wanted to predict the "likely number of errors remaining". Which 
number you cannot verify.


> Guiness does not need to taste every bottle of brew to have a high 
> confidence that they are keeping their quality standards.

Why can a potato chip maker make a chip that costs 0.01$ while a 
computer manufacturer's chip must cost 1000$ ? (BTW is there a 
searchable Dilbert database anywhere?)


-- 
Marcello Perathoner
webmaster at gutenberg.org


From hart at pglaf.org  Wed Mar 19 00:24:59 2008
From: hart at pglaf.org (Michael Hart)
Date: Wed, 19 Mar 2008 00:24:59 -0700 (PDT)
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1" experiment
In-Reply-To: <47E02B10.7090203@netronome.com>
References: <ce2.2715ec92.35006750@aol.com> <47D983EF.5060002@netronome.com>
	<000301c887dc$284710c0$78d53240$@co.uk> <47DE78C7.60700@netronome.com>
	<000901c88847$7a8066f0$6f8134d0$@co.uk>
	<41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com>
	<47DF3CF2.6040103@netronome.com> <47E01719.9000307@perathoner.de>
	<47E02B10.7090203@netronome.com>
Message-ID: <Pine.LNX.4.64.0803190021420.8247@pglaf.org>



On Tue, 18 Mar 2008, La Monte H.P. Yarroll wrote:

> Marcello Perathoner wrote:
>> La Monte H.P. Yarroll wrote:
>>
>>
>>> Much more important than finishing a certain number of rounds is 
>>> to actually predict the likely number of remaining errors in a 
>>> specific text (which we can do with moderate reliability) and 
>>> then decide which kind of round to subject it to.
>>>
>>
>> Why would the "likely number of remaining errors" be a better 
>> estimator for which round to send the text to, than the number of 
>> errors found in the last round?
>>
>
> Someone reading a text does not care how many errors were found in 
> the last round of proofreading. They care about the number 
> remaining.


False.

Anyone seriously commenting on the possible correction of remaining
errors will want to know how much effort it took to get there.

Not to do so would be something like trying to plan the rest of a
trip without knowing how many miles you had already travelled, in
how much time, taking how much gas, etc., etc.

Planning ahead is more than just pointing over the horizon.


Thanks!!!

Michael S. Hart
Founder
Project Gutenberg


From Bowerbird at aol.com  Wed Mar 19 01:53:42 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 19 Mar 2008 04:53:42 EDT
Subject: [gutvol-d] let's make it simple, ok?
Message-ID: <bee.25035df9.35122e96@aol.com>

i said:
>    >   if 84/16 is good enough for you, fine.? 
>    >    if not, send it through again, and
>    >    -- if it comes out clean again -- 
>    >    the odds will then be 90/10 it's perfect.

piggy said:
>    If you have a good way to get 
>    a solid consensus on 
>    what that probability should be, 
>    I would like to hear your suggestions.

well, i've already told people -- repeatedly --
what i think the percentage should be...

i'm willing to accept 1 error on every 10 pages
-- as the starting point for release to the public
so they can help proof the e-texts they "own" -
so that means i'm willing to accept a 90/10 rate.

that, by definition, means a 10% chance of an error,
which informs us there'll be 10 errors in 100 pages,
thus yielding my 1-error-on-every-10-pages figure.

lucky for us, though, i know we don't have to _settle_
for such a paltry figure.   i know that, with good tools,
we can boost our accuracy up to around the 99% level,
which means 2 or 3 (2.5) errors on a 250-page book, or
-- for a 150-pager like your test, just 1 or 2 (1.5) errors.

and i can almost guarantee those errors are fairly trivial.
i can say with certainty they won't be misspelled words, so
-- except for the nightmare scenario of _missing_text_ --
the flaws will probably be related to _punctuation_errors_,
and i haven't found a one of those yet that altered the plot,
and it's not because i haven't been looking, because i have...
(stealth scannos are spotted very quickly by real readers, so
they are not nearly as frightening as they might seem to be.)

but...

still...

all of this is rather meaningless...

isn't it?...

because you didn't ask me what _i_ think, did you?

no, you asked how you could get a _consensus_
on _what_ the probability _should_ be.   "should be".

well, heck, the best way to get that kind of consensus
would be to run a poll and see what your people think.

and then keep running the poll until everyone agrees...

sorry, just kidding with that last part...           :+)

but seriously, run a poll...

make people be specific and pick a number.
you'll see the difference between the "quality"
and the "quantity" people exists mainly because
nobody has bothered to quantify the argument...

moreover, in the long run, everyone _will_ agree on it...
when everyone sees that -- because of our good tools --
we can get a 99% accuracy level, _and_ obtain that quality
with relatively little effort, people will be more than happy
to put out the degree of effort needed to utilize those tools.
quality will be know to be high, and quantity will begin to fly.

really.   even the most die-hard quality folks have to accept
2-3 errors in a book as acceptable.   because if they don't,
they're gonna be slitting their wrists any day now, so we
won't have to worry about what they think for very long...

but seriously, if you want the consensus opinion, run a poll.

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080319/44ea6fe5/attachment.htm 

From Bowerbird at aol.com  Wed Mar 19 02:06:27 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 19 Mar 2008 05:06:27 EDT
Subject: [gutvol-d] let's make it simple, ok?
Message-ID: <c32.2a463122.35123193@aol.com>

piggy said:
>    You are neglecting error injection rate.

i'm not "neglecting it".   i'm _ignoring_ it.

because i haven't wanted to have to tell you 
that you invented a really stupid concept there.


>    Proofers don't just remove errors, they add them too.

when you have a proofer who is _adding_ errors in
-- and you will, so detect them as _beginners_ --
you need to take them aside and give them a lesson.

instruct them exactly what the workflow expects of them
-- which yes, means the policy needs to be unequivocal --
and then give them a pat on the back and send them back.

they're glad you set them straight, and inject no more errors.

simple.


>    If there are defects 
>    which have detection rates 
>    down near the error injection floor, 
>    it may not be possible to remove them 
>    with any level of confidence at all.

you're just confusing yourself now.   keep it simple...

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080319/883b3c07/attachment.htm 

From schultzk at uni-trier.de  Wed Mar 19 02:43:26 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Wed, 19 Mar 2008 10:43:26 +0100
Subject: [gutvol-d] let's make it simple, ok?
In-Reply-To: <c32.2a463122.35123193@aol.com>
References: <c32.2a463122.35123193@aol.com>
Message-ID: <8C4FB3D0-BC45-41A5-8530-DAFF9C56D5C5@uni-trier.de>

Hi Everybody,

	I have not been following this thread(s) closely,
	but I am wondering:

	If it might not be better to have proofing teams!!??
	That is a group of proofers that work together.


	The advantages are the team members get to know
	the others strengths and weaknesses. The catch each
	others errors. They know that one is good at this the
	other that. One is likely to do that wrong. The team is
	then able to delegate tasks, clean up if necessary after
	each other. Thus reducing the error injection (though
	I believe that it is most likely neglectable in most
	cases) and increasing confidence.

	A team can discuss things and find solutions by themselves.
	Also, the team developes its own hiearchy of proofers
	from proficient to inexperienced.

	Ar first this may seem complicated, but is actually quite
	simple. The size of such teams is a good question. The
	organisation can be left mostly to the anarchy of the net/groups.
	
	just my thoughts. Will be out for Easter, but I try to catch up next  
week.

	regards and happy easter eggs.
		keith.



Am 19.03.2008 um 10:06 schrieb Bowerbird at aol.com:

> piggy said:
> >   You are neglecting error injection rate.
>
> i'm not "neglecting it".  i'm _ignoring_ it.
>
> because i haven't wanted to have to tell you
> that you invented a really stupid concept there.
>
>
> >   Proofers don't just remove errors, they add them too.
>
> when you have a proofer who is _adding_ errors in
> -- and you will, so detect them as _beginners_ --
> you need to take them aside and give them a lesson.
>
> instruct them exactly what the workflow expects of them
> -- which yes, means the policy needs to be unequivocal --
> and then give them a pat on the back and send them back.
>
> they're glad you set them straight, and inject no more errors.
>
> simple.
>
>
> >   If there are defects
> >   which have detection rates
> >   down near the error injection floor,
> >   it may not be possible to remove them
> >   with any level of confidence at all.
>
> you're just confusing yourself now.  keep it simple...
>
> -bowerbird
>
>
>
> **************
> Create a Home Theater Like the Pros. Watch the video on AOL Home.
> (http://home.aol.com/diy/home-improvement-eric-stromer?video=15? 
> ncid=aolhom00030000000001)
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080319/a35afc41/attachment-0001.htm 

From Bowerbird at aol.com  Wed Mar 19 03:57:35 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 19 Mar 2008 06:57:35 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 01
Message-ID: <d20.256e167b.35124b9f@aol.com>

for a whole bunch of reasons that i will tell you later,
i'm going to largely ignore the "parallel proofing" tests
also announced on the "confidence in page" wiki-page.
i know parallel proofing works.   that fact is established.

but since i've run through some of the current text anyway,
and it's illustrative, let me quickly share that with you, ok?

the list appended here shows the 376 lines that _differed_
on _parallel_ p1 proofings of "paul and the printing press".
(say that several times really fast for a plosive experience.)

most of these lists that you get from me _usually_ have
1)   the old, "wrong" line listed on the top, and
2)   the new, "corrected" line listed at the bottom.

but _this_ time, _this_ list is different, because _either_
the _top_ line, or the _bottom_, or _both_, could be wrong.

(but since they differ, we know that _one_ of them is wrong.
i haven't put my "cheater" lines here, to show you _exactly_
where the lines differ, so you will have to figure that out
for yourself, but i can assure you that they _do_ differ...)

***

one thing remains the same, however, and that is that _many_
of the differences are due to exceedingly stupid d.p. policies...

remember, there are 376 lines differing between the proofings.
(for perspective, there are 6,575 non-blank lines in this book.)

but 100+ of those differences are due to end-of-line hyphenates...
(89 such differences occur on pages 138-147 alone, most likely due
to a beginning proofer who didn't know he was supposed to rejoin.)

and then there are also many differences on things that _could_
(and _should_) have been fixed _automatically_, by the machine,
before the text ever went in front of volunteer human proofers...
(i eliminated some of these -- the spacey contractions -- merely
because i didn't want to be distracted by such meaningless stuff.)

when we disregard all those cases, we'll be left with precious few
honest-to-goodness differences between the parallel proofings.

***

even now, i don't think there's much to say about the differences.

in general, the incorrect versions look very much like the typical
kind of errors that people get from o.c.r. -- case differences and
incorrect letters and punctuation problems and other crap like dat.

what happened here is that one proofer found and fixed the error,
and the other proofer missed it.   knowledge of the number of lines
like these -- caught by one proofer _or_ the other -- is interesting.
(and resolving those differences is what gives you great accuracy.)

but you also need to know the number of errors caught by _both_,
and the number of errors caught by _neither_, for the full picture...

you'll have to wait for that data, though...

-bowerbird

>    the list appended here shows the 376 lines that _differed_
>    on _parallel_ p1 proofings of "paul and the printing press"...
>    http://www.pgdp.net/c/project.php?id=projectID45ca8b8ccdbea
>    http://www.pgdp.net/c/project.php?id=projectID45ca5d5645cfb

      p#001
      p#002
      p#003
      p#004
      p#005
      p#006
      p#007

001     01      Copyright, 1920,
001     01      Copyright, 1920

      p#008

002     03      The arch-enchantcr's wand! -- Itself a nothing --
002     03      The arch-enchanter's wand! -- Itself a nothing --

003     05      To paralyze the Caesars -- and to strike
003     05      To paralyze the Caesars -- and to stike

      p#009
      p#010
      p#011
      p#012

004     04      II THE CLASS MEETING AND WHAT FOLLOWED IT I3
004     04      II THE CLASS MEETING AND WHAT FOLLOWED IT 13

005     07      V PAUL GIVES THANKS FOR HIS BLESSINGS...50
005     07      V PAUL GIVES THANKS FOR HIS BLESSINGS... 50

006     16      XIV PAUL MAKES A PILGRIMAGE TO THE CITY...162
006     16      XIV PAUL MAKES A PILGRIMAGE TO THE CITY... 162

      p#013
      p#014

007     06      More than one digniiied resident of the town
007     06      More than one dignified resident of the town

      p#015
      p#016

008     05      IT was the vision of a monthly paper for the
008     05      It was the vision of a monthly paper for the

      p#017
      p#018

009     09      expensive piece of property, my son," he replied.
009     09      expensive piece of property, my son," he relied.

010     18      "The better way to go at such an undertaking,"
010     18      The better way to go at such an undertaking,"

      p#019
      p#020
      p#021
      p#022

011     01      "Say, Cart, what do you think of '2O starting
011     01      "Say, Cart, what do you think of '2O Starting

012     13      "Why, to print our life histories and obituaries
012     13      "Why to print our life histories and obituaries

      p#023

013     02      scheme? what about that?"
013     02      scheme? What about that?"

014     25      asserted at length. "But the ducats -- where
014     25      asserted at length. " But the ducats -- where

015     28      "I suppose we couldn't buy a press second-hand
015     28      "I suppose we couldn't buy a press secondhand

      p#024

016     11      some one to print the paper for us."
016     11      someone to print the paper for us."

017     20      them it was their money or their life -- death
017     20      tell them it was their money or their life -- death

018     28      Melville declared. "You ean't expect to boom
018     28      Melville declared. "You can't expect to boom

      p#025

019     03      the Fire-eater! Have a copy of the Jabberwock!
019     03      the Fire-eater! Have a copy of the Jabbermock!

020     06      get anywhere. why not call it The March
020     06      get anywhere. Why not call it The March

021     14      Hare it is! We'll begin getting subscriptions
021     14      Hare it is! We"ll begin getting subscriptions

022     24      money you and you can't get any one to print
022     24      money you find you can't get any one to print

      p#026

023     01      "The March Hare" he repeated wlth enthusnasm.
023     01      "The March Hare!" he repeated wlth enthusiasm.

      p#027
      p#028

024     22      Kipper. We'll see what we can do toward
024     22      Kipper. we'll see what we can do toward

      p#029

025     04      Melville regarded his friend With undisguised
025     04      Melville regarded his friend with undisguised

026     08      March Hare! I can hear the shekels ehinking
026     08      March Hare! I can hear the shekels chinking

027     16      at it. What is one-fifty for such a team of wisdom
027     16      at it. What is one-fifty for such a ream of wisdom

028     17      as we're going to get for out money?"
028     17      as we're going to get for our money?"

      p#030

029     10      is not one of you who does not to make
029     10      is not one of you who does not want to make

      p#031
      p#032
      p#033

030     22      "Great Seott, Paul, but you have got a wily
030     22      "Great Scott, Paul, but you have got a wily

031     29      back a step or two. "I couldn't, Kip. Don't
031     29      back a step or two. " I couldn't, Kip. Don't

      p#034

032     13      any one else in a minute. But Father's so -- well
032     13      any one else in a minute. But Father's so -- well,

      p#035

033     04      and swept out of the oflice before your mouth
033     04      and swept out of the office before your mouth

034     21      Paul. At least I can make my try and convince
034     21      Paul. "At least I can make my try and convince

035     26      "I shan"t allow myself to expect much. Even
035     26      "I shan't allow myself to expect much. Even

      p#036

036     02      half of Melvil1e's opinion.
036     02      half of Melville's opinion.

037     06      reputation of being shrewd, close-fisted, and
037     06      of being shrewd, close-fisted, and

038     09      carryng a grudge to any length for the sheer
038     09      carrying a grudge to any length for the sheer

039     19      Birmingham's most widely circulated daily.
039     19      Burmingham's most widely circulated daily.

040     27      "So you're Paul Cameron. I've had dealings
040     27      So you're Paul Cameron. I've had dealings

      p#037

041     05      father,"suggested the great man, after he had
041     05      father," suggested the great man, after he had

042     10      Wouldn't like to print the March Hare, a new
042     10      wouldn't like to print the March Hare, a new

043     21      `March Hare."
043     21      March Hare."

044     30      "why, indeed!"
044     30      "Why, indeed!"

      p#038

045     20      "I had two last night-myself and another
045     20      "I had two last night -- myself and another

046     25      Mr. Carter, the shadow of a Smile on
046     25      Mr. Carter, the shadow of a smile on

      p#039
      p#040
      p#041

047     14      "But my father-" burst out Paul, then
047     14      "But my father -- " burst out Paul, then

048     17      calmly. " We differ in politics and we've
048     17      calmly. "We differ in politics and we've

049     19      take my paper-wouldn't do it for love or
049     19      take my paper -- wouldn't do it for love or

050     26      "And with regard to the advertising I mentioned,
050     26      "And with regard to the advertising I mentioned,"

      p#042

051     05      "As for Judge Damon-well, if you ean't
051     05      "As for Judge Damon -- well, if you can't

052     09      law and the best man I know to handle the Subject.
052     09      law and the best man I know to handle the subject.

053     14      staff," ventured Paul. `
053     14      staff," ventured Paul.

      p#043

054     08      the Echo?"
054     08      the Echo?"'

      p#044

055     11      gggfg newspaper was such a difficult and expensive
055     11      newspaper was such a difficult and expensive

      p#045

056     30      "Oh, it's not that," said Paul quickly. "We
056     30      "Oh, it's not that," said Paul quickly. " We

      p#046

057     10      "People didn't always use to have paper,
057     10      People didn't always use to have paper,

058     11      my Son"
058     11      my son."

059     19      many kings, bishops, and persons of rank could
059     19      kings, bishops, and persons of rank could

060     30      in them however-material such as the Norse
060     30      in them however -- material such as the Norse

      p#047

061     01      Sagas and the Odes of Horace-were handed
061     01      Sagas and the Odes of Horace -- were handed

      p#048

062     12      loss," declared his father good-humored1y.
062     12      loss," declared his father good-humoredly.

      p#049

063     02      was first no great demand for them. Learning
063     02      was at first no great demand for them. Learning

064     20      the patient workers were so glad when their
064     20      the patient Workers were so glad when their

065     25      "'This book was illuminated, bound, and
065     25      "This book was illuminated, bound, and

066     30      "'Thanks be to God, Hallelujah!'
066     30      "Thanks be to God, Hallelujah!'

      p#050

067     23      a copy of this adjuratjion to what them host
067     23      a copy of this adjuration to what thou hast

068     25      "Thus, you see, was the copyist forced to
068     25      "Thus, you see, was the eopyist forced to

069     30      manuscripts, and many a one is marred by misspelling
069     30      manuscripts, and many a one is marred by mis-spelling

      p#051

070     17      and were sold to dignitaries of the Church or to
070     17      and were sold to of the Church or to

      p#052

071     11      The great objection to this method was that several
071     11      the great objection to this method was that several

072     15      entirely inappropriate to it."
072     15      was entirely inappropriate to it."

073     26      on with the project? You seem bothered."
073     26      with the project? You seem bothered."

      p#053

074     20      "Yes."
074     20      ??line missing here...??

075     28      Paul waited an instant, then added dryly: "In
075     28      Paul waited an instant, then added dryly: " In

      p#054

076     29      that he knew could never be fullfilled and sent
076     29      that he knew could never be fulfilled and sent

      p#055

077     15      shoulder. "I'l1 do it! I declare if I won't.
077     15      shoulder. "I'll do it! I declare if I won't.

078     16      I ll send in my subscription to the Echo to-morrow.
078     16      I'll send in my subscription to the Echo to-morrow.

079     29      "Mr. Carter said Judge Damon was an expert
079     29      "Mr. Carter said Judge Damon was an ex-

080     30      on international law," explained Paul.
080     30      pert on international law," explained Paul.

      p#056

081     19      Again courage shone in Pau1's eyes.
081     19      Again courage shone in Paul's eyes.

      p#057
      p#058

082     03      Mr. Cameron was as good as his word.
082     03      MR. CAMERON was as good as his word.

083     24      moment, -- litt1e more, in fact, than a boy like
083     24      moment, -- little more, in fact, than a boy like

      p#059

084     17      Cameron. "Call them up this minute and nail
084     17      Cameron." Call them up this minute and nail

085     28      "O.K.!" he said. "I talked with one of the
085     28      "O. K.!" he said. "I talked with one of the

      p#060
      p#061

086     17      Caesar did in Gaul, what Cyrus and the Silician
086     17      C?sar did in Gaul, what Cyrus and the Silician

087     22      and by and by the geometries, Roman his-
087     22      and by and by the geometries, Roman histories,

088     23      tories, and the peregrinations of Cyrus were
088     23      and the peregrinations of Cyrus were

      p#062
      p#063

089     11      the judge mischievously. "If you boys propose
089     11      the judge mischievously. "It you boys propose

      p#064
      p#065

090     18      to what methods you resorted to win these concessions
090     18      to what methods you resorted to win these con-

091     19      from these stern-purposed gentlemen.
091     19      cessions from these stern-purposed gentlemen.

092     23      "The judge, for example -- I can't imagine
092     23      "The judge, for example-I can't imagine

      p#066
      p#067

093     20      New York and was, I fancy, glad to find someone
093     20      New York and was, I fancy, glad to find some

094     21      who was interested and would appreciate
094     21      one who was interested and would appreciate

      p#068

095     10      "Yes, and not only were the first manuscripts
095     10      "Yes, and not only were the first manuseripts

096     20      the common people `for whom they were not
096     20      the common people 'for whom they were not

      p#069
      p#070
      p#071
      p#072

097     30      them would fill a room."
097     30      them would fill a room.:"

      p#073
      p#074
      p#075

098     30      at liberty to send contributions back with
098     30      ways at liberty to send contributions back with

      p#076

099     18      does n't like, regardless of who wrote it."
099     18      doesn't like, regardless of who wrote it."

      p#077
      p#078

100     04      amid great excitement, excitement that
100     04      amid great excitement, -- excitement that

      p#079

101     05      was quite an eye opener! A paper for general
101     05      Was quite an eye opener! A paper for general

102     07      Burmingham. There was actually something
102     07      Burmingham. There Was actually something

103     24      else in the paper. Some thought more
103     24      else in the paper. Sorne thought more

104     27      others were for choking oft the girls' artieles on
104     27      others were for choking off the girls' articles on

      p#080

105     06      body of workers finally stood shoulder to shoulder,
105     06      body of workers hnally stood shoulder to shoulder,

106     28      with a pride in his especial r?le on the team, and
106     28      with a pride in his especial role on the team, and

      p#081

107     03      manager; the alumn?, now scattered in
107     03      manager; the alumnae, now scattered in

108     10      Into Paul's editorial sanctum articles from
108     10      Into Pau1's editorial sanctum articles from

109     28      Mrs. Wi1bur's garden. 1920 would see
109     28      Mrs. Wilbur's garden. 1920 would see

      p#082

110     11      one passed through the school corridors, and
110     11      one passed through the school corridors, and `

111     31      the March Hate appeared, each marked by a
111     31      the March Hare appeared, each marked by a

      p#083
      p#084

112     15      it. I've always envied those ehaps who whispered
112     15      it. I've always envied those chaps who whispered

113     26      impulse is a very selfish one," said his father.
113     26      impulse is a very seliish one," said his father.

      p#085

114     30      pioneer printers' initial eiiorts were turned in
114     30      pioneer printers' initial efforts were turned in

      p#086

115     20      be produced -- the first crude attempt at papermaking -- and
115     20      be produced -- the first crude attempt at paper-making -- and

116     28      ones were painted on tablets ot ivory, or engraved
116     28      ones were painted on tablets of ivory, or engraved

      p#087

117     09      altar cloths -- the brst primitive printing
117     09      altar cloths -- the first primitive printing

      p#088
      p#089

118     03      and Diamonds for the more prosperous
118     03      and Diamonds for the more prosperous `

119     27      stained-glass windows and mosaies in the
119     27      stained-glass windows and mosaics in the

      p#090

120     02      There were hieroglyphies in Egypt; 'speaking
120     02      There were hieroglyphics in Egypt; 'speaking

121     06      simple outline, by means of woodeuts, the religious
121     06      simple outline, by means of woodcuts, the religious

122     11      was one of the later and most skilful woodcut
122     11      was one of the later and most skilful Woodcut

123     13      woodcut was to art -- simple, direct, appealing."
123     13      woodcut was to art -- simple, direct, appealing"

124     17      public that desired to read -- which this one did
124     17      is public that desired to read -- which this one did

      p#091

125     10      a "cover contest", the prize oitered being a
125     10      a "cover contest", the prize offered being a

126     24      forward the f?te, more than one dignifted resident
126     24      forward the f?te, more than one dignified resident

      p#092

127     01      More than one dignified resident of the town struggled
127     01      More than one dignified resident of town struggled

128     02      into an incongruous garment.
128     02      into an incongruous garment. Page 74.

      p#093
      p#094

129     02      the white Queen, the Red Queen, the Duchess,
129     02      the White Queen, the Red Queen, the Duchess,

130     03      Father william, and the Aged Man. Judge
130     03      Father William, and the Aged Man. Judge

131     06      the Carpenter, and Paul'ss mother, who was
131     06      the Carpenter, and Paul's mother, who was

132     12      the last moment as the Doormouse.
132     12      the last moment as the Dormouse.

133     17      democratie fashion. The frolic had in it a
133     17      democratic fashion. The frolic had in it a

134     21      in years!" ejaculated the postmaster. "Seems
134     21      in years!" ejaculated the postmaster. " Seems

135     22      it like we've all got better acquainted with our
135     22      like we've all got better acquainted with our

      p#095

136     04      their diiterenees by talking together about their
136     04      their differences by talking together about their

137     30      one evening, "that the printing press was in
137     30      one evening, "that the printing press was invented

138     31      vented by Lawrence Coster (or Lorenz Koster)
138     31      by Lawrence Coster (or Lorenz Koster)

      p#096

139     20      John a native of Strasburg, who
139     20      John Gutenburg,a native of Strasburg, who

      p#097
      p#098

140     14      he had done the inventor had it all to creat
140     14      he had done the inventor had it all to create

141     19      "How soon did he resmake his metal
141     19      "How soon did he re-make his metal

      p#099
      p#100

142     02      dispute the Archbishop'S Bible was produced
142     02      dispute the Archbishop's Bible was produced

143     11      precisely like the king;s and the Archbishop's.
143     11      precisely like the king's and the Archbishop's.

144     27      "I suppose he went told!" put in Paul
144     27      "I suppose he went and told!" put in Paul

      p#101

145     22      meantime william Caxton, an English mer
145     22      meantime William Caxton, an English merchant,

146     23      chant, traveled to Holland to buy cloth, and
146     23      traveled to Holland to buy cloth, and

147     28      from Iwestminster Abbey. The first English
147     28      from Westminster Abbey. The first English

      p#102

148     17      only because of an established precedent, out
148     17      only because of an established precedent, but

      p#103

149     10      Mandevi1le's Travels, Sidney's 'Arcadia',
149     10      Mandeville's Travels, Sidney's 'Arcadia',

      p#104

150     03      type. But Gutenburg was the tirst to combine
150     03      type. But Gutenburg was the first to combine

151     05      purposes. In other words, he was the Brst
151     05      purposes. In other words, he was the first

      p#105

152     11      a volume in itself. Many Scholars and many
152     11      a volume in itself. Many scholars and many

      p#106
      p#107
      p#108
      p#109

153     20      "But there are short cuts," argued Mr. Cameron.
153     20      "But there are short outs," argued Mr. Cameron.

      p#110

154     17      his father answered. "Nothinig walks with
154     17      his father answered. "Nothing walks with

155     28      generous eitizens, have opened their doors to
155     28      generous citizens, have opened their doors to

      p#111

156     25      or enamel. As time went on and the religious
156     25      or enamel. As time Went on and the religious

      p#112
      p#113

157     18      print nfty copies of a volume as several hundred.
157     18      print fifty copies of a volume as several hundred.

      p#114

158     02      at all. They get a scenario or r?sum? of the
158     02      at all. They get a scenario or resume of the

      p#115

159     31      it are rectifed. After this it is again corrected
159     31      it are rectified. After this it is again corrected

      p#116

160     25      technicality as the filling out of a short-line."
160     25      technicality as the filling out of a short line."

      p#117

161     19      cultured nation. By no means. What I mean
161     19      cultured nation. By no means. what I mean

162     20      is that our public school systerh offers education
162     20      is that our public school system offers education

163     30      citizens can read and write, and vast is
163     30      citizens can read and write, and vast

      p#118
      p#119

164     15      are always seamps in every calling, the best
164     15      are always scamps in every calling, the best

      p#120

165     13      "Typewriters come at all prices," his father
165     13      "Typewriters Come at all prices," his father


      p#121
      p#122
      p#123
      p#124
      p#125
      p#126
      p#127
      p#128
      p#129
      p#130

166     07      When the accounts were found to be short,
166     07      When the acounts were found to be short,

      p#131

167     20      bills as it went along; then its editors. would
167     20      bills as it went along; then its editors would

168     30      What was to be done?
168     30      what was to be done?

      p#132

169     18      He broke oft speechlessly.
169     18      He broke off speechlessly.

170     26      I can't understand it. We haven't branched
170     26      I can't understand it. we haven't branched

      p#133

171     08      for a farm down East. And how the fresh-men
171     08      for a farm down East. And how the freshmen

172     14      "I, for one, say we don't tell anybody," Mel-
172     14      "I, for one, say we don't tell anybody," Melville

173     15      ville burst out. "I've some pride and I draw
173     15      burst out. "I've some pride and I draw

174     28      "We?
174     28      "We?"

      p#134

175     07      "Yep"
175     07      "Yep."

176     09      "Could you manage it -- fifty dollars?"
176     09      "Could you manage it-fifty dollars?"

177     17      "I don't care about being joshed, either," dedared
177     17      "I don't care about being joshed, either," declared

178     19      "Something's fussing you. What is it?"
178     19      "Something's fussing you. what is it?"

      p#135

179     12      Bond" was converted into cash; Paul's typewriter
179     12      Bond" was converted into cash; Paul'S typewriter

      p#136
      p#137

180     01      well. In fact, it was not long before these de-
180     01      well. In fact, it was not long before these departments

181     02      partments were merged into a sort of forum
181     02      were merged into a sort of forum

182     07      Arthur Presby Carter sat quietly in his oiiiee
182     07      Arthur Presby Carter sat quietly in his office

183     17      confess that a seventeenyear-old boy had
183     17      confess that a seventeen-year-old boy had

      p#138

184     09      publication had been born that was undermin-
184     09      publication had been born that was undermining

185     10      ing his prestige and putting to naught his creeds
185     10      his prestige and putting to naught his creeds

186     21      was a shrewd business man. He had, he con-
186     21      was a shrewd business man. He had, he confessed

187     22      fessed to himself, been trapped into printing
187     22      to himself, been trapped into printing

      p#139

188     02      which he had never suspected the existence, --
188     02      which he had never suspected the existence, -- an

189     03      an intelligence, an open-mindedness, a search-
189     03      intelligence, an open-mindedness, a searching

190     04      ing after truth. Hitherto the subscribers to
190     04      after truth. Hitherto the subscribers to

191     11      through every page -- that beating of hearts --
191     11      through every page -- that beating of hearts -- fathers,

192     12      fathers, mothers, girls, boys speaking with
192     12      mothers, girls, boys speaking with

193     16      blood that glowed so warmly and sympatheti-
193     16      blood that glowed so warmly and sympathetically

194     17      cally through the dead mediums of paper and
194     17      through the dead mediums of paper and

195     27      characteristic honesty that had he cared to ob-
195     27      characteristic honesty that had he cared to obtain

196     28      tain from them this free expression of opinion
196     28      from them this free expression of opinion

197     29      and learn the reactions their minds were con-
197     29      and learn the reactions their minds were constantly

198     30      stantly reflecting, he would have been at a loss
198     30      reflecting, he would have been at a loss

      p#140

199     02      mere boy, a boy the age of his own son, the elu-
199     02      mere boy, a boy the age of his own son, the elusive

200     03      sive result had been accomplished!
200     03      result had been accomplished!

201     13      It was this " echoing idea" that was new to
201     13      It was this "echoing idea" that was new to

202     21      appeal, the elder man faced the real psychologi-
202     21      appeal, the elder man faced the real psychological

203     22      cal secret of the junior paper's success: it lis-
203     22      secret of the junior paper's success: it listened

204     23      tened and did not talk; it was a dialogue instead
204     23      and did not talk; it was a dialogue instead

205     24      of a monologue,-an exact reversal of his policy.
205     24      of a monologue, -- an exact reversal of his policy.

206     25      Moreover, this dialogue, contrary to his pre-
206     25      Moreover, this dialogue, contrary to his previous

207     26      vious beliefs, presented amazingly interesting
207     26      beliefs, presented amazingly interesting

208     29      America, -- what its government, its statesman-
208     29      America, -- what its government, its statesmanship,

209     30      ship, its ideals should be. The Past was rich
209     30      its ideals should be. The Past was rich

      p#141

210     02      faith, courage. Youth, the citizen of to-mor-
210     02      faith, courage. Youth, the citizen of to-morrow,

211     03      row, had a thousand theories for righting the
211     03      had a thousand theories for righting the

212     12      stimulate but to silence discussion and it prob-
212     12      stimulate but to silence discussion and it probably

213     13      ably did so, descending upon its audience with a
213     13      did so, descending upon its audience with a

214     18      not to lift up his voice in its presence and de-
214     18      not to lift up his voice in its presence and demand

215     19      mand a hearing.
215     19      a hearing.

216     20      Such a novel and rare product was worth per-
216     20      Such a novel and rare product was worth perpetuating.

217     21      petuating. From a money standpoint alone the
217     21      From a money standpoint alone the

218     22      paper might become in time a paying invest-
218     22      paper might become in time a paying investment.

219     23      ment. It was, of course, a bit crude at present;
219     23      It was, of course, a bit crude at present;

220     28      enterprise at the end of the year and take it in-
220     28      enterprise at the end of the year and take it into

221     29      to his own hands? Might it not be nursed into
221     29      his own hands? Might it not be nursed into

      p#142

222     01      He would improve it -- that would go with-
222     01      He would improve it -- that would go without

223     02      out saying -- touch it up and polish it; doubt-
223     02      saying -- touch it up and polish it; doubtless

224     03      less he would think best to revise some of its
224     03      he would think best to revise some of its

225     06      could not continue to perpetuate such an ab-
225     06      could not continue to perpetuate such an absurdity

226     07      surdity as that title. Perhaps he would christen
226     07      as that title. Perhaps he would christen

227     09      The notion of purchasing the amateur prod-
227     09      The notion of purchasing the amateur product

228     10      uct appealed to his sense of humor. The more
228     10      appealed to his sense of humor. The more

229     14      Yes, he would get out the few remaining is-
229     14      Yes, he would get out the few remaining issues

230     15      sues of the March Hare under its present name
230     15      of the March Hare under its present name

231     25      himself in the solitude and silence of his edi-
231     25      himself in the solitude and silence of his editorial

232     26      torial sanctum. And after he had disposed of
232     26      sanctum. And after he had disposed of

233     29      deliberation to purchase also certain oil prop-
233     29      deliberation to purchase also certain oil properties

234     30      erties in Pennsylvania. For Mr. Arthur
234     30      in Pennsylvania. For Mr. Arthur

      p#143

235     03      and buying March Hare or oil wells was all
235     03      and buying March Hares or oil wells was all

      p#144

236     04      and thus reflected on his many business ven-
236     04      and thus reflected on his many business ventures

237     05      tures Paul Cameron was also sitting in his ed-
237     05      Paul Cameron was also sitting in his editorial

238     06      itorial domain thinking intently.
238     06      domain thinking intently.

239     08      treasury bothered him more than he was will-
239     08      treasury bothered him more than he was willing

240     09      ing to admit. It was, of course, quite possible
240     09      to admit. It was, of course, quite possible

241     10      for him to repair the error -- for he was con-
241     10      for him to repair the error -- for he was convinced

242     11      vinced an error in the March Hare's bookkeep-
242     11      an error in the March Hare's bookkeeping

243     12      ing had caused the shortage. A bill of a hun-
243     12      had caused the shortage. A bill of a hundred

244     13      dred dollars must have been paid and not re-
244     13      dollars must have been paid and not recorded.

245     14      corded. Melville Carter had never had actual
245     14      Melville Carter had never had actual

246     22      was no easy task. It was a thankless job, any-
246     22      was no easy task. It was a thankless job, anywy

247     23      way -- the least interesting of any of the posi-
247     23      -- the least interesting of any of the positions

248     24      tions on the paper, and one that entailed more
248     24      on the paper, and one that entailed more

      p#145

249     11      mistake of one figure in adding and subtract-
249     11      mistake of one figure in adding and subtracting

250     12      ing columns. There did not, it was true, seem
250     12      columns. There did not, it was true, seem

251     30      were that a boy of seventeen was unable to an-
251     30      were that a boy of seventeen was unable to answer!

252     31      swer! If he were to ask his father how to sell
252     31      If he were to ask his father how to sell

      p#146

253     01      the bond, it might arouse suspicion, to ask any-
253     01      the bond, it might arouse suspicion, to ask anybody

254     02      body else might do so too. People would won-
254     02      else might do so too. People would wonder

255     03      der why he, Paul Cameron, was selling a Lib-
255     03      why he, Paul Cameron, was selling a Liberty

256     04      erty Bond he had bought only a short time be-
256     04      Bond he had bought only a short time before.

257     05      fore. Burmingham was a gossipy little town.
257     05      Burmingham was a gossipy little town.

258     24      thought he realized that Mr. Stacy was an in-
258     24      thought he realized that Mr. Stacy was an intimate

259     25      timate friend of his father's and might mention
259     25      friend of his father's and might mention

260     26      the incident. Therefore he at length dis-
260     26      the incident. Therefore he at length dismissed

261     27      missed the possibility of selling his bond and
261     27      the possibility of selling his bond and

      p#147

262     01      Echo offices that day with copy for the next is-
262     01      Echo offices that day with copy for the next issue

263     02      sue of his paper and was still rebelliously wa-
263     02      of his paper and was still rebelliously wavering

264     03      vering over the loss of his typewriter when the
264     03      over the loss of his typewriter when the

265     13      with Mr. Carter, toward whom he still main-
265     13      with Mr. Carter, toward whom he still maintained

266     14      tained no small degree of awe; usually the af-
266     14      no small degree of awe; usually the affairs

267     15      fairs relative to the school paper were trans-
267     15      relative to the school paper were transacted

268     16      acted either through the business manager of
268     16      either through the business manager of

269     18      But to-day Mr. Carter was suddenly all ami-
269     18      But to-day Mr. Carter was suddenly all amiability.

270     19      ability. He escorted Paul into his sanctum,
270     19      He escorted Paul into his sanctum,

271     23      "How is your paper coming on, Paul?' he
271     23      "How is your paper coming on, Paul?," he

272     27      "Austin, our manager, tells me your circu-
272     27      "Austin, our manager, tells me your circulation

273     28      lation is increasing."
273     28      is increasing."

      p#148
      p#149
      p#150

274     12      "B -- u -- t -- " stammered Paul and then
274     12      "B -- u -- t-" stammered Paul and then

      p#151
      p#152
      p#153

275     03      "I -- I -- " faltered Paul.
275     03      "I -- I-" faltered Paul.

276     30      "I don't quite -- "
276     30      "I don't quite-"

      p#154
      p#155

277     09      it."
277     09      it." t

278     14      "y -- e -- s."
278     14      "Y -- e -- s."

279     18      "Oh-ho! So you're in a scrape, eh?"
279     18      "Oh -- ho! So you're in a scrape, eh?"

      p#156

280     02      Paul. Page 137.
280     02      Paul. Page 13T.

      p#157
      p#158

281     09      prefer. A loan with a bond for security is
281     09      prefer, A loan with a bond for security is

282     12      "But -- "
282     12      :But -- "

      p#159
      p#160

283     20      Paul fingered the bill nervously. Fifty dollars!
283     20      Paul lingered the bill nervously. Fifty dollars!

      p#161
      p#162
      p#163
      p#164

284     02      money and government notes are fine examples
284     02      money and government notes are line examples

      p#165

285     23      of the authorities but it does a
285     23      of the Washington authorities but it does a

286     31      quantities of paper," answered his father.
286     31      quantities of paper," answered his father;

      p#166

287     01      "Directories, telephone books, cireulars, and
287     01      "Directories, telephone books, circulars, and

288     17      t them in color; dry goods houses send out photographs
288     17      them in color; dry goods houses send out photographs

289     21      there are commercial nrms whose mail-order
289     21      there are commercial firms whose mail-order

290     28      little expense this means of advertising is be-
290     28      little expense this means of advertising is becoming

291     29      coming more and more popular. Many charities
291     29      more and more popular. Many charities

      p#167

292     12      do little else," smiled his father. " Nevertheless,
292     12      do little else," smiled his father. "Nevertheless,

      p#168
      p#169

293     17      Mr. Cameron waited a second. A Wild impulse
293     17      Mr. Cameron waited a second. A wild impulse

      p#170
      p#171
      p#172

294     14      gig had won the election, it is true, but it had been
294     14      had won the election, it is true, but it had been

      p#173

295     13      school, and all the web of circumstances in
295     13      school, and all the Web of circumstances in

      p#174
      p#175
      p#176

296     25      press rooms for striking off proof when the
296     25      press rooms for striking oil proof when the

      p#177

297     27      a press was built up which is so intricate and
297     27      a press Was built up Which is so intricate and

      p#178
      p#179

298     12      visit to a big newspaper office Saturday evening
298     12      visit to a big newspaper offfice Saturday evening

299     15      "That Would be great!"
299     15      "That would be great!"

      p#180
      p#181

300     06      you must remember that it was especially difficult
300     06      you must remember that it was especially diffcult

      p#182

301     05      "So, son," concluded Mr. Wright, "you've
301     05      "So, son," concluded Mr. wright, "you've

302     15      approve of the fifty-dollar bill which at that
302     15      approve of the fity-dollar bill which at that

      p#183
      p#184

303     07      about
303     07      about.

      p#185
      p#186

304     01      their days. " I'm going to take you upstairs
304     01      their days."

305     02      first," Mr. Hawley said briskly. "We may
305     02      Mr. Hawley said briskly. "We may

306     08      frankly. `
306     08      frankly.

      p#187

307     21      This cast is then fitted upon the rollers
307     21      This east is then fitted upon the rollers

      p#188

308     11      have the main idea and when I see the thing in
308     11      have the main idea and When I see the thing in

      p#189

309     04      surface."
309     04      surface.'

310     17      The style or design of letter is called the `face',
310     17      The style or design of letter is called the 'face',

311     30      find what they want. I should think --"
311     30      find what they want. I should think -- "

      p#190

312     19      a small space allowed it; X, too, is not much in
312     19      a small space allowed it; N, too, is not much in

      p#191
      p#192

313     19      large metal sections that fit on the two halves of
313     19      large metal sections that lit on the two halves of

      p#193

314     13      type constantly becorne very expert in detecting
314     13      type constantly become very expert in detecting

315     30      process and know how the first printing
315     30      process and know how the brst printing

      p#194
      p#195

316     16      gteat amount of time and thought that goes
316     16      great amount of time and thought that goes

      p#196

317     18      of each shelf classified and marked."
317     18      of each shelf classined and marked."

      p#197

318     27      and see some of ours at nrst hand."
318     27      and see some of ours at first hand."

      p#198

319     11      however, the Boston Post ventured an innovation
319     11      however, the Boston Post ventured an innovation by

320     12      by arranging its presses one over the other,
320     12      arranging its presses one over the other,

321     17      " If floor space can be economized it must be
321     17      "If floor space can be economized it must be

      p#199
      p#200

322     01      They had now reached the lowest floor and
322     01      They had now reached the lowest Hoor and

323     06      a high above his head.
323     06      high above his head.

324     31      duty it was to load it on to a truck, carry it up-
324     31      it duty it was to load it on to a truck, carry it up-

      p#201

325     15      periodicals," Mr. Hawley managed to shout
325     15      periodicals, "Mr. Hawley managed to shout

326     25      during the war," stammered Paul.
326     25      during the war," Stammered Paul.

      p#202

327     19      publishers."
327     19      publishers." I

      p#203

328     02      the cardboard. The thickness of these semicylindrical
328     02      the cardboard. The thickness of these semi-cylindrical

329     09      cast, the half sections of stereotype were put
329     09      cast, the sections of stereotype were put

      p#204

330     01      little chap over there by the fire hangs our
330     01      little chap over there by the bre hangs our

331     16      Sidewalk.
331     16      sidewalk.

332     27      we ought to pay more for our newspapers."
332     27      we ought to pay more for our newspapers.'

      p#205

333     14      fine articles from parents and distant
333     14      fine articles from patents and distant

      p#206

334     02      bid good-by to the familiar halls of the school,
334     02      bid good-by to the familiar balls of the school,

335     21      clouded Pau1's brow. He still had intact Mr.
335     21      clouded Paul's brow. He still had intact Mr.

      p#207

336     05      own, was far from being the same thing as returning
336     05      own, was far from being the same thing as returning it.

337     06      it. It was strange that it should be so
337     06      It was strange that it should be so

338     20      easily to be cleared from Pau1's path.
338     20      easily to be cleared from Paul's path.

339     28      his classmates to earn it, -- for earn it he must,
339     28      his classmates to earn it, -- -for earn it he must,

      p#208
      p#209

340     28      "Because -- well -- it would be so yellow,"
340     28      "Because -- well-it Would be so yellow,"

341     30      thing is yours -- why -- ," he broke off help
341     30      thing is yours -- why -- ," he broke off help-

      p#210

342     26      he wanted to sell them. Father said so. Besides,
342     26      he wanted to sell them. Father said so. Be

343     27      what's to become of 1921 if you sell out
343     27      sides, what's to become of 1921 if you sell out

      p#211
      p#212


344     05      "But -- to sell it out for cash, as it stands -- you
344     05      "But -- to sell it out for cash, as it stands --

345     06      mean that?"
345     06      you mean that?"

346     09      "Yes"
346     09      "Yes."

      p#213

347     22      he heard himself saying, "I'd call it a beastly
347     22      he heard himself saying, " I'd call it a beastly

      p#214

348     06      Hare to 1921 with out blessing?" asked Paul,
348     06      Hare to 1921 with our blessing?" asked Paul,

349     22      "Nothing! Cut it out, that's all."
349     22      "Nothing! 'Cut it out, that's all."

      p#215

350     24      "Yes, I'm corning right now," returned Paul
350     24      "Yes, I'm coming right now," returned Paul

351     31      with the boy?
351     31      with the boy?'

      p#216

352     20      fancy the corning interview with Mr. Carter.
352     20      fancy the coming interview with Mr. Carter.

      p#217

353     06      only too fast. 4
353     06      only too fast.

354     16      come, something within him had leaped into being, -- 
something
354     16      come, something within him had leaped into being,

355     17      that had automatically prevented
355     17      -- something that had automatically prevented

      p#218

356     28      other side of it and all retreat would be out off.
356     28      other side of it and all retreat would be cut off.

357     30      only that he dreaded...The knob turned
357     30      only that he dreaded... The knob turned

      p#219

358     17      sharply. " I'm sorry to hear that. What was
358     17      sharply. "I'm sorry to hear that. What was

      p#220
      p#221
      p#222

359     15      Mr. Carter -- "you were just right, son. The
359     15      Mr. Carter -- " you were just right, son. The

      p#223

360     03      "Why, sir, I can't-"
360     03      "Why, sir, I can't -- "

      p#224

361     11      "How are you, old man," Paul called jubilantly.
361     11      "How are you, old man,' Paul called jubilantly.

362     21      "He was great-corking!"
362     21      "He was great -- corking!"

      p#225

363     07      Donald broke into a laugh. t
363     07      Donald broke into a laugh.

      p#226

364     30      loyally refusing to peach on his chums. That
364     30      loyally refusing to peach on his churns. That

      p#227

365     24      "They say there always has to be a first time.
365     24      "They say there always has to be a fist time.

      p#228
      p#229

366     27      wretchedly. "That's what's got me fussed.
366     27      wretchedly. "That's what'S got me fussed.

      p#230
      p#231

367     18      Paul. "But it's all right now. The
367     18      Paul. " But it's all right now. The

368     19      accounts are O. K.; I shall get my money back;
368     19      accounts are O.K.; I shall get my money back;

369     27      him," cried Donald. " He's a trump! As for
369     27      him," cried Donald. "He's a trump! As for

      p#232

370     15      that money. It's caused too much worry already."
370     15      that money. It's caused too much Worry already."

      p#233

371     06      delivered was clicked off on Mr. Carter's typewriter
371     06      delivered was clicked offon Mr. Carter's typewriter

      p#234

372     17      "And I on yours, Mr. Carter. Melville is a
372     17      "And I oh yours, Mr. Carter. Melville is a

373     30      school end the community a service, Carter, by
373     30      school and the community a service, Carter, by

      p#235
      p#236

374     25      Cameron."
374     25      Cameron.'

      p#237
      p#238

375     03      With a sigh glad yet regretful, Paul Surrendered
375     03      With a sigh glad yet regretful, Paul surrendered

376     12      familiar classroorns. And the comrades of
376     12      familiar classrooms. And the comrades of

      p#239
      p#240
      p#241
      p#242
      p#243
      p#244



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080319/517cbf0a/attachment-0001.htm 

From julio.reis at tintazul.com.pt  Wed Mar 19 04:08:38 2008
From: julio.reis at tintazul.com.pt (=?ISO-8859-1?Q?J=FAlio?= Reis)
Date: Wed, 19 Mar 2008 11:08:38 +0000
Subject: [gutvol-d] gutvol-d Digest, Vol 44, Issue 24
In-Reply-To: <mailman.2.1205866802.29269.gutvol-d@lists.pglaf.org>
References: <mailman.2.1205866802.29269.gutvol-d@lists.pglaf.org>
Message-ID: <1205924918.27032.49.camel@abetarda.mshome.net>

> just to remind them, off the dome, this is what needs to be done:
> 1.  ensure you have decent scans, and name them intelligently.
> 2.  use a decent o.c.r. program, and ensure quality results.
> 3.  do not tolerate bad text handling by content providers.
> 4.  do a decent post-o.c.r. cleanup, before _any_ proofing.
> 5.  retain linebreaks (don't rejoin hyphenates or clothe em-dashes).
> 6.  change the ridiculous ellipse policy to something sensible.
> 7.  stop doing small-cap markup with no semantic meaning.
> 8.  i forget what 8 was for.
> 9.  retain pagenumber information, in an unobtrusive manner.
> 10.  format the ascii version using light markup, for auto-html.
> 
> -bowerbird

Hey, I like bowerbird's item number 8 :)

Lighten up people, just a joke. Please go easy on the ranting. DP or not
DP, we all want the best for the public domain. Right? Let's shake hands
or kiss, now.

bowerbird's issues number 1, 2, 3, and 4 also get a "certainly" from me.

6 is language-dependent so I'm staying out of that issue. English in not
my language; I ask my kitty to type all my English ^___^

I'm not commenting on the other items, because I'm trying to be positive
here; consensus-building and the like. But these first four items -- I'm
with you all the way.

Actually (I can't resist another joke), bowerbird's number 8 was "type
everything in lowercase" lol. hey, never mind man, i like your writing
style. i can always tell when it's your post even without looking at the
from field.

J?lio.


From piggy at netronome.com  Wed Mar 19 05:03:50 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Wed, 19 Mar 2008 08:03:50 -0400
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1" experiment
In-Reply-To: <Pine.LNX.4.64.0803190021420.8247@pglaf.org>
References: <ce2.2715ec92.35006750@aol.com>
	<47D983EF.5060002@netronome.com>	<000301c887dc$284710c0$78d53240$@co.uk>
	<47DE78C7.60700@netronome.com>	<000901c88847$7a8066f0$6f8134d0$@co.uk>	<41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com>	<47DF3CF2.6040103@netronome.com>
	<47E01719.9000307@perathoner.de>	<47E02B10.7090203@netronome.com>
	<Pine.LNX.4.64.0803190021420.8247@pglaf.org>
Message-ID: <47E10126.8010300@netronome.com>

Michael Hart wrote:
> On Tue, 18 Mar 2008, La Monte H.P. Yarroll wrote:
>
>   
>> Marcello Perathoner wrote:
>>     
>>> La Monte H.P. Yarroll wrote:
>>>
>>>
>>>       
>>>> Much more important than finishing a certain number of rounds is 
>>>> to actually predict the likely number of remaining errors in a 
>>>> specific text (which we can do with moderate reliability) and 
>>>> then decide which kind of round to subject it to.
>>>>
>>>>         
>>> Why would the "likely number of remaining errors" be a better 
>>> estimator for which round to send the text to, than the number of 
>>> errors found in the last round?
>>>
>>>       
>> Someone reading a text does not care how many errors were found in 
>> the last round of proofreading. They care about the number 
>> remaining.
>>     
>
>
> False.
>
> Anyone seriously commenting on the possible correction of remaining
> errors will want to know how much effort it took to get there.
>
> Not to do so would be something like trying to plan the rest of a
> trip without knowing how many miles you had already travelled, in
> how much time, taking how much gas, etc., etc.
>
> Planning ahead is more than just pointing over the horizon.
>   

Very good point. I was only thinking about the casual reader.

My wife suggests that we could include accuracy-related metrics in a 
proofing note for each book. It's not necessarily useful for everyone, 
but it would be nice to not lose the information.
> Thanks!!!
>
> Michael S. Hart
> Founder
> Project Gutenberg
>   

From Bowerbird at aol.com  Wed Mar 19 09:03:42 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 19 Mar 2008 12:03:42 EDT
Subject: [gutvol-d] so, what have we learned from "perpetual p1"?
Message-ID: <cf7.2cb83bad.3512935e@aol.com>

wrote this yesterday.   still appropriate today.

***

now that the results are in for "perpetual p1",
let's ask ourselves some questions about the
"confidence in page" work which generated it.

***

so, what did we learn from "perpetual p1"?

not much, really.

p1 will remove a large mass of the errors, and 
subsequent rounds will whittle away at the rest,
until finally there are only very few remaining...

but, um, well...   who didn't know this already?

proofing is an easy job.   any person who is
motivated to do it well _can_ do it fairly well,
as long as they understand a book's content.

(can't proof greek too well if you don't know it,
or equations if you haven't taken math classes.
but under most circumstances, proofing is easy.)

***

will proofers cycle corrections back-and-forth?

if your workflow allows them to do so, maybe,
especially if you don't train them adequately...

but the best course will be to fix the workflow.

***

and how is the workflow over at d.p.?

it sucks.   badly.

it imposes meaningless work on the proofers,
and doesn't facilitate the work they need to do,
so the efficiency of the operation is very weak...

(and it's _not_ ok to waste people's time
just because they've voluntarily given it.)

***

so, did we accomplish our mission?

um, nope...   not as far as i can see...

the original charter was to determine a
"confidence in page" measure that would
tell if a page needed to be proofed again,
to use in implementing a roundless system.

but somewhere, the mission was abandoned.

now there's just a big bunch of gobbledygook
on the "confidence in page" wiki-page at d.p.:
>    http://www.pgdp.net/wiki/Confidence_in_Page_analysis

ironically, the _best_ logic answering the question
is stuff that was put on the wiki early.   by piggy:
>    If it covers ALL pages, then we can conclude that 
>    each round finds about half as many pages 
>    with errors as the previous round. This is the sort of 
>    stable epsilon process I've been expecting. I THINK
>    this translates into "Each round finds about 
>    half the remaining defects."
>    Each round of a page having zero changes merely 
>    increases our confidence that it is defect-free 
>    (by a factor of about 2). 

note that if you're cutting the number of errors in half,
that your accuracy rate is 67%, which is just about what
i've been figuring all along as the "average" accuracy...

***

so, was "perpetual p1" worth the time spent on it?

no.   still, i spent the time anyway, to document it...

but juliet wants to believe that proofing is _difficult_ 
-- the message she spreads all around cyberspace --
so she's going to ignore the conclusion that it's not...

that p1 proofers can do well doesn't fit her viewpoint.

and there's such a huge investment in the "p2" and "p3"
hierarchy over at d.p. they probably cannot dismantle it,
not without making themselves look really really stupid,
so they ain't about to do that any time in the near future.

this whole "confidence in page" wild goose chase is just
the equivalent of the "busywork" required of p1 proofers
-- the wizard of oz sending dorothy and her companions
off to collect the broom of the wicked witch of the west,
as a way to make them _go_away_ and waste their time...

-bowerbird
 


**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080319/9568aae8/attachment.htm 

From marcello at perathoner.de  Wed Mar 19 11:36:35 2008
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed, 19 Mar 2008 19:36:35 +0100
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1" experiment
In-Reply-To: <47E02B10.7090203@netronome.com>
References: <ce2.2715ec92.35006750@aol.com>	<47D983EF.5060002@netronome.com>	<000301c887dc$284710c0$78d53240$@co.uk>	<47DE78C7.60700@netronome.com>	<000901c88847$7a8066f0$6f8134d0$@co.uk>	<41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com>	<47DF3CF2.6040103@netronome.com>	<47E01719.9000307@perathoner.de>
	<47E02B10.7090203@netronome.com>
Message-ID: <47E15D33.7080901@perathoner.de>

In a discussion about proofreading La Monte H.P. Yarroll wrote:

> Guiness does not need to taste every bottle of brew to have a high 
> confidence that they are keeping their quality standards.

Learn how to produce an ebook in one second:

   http://dribibu.xs4all.nl/dilbert19950628.html


BTW: its "Guinness".


-- 
Marcello Perathoner
webmaster at gutenberg.org


From piggy at netronome.com  Wed Mar 19 12:06:17 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Wed, 19 Mar 2008 15:06:17 -0400
Subject: [gutvol-d] a write-up of the final results on the "perpetual
 p1" experiment
In-Reply-To: <47E15D33.7080901@perathoner.de>
References: <ce2.2715ec92.35006750@aol.com>	<47D983EF.5060002@netronome.com>	<000301c887dc$284710c0$78d53240$@co.uk>	<47DE78C7.60700@netronome.com>	<000901c88847$7a8066f0$6f8134d0$@co.uk>	<41fd8970803171012j397d8197y3b252b57df62b121@mail.gmail.com>	<47DF3CF2.6040103@netronome.com>	<47E01719.9000307@perathoner.de>	<47E02B10.7090203@netronome.com>
	<47E15D33.7080901@perathoner.de>
Message-ID: <47E16429.1080806@netronome.com>

Marcello Perathoner wrote:
> In a discussion about proofreading La Monte H.P. Yarroll wrote:
>
>   
>> Guiness does not need to taste every bottle of brew to have a high 
>> confidence that they are keeping their quality standards.
>>     
>
> Learn how to produce an ebook in one second:
>
>    http://dribibu.xs4all.nl/dilbert19950628.html
>
>
> BTW: its "Guinness".
>
>   
Um, that was a deliberately inserted misprint to keep the proofreaders 
happy. Yep.

OK, my reference to Guinness was a bit obscure. That brewing company had 
a trade secret method for testing properties of their product with very 
small samples. The trade secret was eventually lost when "A. Student" 
published the details. That trade secret method is a statistical test 
now called "Student's T".


From Bowerbird at aol.com  Wed Mar 19 13:43:00 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 19 Mar 2008 16:43:00 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 02
Message-ID: <c7a.2dab3c80.3512d4d4@aol.com>

ok, let's cut right to the chase on this parallel test...

this book did the normal d.p. p1/p2/p3 workflow...

then p1 was repeated, from the original o.c.r. output.

now, as i reported earlier, the two parallel versions of p1
had 376 differences between them.   i resolved all those,
by doing a quick visual check (without referring to scans,
so you can assume i made a few mistakes in there, sorry).

i then compared the _second_ parallel p1 (resolved) to the
_p3_ version resulting from the normal workflow, which we
assume to be the most accurate text we have at this point...

i've appended the mere 87 differences between the versions.

and a glance at them reveals there are some cases where
the p1p (p1 parallel) version seems to be the correct one,
and _not_ the p3n (p3 normal) version, which is humorous.

there are also cases of meaningless linebreak differences,
words that wouldn't pass spellcheck, and errors that could
be easily found by using even a rudimentary clean-up tool.

after we reconcile all of that, we're probably looking at about
<44 errors that were left after two resolved parallel proofings.

so what can we conclude already from this experiment?

_2_ parallel p1 proofings have given us results that are quite
similar to _3_ rounds -- p1/p2/p3 -- in the normal workflow.

ponder that one...

i have not pored over all this data to verify the accuracy,
and i don't really intend to do so, because the results are
clear enough to me already.   p1 proofers do darn good...

-bowerbird

-----------------------------------------------------------------------

more results from the d.p. parallel proofing test:
>    http://www.pgdp.net/c/project.php?id=projectID45ca8b8ccdbea
>    http://www.pgdp.net/c/project.php?id=projectID45ca5d5645cfb

this list contrasts the p1 parallel proofing (p1p)
with the output from the p3 normal workflow (p3n)...

again, remember, these are _differences_ only,
so the top line _or_ the bottom line, or _both_,
could be the incorrect ones.

note:   use a fixed-point font in order to utilize the "cheater" line...


p1p)     Copyright, 1920
p3n)     Copyright, 1920,
===)     ===============^

p1p)     V PAUL GIVES THANKS FOR HIS BLESSINGS...50
p3n)     V PAUL GIVES THANKS FOR HIS BLESSINGS... 50
===)     ========================================^^^

p1p)     XIV PAUL MAKES A PILGRIMAGE TO THE CITY...162
p3n)     XIV PAUL MAKES A PILGRIMAGE TO THE CITY... 162
===)     ==========================================^^^^

p1p)     "Enough to till a good-sized daily, I should
p3n)     "Enough to fill a good-sized daily, I should
===)     ===========^================================

p1p)     "Why, to print our life histories and obituaries
p3n)     "Why to print our life histories and obituaries
===)     ====^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

p1p)     money you and you can't get any one to print
p3n)     money you find you can't get any one to print
===)     ==========^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

p1p)     "The March Hare!" he repeated wlth enthusiasm.
p3n)     "The March Hare!" he repeated with enthusiasm.
===)     ===============================^==============

p1p)     Birmingham's most widely circulated daily.
p3n)     Burmingham's most widely circulated daily.
===)     =^========================================

p1p)     pay too."
p3n)     pay, too."
===)     ===^^^=^^^

p1p)     firm of George L. Kirnball and from Dalrymple
p3n)     firm of George L. Kimball and from Dalrymple
===)     ====================^^^^=^^^^^^^^^^^^^^^^^^^

p1p)     the Echo?"'
p3n)     the Echo?"
===)     ==========

p1p)     "This book was illuminated, bound, and
p3n)     "'This book was illuminated, bound, and
===)     =^^^^^^^=^^^^^^^^=^^^^^^^^^^^^^^^^^^^^^

p1p)     manuscripts, and many a one is marred by misspelling
p3n)     manuscripts, and many a one is marred by mis-spelling
===)     ============================================^^^^=^^^^

p1p)     Mr. Cameron was as good as his word.
p3n)     MR. CAMERON was as good as his word.
===)     =^===^^^^^^=========================

p1p)     "O.K.!" he said. "I talked with one of the
p3n)     "O. K.!" he said. "I talked with one of the
===)     ===^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

p1p)     Caesar did in Gaul, what Cyrus and the Silician
p3n)     C?sar did in Gaul, what Cyrus and the Silician
===)     =^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

p1p)     New York and was, I fancy, glad to find someone
p3n)     New York and was, I fancy, glad to find some
===)     ============================================

p1p)     who was interested and would appreciate
p3n)     one who was interested and would appreciate
===)     ^^^==^^=^^^^^^^^^^^^^==^^^^^^^^^^^^^^^^^^^^

p1p)     I have already explained, care much for reading;
p3n)     have already explained, care much for reading;
===)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^^^

p1p)     ways at liberty to send contributions back with
p3n)     at liberty to send contributions back with
===)     ^^^^^^^^^^^^^^^^^^=^^=^^^^^=^^^^^^^^^=^^^^

p1p)     smoothed away his objectious until, upon a
p3n)     smoothed away his objections until, upon a
===)     ==========================^===============

p1p)     finer and more efiicient. It was, as Paul
p3n)     finer and more efficient. It was, as Paul
===)     =================^=======================

p1p)     manager; the alumnae, now scattered in
p3n)     manager; the alumn?, now scattered in
===)     ==================^^^^^^^^^^^=^^^^^^^

p1p)     one passed through the school corridors, and
p3n)     one passed through the school corridors, and `
===)     ============================================^^

p1p)     various sources one number after another of `
p3n)     various sources one number after another of
===)     ===========================================

p1p)     like to write up fires and aceidents and wear a
p3n)     like to write up fires and accidents and wear a
===)     =============================^=================

p1p)     a under the ropes."
p3n)     under the ropes."
===)     ^^^^^^^^^^^^^^^^^

p1p)     into an incongruous garment.
p3n)     into an incongruous garment. Page 74.
===)     ============================^^^^^^^^^

p1p)     John Gutenburg,a native of Strasburg, who
p3n)     John Gutenburg, a native of Strasburg, who
===)     ===============^^^^^^^^^^^^^^^^^^^^^^^^^^^

p1p)     was the principle of it is identical with that
p3n)     was, the principle of it is identical with that
===)     ===^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

p1p)     "But there are short cuts," argued Mr. Cameron.
p3n)     "But there are short outs," argued Mr. Cameron.
===)     =====================^=========================

p1p)     at all. They get a scenario or resume of the
p3n)     at all. They get a scenario or r?sum? of the
===)     ================================^===^=======

p1p)     citizens can read and write, and vast is
p3n)     citizens can read and write, and vast
===)     =====================================

p1p)     author the prey of vultures who
p3n)     author was the prey of vultures who
===)     =======^^^=^^=^^^^^^^^^^^^^^^^^^^^^

p1p)     When the accounts were found to be short,
p3n)     When the acounts were found to be short,
===)     ===========^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

p1p)     a patronizing scorn, For a press of the Echo's
p3n)     a patronizing scorn. For a press of the Echo's
===)     ===================^==========================

p1p)     the contrary it naively confessed that it was
p3n)     the contrary it na?vely confessed that it was
===)     ==================^==========================

p1p)     was no easy task. It was a thankless job, anywy
p3n)     was no easy task. It was a thankless job, anyway -- the
===)     ==============================================^^^^^^^^^

p1p)     -- the least interesting of any of the positions
p3n)     least interesting of any of the positions
===)     ^^^^^^^^^^^^^^^^^^^^^^=^====^^^=^^^^^^^^^

p1p)     "How is your paper coming on, Paul?," he
p3n)     "How is your paper coming on, Paul?" he
===)     ===================================^^^^

p1p)     "B -- u -- t-" stammered Paul and then
p3n)     "B -- u -- t -- " stammered Paul and then
===)     ============^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

p1p)     "I -- I-" faltered Paul.
p3n)     "I -- I -- " faltered Paul.
===)     =======^^^^^^^^^^^^^^^^^^^^

p1p)     "I don't quite-"
p3n)     "I don't quite -- "
===)     ==============^^^^^

p1p)     "We'll talk no more about this matter today,"
p3n)     "We'll talk no more about this matter to-day,"
===)     ========================================^^^^^^

p1p)     fifty-dollar bond I have"
p3n)     fifty-dollar bond I have."
===)     ========================^^

p1p)     Mr. Carter winked
p3n)     Mr. Carter winked.
===)     =================^

p1p)     "I see," he said
p3n)     "I see," he said.
===)     ================^

p1p)     the machine's myriad advantages. wasn't it
p3n)     the machine's myriad advantages. Wasn't it
===)     =================================^========

p1p)     March Hare Would branch out and be made
p3n)     March Hare would branch out and be made
===)     ===========^===========================

p1p)     largest industries. we cannot do without
p3n)     largest industries. We cannot do without
===)     ====================^===================

p1p)     gig had won the election, it is true, but it had been
p3n)     had won the election, it is true, but it had been
===)     ^^^=^^^=^^^=^^=^^^^^^^^^^^^^^^^^^^^^^=^^^^^^=^^^^

p1p)     press rooms for striking oil proof when the
p3n)     press rooms for striking off proof when the
===)     ==========================^^===============

p1p)     Paul had had time to become really downhearted,
p3n)     Paul had had time to become really down-hearted,
===)     =======================================^^^^^^^^^

p1p)     their days. " I'm going to take you upstairs
p3n)     their days. "I'm going to take you upstairs
===)     =============^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

p1p)     "I See"
p3n)     "I see."
===)     ===^==^^

p1p)     cardboard, a sort of papier-mache, and by forcing
p3n)     cardboard, a sort of papier-mach?, and by forcing
===)     ================================^================

p1p)     "I See."
p3n)     "I see."
===)     ===^====

p1p)     however, the Boston Post ventured an innovation
p3n)     however, the Boston Post ventured an innovation by
===)     ===============================================^^^

p1p)     by arranging its presses one over the other,
p3n)     arranging its presses one over the other,
===)     ^^^=^^^==^^^^^^^^^^==^^^^^^^^^^^^^^^^^^^^

p1p)     duty it was to load it on to a truck, carry it up-
p3n)     it duty it was to load it on to a truck, carry it up-
===)     ^^^^^^^=^^^^^^=^=^^^^^=^^=^^=^^^^^^^^^^^^^^^^^=^^^^^^

p1p)     cast, the half sections of stereotype were put
p3n)     cast, the sections of stereotype were put
===)     ==========^^^^^^^^^^^^=^^^^^=^^=^^^^==^^^

p1p)     and Paul Smiled in return.
p3n)     and Paul smiled in return.
===)     =========^================

p1p)     fine articles from parents and distant
p3n)     fine articles from patents and distant
===)     =====================^================

p1p)     alumnae. Judge Damon had taken to contributing
p3n)     alumn?. Judge Damon had taken to contributing
===)     =====^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

p1p)     and two of Burminghams graduates
p3n)     and two of Burmingham's graduates
===)     =====================^^^^^^^^^^^^

p1p)     own, was far from being the same thing as returning
p3n)     own, was far from being the same thing as returning it.
===)     ===================================================^^^^

p1p)     it. It was strange that it should be so
p3n)     It was strange that it should be so
===)     ^=^^^^=^^^^^^^^^^^^^^^^^^^^^^^^^^^^

p1p)     "Because -- well -- it would be so yellow,"
p3n)     "Because -- well -- it would be so darn yellow,"
===)     ===================================^^^^^^^^^^^^^

p1p)     "What else could we sell it out for, fathead?"
p3n)     "What else could we sell it out for, fat-head?"
===)     ========================================^^^^^^^

p1p)     Deeker, rolling his eyes up to the ceiling with
p3n)     Decker, rolling his eyes up to the ceiling with
===)     ==^============================================

p1p)     with the boy?'
p3n)     with the boy?
===)     =============

p1p)     be confessing that he had failed in his mission,
p3n)     be confessing that he had failed in his mission, -- nay,
===)     ================================================^^^^^^^^

p1p)     -- nay, worse than that, that he had not even
p3n)     worse than that, that he had not even
===)     ^^^^^^^^^^^^^^=^^^^^^^^^=^^^^^^^=^^^^

p1p)     only that he dreaded... The knob turned
p3n)     only that he dreaded.... The knob turned
===)     =======================^^^^^^^^^^^^^^^^^

p1p)     hollowing them out and tilling them up again
p3n)     hollowing them out and filling them up again
===)     =======================^====================

p1p)     wont, in unselhsh fashion, to let every one else
p3n)     wont, in unselfish fashion, to let every one else
===)     ==============^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

p1p)     the five hundredth-time Don had been caught
p3n)     the five hundredth -- time Don had been caught
===)     ==================^^^^^^^^^^^^^^^^^^^^^^^^^^^^

p1p)     me to deposit some money in the bank for him
p3n)     me to deposit some money in the bank for him -- a
===)     ============================================^^^^^

p1p)     -- a hundred-dollar bill. I put the envelope in
p3n)     hundred-dollar bill. I put the envelope in
===)     ^^^^^^^^=^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^^^

p1p)     In fact," he continued, lapsing into seriousness,"
p3n)     In fact," he continued, lapsing into seriousness,
===)     =================================================

p1p)     the younger generation teaches us
p3n)     "the younger generation teaches us
===)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

p1p)     Carneron was a big enough man to be forgiving.
p3n)     Cameron was a big enough man to be forgiving.
===)     ==^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

p1p)     "An honest blunder is one thing; but premeditated
p3n)     "An honest blunder is one thing; but pre-meditated
===)     ========================================^^^^^^^^^^

p1p)     and joy to the crowning event of l920's
p3n)     and joy to the crowning event of 1920's
===)     =================================^=====

p1p)     course, the far-tamed March Hare. Its advent
p3n)     course, the far-famed March Hare. Its advent
===)     ================^===========================

p1p)     when weary, sleepy, but triumphant, a half
p3n)     when weary, sleepy, but triumphant, a half-jubilant,
===)     ==========================================^^^^^^^^^^

p1p)     jubilant, half-sorrowful lot of girls and boys
p3n)     half-sorrowful lot of girls and boys
===)     ^^^^^^^^^^^^^^^^=^^=^^^^^=^^^^^=^^^^




**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080319/10252551/attachment-0001.htm 

From piggy at netronome.com  Wed Mar 19 14:23:20 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Wed, 19 Mar 2008 17:23:20 -0400
Subject: [gutvol-d] parallel -- paul and the printing press -- 02
In-Reply-To: <c7a.2dab3c80.3512d4d4@aol.com>
References: <c7a.2dab3c80.3512d4d4@aol.com>
Message-ID: <47E18448.1000708@netronome.com>

Could I trouble you to calculate the number of changes which P1 and P1P 
made which were essentially identical? I'd like to see how well Polya's 
formula works.

Bowerbird at aol.com wrote:
> ok, let's cut right to the chase on this parallel test...
>
> this book did the normal d.p. p1/p2/p3 workflow...
>
> then p1 was repeated, from the original o.c.r. output.
>
> now, as i reported earlier, the two parallel versions of p1
> had 376 differences between them.  i resolved all those,
> by doing a quick visual check (without referring to scans,
> so you can assume i made a few mistakes in there, sorry).
...


From Bowerbird at aol.com  Wed Mar 19 15:48:03 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 19 Mar 2008 18:48:03 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 02
Message-ID: <d07.29f2b8d6.3512f223@aol.com>

piggy said:
>    Could I trouble you to calculate the number of changes 
>    which P1 and P1P made which were essentially identical?

i'd assume you want the number of _meaningful_changes_
they made which were identical, but that takes lots of work,
because one has to weed out all the meaningless changes...

and, if you'd accept the meaningless changes in the count,
the number quite likely runs in the _thousands_ once again,
which fractures the assumptions of any statistics you'd use...

once d.p. straightens out its policies to eliminate all of the
meaningless changes, that data will just fall into your lap...

until then, the benefit of computing it won't justify the cost.

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080319/e38a10c3/attachment.htm 

From Bowerbird at aol.com  Wed Mar 19 19:04:15 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 19 Mar 2008 22:04:15 EDT
Subject: [gutvol-d] stopping perpetuity
Message-ID: <d5d.1fecec36.3513201f@aol.com>

ok, we're done with iteration#5 of the perpetual proof.   yay!

for iteration#6, please fix the spacey ellipses on these pages:
>    1 6 23 24 29 155 157

the last two -- pages 155 and 157 -- had spacey ellipses
_introduced_, so let's jump on these pages quickly this time,
and get them saved right, so we can pursue no-diff nirvana...

but even more importantly, please pay attention to page 33!

you will please find, on page 33, this line, and correct it as shown:
>    around for a couple of weeks. Then he came, into the shop
>    around for a couple of weeks. Then he came into the shop

specifically, delete the comma after the words "he came".
there is no comma there, folks.   never was, never will be.

there _is_ an eensy-teensy-weensy little speck on the scan;
but it's _so_ small, i'm not sure how o.c.r. saw it as a comma.

it's so scrawny it couldn't even be considered as a _period_...
let alone have the "tail" that would turn it into a _comma_, but
o.c.r. put a period there, and now it's _our_ job to take it out...

this is the very last error in this book!   the last one!   please fix it!

so remember page#33!   if you're the first in on this iteration,
in fact, click through until you get to page 33 and fix it _now_!
then go back to fix the spacey ellipses on 1, 6, 23, 24, and 29.

page#1
>    screen! Maybe meteors ... More blips--and
>    fragile vehicles. Air puffed out ... and Nelson
page#6
>    --the first time ..."
page#23
>    the rough stuff to come, when we blast out! ... Hey, Eileen--you
page#24
>    So soon ... Pop...."
page#29
>    the Asteroid Belt ... Mars? That was the heebie-jeebie planet.
page#155
>    " ... Frank, Gimp, Two-and-Two, Paul, Mr. Reynolds,
page#157
>    can remember what's Out There ... Serene, bubb, Belt,

oh yeah, on the one on page 155, delete the space on both sides.

***

all the other pages are right, so don't mess with them, at all.
because if you do, you better be darn sure you've got it right.
if you introduce a new error, we _will_ hunt you down, son...

so anyway, have a nice day...

thank you for your cooperation!   book-wide proofing rules!

-bowerbird

p.s.   we saved formatting done by f1/f2 -- thanks, chaps! --
and we will be introducing that in one book-wide operation
(no fuss, no muss, no diffs, just spliffs, we flyin' now, matey!),
so we're close to finishing this book!   keep up the good work!

p.p.s.   once we've stopped the perpetual proofing machine,
we'll be able to step into the tear in the space-time curtain,
and we're perched on the event horizon of the black hole...



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080319/29c8d751/attachment.htm 

From Bowerbird at aol.com  Thu Mar 20 00:34:03 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 20 Mar 2008 03:34:03 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 03
Message-ID: <d4c.16c10770.35136d6b@aol.com>

i wrote most of this some time back.   yes, it's still applicable.

i told you there were reasons i'd tell you later.   here they are.

***

over on the "confidence in page" wiki-page on the d.p. wiki,
in addition to the "perpetual p1" experiment i've discussed,
they note the presence of two parallel proofing experiments.

so...   what to make of this?...

first, parallel proofing works.   it has an excellent track-record
-- coming from the "double-punch" method of keypunchers --
and has been validated in several experiments performed by me
and documented extensively, right on the forum boards at d.p.

also, as i remarked a while back, the "perpetual p1" experiment
gave us additional proof of the value of parallel proofing, since
the regular-workflow and the p1-iterations were parallel modes
which -- in combination -- found more errors than either alone.

whether parallel proofing works _better_ than serial proofing
is an open question.   i'm not all too sure that it does, and since
it involves redundant work -- i.e., having multiple proofers find
and fix the same mistakes -- it doesn't appeal to me very much.

however, since d.p. is now wasting the time and energy of its
volunteers in so many blatant ways, this bit of redundancy in
parallel proofing pales to near insignificance in comparison...

so i _might_ be interested in these tests...

at the same time, the purpose of the two parallel proofing tests
is unclear enough that i cannot say for certain what it might be,
so that has made me fairly reluctant to even look at their data...

even worse, when i saw the o.c.r., i was appalled and dismayed.

all of the blank lines between paragraphs were lost in this o.c.r.!

meaning that the _proofers_ had to reinsert them _manually_!

in both books!   amazing!

that's disgusting.   an error like that in the execution of the o.c.r.
should be fixed by the person who _did_ the o.c.r., not proofers.

but this is typical of d.p. workflow.   people make bad mistakes
-- mistakes which they never should have made, which would
be easy for them to fix -- yet the proofers have to clean it up.

i'm not saying that it's _difficult_ for the proofers to have to
repair something like this.   it's just putting a cursor into the
right place in the text-field and then pressing the return key,
repeating for as many times as necessary on any given page.

so it's trivially _easy_ to correct these.   but it's also _numbing_
to have to fix literally hundreds and hundreds of such errors.

and it's fully _unnecessary_ to have the errors in the first place.
someone has just set the options wrong on the o.c.r. interface,
which means that it's also quite demeaning.   frankly, an insult.

and talk about "error injection"!   _this_, folks, is error injection!

a proofer adding some spaces around an ellipse?   aw-psshaw.
that's kid-stuff.   how much damage can you really do that way?

but when that one checkbox in the o.c.r. settings box was wrong,
literally hundreds and hundreds of _errors_ were _injected_ into
this file.   hundreds and hundreds!   _that's_ how you "inject errors",
my boy.   as i said earlier, we need to determine _who_ did this and
take them aside for a little chat where we will kindly instruct them
what they did wrong, and have them promise to never do it again.

because what happened here is inexcusable.

and the error should've been corrected before _any_ of this text
went in front of one single proofer.   not one page.   not one proofer.
because this is simply unacceptable...   un-ac-cep-ta-ble.   totally...

it shows extreme disrespect for the time and energy that are being
_donated_ to the cause of _the_digitization_ of the _public_domain_.

but wait...

because the problem gets even worse...

not only was the paragraphing lost in this o.c.r., due to carelessness,
but also the o.c.r. itself shows page after page of _systematic_defects_.

for the first parallel proofing experiment, paul and the printing press,
there are _44_ pages that demonstrate clipping of one side of the text:
>    16 18 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 
>    52 56 72 76 78 80 82 84 86 88 90 96 98 100 166 170 
>    172 174 178 180 188 190 192 194 196 198 200

the other parallel proofing book   christopher and the clockmakers,
is even worse, with some _56_ pages with text that is badly clipped:
>    20 36 38 44 88 90 92 98 126 128 130 132 134 136 140 142 
>    144 146 148 150 152 154 156 158 160 162 164 166 168 
>    170 172 174 176 178 180 188 194 200 202 204 206 208 212 
>    214 216 218 220 222 224 226 228 230 234 236 240 246 

i've posted text from the "paul" pages, to show how terrible they are:
>    http://z-m-l.com/misc/paul_bad_pages.html
in addition, i've appended one page of this text to this post for you...

proofers had to first erase the junk, then type in the text from the scan,
for dozens and dozens of lines, on dozens and dozens of pages.   wow.

i don't know about you, but _i_ would be downright embarrassed to
even _show_ that o.c.r. to another person, let alone ask them to fix it.

honestly, i don't know if it's so bad because the _scans_ were clipped,
or whether the "scanning zones" were incorrectly set in the o.c.r. app.

but whichever it was, _someone_ should have fixed that problem first,
instead of just shrugging the shoulders and passing through bull crap
for someone further down the line to clean up.   this is just _disgusting_.

and let me say it a third time, so it really sinks in.   this is 
_disgusting_...

so -- just like "planet strappers" -- where an incompetent human made
a bad mistake by changing all of the em-dashes into en-dashes instead,
leaving the poor proofers to _manually_ change 1,137 back, one at a time,
here too (in two books!) an incompetent content provider has caused grief.

whoever that "someone" might be, they should be ashamed of themselves.

and it's not like i _picked_ these examples, due to some "agenda" of mine.

all this research was conceived and conducted by d.p. people themselves,
who seemingly have become immune to their incompetence, and consider
themselves to be "justified" when they get angry at people who point it out.

and it's not some "fluke" that these books were badly flawed either.

the truth is, i've examined lots and lots of d.p. books at the various 
stages,
and the vast majority of them are flawed in significant and pervasive ways,
and these flaws dump tons and tons of unnecessary work on volunteers...

the incompetence shown over there is -- only one word for it -- stunning.

-bowerbird

p.s.   here's one page of the flawed text from "paul and the printing press".
this book appears to have junk in the margin, rather than pages clipped,
but the result is the same -- unnecessary work for the proofers to correct.
           
                   ->   p#036
           ft r. So you've come to explore the repairing de-
           |artment, have you?
            The informality of the greeting was delightful
           (ho Christopher, and immediately his heart went out
           |gm the old Scotchman.
           |~(|V " I guess so, yes," smiled he. " I didn't know I
           |{was going to though. It just happened."
           V:'| " It's not a bad happen, perhaps. Make your-
           |jself at home, laddie. Here's a stool."
           | ~" I'd rather stand and watch you."
           |V` " But I sha'n't let you. It makes me nervous to
           V |{have somebody hanging over my shoulder and
           |jmaybe jogging my elbow. If you're to stay you
           |(must sit," was the brusque but not unkindly
           x |fanswer.
           (g41;t Somewhat crestfallen the boy slipped to the
           |(gotool and for a few moments remained immovable,
           |'Watching the workman's busy fingers. How care-
           xgsfully they moved--with what fascinating deftness
           |and rapidity!
           ~.^J*| " I see you are not one to keep hitching and
           |Jtwiddling around," the clockmaker presently re-
            t |arked, with a twinkle. " We shall get on
           | = ously together. I detest nervous people."
           A gig| " Are you fixing the clock Mr. Bailey was ask-
            |;-~|g about?" Christopher ventured.
           |sum;" Not just now, sonny. I am finishing up a
           | job. I shall go back to her in a minute,
           `;|Tjtowever. You can't just tinker her at will as you
           | common clocks. She has to be dreamed over."
           V it| " Dreamed over!" repeated Christopher, not a
           ` `ijl|attle puzzled.
           | ' "Aye, dreamed over! Well-nigh prayed over



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080320/7be9f41e/attachment-0001.htm 

From richfield at telkomsa.net  Wed Mar 19 13:15:13 2008
From: richfield at telkomsa.net (Jon Richfield)
Date: Wed, 19 Mar 2008 22:15:13 +0200
Subject: [gutvol-d] Gothic or Gothic?  Thanks folks!
Message-ID: <47E17451.9080401@telkomsa.net>

I wrote asking for advice on scanning Fraktur.  (Absent-mindedly 
claiming to have Omniscan, which afaik is the one universal and 
error-free scanning software package; I meant of course Omnipage, which 
is not.  (Not bad actually, but...))

Steven desJardins wrote:
 >I (Jon) had asked:

>> BTW, just as a matter of curiosity, what is the copyright situation with
>>  Hitler's works?  I know that it has lapsed in Australia and presumably
>>  Canada, but it should nominally be in copyright in the US.  Is it
>>  regarded as such, and if so, is it an academic question, or would it be
>>  enforced, and if so , by whom?

Steve replied:

According to Wikipedia, "The U.S. government seized the copyright
during the Second World War as part of the Trading with the Enemy Act
and in 1979, Houghton Mifflin, the U.S. publisher of the book, bought
the rights from the government. "
<

Thanks Steve.  I cannot help wondering whether HM made any profit on the deal;  I seldom see a copy.

================

Robert Cicconetti replied:

>Fraktur fonts are difficult to OCR well; I have not tried in a while,
but I understand older versions of OCR software actually do better (for
Finereader, it was v5 or v6; can't recall) as they make fewer
assumptions about the typeface. There has also been some work done on
the open-source OCR engine Tesseract by piggy, a member of DP; I have
not used it myself so I cannot comment on how well it works as yet.

I can say that I spent many hours trying to train FR7 to understand
Fraktur and other blackletter fonts, and got absolutely nowhere.<

Thanks Robert,  That sounds like an admonition not to be in too big a hurry.  Fortunately I have more on my fork at the moment than I can manage!

===================


La Monte H.P. Yarroll wrote:

>
The OCR package tesseract now has usable fraktur support. You want to 
use the deu-f language package. 

If you find pages that don't OCR well, send them to me and I'll fix the 
tesseract training to work better with them.<

Much thanks sir.  I am archiving this note of course, and if we both survive long enough, I'll take you up on your helpfulness!

==================

To all:  I much appreciate your responses.

Go well and thank you,

Jon 







From piggy at netronome.com  Thu Mar 20 06:47:20 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Thu, 20 Mar 2008 09:47:20 -0400
Subject: [gutvol-d] parallel -- paul and the printing press -- 02
In-Reply-To: <d07.29f2b8d6.3512f223@aol.com>
References: <d07.29f2b8d6.3512f223@aol.com>
Message-ID: <47E26AE8.9040404@netronome.com>

Bowerbird at aol.com wrote:
> piggy said:
> >   Could I trouble you to calculate the number of changes
> >   which P1 and P1P made which were essentially identical?
>
> i'd assume you want the number of _meaningful_changes_
> they made which were identical, but that takes lots of work,
> because one has to weed out all the meaningless changes...

You are correct in surmising that I'm interested in your "real errors" 
metric.
>
> and, if you'd accept the meaningless changes in the count,
> the number quite likely runs in the _thousands_ once again,
> which fractures the assumptions of any statistics you'd use...

I think the wdiff alterations metric (changed + inserted + deleted as 
reported by wdiff -s) would be interesting and potentially useful. 
Presumably the ratio of "meaningless changes" to "real errors" is fairly 
consistent between the parallel rounds.
>
> once d.p. straightens out its policies to eliminate all of the
> meaningless changes, that data will just fall into your lap...

My focus is in devising an automated metric which can ignore most of the 
"meaningless changes". I think you will agree that automation is much 
easier to deploy than social engineering.

That doesn't mean it isn't worth doing the social engineering, but 
technological changes have substantially lower inertia than social changes.
>
> until then, the benefit of computing it won't justify the cost.

Have you played with ocrdiff? I think it has a really good chance of 
approximating your metric well enough to be usable.

Even if you use a different starting point, I am very interested in a 
fully automated tool that can approximate your metric.

If you feel that you have invested as much into this project as you find 
necessary, I'll understand. I certainly wish to thank you for your 
contributions to date.
>
> -bowerbird
>

From piggy at netronome.com  Thu Mar 20 06:58:08 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Thu, 20 Mar 2008 09:58:08 -0400
Subject: [gutvol-d] stopping perpetuity
In-Reply-To: <d5d.1fecec36.3513201f@aol.com>
References: <d5d.1fecec36.3513201f@aol.com>
Message-ID: <47E26D70.5090604@netronome.com>

Bowerbird at aol.com wrote:
> ok, we're done with iteration#5 of the perpetual proof.  yay!
>
Wow! That was really fast.

Hmm... A quick analysis shows that about half the work was done by a 
single relatively new proofer. This newby went from about 40 pages to 
over 130 working on this project. They were also spending something like 
10-15 seconds per page rather than then 2-5 minutes per page that 
everybody else was applying. I think this may skew the results of this 
round a little.

I'll PM them to thank them for their enthusiasm but to request closer 
attention next round.


From piggy at netronome.com  Thu Mar 20 07:09:39 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Thu, 20 Mar 2008 10:09:39 -0400
Subject: [gutvol-d] parallel -- paul and the printing press -- 03
In-Reply-To: <d4c.16c10770.35136d6b@aol.com>
References: <d4c.16c10770.35136d6b@aol.com>
Message-ID: <47E27023.1020208@netronome.com>

Bowerbird at aol.com wrote:
> i wrote most of this some time back.  yes, it's still applicable.
>
> i told you there were reasons i'd tell you later.  here they are.
> ...
> p.s.  here's one page of the flawed text from "paul and the printing 
> press".
> this book appears to have junk in the margin, rather than pages clipped,
> but the result is the same -- unnecessary work for the proofers to 
> correct.
>          
>                   ->  p#036
>           ft r. So you've come to explore the repairing de-
>           |artment, have you?
>            The informality of the greeting was delightful
>           (ho Christopher, and immediately his heart went out
>           |gm the old Scotchman.
>           |~(|V " I guess so, yes," smiled he. " I didn't know I
>           |{was going to though. It just happened."
>           V:'| " It's not a bad happen, perhaps. Make your-
>           |jself at home, laddie. Here's a stool."
>           | ~" I'd rather stand and watch you."
>           |V` " But I sha'n't let you. It makes me nervous to
>           V |{have somebody hanging over my shoulder and
>           |jmaybe jogging my elbow. If you're to stay you
>           |(must sit," was the brusque but not unkindly
>           x |fanswer.
>           (g41;t Somewhat crestfallen the boy slipped to the
>           |(gotool and for a few moments remained immovable,
>           |'Watching the workman's busy fingers. How care-
>           xgsfully they moved--with what fascinating deftness
>           |and rapidity!
>           ~.^J*| " I see you are not one to keep hitching and
>           |Jtwiddling around," the clockmaker presently re-
>            t |arked, with a twinkle. " We shall get on
>           | = ously together. I detest nervous people."
>           A gig| " Are you fixing the clock Mr. Bailey was ask-
>            |;-~|g about?" Christopher ventured.
>           |sum;" Not just now, sonny. I am finishing up a
>           | job. I shall go back to her in a minute,
>           `;|Tjtowever. You can't just tinker her at will as you
>           | common clocks. She has to be dreamed over."
>           V it| " Dreamed over!" repeated Christopher, not a
>           ` `ijl|attle puzzled.
>           | ' "Aye, dreamed over! Well-nigh prayed over

If you would care to implement a gutter noise removal algorithm for 
tesseract, I would certainly be happy to see the contribution. Most 
libraries I borrow books from are not willing to let me cut their books 
up so that I can get perfectly flat scans.

You skills at alienating your advocates continue to impress me.


From piggy at netronome.com  Thu Mar 20 07:14:06 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Thu, 20 Mar 2008 10:14:06 -0400
Subject: [gutvol-d] Gothic or Gothic?  Thanks folks!
In-Reply-To: <47E17451.9080401@telkomsa.net>
References: <47E17451.9080401@telkomsa.net>
Message-ID: <47E2712E.5030609@netronome.com>

Jon Richfield wrote:
> I wrote asking for advice on scanning Fraktur.  (Absent-mindedly 
> claiming to have Omniscan, which afaik is the one universal and 
> error-free scanning software package; I meant of course Omnipage, which 
> is not.  (Not bad actually, but...))
>   
...

I'm actively interested in improving tesseract OCR's fraktur 
performance. Could you point me at a few of your pages? I'll let you 
know how well we're doing with the limited fraktur training we have so far.
 

From rolsch at verizon.net  Thu Mar 20 08:17:29 2008
From: rolsch at verizon.net (Roland Schlenker)
Date: Thu, 20 Mar 2008 10:17:29 -0500
Subject: [gutvol-d] parallel -- paul and the printing press -- 03
In-Reply-To: <47E27023.1020208@netronome.com>
References: <d4c.16c10770.35136d6b@aol.com> <47E27023.1020208@netronome.com>
Message-ID: <200803201117.29396.rolsch@verizon.net>

On Thursday 20 March 2008 10:09:39 am La Monte H.P. Yarroll wrote:
>
> If you would care to implement a gutter noise removal algorithm for
> tesseract, I would certainly be happy to see the contribution. Most
> libraries I borrow books from are not willing to let me cut their books
> up so that I can get perfectly flat scans.

If you use the program "unpaper" before OCR'ing.  Most of the gutter noise is 
removed.

Roland Schlenker


From Bowerbird at aol.com  Thu Mar 20 11:17:42 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 20 Mar 2008 14:17:42 EDT
Subject: [gutvol-d] stopping perpetuity
Message-ID: <bef.2516d7e8.35140446@aol.com>

piggy said:
>    I think this may skew the results of this round a little.

you look very very closely and let me know if it does...        :+)

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080320/9a5a8494/attachment.htm 

From nwolcott2ster at gmail.com  Thu Mar 20 11:38:23 2008
From: nwolcott2ster at gmail.com (Norm Wolcott)
Date: Thu, 20 Mar 2008 13:38:23 -0500
Subject: [gutvol-d] parallel -- paul and the printing press -- 03
References: <d4c.16c10770.35136d6b@aol.com> <47E27023.1020208@netronome.com>
	<200803201117.29396.rolsch@verizon.net>
Message-ID: <001701c88ab9$af97c080$660fa8c0@atlanticbb.net>

Is unpaper available for us windows users?

nwolcott2 at post.harvard.edu
----- Original Message -----
From: "Roland Schlenker" <rolsch at verizon.net>
To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
Sent: Thursday, March 20, 2008 10:17 AM
Subject: Re: [gutvol-d] parallel -- paul and the printing press -- 03


> On Thursday 20 March 2008 10:09:39 am La Monte H.P. Yarroll wrote:
> >
> > If you would care to implement a gutter noise removal algorithm for
> > tesseract, I would certainly be happy to see the contribution. Most
> > libraries I borrow books from are not willing to let me cut their books
> > up so that I can get perfectly flat scans.
>
> If you use the program "unpaper" before OCR'ing.  Most of the gutter noise
is
> removed.
>
> Roland Schlenker
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d


From grythumn at gmail.com  Thu Mar 20 11:46:46 2008
From: grythumn at gmail.com (Robert Cicconetti)
Date: Thu, 20 Mar 2008 14:46:46 -0400
Subject: [gutvol-d] parallel -- paul and the printing press -- 03
In-Reply-To: <001701c88ab9$af97c080$660fa8c0@atlanticbb.net>
References: <d4c.16c10770.35136d6b@aol.com> <47E27023.1020208@netronome.com>
	<200803201117.29396.rolsch@verizon.net>
	<001701c88ab9$af97c080$660fa8c0@atlanticbb.net>
Message-ID: <15cfa2a50803201146k11e09a9aj1255f633d415d909@mail.gmail.com>

It's easy enough to build binaries (I use cygwin). I don't know if anyone
has packaged them or built a gui around it.

I do recommend turning down the defaults a bit.. I think it was tuned for
processing scanned photocopies, and it is rather overaggressive on my scans.

Bob

On Thu, Mar 20, 2008 at 2:38 PM, Norm Wolcott <nwolcott2ster at gmail.com>
wrote:

> Is unpaper available for us windows users?
>
> nwolcott2 at post.harvard.edu
> ----- Original Message -----
> From: "Roland Schlenker" <rolsch at verizon.net>
> To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
> Sent: Thursday, March 20, 2008 10:17 AM
> Subject: Re: [gutvol-d] parallel -- paul and the printing press -- 03
>
>
> > On Thursday 20 March 2008 10:09:39 am La Monte H.P. Yarroll wrote:
> > >
> > > If you would care to implement a gutter noise removal algorithm for
> > > tesseract, I would certainly be happy to see the contribution. Most
> > > libraries I borrow books from are not willing to let me cut their
> books
> > > up so that I can get perfectly flat scans.
> >
> > If you use the program "unpaper" before OCR'ing.  Most of the gutter
> noise
> is
> > removed.
> >
> > Roland Schlenker
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080320/d3c9fdd9/attachment-0001.htm 

From Bowerbird at aol.com  Thu Mar 20 11:50:11 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 20 Mar 2008 14:50:11 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 03
Message-ID: <c2b.21015d54.35140be3@aol.com>

piggy said:
>    If you would care to implement a gutter noise removal algorithm 
>    for tesseract, I would certainly be happy to see the contribution.

oh please no.   you're using _tesseract_ to o.c.r. books?
why in the world would you use a _beta_ o.c.r. program?

because it didn't cost you anything?

it's costing your volunteer proofers _lots_ of time and energy,
time and energy they donate, in good faith, to a good cause...

when you treat these people like guinea piggies, they will leave,
and never come back again.   is that really what you want to do?
even if it's what you want to do, do you have the _right_ to do it?
the d.p. retention rate sucks, in spite of the cultish "friendliness".
maybe some people might wonder why, but i'm not one of them.

so, what to do?

there are plenty of d.p. volunteers who have abbyy finereader
-- the acknowledged leader in accuracy -- so have _them_ do
the scanning if that's what it takes to get decent o.c.r. output...

besides, the loss of paragraphing, which was a fault noticeable
in 10 seconds, is not something tesseract always causes, is it?
so that was an _operator_ mistake, error injection at its finest.

(and if tesseract _does_ always lose the paragraphing, then
that's even _more_ reason why nobody should use it at d.p.)


>    Most libraries I borrow books from are 
>    not willing to let me cut their books up 
>    so that I can get perfectly flat scans.

oh please.   just because you're off chasing after the broom
of the wicked witch of the west doesn't mean you can bring
mr. strawman into the argument.   you don't need "perfectly"
flat scans.   you just need scans that don't have a ton of noise.

if the scans were bad for these books, then find better scans!
or do consultation with the d.p. image-manipulation experts.
or find another d.p. volunteer who can _create_ better scans...

but do _not_ just accept the bad scans and dump awful o.c.r.
on the proofers, and expect them to essentially do a type-in.

because that is _disgusting_.   and hey, i'm really sorry if that
hurts your feelings, but it's the truth, and you need to know it.
and lots and lots of _other_ d.p. content providers need to be
confronted with the truth of the incompetence of their efforts.


>    You skills at alienating your advocates continue to impress me.

with advocates like this, who needs detractors?

i have _truth_ on my side.   and _common_sense_.
and tons and tons of data i can display any time...

and more lurkers who will step out from the shadows
in support of me if i ask them than you might expect...

which is _not_ to say that i would reject any "advocate".

but if you really wanna be an advocate of mine, then
you'd better understand that point #1 on my plan is
to get good scans, and point #2 is to do quality o.c.r.
there are 8 points after that, but start with #1 and #2.

-bowerbird

 > just to remind them, off the dome, this is what needs to be done:
> 1.? ensure you have decent scans, and name them intelligently.
> 2.? use a decent o.c.r. program, and ensure quality results.
> 3.? do not tolerate bad text handling by content providers.
> 4.? do a decent post-o.c.r. cleanup, before _any_ proofing.
> 5.? retain linebreaks (don't rejoin hyphenates or clothe em-dashes).
> 6.? change the ridiculous ellipse policy to something sensible.
> 7.? stop doing small-cap markup with no semantic meaning.
> 8.? i forget what 8 was for.
> 9.? retain pagenumber information, in an unobtrusive manner.
> 10.? format the ascii version using light markup, for auto-html.



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080320/228da4af/attachment.htm 

From Bowerbird at aol.com  Thu Mar 20 13:25:34 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 20 Mar 2008 16:25:34 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 02
Message-ID: <d18.2478f1e8.3514223e@aol.com>

first of all, i made a big mistake yesterday...

i put on a webpage the o.c.r. from the bad pages
of "paul and the printing press".   or so i thought.

in actuality, i pulled pages from the wrong book
-- the "christopher and the clockmakers" book
which is the second "parallel" experiment text...

so...   if you took a look at that o.c.r. and thought
"i dunno, it doesn't look all _that_ bad to me...",
then you might want to go take another gander:
>?? http://z-m-l.com/misc/paul_bad_pages.html

also, as i intended to do in my previous message,
i have appended one page of o.c.r. -- page 36 --
from the "paul and the printing press" book, and
i've now also included the text as it was corrected,
so you get the full flavor of how bad the o.c.r. is...
(because of _human_error_ while doing the o.c.r.)

to repeat, i'd be embarrassed to show this to people,
let alone actually shovel it to volunteers to _correct_.

oh, and remember, because this was research on
_parallel_proofing_, p1 proofers were subjected to
this onerous o.c.r. _twice_, which is a real travesty.

***

piggy said:
>    You are correct in surmising that 
>    I'm interested in your "real errors" metric.

as i told you, weed out the meaningless stuff,
and whatever you have left are "the real errors".

i didn't do this for "paul and the printing press".

the differences i showed were just _differences_,
i didn't present them as "errors".   however, when
you see the paired lines, it's usually fairly easy to
see which of those two is the one that is in error.

(for the record, i was able to identify the bad line
successfully on 77 of the 87 line-pairs i'd listed.
and this file, with its cut-off lines, was really hard.)


>    I think the wdiff alterations metric 
>    (changed + inserted + deleted as reported by wdiff -s) 
>    would be interesting and potentially useful.

yeah, right.

on the garbage o.c.r. you have in the parallel tests,
you aren't gonna find _any_ statistic that's "useful".

garbage-in-garbage-out.   it's a law you can't break.


>    Presumably the ratio of "meaningless changes" 
>    to "real errors" is fairly consistent 
>    between the parallel rounds.

get real.   you cannot even reliably _count_ the number
of meaningless changes when you have garbage o.c.r.

i've appended the o.c.r. of one page, and the p3 output.

ask ten people to count the "meaningful" changes and
you'll get ten different answers.   and sure, you _could_
settle onto one metric, and use just that, but you won't
get any predictive power out of it, not in the big picture.

and when you throw in the d.p. policy meaninglessness,
things like rejoining hyphenates and clothing em-dashes
and that rot, you're just piling on more ridiculousness...

if you get anything out of that mess, god blessed you...

but strip away that nonsense, and things are crystal clear.

with good scans and good o.c.r., pages have -- at most --
a half-dozen errors, and the p1 proofers get most of them,
and subsequent rounds whittle 'em away until there are 0.

real errors get found, and they get fixed.   and that's it...

it happens on page after page, day after day, over at d.p.


>    My focus is in devising an automated metric 
>    which can ignore most of the "meaningless changes". 

good luck with that.


>    I think you will agree that automation is 
>    much easier to deploy than social engineering.

well, if -- by "social engineering" -- you mean "convincing"
the-powers-that-be (as they humorously call themselves)
over at d.p. to change their evil ways, well then maybe just
maybe you will find it easier to develop an automated metric.

but you'll be a flying piggy by that time, and will most likely
find it more fulfilling to be flitting among the clouds instead.


>    Have you played with ocrdiff?

don't know what it is, and probably don't much care.

since i don't allow any meaningless noise to get inside
my data in the first place, i have no need to tune it out.

good data just falls into my lap.   yes, it really is that simple.


>    I think it has a really good chance of
>    approximating your metric well enough to be usable.

yeah, well, you let me know when that happens.


>    Even if you use a different starting point, I am very interested 
>    in a fully automated tool that can approximate your metric.

i'm curious why you keep calling it a "metric"...

as if it were some kind of stand-in variable for o.c.r. errors.

it's not.   i simply locate the o.c.r. errors, and i count them...

it's not hard to locate the o.c.r. errors.   it's ridiculously easy...

you just take the text as it's been proofed as close to perfection
as you can get it, and then you compare it to the original o.c.r.,
and the places where the lines differ might be the o.c.r. errors...
(they might also be places where the transcriber made a change.)

note, as i've remarked before, if you go look at the page-images,
you'll often find that the o.c.r. didn't really make an "error" per se,
but (for example) recognized a speck as a period, or a comma, or
made some other recognition decision that's fully understandable.
oh, it makes _actual_errors_ too -- sometimes inexplicable ones --
but for the most part, it's usually easy to see why it did what it did...

***

and certainly -- in the case of the page i have appended below --
you can understand why i'm reluctant to call those "o.c.r. errors"...

no siree, those "errors" are due to what we programmers label as
p.e.b.k.a.c. -- a.k.a., "problem exists between keyboard and chair".

-bowerbird

p.s.   here's the o.c.r. from page 36 of "paul and the printing press",
followed by the text as it was proofed by the p1-p2-p3 workflow...


                   ->   p#036
            In spite of Paul's optimism he was more than
            of Melvil1e's opinion.
           g g_: Mr. Carter was well known throughout
            ingharn as a stern, austere man whom
            = le feared rather than loved. He had the
            of being shrewd, closefsted, and
            at a bargain,-a person of few friends
           g-gig? many enemies. I-Ie was a great lighter,
           t; ng a grudge to any length for the sheer
           Rk ure of gratifying it. Therefore many a
           re mature and courageous promoter than
           ii Cameron had shrunk from approaching
            with a business proposition.
            Even Paul did not at all relish the mission be
           if ore him; he was, however, too manly to shirk
            ^Rl Hence that evening, directly after dinner,
            fjmade his way to the mansion of Mr. Arthur
           ` by Carter, the wealthy owner of the Echo,
            jirmingham's most widely circulated daily.
           fggortunately or unfortunately-Paul was
            in which-the capitalist was at home
            at leisure; and with beating heart the boy
           T;' ushered into the presence of this illustrious
            eman.
            Carter greeted him politely but with no
            ixdiality.
            So you 're Paul Cameron. I 've had deal-
           Jl;T t with your father," he remarked dryly.
            $@1 A = t can I do for you?"
            Iii courage ebbed. The question was
           j= = and direct, demanding a reply of similar


           In spite of Paul's optimism he was more than
           half of Melville's opinion.
           
           Mr. Carter was well known throughout
           Burmingham as a stern, austere man whom
           people feared rather than loved. He had the
           reputation of being shrewd, close-fisted, and
           sharp at a bargain,--a person of few friends
           and many enemies. He was a great fighter,
           carrying a grudge to any length for the sheer
           pleasure of gratifying it. Therefore many a
           more mature and courageous promoter than
           Paul Cameron had shrunk from approaching
           him with a business proposition.
           
           Even Paul did not at all relish the mission before
           him; he was, however, too manly to shirk
           it. Hence that evening, directly after dinner,
           he made his way to the mansion of Mr. Arthur
           Presby Carter, the wealthy owner of the <i>Echo</i>,
           Burmingham's most widely circulated daily.
           
           Fortunately or unfortunately--Paul was
           uncertain which--the capitalist was at home
           and at leisure; and with beating heart the boy
           was ushered into the presence of this illustrious
           gentleman.
           
           Mr. Carter greeted him politely but with no
           cordiality.
           
           "So you're Paul Cameron. I've had dealings
           with your father," he remarked dryly.
           "What can I do for you?"
           
           Paul's courage ebbed. The question was
           crisp and direct, demanding a reply of similar 



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080320/4e58af3f/attachment.htm 

From donovan at abs.net  Thu Mar 20 15:27:44 2008
From: donovan at abs.net (D Garcia)
Date: Thu, 20 Mar 2008 18:27:44 -0400
Subject: [gutvol-d] Unpaper for DOS/Windows (WAS: Re: parallel -- paul and
	the printing press -- 03)
In-Reply-To: <001701c88ab9$af97c080$660fa8c0@atlanticbb.net>
References: <d4c.16c10770.35136d6b@aol.com>
	<200803201117.29396.rolsch@verizon.net>
	<001701c88ab9$af97c080$660fa8c0@atlanticbb.net>
Message-ID: <200803201827.44543.donovan@abs.net>

On Thursday 20 March 2008 14:38, Norm Wolcott wrote:
> Is unpaper available for us windows users?

Back in 2006 I built a standalone DOS executable of it for the Windows folks.

zip file here: http://www.pgdp.org/~donovan/unpaper-0.2.zip

That is probably not up to date with the current source, but if you have the 
lcc-win32 or other compiler (not cygwin, you can get into dll-hell going that 
route) you can always rebuild from the current source.

From gbnewby at pglaf.org  Thu Mar 20 18:25:13 2008
From: gbnewby at pglaf.org (Greg Newby)
Date: Thu, 20 Mar 2008 18:25:13 -0700
Subject: [gutvol-d] Moderation/censorship
Message-ID: <20080321012513.GB22705@mail.pglaf.org>

I received a request to moderate or otherwise quiet
Bowerbird on gutvol-d.  This request was based on an
opinion that he has been behaving poorly.

Unfortunately I somehow deleted the message, so am
not certain who it was from.  Therefore, I'm responding
here:

  The answer is: no, I will not turn on moderation or
  remove list members at this time.

This topic has been hashed over several times in the past,
so I'm not going to try to retype the history..maybe
some other people would like to.

The bottom line is that IF you want a moderated list,
put together a *team* of moderators and we'll make a 
moderated list.  I'm personally unwilling to take on
that responsibility.  (The Project Wombat lists are
good examples of multiple lists with different levels
of moderation.)  I'm happy to set up additional
mailing lists.  The moderated list could, in one scenario,
consist mostly of filtered postings from gutvol-d.  It's
up to the moderator team.

I do insist that any list, moderated or not, be open to any
and all subscribers.  But in a moderated list, the moderators
decide which messages go to the list, and which ones do not.
People who do not like the moderation policy or practice
are, of course, welcome to start their own list.

Because I have not been following the threads closely,
I am not going to offer an opinion on appropriate versus
inappropriate behavior.  It seems there has been plenty
of rough talk, though, which I simply delete.  As always,
I encourage people to make maximum use of their email
systems to block individuals, threads, subjects, etc. which
they would rather not see.

  -- Greg


From piggy at netronome.com  Thu Mar 20 18:26:59 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Thu, 20 Mar 2008 21:26:59 -0400
Subject: [gutvol-d] parallel -- paul and the printing press -- 03
In-Reply-To: <15cfa2a50803201146k11e09a9aj1255f633d415d909@mail.gmail.com>
References: <d4c.16c10770.35136d6b@aol.com>
	<47E27023.1020208@netronome.com>	<200803201117.29396.rolsch@verizon.net>	<001701c88ab9$af97c080$660fa8c0@atlanticbb.net>
	<15cfa2a50803201146k11e09a9aj1255f633d415d909@mail.gmail.com>
Message-ID: <47E30EE3.2040208@netronome.com>

Robert Cicconetti wrote:
> It's easy enough to build binaries (I use cygwin). I don't know if 
> anyone has packaged them or built a gui around it.
>
> I do recommend turning down the defaults a bit.. I think it was tuned 
> for processing scanned photocopies, and it is rather overaggressive on 
> my scans.

I have been limiting its use to very narrow cases because of how 
aggressive it is.

What settings do you find best for 8-bit grayscale documents scanned 
from original books? I have not found settings I have been happy with.
>
> Bob
>
> On Thu, Mar 20, 2008 at 2:38 PM, Norm Wolcott <nwolcott2ster at gmail.com 
> <mailto:nwolcott2ster at gmail.com>> wrote:
>
>     Is unpaper available for us windows users?
>
>     nwolcott2 at post.harvard.edu <mailto:nwolcott2 at post.harvard.edu>
>     ----- Original Message -----
>     From: "Roland Schlenker" <rolsch at verizon.net
>     <mailto:rolsch at verizon.net>>
>     To: "Project Gutenberg Volunteer Discussion"
>     <gutvol-d at lists.pglaf.org <mailto:gutvol-d at lists.pglaf.org>>
>     Sent: Thursday, March 20, 2008 10:17 AM
>     Subject: Re: [gutvol-d] parallel -- paul and the printing press -- 03
>
>
>     > On Thursday 20 March 2008 10:09:39 am La Monte H.P. Yarroll wrote:
>     > >
>     > > If you would care to implement a gutter noise removal
>     algorithm for
>     > > tesseract, I would certainly be happy to see the contribution.
>     Most
>     > > libraries I borrow books from are not willing to let me cut
>     their books
>     > > up so that I can get perfectly flat scans.
>     >
>     > If you use the program "unpaper" before OCR'ing.  Most of the
>     gutter noise
>     is
>     > removed.
>     >
>     > Roland Schlenker
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>   


From Bowerbird at aol.com  Thu Mar 20 18:51:14 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 20 Mar 2008 21:51:14 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 05
Message-ID: <d16.24d9b37c.35146e92@aol.com>

a "metric"?   you want a "metric"?   i can give you a "metric".
one that's so easy to compute a computer can even do it.

i'm not gonna process the numbers for the whole book,
but i think it's fine to look at a single page for education.

so let's look at page 36 from "paul and the printing press".

there are 31 lines on this page.   thirty-one.   31.

four versions of the page's text -- o.c.r., p1, p2, and p3 --
are presented below, for your enjoyment and edification...

refer to the versions frequently, as need be, as you follow
along with this brief analysis of the life of this little page...

after o.c.r., the page went through the normal p1-p2-p3,
so i have used these labels herein: ocr, p1n, p2n, and p3n.

***

o.c.r. got 1 line right...   out of 31 lines...   pathetic...

for the record:
>    In spite of Paul's optimism he was more than

***

p1n fixed 28 of the 30 error-ridden lines correctly, except 2:

ocr>    of being shrewd, closefsted, and
p1n>    of being shrewd, close-fisted, and <- fixed scanno, but...
p2n>    reputation of being shrewd, close-fisted, and   <-bingo
p3n>    reputation of being shrewd, close-fisted, and 

in this first case, p1n fixed the scanno that was there, but
also missed the fact that a (big) word had been completely
cut off from the text.   missing words can be difficult to spot,
which is why you want to make sure o.c.r. doesn't cut any off.
also, missing words are extremely hard to catch automatically.

p2 came along and fixed this error.   thank you.

the second line that where p1 missed error was this:

ocr>    So you 're Paul Cameron. I 've had deal-
p1n>    So you're Paul Cameron. I've had dealings <- rejoined, but
p2n>    So you're Paul Cameron. I've had dealings <- missed one...
p3n>    "So you're Paul Cameron. I've had dealings   <-bingo

in this second case, p1n rejoined the end-line-hyphenate fine, but
missed that an opening quotemark was absent at the start of the line.
likewise, p2 missed it too.

p3 fixed this error, so it was a persistent one.

however, this kind of error is _easy_ to spot via automated analysis
-- a simple routine detects unbalanced quotemarks in a paragraph --
so we don't have to take it seriously.

so even the two errors that "slipped by" could well have been avoided.
good scanning would have prevented that word from being chopped,
and a good clean-up tool would have _alerted_ us to the quotemark...

all in all, a _fantastic_ job by p1.

but boy wasn't that o.c.r. awful?   my goodness!   awful!

p1 had to fix 30 out of 31 lines!   that's 96.77% _bad_.   phew!

there's your metric, the percentage of lines changed in p1.
you told everyone you wanted a metric.   there's your metric.

***

p2n corrected 1 of the 2 errors with which it was faced.

well, it could have done better.   and it could have done worse.

not much more that you can say after that.

***

p3n corrected the 1 error with which it was faced...

(if there are more errors on the page -- it's possible --
then all of the normal p1-p2-p3 rounds missed them.)

***

ok, so i'm just gonna throw out a round guesstimate and say that
p1 made 98 corrections...   and we know p2 and p3 made 1 each...

so p1 had an accuracy-rate of 98%, p2 had a rate of 50%,
and p3 -- by definition -- had an accuracy-rate of 100%.

how 'bout those p1 proofers?   really something, aren't they?

that's the trend you always see.   p1 is big, then whittle away.

and like i said, they do it day in and day out, on page after page.
they rock...

***

ok, so we did that little exercise on one page.   but you can do it
on _all_ of the pages if you want to, because it's really very easy.

just go to the project page for this book:
>    http://www.pgdp.net/c/project.php?id=projectID45ca8b8ccdbea

make sure you've chosen "detail level 4", at the top of the page,
and then go down lower to follow the _progress_ of each page...

click on the "diff" between o.c.r. and p1 to see the changes made.
then click the "diff" between p1 and p2 to see the changes made.
and finally click the "diff" between p2 and p3 to see the changes.

a "no diff" between two spots means that no changes were made.

and you'll see it time after time.   p1 is big, p2 and p3 whittle away.

-bowerbird


============================================
the text for page 36 from "paul and the printing press" experiment
for page 36.   (actually file#36, which is d.p. lingo for, um, page 19.)
the text is given from o.c.r., after p1n, after p2n, and then after p3n...
============================================

         ->   p#036 -- ocr
In spite of Paul's optimism he was more than
of Melvil1e's opinion.
g g_: Mr. Carter was well known throughout
ingharn as a stern, austere man whom
= le feared rather than loved. He had the
of being shrewd, closefsted, and
at a bargain,-a person of few friends
g-gig? many enemies. I-Ie was a great lighter,
t; ng a grudge to any length for the sheer
Rk ure of gratifying it. Therefore many a
re mature and courageous promoter than
ii Cameron had shrunk from approaching
with a business proposition.
Even Paul did not at all relish the mission be
if ore him; he was, however, too manly to shirk
^Rl Hence that evening, directly after dinner,
fjmade his way to the mansion of Mr. Arthur
` by Carter, the wealthy owner of the Echo,
jirmingham's most widely circulated daily.
fggortunately or unfortunately-Paul was
in which-the capitalist was at home
at leisure; and with beating heart the boy
T;' ushered into the presence of this illustrious
eman.
Carter greeted him politely but with no
ixdiality.
So you 're Paul Cameron. I 've had deal-
Jl;T t with your father," he remarked dryly.
$@1 A = t can I do for you?"
Iii courage ebbed. The question was
j= = and direct, demanding a reply of similar


       ->   p#036 -- p1n
In spite of Paul's optimism he was more than
half of Melville's opinion.
Mr. Carter was well known throughout
Burmingham as a stern, austere man whom
people feared rather than loved. He had the
of being shrewd, close-fisted, and
sharp at a bargain,--a person of few friends
and many enemies. He was a great fighter,
carrying a grudge to any length for the sheer
pleasure of gratifying it. Therefore many a
more mature and courageous promoter than
Paul Cameron had shrunk from approaching
him with a business proposition.
Even Paul did not at all relish the mission before
him; he was, however, too manly to shirk
it. Hence that evening, directly after dinner,
he made his way to the mansion of Mr. Arthur
Presby Carter, the wealthy owner of the Echo,
Burmingham's most widely circulated daily.
Fortunately or unfortunately--Paul was
uncertain which--the capitalist was at home
and at leisure; and with beating heart the boy
was ushered into the presence of this illustrious
gentleman.
Mr. Carter greeted him politely but with no
cordiality.
So you're Paul Cameron. I've had dealings
with your father," he remarked dryly.
"What can I do for you?"
Paul's courage ebbed. The question was
crisp and direct, demanding a reply of similar


       ->   p#036 -- p2n
In spite of Paul's optimism he was more than
half of Melville's opinion.
Mr. Carter was well known throughout
Burmingham as a stern, austere man whom
people feared rather than loved. He had the
reputation of being shrewd, close-fisted, and
sharp at a bargain,--a person of few friends
and many enemies. He was a great fighter,
carrying a grudge to any length for the sheer
pleasure of gratifying it. Therefore many a
more mature and courageous promoter than
Paul Cameron had shrunk from approaching
him with a business proposition.
Even Paul did not at all relish the mission before
him; he was, however, too manly to shirk
it. Hence that evening, directly after dinner,
he made his way to the mansion of Mr. Arthur
Presby Carter, the wealthy owner of the Echo,
Burmingham's most widely circulated daily.
Fortunately or unfortunately--Paul was
uncertain which--the capitalist was at home
and at leisure; and with beating heart the boy
was ushered into the presence of this illustrious
gentleman.
Mr. Carter greeted him politely but with no
cordiality.
So you're Paul Cameron. I've had dealings
with your father," he remarked dryly.
"What can I do for you?"
Paul's courage ebbed. The question was
crisp and direct, demanding a reply of similar


       ->   p#036 -- p3n
In spite of Paul's optimism he was more than
half of Melville's opinion.
Mr. Carter was well known throughout
Burmingham as a stern, austere man whom
people feared rather than loved. He had the
reputation of being shrewd, close-fisted, and
sharp at a bargain,--a person of few friends
and many enemies. He was a great fighter,
carrying a grudge to any length for the sheer
pleasure of gratifying it. Therefore many a
more mature and courageous promoter than
Paul Cameron had shrunk from approaching
him with a business proposition.
Even Paul did not at all relish the mission before
him; he was, however, too manly to shirk
it. Hence that evening, directly after dinner,
he made his way to the mansion of Mr. Arthur
Presby Carter, the wealthy owner of the Echo,
Burmingham's most widely circulated daily.
Fortunately or unfortunately--Paul was
uncertain which--the capitalist was at home
and at leisure; and with beating heart the boy
was ushered into the presence of this illustrious
gentleman.
Mr. Carter greeted him politely but with no
cordiality.
"So you're Paul Cameron. I've had dealings
with your father," he remarked dryly.
"What can I do for you?"
Paul's courage ebbed. The question was
crisp and direct, demanding a reply of similar

================================================

wondering what happened with the parallel p1 proofing on page 36?

well, the parallel proofing didn't do _quite_ as good as the normal p1;
they missed _3_ errors, compared to the normal p1 missing just _2_...

the _good_ news, however, is that the second parallel proofing _caught_
the 2 errors missed by the _first_ parallel proofing, so -- taken together --
they achieved perfection.   in 2 p1 rounds!   as opposed to 3 normal rounds.
and they had no useful workcheck, either, with "good" and "bad" word-lists.
i tell you, those p1 proofers _rock_...

for the record, here's the 3 errors missed by the p1p proofers:
p1p>    half of Melvil1e's opinion.
p1p>    carryng a grudge to any length for the sheer
p1p>    Birmingham's most widely circulated daily.

oh, and just so you know, spellcheck would catch all 3 of those errors.   
neat.

================================================



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080320/41170224/attachment.htm 

From grythumn at gmail.com  Thu Mar 20 18:53:52 2008
From: grythumn at gmail.com (Robert Cicconetti)
Date: Thu, 20 Mar 2008 21:53:52 -0400
Subject: [gutvol-d] parallel -- paul and the printing press -- 03
In-Reply-To: <47E30EE3.2040208@netronome.com>
References: <d4c.16c10770.35136d6b@aol.com> <47E27023.1020208@netronome.com>
	<200803201117.29396.rolsch@verizon.net>
	<001701c88ab9$af97c080$660fa8c0@atlanticbb.net>
	<15cfa2a50803201146k11e09a9aj1255f633d415d909@mail.gmail.com>
	<47E30EE3.2040208@netronome.com>
Message-ID: <15cfa2a50803201853u2fe4eb9eh1aec8f7601ed3287@mail.gmail.com>

On Thu, Mar 20, 2008 at 9:26 PM, La Monte H.P. Yarroll <piggy at netronome.com>
wrote:

> Robert Cicconetti wrote:
> > It's easy enough to build binaries (I use cygwin). I don't know if
> > anyone has packaged them or built a gui around it.
> >
> > I do recommend turning down the defaults a bit.. I think it was tuned
> > for processing scanned photocopies, and it is rather overaggressive on
> > my scans.
>
> I have been limiting its use to very narrow cases because of how
> aggressive it is.
>
> What settings do you find best for 8-bit grayscale documents scanned
> from original books? I have not found settings I have been happy with.


I don't really use it to clean scans from original books; Abbyy FR's
adaptive thresholding works well for almost all of my text pages. (It sucks
on illos, of course) I have used unpaper to split 2-up pages from original
scans, and occasionally use it to clean up microfilm scans (~600 DPI b/w).
Blackfilter set to about 0.98, intensity 10 (Depends on the book), plus
deskewing and page splitting, is something I've used in the past for
microfilm scans. I think I've used it on some other projects, but I can't
find any notes on the settings. unpaper's deskewing, at least the version
I'm using, is slow and definitely a memory hog. I generally turn off qpixel.

R C
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080320/2cbeeb1e/attachment-0001.htm 

From rolsch at verizon.net  Thu Mar 20 18:56:48 2008
From: rolsch at verizon.net (Roland Schlenker)
Date: Thu, 20 Mar 2008 20:56:48 -0500
Subject: [gutvol-d] parallel -- paul and the printing press -- 03
In-Reply-To: <47E30EE3.2040208@netronome.com>
References: <d4c.16c10770.35136d6b@aol.com>
	<15cfa2a50803201146k11e09a9aj1255f633d415d909@mail.gmail.com>
	<47E30EE3.2040208@netronome.com>
Message-ID: <200803202156.49137.rolsch@verizon.net>

On Thursday 20 March 2008 9:26:59 pm La Monte H.P. Yarroll wrote:
> Robert Cicconetti wrote:
> > I do recommend turning down the defaults a bit.. I think it was tuned
> > for processing scanned photocopies, and it is rather overaggressive on
> > my scans.
>
> I have been limiting its use to very narrow cases because of how
> aggressive it is.
>
> What settings do you find best for 8-bit grayscale documents scanned
> from original books? I have not found settings I have been happy with.
>
> > Bob

For an original books that I have scanned myself, I input the scans directly 
into FineReader.

However, for scan-sets from Early Canadiana Online, that I have recently 
ORC'ed for DP-C, that scans were of very, very poor quality.  I used 
aggressive unpaper options, "-bt 0.85 -ni 8 -li 0.03",  which at times 
removed areas of text.  Those scans, missing areas of text, were reprocessed 
without using unpaper by simply removed the unnecessary edge areas. 

Roland Schlenker

From donovan at abs.net  Thu Mar 20 20:49:14 2008
From: donovan at abs.net (D Garcia)
Date: Thu, 20 Mar 2008 23:49:14 -0400
Subject: [gutvol-d] Moderation/censorship
In-Reply-To: <20080321012513.GB22705@mail.pglaf.org>
References: <20080321012513.GB22705@mail.pglaf.org>
Message-ID: <200803202349.15039.donovan@abs.net>

On Thursday 20 March 2008 21:25, Greg Newby wrote:
> I received a request to moderate or otherwise quiet
> Bowerbird on gutvol-d.  This request was based on an
> opinion that he has been behaving poorly.
<snip>
>   The answer is: no, I will not turn on moderation or
>   remove list members at this time.

In contrast, over on DP where bowerbird created several sockpuppet accounts to 
bypass his unprecented ban there, he recently used one to intentionally 
sabotage the experiment in continuous proofing which he has been making such 
noise about here. That account and the others he has been using have also 
been disabled according to Juliet Sutherland's instruction to maintain the 
ban, especially given the escalation of his behavior there from trolling to 
actual sabotage. 

(Or if you prefer, his stooping to that behavior, since online, he's a bird.)

David

From marcello at perathoner.de  Fri Mar 21 00:11:48 2008
From: marcello at perathoner.de (Marcello Perathoner)
Date: Fri, 21 Mar 2008 08:11:48 +0100
Subject: [gutvol-d] Moderation/censorship
In-Reply-To: <200803202349.15039.donovan@abs.net>
References: <20080321012513.GB22705@mail.pglaf.org>
	<200803202349.15039.donovan@abs.net>
Message-ID: <47E35FB4.2070901@perathoner.de>

D Garcia wrote:

> In contrast, over on DP where bowerbird created several sockpuppet accounts to 
> bypass his unprecented ban there, he recently used one to intentionally 
> sabotage the experiment in continuous proofing which he has been making such 
> noise about here.

That clearly shows the level of confidence he has in his own theories. 
He had to go and skew the stats. Tzzzz.



-- 
Marcello Perathoner
webmaster at gutenberg.org


From ralf at ark.in-berlin.de  Fri Mar 21 01:32:05 2008
From: ralf at ark.in-berlin.de (Ralf Stephan)
Date: Fri, 21 Mar 2008 09:32:05 +0100
Subject: [gutvol-d] tesseract and ligatures
Message-ID: <20080321083205.GC18003@ark.in-berlin.de>

BTW, since we're just at it, there's another nontriviality
involved with some scans that are out there.

Some scans you can get have sort of shadows, just like
antialiasing which is practically impossible to remove.
This leads to characters sticking together like ligatures do.

I thought OK why not, then let's just train tesseract for
those character groups, it will only take a bit more effort...
Result was, tesseract is not able to train ligatures, i.e.
groups of characters, at all! It's hard wired to single characters.
It *appears* at first to be able to train pairs of characters
but this is an illusion because if it's a pair of ASCII chars
tesseract won't barf because it thinks it's one UTF-8 char.

I wonder if they even thought of multibyte UTF-8?

Summary for me: Tesseract is unusable without ligature support.
This is a major bug.


Regards,
ralf

From Bowerbird at aol.com  Fri Mar 21 02:13:57 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 21 Mar 2008 05:13:57 EDT
Subject: [gutvol-d] Moderation/censorship
Message-ID: <c73.2b132cc7.3514d655@aol.com>

donovan/david said:
>    In contrast, over on DP where bowerbird 
>    created several sockpuppet accounts to
>    bypass his unprecented ban there, 

um... he probably meant "unprecedented"...

but... sockpuppet accounts?   to bypass a ban?

excuse me?   i was explicitly _not_ prohibited
from _proofing_ when i was "banned" -- i was
only restricted from _posting_ in the forums...

go back and look it up, if you must...


>    he recently used one to intentionally 
>    sabotage 
>    the experiment in continuous proofing 

untrue.   and a low blow to boot.

i didn't "sabotage" the experiment.

i was doing the one thing i was still _allowed_
to do at distributed proofreaders, i.e., proof...

and i did a darn good job on every page i did.

look at my "diffs", and you'll see that i spotted
and corrected the errors on pages i proofed...

and then look at my "no diff" pages, and you'll
see i _correctly_ passed through correct pages.

show me one error that i failed to catch.

show me one "error" which i "injected"...

no sir, as far as i know, and i would _love_it_
if someone pointed out a mistake i had made,
because i _learn_ from my _mistakes_, i _do_,
but as far as i know, i made _no_ mistakes on
the 128+ pages which i proofed...   not a one...

and to imply otherwise is to tell one big fat lie.


>    That account and the others he has been using 
>    have also been disabled according to Juliet Sutherland's 
>    instruction to maintain the ban, especially given the 
>    escalation of his behavior there from trolling 
>    to actual sabotage. 

there was no :trolling".   and there's been no "sabotage".

these loaded words are because donovan/davie simply
cannot tolerate the truth, _especially_ when it is backed
with data and data and more data and more data still...

-bowerbird

p.s.   but i knew he'd shut me out when i revealed my data,
because that's what power freaks do when you reveal them,
they exercise their power.   of course, if i'd still cared one bit
whether i can access d.p. or not, i wouldn't have cut the cord.
i knew exactly what i was doing, and exactly what he would do.
power freaks have buttons that are so easy to predict, it's pitiful.



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080321/c6411936/attachment.htm 

From piggy at netronome.com  Fri Mar 21 04:46:27 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Fri, 21 Mar 2008 07:46:27 -0400
Subject: [gutvol-d] tesseract and ligatures
In-Reply-To: <20080321083205.GC18003@ark.in-berlin.de>
References: <20080321083205.GC18003@ark.in-berlin.de>
Message-ID: <47E3A013.2050009@netronome.com>

Ralf Stephan wrote:
> BTW, since we're just at it, there's another nontriviality
> involved with some scans that are out there.
>
> Some scans you can get have sort of shadows, just like
> antialiasing which is practically impossible to remove.
> This leads to characters sticking together like ligatures do.
>
> I thought OK why not, then let's just train tesseract for
> those character groups, it will only take a bit more effort...
> Result was, tesseract is not able to train ligatures, i.e.
> groups of characters, at all! It's hard wired to single characters.
> It *appears* at first to be able to train pairs of characters
> but this is an illusion because if it's a pair of ASCII chars
> tesseract won't barf because it thinks it's one UTF-8 char.
>
> I wonder if they even thought of multibyte UTF-8?
>
> Summary for me: Tesseract is unusable without ligature support.
> This is a major bug.
>
>   
Uh, it handles ligatures just fine. I couldn't do Fraktur without it. It 
uses UTF-8 internally. For some typefaces I've done exactly what you 
describe--I've added "ligatures" which are just common printing defects.

There's a fellow working on Kanada, and there every glyph is a ligature.

If you send me your training pages and the training data you generated, 
I'd be happy to look through them.
> Regards,
> ralf
>   

From Bowerbird at aol.com  Fri Mar 21 11:25:11 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 21 Mar 2008 14:25:11 EDT
Subject: [gutvol-d] happy spring everyone
Message-ID: <c36.26d32fc3.35155787@aol.com>

information at the website of the griffith observatory
tells me it has been spring ever since wednesday night,
so i'm glad i am officially caught up with the universe...

it also tells me the moon was full last night, which just
might help explain the little bit of lunacy that happened.

as if we needed an explanation...

you might remember that, earlier in the week, i told you
that i intended to start taking my posts to a public blog,
one that -- unlike this list -- will be crawled by google,
and so gain a greater visibility i have eschewed thus far.

i said i would start it on _friday_, and lo and behold, on
_thursday_night_ donovan and juliet invent an excuse
to "justify" a new effort to stop me from _visiting_ d.p.

i certainly understand them wanting to cut me off...

as long as i was just _talking_ about d.p. inefficiencies,
they could try to get you to dismiss me as an old crank.

but once i start _quantifying_ those inefficiencies --
1,137 em-dashes mistakenly turned into en-dashes
which then had to be corrected manually by proofers,
504 end-line hyphenates which had to be rejoined
_needlessly_, 57 end-line em-dashes to be clothed --
and doing it with results from their very own research,
which i can point people to view right on the d.p. site,
well then that solid evidence proves i ain't just a crank.

as long as i'm abstractly _talking_ about incompetence,
it's one thing.   but when i _show_ you their actual o.c.r.:
>    http://z-m-l.com/misc/paul_bad_pages.html
so you can see _exactly_ how embarrassingly awful it is,
and understand how wrong it is to make proofers fix that,
well, it's kind of hard to sweep those facts under the rug...

so i certainly understand them wanting to cut me off...

they don't want you to see the truth, to see the actual data.
so if i'm gonna _reveal_ it, their only option is to cut me off.

***

but you nice folks on this list don't need more help from me.
you now know how to view d.p. incompetence on your own...

just go to the project page of a few books they're doing
-- pick a handful of books, at random, to get a sample --
and step through the progression of pages like i did for
file #36 of "paul and the printing press", and you'll see it
for yourself.   in roughly half of the d.p. books, there is an
initial incompetence which is huge, then a heroic p1 job,
followed by p2 catching most of the rest, and finally p3
coming in to "finish up" the job, at least more or less...

in the other half of the books, there were good scans and
good o.c.r., and then p1 has a _much_ easier time of it...

still, the curve you obtain when you plot the errors fixed
per round is uniformly down-sloping.   the only difference
is how high it starts at.   with "paul and the printing press",
there were typically an estimated 50-100 errors per page,
with p1 fixing 90%-98%, then p2 getting a couple of them,
and p3 doing the stragglers.   (and probably missing a few.)

with the "cleaner" books, p1 will catch 2-10 errors per page,
p2 will catch the remaining 1 or 2, and p3 has nothing left...

page after page like this, in book after book, day after day...

***

oh yeah, since we're starting off a new season and all of that,
it's probably a good time to make an important observation.

it should be clear that i think very highly of the p1 proofers...
(and it should be very clear now that they deserve our praise,
rather than the disdain that they sometimes get over at d.p.)

what might not be so clear is how i feel about the p3 proofers.

the data i have shown hasn't been very kind to their reputation,
and i've pointed out time after time their performance has been
not significantly different than the p1 proofers.

but let me assure you that i think _very_ highly of the p3 people.

first of all, they're volunteers just like everyone else.   moreover,
they are the volunteers who have stepped up to the plate and
said "i will be one of the people who constitutes the final line".

this means they're taking responsibility for attaining perfection;
if there are o.c.r. errors left in a text after they are done with it,
they are the ones who will take the blame.   that is _admirable_...

plus, they get the toughest errors, the ones that have survived
two sets of human proofer eyeballs already.   the sneaky ones.

all by itself, this combination of _increased_responsibility_ and
_finding_the_persistent_errors_ is a difficult-enough burden...

but furthermore, due to the wacky workflow d.p. has invented,
where 100 p3 proofers have to do the same number of pages
as _thousands_ of p1 proofers, and do 'em to a higher standard,
the p3 proofers are now _exhausted_.   they're badly burned out,
and they're tired, and that does not make for an efficient proofer.

so if the p3 proofers are letting a few errors slip by these days,
let me assure them loudly and clearly that "i understand why!"
they're _overworked_, and underpaid, and they need a break...

a spring break...

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080321/6be38bd7/attachment-0001.htm 

From richfield at telkomsa.net  Fri Mar 21 06:48:12 2008
From: richfield at telkomsa.net (Jon Richfield)
Date: Fri, 21 Mar 2008 15:48:12 +0200
Subject: [gutvol-d] Gothic or Gothic? attn: La Monte H.P. Yarroll
Message-ID: <47E3BC9C.80006@telkomsa.net>

Why, sure.  Let me know in what form.  I'd be happy to email you a few 
Omniscanned page  images, or if you prefer, digitally photographed 
JPGs.  Let me know whether you have any strong preferences; eg do you 
just want a sample page, or one page from each letter of the alphabet, 
or what? 

I also have a Cassel's German Dictionary from the 1950's. (G-E, E-G) The 
German text is in Fraktur.  Do you want a couple of pages of that as well?

One thing though; neither book is sacrificable, so you will have to take 
pot luck with page gutters etc. 

Unfortunately though, I cannot get down to that till first week in 
April.  I hope that isn't a deal-breaker.

Cheers,

Jon

 >>>
Subject:
Re: [gutvol-d] Gothic or Gothic? Thanks folks!
From:
"La Monte H.P. Yarroll" <piggy at netronome.com>
Date:
Thu, 20 Mar 2008 10:14:06 -0400

To:
Project Gutenberg Volunteer Discussion <gutvol-d at lists.pglaf.org>


Jon Richfield wrote:
> I wrote asking for advice on scanning Fraktur.  (Absent-mindedly 
> claiming to have Omniscan, which afaik is the one universal and 
> error-free scanning software package; I meant of course Omnipage, 
> which is not.  (Not bad actually, but...))
>   
.....

I'm actively interested in improving tesseract OCR's fraktur 
performance. Could you point me at a few of your pages? I'll let you 
know how well we're doing with the limited fraktur training we have so far.

<<<


From hyphen at hyphenologist.co.uk  Fri Mar 21 11:31:46 2008
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Fri, 21 Mar 2008 18:31:46 -0000
Subject: [gutvol-d] happy spring everyone
In-Reply-To: <c36.26d32fc3.35155787@aol.com>
References: <c36.26d32fc3.35155787@aol.com>
Message-ID: <000601c88b81$d2af6050$780e20f0$@co.uk>

 

Bowerbird at aol.com wrote



 

>information at the website of the griffith observatory
>tells me it has been spring ever since wednesday night,

 

Complete with snow in the northern of England where I am
Brrrrrrrrrrrrrrrrrrrr.

 

Dave F






-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080321/b14788ba/attachment.htm 

From Bowerbird at aol.com  Fri Mar 21 11:49:33 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 21 Mar 2008 14:49:33 EDT
Subject: [gutvol-d] happy spring everyone
Message-ID: <cd5.29748ed1.35155d3d@aol.com>

dave said:
?>    Complete with snow in the northern of England

california is the place you want to be...               ;+)

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080321/81dcc85e/attachment.htm 

From Bowerbird at aol.com  Fri Mar 21 12:42:58 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 21 Mar 2008 15:42:58 EDT
Subject: [gutvol-d] the google book a.p.i.
Message-ID: <d34.210a36f4.351569c2@aol.com>


anybody using the google book a.p.i. yet?
>    
http://booksearch.blogspot.com/2008/03/preview-books-anywhere-with-new-google.html

found a way to connect p.g. e-texts to it?

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080321/e27cce46/attachment.htm 

From j.hagerson at comcast.net  Fri Mar 21 12:51:37 2008
From: j.hagerson at comcast.net (John Hagerson)
Date: Fri, 21 Mar 2008 14:51:37 -0500
Subject: [gutvol-d] PG DVD project needs volunteers
Message-ID: <023701c88b8c$f9d3d1b0$1f12fea9@sarek>

If you would like to help us duplicate and mail out copies of the Project
Gutenberg DVD, we could use your assistance. Please contact me off the list.
Thank you.

John Hagerson


From piggy at netronome.com  Fri Mar 21 15:21:12 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Fri, 21 Mar 2008 18:21:12 -0400
Subject: [gutvol-d] Gothic or Gothic? attn: La Monte H.P. Yarroll
In-Reply-To: <47E3BC9C.80006@telkomsa.net>
References: <47E3BC9C.80006@telkomsa.net>
Message-ID: <47E434D8.3010901@netronome.com>

Jon Richfield wrote:
> Why, sure.  Let me know in what form.  I'd be happy to email you a few 
> Omniscanned page  images, or if you prefer, digitally photographed 
> JPGs.  Let me know whether you have any strong preferences; eg do you 
> just want a sample page, or one page from each letter of the alphabet, 
> or what?
>   

I would recommend avoiding JPG for anything that has to go through OCR. 
The lossy characteristics of JPG tend to reduce the effectiveness of OCR 
a lot.

PNG and TIFF are my preferred formats, but any open format will do.

A single sample page would be a good start. If the current training does 
not work well, I'll ask for more.
> I also have a Cassel's German Dictionary from the 1950's. (G-E, E-G) The 
> German text is in Fraktur.  Do you want a couple of pages of that as well?
>   

I don't think I can clear that. Thanks for the offer though.
> One thing though; neither book is sacrificable, so you will have to take 
> pot luck with page gutters etc. 
>   

Ah, yes. Greyscale scans are much preferred over bilevel, especially if 
there is going to be gutter noise.
> Unfortunately though, I cannot get down to that till first week in 
> April.  I hope that isn't a deal-breaker.
>   

I certainly hope to live much more than another two weeks. :=)
> Cheers,
>
> Jon
>
>  >>>
> Subject:
> Re: [gutvol-d] Gothic or Gothic? Thanks folks!
> From:
> "La Monte H.P. Yarroll" <piggy at netronome.com>
> Date:
> Thu, 20 Mar 2008 10:14:06 -0400
>
> To:
> Project Gutenberg Volunteer Discussion <gutvol-d at lists.pglaf.org>
>
>
> Jon Richfield wrote:
>   
>> I wrote asking for advice on scanning Fraktur.  (Absent-mindedly 
>> claiming to have Omniscan, which afaik is the one universal and 
>> error-free scanning software package; I meant of course Omnipage, 
>> which is not.  (Not bad actually, but...))
>>   
>>     
> .....
>
> I'm actively interested in improving tesseract OCR's fraktur 
> performance. Could you point me at a few of your pages? I'll let you 
> know how well we're doing with the limited fraktur training we have so far.
>
> <<<
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
>   


From donovan at abs.net  Fri Mar 21 16:52:00 2008
From: donovan at abs.net (D Garcia)
Date: Fri, 21 Mar 2008 19:52:00 -0400
Subject: [gutvol-d] Moderation/censorship
In-Reply-To: <c73.2b132cc7.3514d655@aol.com>
References: <c73.2b132cc7.3514d655@aol.com>
Message-ID: <200803211952.00866.donovan@abs.net>

On Friday 21 March 2008 05:13, Bowerbird at aol.com wrote:
> donovan/david said:
> >    In contrast, over on DP where bowerbird
> >    created several sockpuppet accounts to
> >    bypass his unprecented ban there,
>
> um... he probably meant "unprecedented"...

In fact I do and did. Cat fur in the keyboard, probably.

> but... sockpuppet accounts?   to bypass a ban?

Yes, and my apologies to the DP user whose username actually 
*is* 'sockpuppet.' Evidence follows, but for the impatient, skip to the last 
quoted portion and response.

> excuse me?   i was explicitly _not_ prohibited
> from _proofing_ when i was "banned" -- i was
> only restricted from _posting_ in the forums...

Since bowerbird mentions it, let's review the sum total of his known 
proofreading activities at DP. It's quite an enlightening view, and very 
relevant to the discussion.

As bowerbird, 32 pages back in the years when DP had only two rounds.

As bradjohnson, 3 pages, account not used in 251 days.

As haroldjohnson, 4 pages, most recently a single page on March 7, 2008.

As ellipsisshellipis, (interesting nick choice), 16 pages on March 7, 2008 
(the date the account was created), and the 116 pages of "work" in the 
experiment project on March 19, 2008. This account was also used to post a 
poll on the DP forums. (See above where bb clearly states his belief was that 
he was explicity banned from posting in the forums.)

As sandy claws, no pages, but a Christmas Day 2007 posting (the day the 
account was created.) (Again, see above where bb clearly states his belief 
was that he was explicity banned from posting in the forums.)

Patterns, anyone?

Out of all the projects available to choose from during all that time, 
bowerbird only managed to find *one* that piqued his interest, and it just so 
happened to be the one he's been ever so faithfully posting about here, in 
much less than flattering terms.

Obviously he understood that he was banned from posting in the DP forums, and 
yet he used two freshly-minted accounts to do exactly that.

> >    he recently used one to intentionally
> >    sabotage
> >    the experiment in continuous proofing
>
> untrue.   and a low blow to boot.

See above.

> i didn't "sabotage" the experiment.

The people actually running the experiment at DP say differently, used far 
stronger language in describing his efforts in that project, and are to me 
far more credible as references.

> i was doing the one thing i was still _allowed_
> to do at distributed proofreaders, i.e., proof...

See above for evidence regarding bowerbird's obvious commitment to DP.

> and i did a darn good job on every page i did.

Many of our volunteers with bowerbird's level of experience with DP also 
believe the above statement to be true of themselves.

> no sir, as far as i know, and i would _love_it_
> if someone pointed out a mistake i had made,
> because i _learn_ from my _mistakes_, i _do_,
> but as far as i know, i made _no_ mistakes on
> the 128+ pages which i proofed...   not a one...

Perhaps bowerbird has chosen to learn from the wrong mistakes.
Let's skip on a bit...

> there was no :trolling".   and there's been no "sabotage".
>
> these loaded words are because donovan/davie simply
> cannot tolerate the truth, _especially_ when it is backed
> with data and data and more data and more data still...

I don't believe I've ever seen a more clear-cut example of projection.

> p.s.   but i knew he'd shut me out when i revealed my data,
> because that's what power freaks do when you reveal them,
> they exercise their power. of course, if i'd still cared one bit 
> whether i can access d.p. or not, i wouldn't have cut the cord.

Except that you *didn't* cut the cord. Instead, you explicitly circumvented 
the letter *and* the spirit of your ban at DP,  got caught (calamity of 
calamities!) and DP slapped your hands for it. Forgive me if I'm entirely 
unsympathetic, but you set this up yourself. The admins at DP have long been 
aware that you have other accounts, and as long as you only used them to read 
the forums, you received the benefit of the doubt. *Only* when that condition 
changed was the ban extended to the other accounts, *after* discussion and  
agreement by the admins. Your only justification in characterizing me as 
a "power freak" is that I delivered the message.

At any rate, we've established that bowerbird doesn't care about DP anymore, 
and that's great news to a lot of people. I hope this means no more looking 
for new accounts that he has created, and that the PG volunteers on this list 
will no longer have to skip past his previously uninterrupted flow of 
infrequently relevant but always copiously quixotic posts.

Have a great Easter!
David

From Bowerbird at aol.com  Sat Mar 22 00:07:09 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 22 Mar 2008 03:07:09 EDT
Subject: [gutvol-d] Moderation/censorship
Message-ID: <cae.2d7602d8.35160a1d@aol.com>


i'll have some nice long replies to donovan/david next week...             
:+)

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080322/701aaf2f/attachment.htm 

From paulmaas at airpost.net  Sat Mar 22 07:02:06 2008
From: paulmaas at airpost.net (Paul Maas)
Date: Sat, 22 Mar 2008 07:02:06 -0700
Subject: [gutvol-d] Moderation/censorship
In-Reply-To: <cae.2d7602d8.35160a1d@aol.com>
References: <cae.2d7602d8.35160a1d@aol.com>
Message-ID: <1206194526.21133.1243751209@webmail.messagingengine.com>

Mr. Bowerbird, to spare us your bloated email replies, why don't you
instead post them in a more generic form to your esteemed blog?

That way they can be Google indexed.  Fair trade.

What is the URL to your blog?


On Sat, 22 Mar 2008 03:07:09 EDT, Bowerbird at aol.com said:
> 
> i'll have some nice long replies to donovan/david next week...            
> :+)
> 
> -bowerbird
> 
> 
> 
> **************
> Create a Home Theater Like the Pros. Watch the video on AOL 
> Home.
>       
> (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-- 
  Paul Maas
  paulmaas at airpost.net

-- 
http://www.fastmail.fm - Access your email from home and the web


From Bowerbird at aol.com  Sat Mar 22 11:15:09 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 22 Mar 2008 14:15:09 EDT
Subject: [gutvol-d] bloated email replies
Message-ID: <d5e.258a5213.3516a6ad@aol.com>

paul said:
>   to spare us your bloated email replies

well, it appears that paul doesn't mind hearing an unfair insult, but
balks when then asked to listen to the person clearing their name...

learn to use your delete key, paul.   because i respond to attacks...

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080322/53360a9a/attachment.htm 

From Bowerbird at aol.com  Sat Mar 22 11:30:46 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 22 Mar 2008 14:30:46 EDT
Subject: [gutvol-d] a lot of incompetence
Message-ID: <c05.31c85a53.3516aa56@aol.com>

folks, there's a lot of incompetence over at d.p.

a lot.

from a lot of people.

and now that i'm pointing to it so everyone can see it,
rather than just making "vague" claims that it's there,
it's making those incompetent people _very_ nervous.

they're used to dumping their crap on the proofers
-- unfairly -- and getting it back all nice and shiny.

now i'm serving notice that i'm going to reveal them...

so they're gonna turn up their attack machines.

but i can handle their flak.

i still buy my anti-flame foam -- the same kind they
use on airport runways -- by the tanker-truckload...

as much as possible, i'm going to stick to the _data_.

but the idiots are going to try to make it _personal_...

so if you don't like turbulence, buckle your seatbelts.

-bowerbird

p.s.   and yes, i will also post the data to a blog, and
prevent the idiots from commenting there, so if you
_really_ want peace and quiet, just read it there...
but if they wanna fight here, i _will_ fight 'em here.



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080322/c247ff7a/attachment.htm 

From Bowerbird at aol.com  Sat Mar 22 11:41:32 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 22 Mar 2008 14:41:32 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 06
Message-ID: <d47.249eb14a.3516acdc@aol.com>

speaking of data...

more info on the parallel test of "paul and the printing press"...

once again, looking for an easy-to-compute "metric" of quality.

yesterday we noted the percentage of lines changed on a page to
give us a "metric" about the quality of the page before and after...

today we get the percentage of _pages_ changed in a _round_ to
give us a "metric" we can use to determine quality of the round...

(and you're right, such a "round metric" is of absolutely _no_use_
in the promulgation of a "roundless" system, but piggy wants one
anyway, so let's try to give piggy what he wants, make him happy.)

again, you can follow along by looking at the actual data at d.p.:
>    http://www.pgdp.net/c/project.php?id=projectID45ca8b8ccdbea

this parallel proofing was the one in the normal p1-p2-p3 workflow.

we're gonna focus right now on the changes made in the p1 round...

specifically, we're gonna count the number of pages changed by p1.

out of 244 pages, the only "no diff" ones were the 15 blank pages,
with 1 exception (#140), which was "no diff" because the proofer
missed the 2 errors on it.   (he also "forgot" to rejoin hyphenates,
_and_ to place blank lines between paragraphs, which indicates
to me that this proofer just plain neglected to proof this page.)

anyway, for the record, here are the two hard-core errors:
>    It was this " echoing idea" that was new to <- floating quotemark
>    of a monologue,-an exact reversal of his policy. <-s/b em-dash

oh, and by the way, the second parallel proofing caught
the latter of those two errors, but it missed the former...
but as you know, floating quotemarks are autodetectable.

ok, so out of 229 non-blank pages, _228_ were changed by p1...
that's a very high percentage, reflecting how rotten the o.c.r. was.

when you have good scans and good o.c.r., 25% to 75% of the
pages in a book can be recognized perfectly by the o.c.r. app,
especially when it is supplemented by a good clean-up tool...

***

while we're here, let's dig a little bit deeper into this data, ok?

especially in a way that will give us a _page-quality_ metric.
(because remember, that's what this mission was all about.)

first, toss the 15 blank pages.   not hard to proof them.

of the remaining 229 pages, what we have left is this...
>    73 pages with a p2 "no diff" and a p3 "no diff" following it.
>    118 pages where p2 had a "diff", and p3 "no diff" after that.
>    24 pages where p2 had a "diff" and p3 had a "diff" as well.
>    14 pages where p2 had "no diff", but p3 _did_ have a "diff".

ok, so let's do a closer analysis of these one-by-one...

73 pages with a p2 "no diff" and a p3 "no diff" following it.
these are the pages which p1 proofers took to perfection,
often after having made _many_ changes to inferior o.c.r.
considering that some of these pages required _dozens_
of type-in corrections, this 32% perfection rate is _great_.

118 pages where p2 had a "diff", and p3 "no diff" after that.
these are the 52% of the pages which p2 took to perfection,
usually by catching the occasional errors p1 had missed...
note that after 2 rounds, 84% of the pages were _perfect_.

24 pages where p2 had a "diff" and p3 had a "diff" as well.
these 10% of the pages are ones we presume p3 perfected,
p2 fixed _some_ errors on these pages, but p3 got the rest.
so these pages took the combined efforts of p1, p2 and p3.

14 pages where p2 had "no diff", but p3 _did_ have a "diff".
on these 6% of the pages, p2 was asleep, but p3 covered;
(but, in fairness to p2, half the changes were "ticky-tack".)

***

so once again, we get the pattern i've discussed all along,
the pattern that seems to capture a "common-sense" take,
which is that p1 fixes most of the errors, p2 gets most of
the remaining ones, and p3 comes in and does clean-up.

sure enough, p1 did _awesome_, converting _rotten_ o.c.r.
into great pages, including an amazing 32% perfection rate.

p2 did well, taking another 52% of the pages to perfection
all by themselves, and 10% more in conjunction with p3...

p3 had to clean up the final 5% of the pages -- just 14 --
and on half of those, the changes they made were minor.

again, this is the pattern you get on page after page,
in book after book, day after day, over in d.p.-land...

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080322/3dccd85f/attachment-0001.htm 

From hart at pglaf.org  Sun Mar 23 13:08:56 2008
From: hart at pglaf.org (Michael Hart)
Date: Sun, 23 Mar 2008 13:08:56 -0700 (PDT)
Subject: [gutvol-d] Unexpected Events
Message-ID: <Pine.LNX.4.64.0803231306500.16832@pglaf.org>


I'm doing a survey on events of the last 5-10 years
that you did NOT expect.

In that perspective, I would also be interested in
hearing your predictions for the next 5-10 years of
events that YOU think might happen that would NOT
be expected by the general population.

Thanks!!!

Michael S. Hart
Founder
Project Gutenberg

From ricardofdiogo at gmail.com  Sun Mar 23 13:53:30 2008
From: ricardofdiogo at gmail.com (Ricardo F Diogo)
Date: Sun, 23 Mar 2008 20:53:30 +0000
Subject: [gutvol-d] Unexpected Events
In-Reply-To: <Pine.LNX.4.64.0803231306500.16832@pglaf.org>
References: <Pine.LNX.4.64.0803231306500.16832@pglaf.org>
Message-ID: <9c6138c50803231353y3da0e7f3g3059cee1cc14d32b@mail.gmail.com>

2008/3/23, Michael Hart <hart at pglaf.org>:
>
>
> In that perspective, I would also be interested in
> hearing your predictions for the next 5-10 years of
> events that YOU think might happen that would NOT
> be expected by the general population.
>

In the next 5-10 years:
* most people in the western world will have an ebook reading device;
* they'll try to create an international cyberpolice for tracking down
everything you download;
* copyright is going to change drastically: a foundation will be
created for managing and paying royalties to the authors, ebooks will
be populated with adds, more and more content will be available for
free;

Ricardo

From Bowerbird at aol.com  Mon Mar 24 10:15:28 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 24 Mar 2008 13:15:28 EDT
Subject: [gutvol-d] the myth of the elusive error only the expert proofer
	can catch
Message-ID: <d1d.1e6c81cf.35193bb0@aol.com>

some people have this notion that there are "elusive errors"
that "only an expert proofer" can spot.   this belief is a myth.

some people are certainly _better_ proofers than other folks.

a few individuals might even be relatively good enough that
we could reasonably consider 'em to be "experts" at proofing.

(but they seem to have been _born_ with the skill, rather than
having "learned" it, though experience does make it sharper.)

however, the flip side -- the error which is so elusive that _only_
the "expert proofer" can find it -- has no evidence that i can see.

some people have even asserted that _ten_rounds_ of "novice"
proofers might miss one of these "elusive errors", which would
then be spotted in a _single_ pass by _one_ expert proofer...

bull crap.

at least, _i_ have never seen that happen.

and, try as hard as i might, i can't even _imagine_ what such an
"elusive error" might be, what it would look like, how it can hide.

and i'm one of the (unlucky?) people who can't not notice typos,
because they jump out and stick me in the eyeball with a pencil.
so i am quite sure that i'm not "blind" to these "elusive" errors...

oh, don't get me wrong.   i've seen _plenty_ of errors that have
managed to escape detection from one, two, three, even _four_
rounds of proofing.   heck, in the "planet strappers" experiment,
there was an error that went unfixed by _five_ proofing rounds.

what was that error?   it was a _comma_, smack dab in the middle
of a sentence, big as day, there for anyone and everyone to see...
anyone reading the page would know that comma didn't belong.
it hardly qualified as something that people would call "elusive"...

then why was it missed?   for the very same reason that sometimes
you will get 10 coin-flips in a row all coming up "tails" -- _chance_.

another error that survived for many rounds was one that simple
_spellcheck_ would detect.   how did it last so long?   don't ask me.

no, there is _no_ error that is so "elusive" that 100% of "novices"
will miss it and 100% of the "expert proofers" will locate it.   none.

and if you want to maintain that there is, let's put it to the test.

of the _self-selected_ lot who proof at distributed proofreaders,
and thus get lots of experience, i'd say the "best" proofers catch
about 80-95% of the errors, and the "ordinary" ones get 65-85%.

so we're basically seeing many shades of gray, not black-or-white.

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080324/6235827f/attachment.htm 

From ralf at ark.in-berlin.de  Sat Mar 22 01:19:41 2008
From: ralf at ark.in-berlin.de (Ralf Stephan)
Date: Sat, 22 Mar 2008 09:19:41 +0100
Subject: [gutvol-d] tesseract and ligatures
In-Reply-To: <20080321083205.GC18003@ark.in-berlin.de>
References: <20080321083205.GC18003@ark.in-berlin.de>
Message-ID: <20080322081941.GA5299@ark.in-berlin.de>

me wrote 
> Summary for me: Tesseract is unusable without ligature support.
> This is a major bug.

This applies to the SVN version (154, head) only, as I just found
out, so hands off that. The official version 2.01 appears to do
better, but I'm still testing.


Sorry for shouting first,
ralf

From Bowerbird at aol.com  Mon Mar 24 14:36:55 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 24 Mar 2008 17:36:55 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 07
Message-ID: <cf3.2c1abd7e.351978f7@aol.com>

here's elaboration on the data i presented on saturday,
again on the parallel test of "paul and the printing press".

this view is the best look at this data-set yet...

and it's derived using info that's presently available
to d.p. right on its own "project page" for this book:
>?? http://www.pgdp.net/c/project.php?id=projectID45ca8b8ccdbea

(which means that any book in their system could be
subjected to this same analysis, at any time you want.)

***

but first, correction of a minor error i made:
in discussion of a metric for the rotten o.c.r.,
i checked the number of o.c.r. pages changed.

i said the normal p1-p2-p3 workflow had had
just 1 page where p1 had a "no diff" from o.c.r.,
and even then there were 2 errors on that page,
and that the parallel p1 had caught one of them.

in actuality, it was the _parallel_ p1 who had
a "no diff" on that page, as the _normal_ p1
had found and fixed _both_ of those errors.

normal p1 had a "diff" on _all_ non-blank pages.

again, you have some pretty awful o.c.r. when
it can't get even _1_ page perfect out of 200+.

***

ok, now on to a closer view of the data i presented saturday...

in this view, i show the different types of progression through 
the rounds, starting with those that would benefit most from
another round of full proofing, or a changes-only verification.
i list the actual pages that fall into each type of "progression"...

by "progression", i mean separation of each page into "types",
where the type reflects the rounds making a change to a page.

a pair of asterisks indicates no change was made in that round,
so -- for instance -- the progression-type of p1-**-p3 means
that p1 changed the page, p2 had a "no diff", and p3 changed it.

the progression-types i found were:
->    p1-p2-p3 -- 22 pages -- (every round made a change)
->    p1-**-p3 -- 14 pages -- (p1 and p3 made a change)
->    **-p2-p3 -- 1 page -- (p2 and p3 made a change)
->    p1-p2-** -- 116 pages -- (p2 made the last change to the page)
->    p1-**-** -- 76 pages -- (p1 made the last change to the page)
->    **-**-** -- 15 pages -- (all these no-change pages were _blank_)

the pages that comprise each type of progression are listed below...

again, you can follow along if you like, by viewing the project page:
>?? http://www.pgdp.net/c/project.php?id=projectID45ca8b8ccdbea

***

these pages could benefit from another round...

->   progression-type p1-p2-p3
(all 3 rounds made changes -- 22 pages -- could use another _full_ proofing 
round)
8, 18, 22, 23, 36, 38, 65, 81, 90, 96, 111, 144, 154, 155, 168, 196,
205, 209, 212, 225, 227, 237

->   progression-type p1-**-p3
(p2 "no diff", but p3 "diff -- 14 pages -- could use a changes-only 
verification round)
7, 26, 47, 70, 75, 82, 89, 108, 160, 197, 199, 203, 206, 238

this progression-type is the most troubling.   we'd _prefer_ to believe that
-- when a page encounters a "no diff" experience -- it's because it's clean.

but the fact is that these pages were "no diff" in p2, yet p3 made a change.

it might be that we need to have _two_ proofers verify that a page is clean,
but that would mean a significant increase in the amount of work required.

so we need to take a closer look on what's going on in these cases...

(and i do that below...)

***

from here on down, i'd say that none of these pages need more verification...

->   progression-type **-p2-p3
(meaningless changes on a forward-matter page --1 page -- can be ignored)
2

->   progression-type p1-p2-**
(no changes after p2 took it to perfection -- 116 pages -- so verified once)
12, 17, 19, 20, 25, 28, 29, 33, 35, 39, 42, 43, 45, 46, 48, 49, 50, 51, 52, 
53,
54, 55, 56, 57, 58, 59, 60, 63, 66, 68, 72, 74, 77, 78, 79, 80, 84, 92, 94, 
95,
107, 114, 117, 119, 120, 127, 128, 130, 131, 132, 133, 134, 135, 138, 140,
147, 150, 153, 156, 158, 161, 162, 164, 165, 171, 173, 174, 176, 177, 178,
179, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 
195, 198,
200, 201, 202, 204, 207, 208, 210, 211, 213, 214, 215, 216, 217, 219, 220, 
221, 222,
223, 224, 226, 228, 229, 230, 231, 232, 233, 234, 235, 236
***

->   progression-type p1-**-**
(no changes after p1 took it to perfection -- 76 pages -- so verified _twice
_)
5, 6, 10, 14, 16, 21, 24, 30, 31, 32, 34, 37, 40, 41, 44, 61, 62, 64, 67, 69,
71, 73, 76, 83, 85, 86, 87, 88, 91, 97, 98, 99, 100, 101, 102, 103, 104, 105,
106, 109, 110, 112, 113, 115, 116, 118, 121, 122, 123, 124, 125, 126, 129,
136, 137, 139, 141, 142, 143, 145, 146, 148, 149, 151, 152, 159, 163, 166,
167, 169, 170, 172, 175, 180, 218, 239

no comment necessary here.   p1 did an _excellent_ job, transforming some
rotten o.c.r. into 76 pages that were _perfect_ according to later 
proofers...

one might argue these pages, "no diff" by p2, could've been _skipped_ by p3,
but of course there was the risk they were _actually_ in the p1-**-p3 type...

so we'll need to learn what happened with that type before we suggest that.

***

->   progression-type **-**-**
(no changes at all, meaning o.c.r. got it right, 15 pages, all blank)
1, 3, 4, 9, 11, 13, 15, 27, 93, 157, 240, 241, 242, 243, 244

blank pages were the only pages which tesseract recognized correctly...

***

so once again, we get the pattern i've discussed all along,
the pattern that seems to capture a "common-sense" take,
which is that p1 fixes most of the errors, p2 gets most of
the remaining ones, and p3 comes in and does clean-up.

the only puzzling pages here were those 14 pages where
p2 had a "no diff" on the page, but p3 did make a change.

so i took a closer look at those pages...

***

of these 14 pages in p1-**-p3, 12 were not troublesome:
2=errors that don't have any significance in this analysis;
4=changes that were concerned with end-line hyphenates;
4=errors that could've been detected with pre-processing;
2=correct recognition (might or might not be p-book errors).

i've appended the actual text from these changed line-pairs,
with my explicit categorization after the line-pairs.

this left a mere _2_errors_ that were actual, troubling errors.

on file#160:
>    Paul lingered the bill nervously. Fifty dollars!
>    Paul fingered the bill nervously. Fifty dollars!

and on file#206:
>    bid good-by to the familiar balls of the school,
>    bid good-by to the familiar halls of the school,

***

this allows us to comment on the suggestion up above that
we could've skipped p3 on the p1-**-** pages because the
p2 "no diff" acted as a "verification" that the page was clean.

this means that -- if we would have skipped p3 on all of the
90 pages where p2 made no change -- that decision would
have allowed _2_ errors to pass into this book of 200+ pages.
and it's safe to say both would be caught by the general public.

so saving the additional round of proofing on those 90 pages
would seem -- to me -- to have been a good trade in this case.

this is _not_ an argument that a single "no diff" is "good enough"
to stop proofing a page...   i would think that most people would
hold the opinion it takes _2_ "no diff" rounds to be _confident_...

but once again, that depends on _how_good_ is "good enough".

oh, and by the way...

maybe you are a person -- i know some are out there! -- who is
thinking "2 errors!   we can't tolerate 2 errors in a book!   2 many!"

get real, buddy...

because the normal d.p. workflow?, the one with the p3 "marines"?

it left _more_than_ 2 errors in this book, no matter how we count,
and i will give you a list of some of their specific errors tomorrow...

so if you really want to have books that accurate, you will need to
convince the people at d.p. to change over to 4 proofing rounds.
or maybe _5_.   either way, good luck with _that_...   you'll need it...

***

one more thing...

the parallel p1 proofers _corrected_ both "lingered" and "balls"...

that's right, some lowly p1 proofers caught the 2 real errors that
both p2 and p3 proofers missed.   kinda makes you wonder, eh?

so evidently those weren't the mythological "elusive errors"...
especially beings the expert proofers missed them...            ;+)

-bowerbird

p.s.   here is the listing of the 14 cases, with their analysis, where 
p2 had a "no diff" on the page, but p3 came and made a change.
again, categorization of these cases follows at the very bottom...


#7
>    Copyright, 1920
>    Copyright, 1920,

meaningless comma.


#26
>    "The March Hare!" he repeated wlth enthusiasm.
>    "The March Hare!" he repeated with enthusiasm.

bad word would be caught by spellcheck.


#47
>    and the Sanscrit Vedas would have been
>    and the Sanscrit[**typo? Sanskrit] Vedas would have been

recognized as it was printed in the p-book; not an o.c.r. error.


#70
>    I have already explained, care much for reading;
>    have already explained, care much for reading;

the word "i" was doubled from the previous line, 
so it was _autodetectable_ as a repeated word.


#75
>    ways at liberty to send contributions back with
>    at liberty to send contributions back with

improper joining of "always" on line above, so
wouldn't have happened with a good workflow.


#82
>    various sources one number after another of `
>    various sources one number after another of

garbage character should've been eliminated in preprocessing.


#89
>    and Diamonds for the more prosperous `
>    and Diamonds for the more prosperous

garbage character should've been eliminated in preprocessing.


#108
>    their own idle pleasure but to financing Gutenburg's
>    their own idle pleasure but to financing Gutenburg's[**typo? 
Gutenberg's]

recognized as it was printed, consistently, in the p-book; not an o.c.r. 
error.


#160
>    Paul lingered the bill nervously. Fifty dollars!
>    Paul fingered the bill nervously. Fifty dollars!

actual error, and a stealth scanno to boot.   you can't win 'em all...


#197
blank line introduced between paragraphs.   outside the scope of this 
analysis.


#199
>    "Pretty nearly," returned Mr. Hawley good-naturedly.
>    "Pretty nearly," returned Mr. Hawley good-*naturedly.

stupid asterisk note on a questionable hyphenation.   not an error.


#203
>    the cardboard. The thickness of these semi-cylindrical
>    the cardboard. The thickness of these semi-*cylindrical

stupid asterisk note on a questionable hyphenation.   not an error.


#206
>    bid good-by to the familiar balls of the school,
>    bid good-by to the familiar halls of the school,

actual error, and a stealth scanno to boot.   you can't win 'em all...


#238
>         when weary, sleepy, but triumphant, a half jubilant, 
>         when weary, sleepy, but triumphant, a half-jubilant, 

improper rejoining of end-line hyphenate...

***

here's my categorization of the errors on these 14 pages:

errors that don't have any significance in this analysis:
#007>    meaningless comma on a front-matter page.
#197>    blank line between paragraphs, not in our scope.

changes that were concerned with end-line hyphenates:
#075>    improper rejoining of end-line hyphenate.
#238>    improper rejoining of end-line hyphenate.
#199>    asterisk on de-hyphenation.   not o.c.r. error.
#203>    asterisk on de-hyphenation.   not o.c.r. an error.

errors that could've been detected with pre-processing:
#026>    bad word could've been caught by spellcheck.
#070>    doubled-up word could've been autodetected.
#082>    garbage character could've been autodetected.
#089>    garbage character could've been autodetected.

correct recognition (might or might not be p-book errors):
#047>    recognized as is in the p-book; not o.c.r. error.
#108>    recognized as is in the p-book; not o.c.r. error.

actual errors that matter:
#160>    actual error; stealth scanno too; can't win 'em all...
#206>    actual error; stealth scanno too; can't win 'em all...



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080324/e200579c/attachment.htm 

From Bowerbird at aol.com  Mon Mar 24 23:34:16 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 25 Mar 2008 02:34:16 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 08
Message-ID: <d5f.25ccf437.3519f6e8@aol.com>

ok, we need a quick reminder before this next series of data
about the parallel proofing of "paul and the printing press"...

in this series, i'll be quantifying the work required to correct
the rotten o.c.r., over and above that which _would've_been_
required had a better (i.e., efficient) workflow been observed.

it's important to remember that bad scans and bad o.c.r. are
_endemic_ over at distributed proofreaders.   because of that,
it would be shortsighted indeed to blame piggy himself for
the awful o.c.r. with which the proofers were faced.   of course
he was responsible for the bad page-scans which he created,
and for the even-more-flawed decision to use _tesseract_, but
there are dozens of content providers at d.p. (maybe hundreds)
making equally questionable decisions ramifying in poor quality
and wasting the time and energy of the well-intended proofers...

therefore, the _real_ incompetence is not located at _that_ level,
but the level _above_, which allows this bad work to be tolerated.

somebody should have -- as suggested earlier -- been taking
piggy aside, quietly instructing him that such a poor quality of
o.c.r. is not permitted, informing him how to do the job better,
and giving him a pat on the back and sending him back to work.

since he didn't realize this himself, somebody needed to tell him.

it's just that simple.

***

now, in order to quantify the poor showing of tesseract, i re-did
the o.c.r. on the scans with abbyy, the acknowledged o.c.r. leader.

even though it's _clear_ that the tesseract output is _quite_bad_,
quantifying it will better illustrate how much energy it's wasting...

doing the o.c.r. was easy.   of somewhat more -- manual -- work
was synching the two sets of o.c.r. so that we can _compare_ 'em.

so here we have the original o.c.r., from tesseract:
>    http://z-m-l.com/go/paulp/paul-tesseract.html

and here we have the new o.c.r., from finereader:
>    http://z-m-l.com/go/paulp/paul-abbyy.html

if you load those two pages into two windows using your browser,
you can compare them straight across, now that i've synched them.

this comparison reveals there is no comparison between them...

but, if you want numbers, they _are_ more alike than different,
with roughly 4000 lines in common, and 3000 lines that differ.

but still, even a quick glance reveals the abbyy output is better.

later today or tomorrow, i will give some statistics to back it up,
but it is clearly observable even in an "eyeball" test like this that
abby gets a _lot_ more right than tesseract, by a _large_ margin.

equally clear is that _some_ of the pages need to be re-scanned,
because even abbyy was unable to deal with their "gutter noise"...

now last week, there was some discussion of "unpaper", which
might have led some people to believe that the problems with
the "gutter noise" were _unavoidable_, which is _unfortunate_,
because the pattern of affected pages indicates that this was a
_human_ problem.   very simply put, insufficient care was taken.

those scans didn't need to be "cleaned up".

they needed to be _redone_.   period.

when those pages are re-scanned, they will deliver quality o.c.r.
(at least if they're treated by abbyy, which is what i recommend.)

again, this is the kind of thing for which you take a person aside,
quietly inform them that this level of quality is unacceptable, and
then give them a pat on the back and send them back to work...

and since nobody did that in this case, i have done it here now.

***

scans that can be better if you take sufficient care are unacceptable.
and o.c.r. performed with a beta o.c.r. program is fully intolerable...
and every single content provider at d.p. should know these things.

***

once again, this is _not_ a reflection on, or a criticism of, piggy.
i'm sure he's very nice, a good father to his children, and so on.

and the fact that he offered up these projects of his for analysis
means that he's willing to learn, which is a very admirable trait...

so even though he's a step above the average volunteer at d.p.
-- in the sense that he's taking on these additional missions --
and thus is probably more likely to be one who _takes_people_
aside, rather than being taken aside himself, the clear lesson is
that some of the "step-above" volunteers need to up their game.

the people at the bottom of the pile seem to be doing excellent.
they are taking shit and turning it into shinola -- no small task...

if only the people at the _front_ end of the workflow would stop
"injecting" so many "errors" into the text before proofers get it...

***

and one more thing, while i'm at it...

i spent literally _years_ here complaining about d.p. inefficiency.

for a very long time, i resisted getting overly specific, because
i wanted to give the d.p. people the opportunity to "save face".

they squandered this opportunity, using the flexibility i gave them
to lash out at me personally, rather than to clean up their own act.
this indicates how morally bankrupt their position remains today...

believe me, i could have offered up book after book after book as
examples of the poor workflow.   at any time.   and i still can do that.

as a social scientist, i know the power of data, and i know it well...
i held it in reserve because i know its power, but the _inaction_ of
the d.p. "leadership" to correct its flaws, coupled with their clumsy
attempts to silence my legitimate charges, now gives me no choice.

i'm using _these_ books only because d.p. picked them out itself...

and now that i've begun, i'm going to _finish_ the job, completely...

i know many of you are tired of these books, and fatigued by data,
but i'm gonna continue posting until i've completed the analyses,
because down the line i will be referring back to this _solid_data_
whenever i repeat claims about awful d.p. workflow inefficiency...

and from now on, i won't be giving d.p. any wiggle room at all...
they need to fix their workflow, and they need to start that now.

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080325/7b16b324/attachment-0001.htm 

From schultzk at uni-trier.de  Tue Mar 25 02:07:33 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Tue, 25 Mar 2008 10:07:33 +0100
Subject: [gutvol-d] Moderation/censorship
In-Reply-To: <200803211952.00866.donovan@abs.net>
References: <c73.2b132cc7.3514d655@aol.com>
	<200803211952.00866.donovan@abs.net>
Message-ID: <025F131E-C271-4B6D-BBA4-E4858F3829AB@uni-trier.de>

Hi David,

	I have big problems with your accusations.
	You mentioned that Bowerbird evidently sabotaged
	the experiment. What I would like to know is
	in what way? Or is it that due to his work in
	connection with the experiment did give you the
	expected results?

	If so there are rules for eliminating such
	anomalies. I do know what I am talking about.
	
	I have the feeling that the experiment did go as
	you expected and have found that do to BB work
	the results ended up the way they are.

	If so either:
		1) your hypothesis is wrong
		2) you can safely remove BB work as an outlier

	I would love to scrutinize your academic experiment, but
	I am sure you would not like the result.

	Anyway, regards
		Keith J. Schultz

	
Am 22.03.2008 um 00:52 schrieb D Garcia:
> On Friday 21 March 2008 05:13, Bowerbird at aol.com wrote:
>>
	[snip, snip]

>
> Since bowerbird mentions it, let's review the sum total of his known
> proofreading activities at DP. It's quite an enlightening view, and  
> very
> relevant to the discussion.
>
> As bowerbird, 32 pages back in the years when DP had only two rounds.
>
> As bradjohnson, 3 pages, account not used in 251 days.
>
> As haroldjohnson, 4 pages, most recently a single page on March 7,  
> 2008.
>
> As ellipsisshellipis, (interesting nick choice), 16 pages on March  
> 7, 2008
> (the date the account was created), and the 116 pages of "work" in the
> experiment project on March 19, 2008. This account was also used to  
> post a
> poll on the DP forums. (See above where bb clearly states his  
> belief was that
> he was explicity banned from posting in the forums.)
>
> As sandy claws, no pages, but a Christmas Day 2007 posting (the day  
> the
> account was created.) (Again, see above where bb clearly states his  
> belief
> was that he was explicity banned from posting in the forums.)
>
> Patterns, anyone?
>
> Out of all the projects available to choose from during all that time,
> bowerbird only managed to find *one* that piqued his interest, and  
> it just so
> happened to be the one he's been ever so faithfully posting about  
> here, in
> much less than flattering terms.
>
> Obviously he understood that he was banned from posting in the DP  
> forums, and
> yet he used two freshly-minted accounts to do exactly that.
>
>>>    he recently used one to intentionally
>>>    sabotage
>>>    the experiment in continuous proofing
>>
>> untrue.   and a low blow to boot.
>
> See above.
>
>> i didn't "sabotage" the experiment.
>
> The people actually running the experiment at DP say differently,  
> used far
> stronger language in describing his efforts in that project, and  
> are to me
> far more credible as references.
>
>> i was doing the one thing i was still _allowed_
>> to do at distributed proofreaders, i.e., proof...
>
> See above for evidence regarding bowerbird's obvious commitment to DP.
>
>> and i did a darn good job on every page i did.
>
> Many of our volunteers with bowerbird's level of experience with DP  
> also
> believe the above statement to be true of themselves.
>
>> no sir, as far as i know, and i would _love_it_
>> if someone pointed out a mistake i had made,
>> because i _learn_ from my _mistakes_, i _do_,
>> but as far as i know, i made _no_ mistakes on
>> the 128+ pages which i proofed...   not a one...
>
> Perhaps bowerbird has chosen to learn from the wrong mistakes.
> Let's skip on a bit...
>
>> there was no :trolling".   and there's been no "sabotage".
>>
>>

From julio.reis at tintazul.com.pt  Tue Mar 25 06:22:46 2008
From: julio.reis at tintazul.com.pt (=?ISO-8859-1?Q?J=FAlio?= Reis)
Date: Tue, 25 Mar 2008 13:22:46 +0000
Subject: [gutvol-d] Unexpected Events
In-Reply-To: <mailman.2.1206385202.8178.gutvol-d@lists.pglaf.org>
References: <mailman.2.1206385202.8178.gutvol-d@lists.pglaf.org>
Message-ID: <1206451366.12118.128.camel@abetarda.mshome.net>

Unexpected Events in 5 to 10 years: the end of the world.

On Thursday 23 April 2015, UNESCO World Book and Copyright Day, Canada,
the European Union and about 70 other countries repudiated the Berne
Convention and changed copyright law to publication+50.

The so-called Laval Convention (after the city in Qu?bec where the
treaty was signed) was met with much controversy, being supported by
cultural organisations both grass-roots and otherwise (like the UNESCO),
and meeting strong opposition particularly from the White House, the
Kremlin and the Australian government, and a huge coalition of media
industry giants.

The following transition applied in the European Union:

On 1 Jan...  PD includes books:
--------------------------------
    2016     author died 1945; end of all national "special cases"
    2017     published 1946
    2018     published 1949
    2019     published 1952
    2020     published 1955
    2021     published 1958
    2022     published 1961
    2023     published 1964
    2024     published 1967
    2025     published 1970
    2026     published 1973
    2027     published 1976

On 1 Jan 2016, all books published until 1965 became public domain in
Canada, and in many nations which formerly followed a life+50 copyright
law, like Angola, Chile, all North African countries and New Zealand.

The end of Crown Copyright was considered one of the most surprising
events in Europe. In fact, King Arthur II is rumoured to having been
preparing the Laval Convention since his accession to the throne in
November 2008. His was also the proposal that a city in Qu?bec be chosen
to sign the Convention, a declaration which stirred some conservative
sectors of US politics. The United States Ambassador in London, Jon
Huntsman, Jr. called it "a cultural provocation right at our doorstep."
This declaration and the following remark that such a treaty was better
suited to being signed "in some forsaken Bulgarian village" cost him his
seat.

Australia also surprised the world two years later by demarcating itself
from US Copyright policies, which she had been following for the
previous years, and on 23 April 2018 aligned itself with Canada/EU. It
used the same transition as the EU for the 2019-2027 period.

(The US would change its copyright laws, but not until 2026, so that
unexpected event is out of our 5-10 year horizon.)

J?lio.


From piggy at netronome.com  Tue Mar 25 08:08:32 2008
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Tue, 25 Mar 2008 11:08:32 -0400
Subject: [gutvol-d] Moderation/censorship
In-Reply-To: <025F131E-C271-4B6D-BBA4-E4858F3829AB@uni-trier.de>
References: <c73.2b132cc7.3514d655@aol.com>	<200803211952.00866.donovan@abs.net>
	<025F131E-C271-4B6D-BBA4-E4858F3829AB@uni-trier.de>
Message-ID: <47E91570.5040107@netronome.com>

Schultz Keith J. wrote:
> Hi David,
>
> 	I have big problems with your accusations.
> 	You mentioned that Bowerbird evidently sabotaged
> 	the experiment. What I would like to know is
> 	in what way? Or is it that due to his work in
> 	connection with the experiment did give you the
> 	expected results?
>
> 	If so there are rules for eliminating such
> 	anomalies. I do know what I am talking about.
> 	
> 	I have the feeling that the experiment did go as
> 	you expected and have found that do to BB work
> 	the results ended up the way they are.
>
> 	If so either:
> 		1) your hypothesis is wrong
> 		2) you can safely remove BB work as an outlier
>
> 	I would love to scrutinize your academic experiment, but
> 	I am sure you would not like the result.
>   
Most of the raw data is available directly to the public:
http://www.pgdp.net/wiki/Confidence_in_Page_analysis#Perpetual_P1

Proofer identities are the only protected data. I suggest using 
check-ins of regular period to make short-range determinations of "same 
proofer".

I think I have adequately demonstrated my willingness to accept formal 
analysis from all interested volunteers.

Yes, the anomalies under discussion hardly spell the end of the 
experiment. I will even go so far as to say that "sabotage" is too 
strong a term. What irritated me most is that overenthusiastic 
participation forced me to do data analysis I hoped to postpone. I was 
obliged to statistically check the claim that the pages in question were 
edited offline.

I congratulate the proofer in question for their steadily improving 
skill. In I3 they found defects (wa/w) at 1/5th the rate of all other 
proofers combined. In I4 they found defects at 1/3rd the rate of all 
other proofers combined. I see no value in participating in the expected 
thread on the meaning and validity of these statistics. Yes, there are 
problems with the wa/w metric. We're working to address them.
> 	Anyway, regards
> 		Keith J. Schultz
>
> 	
> Am 22.03.2008 um 00:52 schrieb D Garcia:
>   
>> On Friday 21 March 2008 05:13, Bowerbird at aol.com wrote:
>>     
> 	[snip, snip]
>
>   
>> Since bowerbird mentions it, let's review the sum total of his known
>> proofreading activities at DP. It's quite an enlightening view, and  
>> very
>> relevant to the discussion.
>>
>> As bowerbird, 32 pages back in the years when DP had only two rounds.
>>
>> As bradjohnson, 3 pages, account not used in 251 days.
>>
>> As haroldjohnson, 4 pages, most recently a single page on March 7,  
>> 2008.
>>
>> As ellipsisshellipis, (interesting nick choice), 16 pages on March  
>> 7, 2008
>> (the date the account was created), and the 116 pages of "work" in the
>> experiment project on March 19, 2008. This account was also used to  
>> post a
>> poll on the DP forums. (See above where bb clearly states his  
>> belief was that
>> he was explicity banned from posting in the forums.)
>>
>> As sandy claws, no pages, but a Christmas Day 2007 posting (the day  
>> the
>> account was created.) (Again, see above where bb clearly states his  
>> belief
>> was that he was explicity banned from posting in the forums.)
>>
>> Patterns, anyone?
>>
>> Out of all the projects available to choose from during all that time,
>> bowerbird only managed to find *one* that piqued his interest, and  
>> it just so
>> happened to be the one he's been ever so faithfully posting about  
>> here, in
>> much less than flattering terms.
>>
>> Obviously he understood that he was banned from posting in the DP  
>> forums, and
>> yet he used two freshly-minted accounts to do exactly that.
>>
>>     
>>>>    he recently used one to intentionally
>>>>    sabotage
>>>>    the experiment in continuous proofing
>>>>         
>>> untrue.   and a low blow to boot.
>>>       
>> See above.
>>
>>     
>>> i didn't "sabotage" the experiment.
>>>       
>> The people actually running the experiment at DP say differently,  
>> used far
>> stronger language in describing his efforts in that project, and  
>> are to me
>> far more credible as references.
>>
>>     
>>> i was doing the one thing i was still _allowed_
>>> to do at distributed proofreaders, i.e., proof...
>>>       
>> See above for evidence regarding bowerbird's obvious commitment to DP.
>>
>>     
>>> and i did a darn good job on every page i did.
>>>       
>> Many of our volunteers with bowerbird's level of experience with DP  
>> also
>> believe the above statement to be true of themselves.
>>
>>     
>>> no sir, as far as i know, and i would _love_it_
>>> if someone pointed out a mistake i had made,
>>> because i _learn_ from my _mistakes_, i _do_,
>>> but as far as i know, i made _no_ mistakes on
>>> the 128+ pages which i proofed...   not a one...
>>>       
>> Perhaps bowerbird has chosen to learn from the wrong mistakes.
>> Let's skip on a bit...
>>
>>     
>>> there was no :trolling".   and there's been no "sabotage".
>>>
>>>
>>>       
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
>   


From Bowerbird at aol.com  Tue Mar 25 10:17:51 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 25 Mar 2008 13:17:51 EDT
Subject: [gutvol-d] moderation/censorship (let us celebrate because p.g.
	has renounced them!)
Message-ID: <bd5.29641ce4.351a8dbf@aol.com>

piggy said:
>    the anomalies under discussion hardly spell the end of the experiment.

yes, they "hardly" do.   in fact, since i essentially "cycled through" pages
on which there were no errors present, i _sped_up_ the experiment,
moving the text along (as best as i could) to the state of "perfect",
so that "error injection" -- if it's going to happen -- can _begin_...

i intentionally avoided any page that had an outstanding error on it
-- even though i'd identified all such pages and could've fixed them --
as i was as curious as anyone else to see if the next proofer caught it,
cheering on the proofers with sharp eyes, and booing ones who missed
the error _again_ (crap, now we will have to do a whole 'nother round)...

and, on another level, i was showing you the _answer_ to your question.
you were asking what would happen if you recycled a text "perpetually".
the answer is that _someone_ -- in this case it was me -- will eventually
decide to analyze _the_entire_text_ and identify the outstanding errors,
and -- if they are allowed to do so -- go in and fix them all in one shot.

even as it was, one of the proofers called you on the ellipse problem,
noting that most of the changes being made had devolved to ellipses,
which is why you had to say "ok, from now on, you can ignore ellipses".

proofers will not put up with the page-by-page straitjacket for long...
especially when a single button-click gets the text of the whole book.


>    I will even go so far as to say that "sabotage" is too strong a term.

it's not just "too strong".   it's headed in the completely wrong direction.


>    What irritated me most is that overenthusiastic participation

that's quite a euphemism for a dedicated proofer...


>    forced me to do data analysis I hoped to postpone.

evidently you're not reading my posts, because my messages have shown
that i know _exactly_and_precisely_ what's been happening with that text.

i know who fixed what, when, and where the remaining errors are...

my analyses allow me to see things clearly that yours will _never_ reveal.


>    I was obliged to statistically check the claim that 
>    the pages in question were edited offline.

only because you aren't keeping up...

otherwise, you would have known my work was excellent.

besides, doing a comparison of iteration#4 with iteration#5 is 
a 5-minute operation.   if you didn't want to do it, you could've
asked me, and i would have given you a complete report on it...

except for the "[**intentional]" tags on that one page, all of my
changes revolved around the elimination of spacey ellipses...

and i was the one who _found_ the third "error" in that paragraph
that allowed me to make the judgment that they were _intentional_,
so i _deserved_ to make that change.


>   I see no value in participating in the expected thread 
>    on the meaning and validity of these statistics.

the thread that says there is no meaning or value to your statistics?
i don't expect it will be a very long thread, as i've just summed it up.

-bowerbird

p.s.   and keith, thanks for defending me.   but i can do it myself.
with one hand tied behind my back.   these guys have no punch.
if you're expressing yourself, then fine, by all means, continue...
but if you're doing it to "help" me, save your energy for later on.



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080325/fbf5be9b/attachment.htm 

From Bowerbird at aol.com  Tue Mar 25 10:38:47 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 25 Mar 2008 13:38:47 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 08
Message-ID: <d5b.226cee98.351a92a7@aol.com>

here's a web-page with about 1771 clear differences between
the output from tesseract and the output from abby finereader
on the d.p. parallel experiment on "paul and the printing press".

the top line in each pair is from tesseract,
and the bottom line is from abbyy finereader.

(there are more differences than this, but
these are the ones that are _very_sharp...)

here you can see again, in detail, the superiority of finereader
over tesseract.   why waste proofer eyeballs with inferior o.c.r.?

also notice that some pages have a large number of cases here,
where characters on the left edge of the scan were contaminated
by "noise" from pages which were scanned using insufficient care.

again, why waste proofer time and energy on poorly-scanned pages?

***

also of note:   i improved the synch on the two sets of o.c.r.
so if you downloaded these files before, please do it again.

here we have the original o.c.r., from tesseract:
>    http://z-m-l.com/go/paulp/paul-tesseract.html

and here we have the new o.c.r., from finereader:
>    http://z-m-l.com/go/paulp/paul-abbyy.html

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080325/f235783a/attachment-0001.htm 

From Bowerbird at aol.com  Tue Mar 25 12:03:44 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 25 Mar 2008 15:03:44 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 08
Message-ID: <c08.327a2fac.351aa690@aol.com>

oops!   forgot to include the u.r.l. for that comparison-page...

here's a web-page with about 1771 clear differences between
the output from tesseract and the output from abby finereader
on the d.p. parallel experiment on "paul and the printing press".

>    http://z-m-l.com/go/paulp/1771tess_v_abbyy.html

the top line in each pair is from tesseract,
and the bottom line is from abbyy finereader.

(there are more differences than this, but
these are the ones that are _very_sharp...)

here you can see again, in detail, the superiority of finereader
over tesseract.   why waste proofer eyeballs with inferior o.c.r.?

also notice that some pages have a large number of cases here,
where characters on the left edge of the scan were contaminated
by "noise" from pages which were scanned using insufficient care.

again, why waste proofer time and energy on poorly-scanned pages?

***

also of note:   i improved the synch on the two sets of o.c.r.
so if you downloaded these files before, please do it again.

here we have the original o.c.r., from tesseract:
>    http://z-m-l.com/go/paulp/paul-tesseract.html

and here we have the new o.c.r., from finereader:
>    http://z-m-l.com/go/paulp/paul-abbyy.html

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080325/97088e9b/attachment.htm 

From jeroen.mailinglist at bohol.ph  Tue Mar 25 15:44:30 2008
From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account))
Date: Tue, 25 Mar 2008 23:44:30 +0100
Subject: [gutvol-d] Moderation/censorship
In-Reply-To: <c73.2b132cc7.3514d655@aol.com>
References: <c73.2b132cc7.3514d655@aol.com>
Message-ID: <47E9804E.9000408@bohol.ph>

Bowerbird at aol.com wrote:
> there was no :trolling".   and there's been no "sabotage".
>   
However, he was not able to resist changing the word "troll" into 
"vvaannddaall ttrroollll" on the page where it occurred (page 045.png of 
projectID47c4c0eeec634). Let us accept that as the final admission by 
bowerbird of being a troll.

Trolls can do considerable damage to a mailing list and discussion 
forum. They are a nuisance similar to spam, and should be dealt with in 
a similar way. That is no more censorship than killing countless 
mentions of Viagra in your inbox, or removing a guy continuously 
shouting "fire" from the theater. Much serious and fruitful discussion 
is rendered impossible due to the high noise level introduced by a 
single person who takes a special pleasure in provoking people, and 
somehow tries frustrating things, as that is apparently the best that 
person can do.

I'm subscribing to this list, and like to read most peoples valuable 
opinions, and add mine once in a while -- and will continue to do so, 
although I normally ignore our house troll. I invite everybody here to 
from now on ignore our feathered friend, and, if the nuisance gets to 
much, move over to the pgdp.net forums, where similar discussions are 
going on, without this curious part of PG culture.

We could translate the issue of trolling to "Real Life" situations, 
which are somewhat indicative of the difficult issues at stake....

In Holland, we currently have a politician (Geert Wilders) running amok, 
displaying a near endless passion in his attempts to provoke Muslims, 
and once they are provoked, claiming, see, I told you so! He shows all 
the common Internet troll traits. He talks about a movie he has made, 
and makes all kinds of excuses about not showing it to anybody yet. A 
couple of years ago, a Dutch documentary maker was killed for making a 
movie that some Muslims considered insulting. Although I strongly 
believe people should be free to say what they think about Islam (or any 
other subject), he is purposely pushing things to the limit. His claims 
are grossly insulting, racist and irrational. However, almost all 
Muslims in Holland have remained silent, and instead Jewish 
organizations started to speak out. Somebody actually took 25 of his 
public statements, just replaced the word Muslim with Jew, printed them 
on a pamphlet and distributed it in a public distribution, and got 
himself arrested for spreading hatred against Jews. Such spreading of 
hatred against an ethnic or religious group is, unlike in the US, 
against the law here, although no action was taken against this 
politician in over a year of repeated and ever increasing insults. This 
is of course a set-up action, planned ahead to go through all levels of 
courts and which will give judges a very hard time to apply the law... 
If you convict for spreading hatred against Jews, but not Muslims, you 
discriminate in the application of justice, if you do not convict, you 
ignore the law.

Jeroen.


From Bowerbird at aol.com  Tue Mar 25 16:25:04 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 25 Mar 2008 19:25:04 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 09
Message-ID: <c1d.35991f42.351ae3d0@aol.com>

it occurs to me that i have been telling you about the pagescans
associated with the parallel test of "paul and the printing press",
but i've never actually told you directly where you can view them.

of course, the images are always available from the project page:
>    http://www.pgdp.net/c/project.php?id=projectID45ca8b8ccdbea

just find the link that says "view images online" and click that.
note that this works all the time, for any book in the system...

this one-image display works acceptably well for many purposes.

***

in addition, however, for this book, i've put the scans on
my website, so you can view them there using my system.

for instance, this link will take you to the pagespread for page 32.
>    http://z-m-l.com/go/paulp/paulpp032w.html

in that pagespread view, you can click the right page to go ahead,
or click the left page to go backward in the book.   or you can also
use the links spread across the top of the two-page pagespread...
(the "-chap-" and "+chap+" buttons can be very useful at times,
for some purposes, because they skip from chapter to chapter...)

i prefer this pagespread view, as it's more practical in some situations,
_and_ it's twice as fast...          :+)

***

at any rate, as you step through the book's pagespreads, you'll see
that the "gutter noise" problem was _intermittent_, which indicates
that it was caused by insufficient care being taken on some pages...

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080325/40a2f08c/attachment.htm 

From Bowerbird at aol.com  Tue Mar 25 16:53:02 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 25 Mar 2008 19:53:02 EDT
Subject: [gutvol-d] moderation/censorship (let us celebrate because p.g.
	has renounced them!)
Message-ID: <c93.24a89a53.351aea5e@aol.com>

jeroen said:
>    Let us accept that as the final admission by bowerbird of being a troll.

it was no such "admission" at all.   (but interesting attempt at spin, 
jeroen.)

no, it was a taunt, throwing back the word you people have thrown at me...

the initial change was "troll" to "ttrroollll".

and then, when piggy called that "vandalism" at one point
-- at the time, i don't believe he knew who had done it --
i went in on the next round and added "vvaannddaall" to it.

i knew these words would pop up in a spellcheck in post-processing
-- if they even managed to get _that_ far -- so there'd be no damage.

but yes, with all those caveats, i was sending a message to you,
letting you know that it was _bowerbird_ who proofed that page.

(just like when i posted that poll over on the d.p. forums,
i included a gag response that mentioned "pudding".   aha!)

and piggy thought "vvaannddllee" was amusing enough that
he actually put it on his blog as a word that he had made up.

so at least _someone_ has an appropriate sense of humor there...


>    Trolls can do considerable damage 
>    to a mailing list and discussion forum.

so can small-minded people who can't deal with the logic,
so they tar the other person with false charges, like "troll"...

in the old days, we used to call that "ad hominem"...

for years, you guys argued with me incessantly, and then had
the _gall_ to _blame_me_ because you said i "wanted" a fight;
as you put it "who takes a special pleasure in provoking people".

you couldn't even take responsibility for your own behavior.

and you still can't.   

so while i'm putting up post after post with hard solid _data_,
you counter with this weak-ass whine.

i don't provoke _people_, i provoke _thought_...

none of you seem to have absolutely _anything_ to contribute
in terms of intellectual discussion.   it's somewhat _amazing_...

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080325/0b604c55/attachment.htm 

From hyphen at hyphenologist.co.uk  Wed Mar 26 01:14:23 2008
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Wed, 26 Mar 2008 08:14:23 -0000
Subject: [gutvol-d] moderation/censorship (let us celebrate because
	p.g.	has renounced them!)
In-Reply-To: <c93.24a89a53.351aea5e@aol.com>
References: <c93.24a89a53.351aea5e@aol.com>
Message-ID: <001501c88f19$685e1610$391a4230$@co.uk>

 

 

Bowerbird at aol.com wrote

 

jeroen said:
>>   Let us accept that as the final admission by bowerbird of being a
troll.

> it was no such "admission" at all.  (but interesting attempt at spin,
jeroen.)

 

In my opinion bowerbird is not a troll.  

The vast majority of his posts are On Topic. 

Expressing opinions with which others do not agree is not Trolling, 

it is encouraging informed debate

 

Dave Fawthrop






-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080326/8a13f376/attachment.htm 

From schultzk at uni-trier.de  Wed Mar 26 02:32:57 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Wed, 26 Mar 2008 10:32:57 +0100
Subject: [gutvol-d] moderation/censorship (let us celebrate because p.g.
	has renounced them!)
In-Reply-To: <bd5.29641ce4.351a8dbf@aol.com>
References: <bd5.29641ce4.351a8dbf@aol.com>
Message-ID: <397BD29A-CE11-42FE-83E8-7A5F55625AFE@uni-trier.de>

Hi Bowerbird,

	More or less expressing myself. In the works you are
	"defended"(?).

	I would say La Monte refuted the claims I had commented on
	and proved my points.

	regards
		Keith

Am 25.03.2008 um 18:17 schrieb Bowerbird at aol.com:

> [snip, snip]

> -bowerbird
>
> p.s.  and keith, thanks for defending me.  but i can do it myself.
> with one hand tied behind my back.  these guys have no punch.
> if you're expressing yourself, then fine, by all means, continue...
> but if you're doing it to "help" me, save your energy for later on.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080326/41d68cf4/attachment-0001.htm 

From Bowerbird at aol.com  Wed Mar 26 03:29:17 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 26 Mar 2008 06:29:17 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 09
Message-ID: <c7e.fd72683.351b7f7d@aol.com>

oh-ho, you're gonna like this one...

it's a merge of the tesseract and abby output,
with the 4000-5000 identical lines in _black_.

it doesn't necessarily mean that they're _correct_;
usually, but _can_ mean they have identical errors.

the lines which show a difference are both listed,
with the top line tesseract and the bottom abby...

moreover, if the lines differ in some "quasi" way
-- basically whitespace or em-dashes right now --
the tesseract line is magenta, the bottom abby blue.

and when lines differ in some more substantial way,
the top tesseract line is red, the bottom abby blue...

color makes them stand out for easier examination.

you can learn a lot -- especially on auto-correction
of o.c.r. errors -- by studying these difference-pairs.
take my word for it...

>    http://z-m-l.com/go/paulp/abbyytessmerge01.html

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080326/36353cc6/attachment.htm 

From Bowerbird at aol.com  Wed Mar 26 15:04:56 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 26 Mar 2008 18:04:56 EDT
Subject: [gutvol-d] on the rejoining of end-line-hyphenates
Message-ID: <bc0.2babaca0.351c2288@aol.com>

third-party dialog is a real pain in the ass...

speaking of the hind quarters, that appears to be where
big_bill has his head, again, and he's pontificating, again,
this time on the "need" to rejoin end-of-line hyphenates...

the second-grader lecturing to us as if we are first-graders...

let's pretend for a minute bill's correct (even though he's not),
and that a "best practices" digitization workflow would indeed
rejoin end-of-line hyphenates.   (it wouldn't; we're pretending.)

even in this situation, it's _stupid_ to have _proofers_ rejoining.

no, instead, just have the _computer_ do it.   first, human energy
is precious, and thus should be conserved.   second, humans err.
we make mistakes.   some say it's the _essence_ of being human.
and mistakes on the rejoining then have to get fixed themselves.

so have the computer do it.

in other words, have your pre-processing tool rejoin hyphenates.
it can do a better job anyway, since it can access your dictionary...

it also should "clothe" the em-dashes, if you're going to do that.
no, that's not a "best practice" in digitization either, knucklehead.
but you do it nonetheless.   (that "clothe" word has to be one of the
_stupidest_ words in the whole jargon of distributed proofreaders.)

and have your tool close up spacey ellipses while you're at it.

then your humans won't have to do all these _mundane_ tasks.

because humans should _never_ have to do those mundane tasks.

that's _precisely_ why i refuse to charge humans with an "error"
when they "fail" to accomplish your busy-work "requirements"...

so -- if you must follow this pretend "best-practice" -- at _least_
have the decency to have the computer do all the routine work...
after all, that's what it's good for.

understand?   or did you even hear, with your head up your butt?

-bowerbird

p.s.   the _real_ "best practices" on end-line hyphenates, exactly like
everything, is to give users the option to _choose_ what they want...



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080326/99e0aef7/attachment.htm 

From paulmaas at airpost.net  Wed Mar 26 16:29:13 2008
From: paulmaas at airpost.net (Paul Maas)
Date: Wed, 26 Mar 2008 16:29:13 -0700
Subject: [gutvol-d] on the rejoining of end-line-hyphenates
In-Reply-To: <bc0.2babaca0.351c2288@aol.com>
References: <bc0.2babaca0.351c2288@aol.com>
Message-ID: <1206574153.16077.1244489543@webmail.messagingengine.com>

The "head up the ****" is uncalled for.



On Wed, 26 Mar 2008 18:04:56 EDT, Bowerbird at aol.com said:
> third-party dialog is a real pain in the ass...
> 
> speaking of the hind quarters, that appears to be where
> big_bill has his head, again, and he's pontificating, again,
> this time on the "need" to rejoin end-of-line hyphenates...
> 
> the second-grader lecturing to us as if we are first-graders...
> 
> let's pretend for a minute bill's correct (even though he's not),
> and that a "best practices" digitization workflow would indeed
> rejoin end-of-line hyphenates.   (it wouldn't; we're pretending.)
> 
> even in this situation, it's _stupid_ to have _proofers_ rejoining.
> 
> no, instead, just have the _computer_ do it.   first, human energy
> is precious, and thus should be conserved.   second, humans err.
> we make mistakes.   some say it's the _essence_ of being human.
> and mistakes on the rejoining then have to get fixed themselves.
> 
> so have the computer do it.
> 
> in other words, have your pre-processing tool rejoin hyphenates.
> it can do a better job anyway, since it can access your dictionary...
> 
> it also should "clothe" the em-dashes, if you're going to do that.
> no, that's not a "best practice" in digitization either, knucklehead.
> but you do it nonetheless.   (that "clothe" word has to be one of the
> _stupidest_ words in the whole jargon of distributed proofreaders.)
> 
> and have your tool close up spacey ellipses while you're at it.
> 
> then your humans won't have to do all these _mundane_ tasks.
> 
> because humans should _never_ have to do those mundane tasks.
> 
> that's _precisely_ why i refuse to charge humans with an "error"
> when they "fail" to accomplish your busy-work "requirements"...
> 
> so -- if you must follow this pretend "best-practice" -- at _least_
> have the decency to have the computer do all the routine work...
> after all, that's what it's good for.
> 
> understand?   or did you even hear, with your head up your butt?
> 
> -bowerbird
> 
> p.s.   the _real_ "best practices" on end-line hyphenates, exactly like
> everything, is to give users the option to _choose_ what they want...
> 
> 
> 
> **************
> Create a Home Theater Like the Pros. Watch the video on AOL 
> Home.
>       
> (http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-- 
  Paul Maas
  paulmaas at airpost.net

-- 
http://www.fastmail.fm - Accessible with your email software
                          or over the web


From Bowerbird at aol.com  Wed Mar 26 17:35:03 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 26 Mar 2008 20:35:03 EDT
Subject: [gutvol-d] on the rejoining of end-line-hyphenates
Message-ID: <bf7.2e56f465.351c45b7@aol.com>

paul said:
>    The "head up the ****" is uncalled for.

i agree!   help him pull it out!

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080326/853220bb/attachment.htm 

From hyphen at hyphenologist.co.uk  Wed Mar 26 23:26:53 2008
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Thu, 27 Mar 2008 06:26:53 -0000
Subject: [gutvol-d] on the rejoining of end-line-hyphenates
In-Reply-To: <bc0.2babaca0.351c2288@aol.com>
References: <bc0.2babaca0.351c2288@aol.com>
Message-ID: <000001c88fd3$8d686680$a8393380$@co.uk>

 

Bowerbird at aol.com wrote

> speaking of the hind quarters, that appears to be where
> big_bill has his head, again, and he's pontificating, again,
> this time on the "need" to rejoin end-of-line hyphenates...

 

Just a word on end of line hyphenates, and words with link hyphens.

 

If a word with a link hyphen appears at the end of a line 

the line is often broken after the link hyphen.

Because link hyphens disappear In later and/or modern usage,

it is often impossible to tell from an old text if a hyphen at 

the end of a line is a link hyphen or a hyphenated word, 

and so if it should be rejoined or not.

 

If in Shakespear one finds "bed-

room" 

Should an etext contain bed-room or bedroom?

See below.

 

This is in practice of no interest to the general reader,

It is of interest only to the terminal pedant and acedemics.

Thus when processing much less important texts, I just do what 

seams right to me at the time.

 

Computer Hyphenation, or rather Ronald C McIntosh wrote a book on the
subject which should be on the web but isn't.

Here is the relevant bit.

>>> 

Chapter 10:
The hyphened word

Many words entered the language by first being joined together by a
link-hyphen, which created a new compound word. With the passing of time
most of these words became fully assimilated, eventually (but not always)
dropping the hyphen. 

The following words occurred only once in Shakespeare's works, printed with
link-hyphens, and all passed smoothly into the language: tear-ful
blood-thirsty bed-room gentle-folks dis-agree out-break tear-stained
earth-bound. Others failed to make the grade, such as: temple-haunting
(Macbeth) and cloud-kissing (Lucrece). The successful words may have been
happy inventions of the moment, fruits of bardic genius, but we might
suspect that sometimes they were already established, or had been overheard
by the playwright in his favourite hostelry. 

When Samuel Pepys was writing his diary, "every body" was two words, still
awaiting either a link-hyphen or the moment when two words would suddenly be
one. This process is never-ending; in particular the creation of new words
is intense in America, where the link-hyphen is speedily dispensed with.
Space-suit and moon-walk probably dropped their hyphens in the second
edition of the newspapers which reported them. 

This poses another kind of problem for the printer, since many people will
be uncertain whether a particular link-hyphen is necessary, and be inclined
to leave it out. Nobody is likely to revive the hyphens in common words like
newspaper and postman, which once were innovatory, but it is more difficult
to judge modern words such as antitoxin, coaxial and coexistence. Except
where there is a grammatical reason, or where the meaning could be obscured,
writers may consider it safe to leave the hyphen out, perhaps posing a
problem for the computer. Sir John Murray, editor of the massive OED
(grandmother of every subsequent English dictionary), gave examples of
meaning confirmed by the hyphen: "a day well remembered" but "a
well-remembered day" "a sea of deep green" but "a deep-green sea" 

Fowler's MODERN ENGLISH USAGE offers: "an infallible wrinkle-remover" "the
ex-Tory Solicitor-General for Scotland" (i.e. the Solicitor-General who
formerly was a Tory) ne'er-do-well; stick-in-the-mud; what's-his-name 

This is an open-ended subject since users of English feel free to innovate
(so to speak) on the hoof. The link can be useful to avoid ambiguity:
re-form (=form again) as against reform (improve), and re-signing a document
as against resign an employment. There are thousands of significant
examples, some of them already challenged and proven in courts of law. 

In earlier times the hyphen was often pressed into service to solve
orthographic needs, producing some strange oddities in the process. In jury
records of 1658 the Puritans' elaborate compound names are graphic
descriptions of their personalities, e.g. Search-the-scriptures Morton might
serve beside Strong-in-the-faith Jenkinson. Their names may have been
inspired by colourful long names in the Bible, such as Maher-shalal-hash-baz
(Isaiah VII i) 

British aristocrats are still fond of their double and triple-barrelled
names The Lady Caroline Jemima (1858-1946) had a five-barrelled one:
Temple-Nugent-Chandos-Brydges-Grenville. A modern daring explorer, much in
the public eye, is Sir Ranulf Twisleton-Wykeham-Fiennes. 

The most prestigious and valuable hyphen in business history was probably
the one invented by a Mr Royce when he joined the company which had started
out as Rolls & Co. He suggested to his new partner that they should call the
business Royce-Rolls, but in 1906 they eventually chose Rolls-Royce, which
became the world's most famous trademark. In 1977 a play was put on in
London's West End: Rolls Hyphen Royce. 

<<< 

 

Dave Fawthrop






-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080327/10fabe8c/attachment-0001.htm 

From Bowerbird at aol.com  Thu Mar 27 00:28:58 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 27 Mar 2008 03:28:58 EDT
Subject: [gutvol-d] on the rejoining of end-line-hyphenates
Message-ID: <bd7.2b31ef9d.351ca6ba@aol.com>

i wondered whether -- in responding to paul's prudishness --
i should have asked him whether he wanted to go on-topic and
have a thoughtful discussion about end-line hyphenates, or not.

i decided it was pretty clear that he didn't.

especially because, when we were done with the discussion,
it would have become abundantly clear to _everyone_ here
just exactly how _far_ up his butt big_bill's head actually is...

so i opted for the quick reply instead...

but now that _dave_ has brought up the subject...            ;+)

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080327/1eccdb6a/attachment.htm 

From Bowerbird at aol.com  Thu Mar 27 08:43:14 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 27 Mar 2008 11:43:14 EDT
Subject: [gutvol-d] on the rejoining of end-line-hyphenates
Message-ID: <cd2.2cf399fd.351d1a92@aol.com>

oh, yeah, since i said i wasn't gonna give d.p. any
"wiggle room" any more, i should have added this:

the very _idea_ that rejoining end-line hyphenates
and "clothing" end-line em-dashes could be done
by the computer, rather than by human proofers,
seems not to have even occurred to d.p. "leaders",
let alone been _acted_upon_ by them, which seems
to me to be an absolutely astonishing fact...

but there it is...

-bowerbird

p.s.   i was gonna say "even donned on d.p.", but
googling that made me insecure about it, since
"idea donned" got _97_ hits, while "idea dawned"
got _9,280_.   so either my memory of the word is
severely flawed, or _lots_ of people are confused.
100-1 wrong is the biggest imbalance i ever saw!
(although an idea "dawning" is a little bit poetic.)
either way, i decided to go with a bland "occurred".



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080327/de4dac20/attachment.htm 

From vze3rknp at verizon.net  Thu Mar 27 09:00:46 2008
From: vze3rknp at verizon.net (Juliet Sutherland)
Date: Thu, 27 Mar 2008 12:00:46 -0400
Subject: [gutvol-d] on the rejoining of end-line-hyphenates
In-Reply-To: <cd2.2cf399fd.351d1a92@aol.com>
References: <cd2.2cf399fd.351d1a92@aol.com>
Message-ID: <47EBC4AE.6000209@verizon.net>

Bowerbird at aol.com wrote:
> the very _idea_ that rejoining end-line hyphenates
> and "clothing" end-line em-dashes could be done
> by the computer, rather than by human proofers,
> seems not to have even occurred to d.p. "leaders",
> let alone been _acted_upon_ by them, which seems
> to me to be an absolutely astonishing fact...
As it happens, most content providers at DP have been doing automatic 
hyphenation correction for years. Towards the end of 2002, Charles 
Aldarondo wrote a nice little perl script that made clever use of 
Finereader's dehyphenation capabilities. He and I were both using it, as 
well as several of the other major providers of content. When 
thundergnat wrote guiprep, one of the prime features that he included in 
it was a dehyphenation tool. It can be used either in the same way that 
aldarondo did originally (comparing versions from Finereader where one 
had dehyphenation and one didn't) or in a mode where it looks for other, 
non-hyphenated examples of the word in the book. In either case, when it 
is sure, it just rejoins the hyphen. When it isn't, it leaves the hyphen 
in place. It's far from perfect and could use some serious revision, but 
at least it covers the most obvious cases. guiprep also "clothes" 
em-dashes automatically, which can lead to some ah, interesting, results 
when it comes to poetry. We intentionally don't do dehyphenation on 
Beginner projects, so that they will learn what it is and how to do it. 
And perhaps it was not done in the test projects that bowerbird has been 
writing about. But it is certainly done for the majority of projects and 
has been for the last five years.

If bowerbird had had a little more experience with proofreading at DP, 
he would certainly have observed this. bowerbird is occasionally right 
about things, but in this case, he is totally off base.

JulietS

From Bowerbird at aol.com  Thu Mar 27 10:04:18 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 27 Mar 2008 13:04:18 EDT
Subject: [gutvol-d] on the rejoining of end-line-hyphenates
Message-ID: <d4a.26c9084e.351d2d92@aol.com>

juliet said:
>   If bowerbird had had a little more experience with proofreading at DP

oh please, juliet.   i can point to literally _hundreds_ of projects
-- actually, probably thousands if i still had access over there --
which clearly and obviously have not been auto-dehyphenated.

i'd wager that for every auto-dehyphenated file you can point to,
i can point to _5_ others which were _not_ auto-dehyphenated...
(and if the wager isn't too big, i'd say i'll make the margin 10-1.)

maybe it is true only "beginners" get such files, i wouldn't know.

but it makes the least sense of all to have _beginners_ doing a job
that is error-prone and accomplished much better by a computer.

i guess maybe it's a part of the hazing process?

(that's commentary, folks.   d.p. has no official hazing process.)

in any case, it's clear that many of your content providers are
ignorant of dehyphenation capabilities offered by your tools...

even big_bill seems clueless, as evidenced by his statements...

moreover, even the content providers who _do_ know about it
don't seem to use the capability very much, as far as i can see...

but if it makes you feel better, i'll check this out on your tools,
and i'm glad the idea has _occurred_ to you, even if you still
haven't seemed to _acted_upon_ it as well as you might have.

(the caveat about poetry carries no weight with me, because
i know a method that lets one avoid that problem, a method
which provides additional benefits as well to the proofers...)

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      
(http://home.aol.com/diy/home-improvement-eric-stromer?video=15?ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080327/26d6d9d6/attachment.htm 

From gbuchana at teksavvy.com  Thu Mar 27 16:06:19 2008
From: gbuchana at teksavvy.com (Gardner Buchanan)
Date: Thu, 27 Mar 2008 19:06:19 -0400
Subject: [gutvol-d] on the rejoining of end-line-hyphenates
In-Reply-To: <000001c88fd3$8d686680$a8393380$@co.uk>
References: <bc0.2babaca0.351c2288@aol.com>
	<000001c88fd3$8d686680$a8393380$@co.uk>
Message-ID: <47EC286B.3050207@teksavvy.com>

Dave Fawthrop wrote:

> 
> If a word with a link hyphen appears at the end of a line
> the line is often broken after the link hyphen.
> Because link hyphens disappear In later and/or modern usage,
> it is often impossible to tell from an old text if a hyphen at
> the end of a line is a link hyphen or a hyphenated word,
> and so if it should be rejoined or not.
> 

The rules I follow are:

(1) if there is another occurrence of this word in the same
     book, do what it does.
(2) if contemporary books or other books by the same author
     us this word, do what they do.
(3) use the modern convention.

I find more often the the book is not actually self-consistent
than that I can't find a way to resolve a hyphen.

============================================================
Gardner Buchanan                     <gbuchana at teksavvy.com>
Ottawa, ON             FreeBSD: Where you want to go. Today.

From prosfilaes at gmail.com  Thu Mar 27 17:28:30 2008
From: prosfilaes at gmail.com (David Starner)
Date: Thu, 27 Mar 2008 20:28:30 -0400
Subject: [gutvol-d] on the rejoining of end-line-hyphenates
In-Reply-To: <1206574153.16077.1244489543@webmail.messagingengine.com>
References: <bc0.2babaca0.351c2288@aol.com>
	<1206574153.16077.1244489543@webmail.messagingengine.com>
Message-ID: <6d99d1fd0803271728h66d464blc5daaf9b659f0b6c@mail.gmail.com>

On Wed, Mar 26, 2008 at 7:29 PM, Paul Maas <paulmaas at airpost.net> wrote:
> The "head up the ****" is uncalled for.

Then why did you forward it to all of us who have Bowerbird nice and killfiled?

From Bowerbird at aol.com  Thu Mar 27 23:37:44 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 28 Mar 2008 02:37:44 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 10
Message-ID: <ccf.2ab1858f.351dec38@aol.com>

wow.   i'm already up to 10 posts in this parallel series.   (a little 
word-joke there.)

time to take stock...

this is about "paul and the printing press", the book being used for
the experiment in _parallel_proofing_ by distributed proofreaders.
to remind you, the project page for this book can be found here:
>?? http://www.pgdp.net/c/project.php?id=projectID45ca8b8ccdbea

and i've also put the scans for this book on my website for viewing;
>?? http://z-m-l.com/go/paulp/paulpp032w.html

so what have we learned so far?

well, not much, really, especially if you remember that -- at the outset --
we _already_knew_ that parallel proofing has a long glorious track record.

so it's not as if we needed any "confirmation" of that from a d.p. 
experiment.

ironically however -- or maybe not, if you look at it from my perspective --
what we _did_ find is more solid evidence of the incompetence over at d.p.

so what have we learned about _that_, from this experiment?

0.   d.p. needs to tighten the quality standards it judges acceptable.

1.   one needs to start with good page-scans.   re-scan if necessary.
2.   one needs to use a _good_ o.c.r. program, like abbyy finereader.

now...   where have i heard this before?   because it sounds familiar...

oh yeah, i remember, on my "10 points that d.p. needs to improve"...

>    1.? ensure you have decent scans, and name them intelligently.
>    2.? use a decent o.c.r. program, and ensure quality results.
>    3.? do not tolerate bad text handling by content providers.
>    4.? do a decent post-o.c.r. cleanup, before _any_ proofing.
>    5.? retain linebreaks (don't rejoin hyphenates or clothe em-dashes).
>    6.? change the ridiculous ellipse policy to something sensible.
>    7.? stop doing small-cap markup with no semantic meaning.
>    8.? i forget what 8 was for.
>    9.? retain pagenumber information, in an unobtrusive manner.
>    10.? format the ascii version using light markup, for auto-html.

wow, look at that...   we have an exact match on number 1 and number 2.

yes sir, this chain was being undermined by some very weak links
right there at the very _start_ of this project, right at the _outset_...

***

bad scans caused hundreds and hundreds of unnecessary errors
on this book, then bad o.c.r. made hundreds and hundreds more.

when i say "bad scans", i mean that they were carelessly done,
in a way that left a clear sign of incompetence on many of them,
one which would cause problems with even a good o.c.r. app...

and when i say "bad o.c.r.", i mean the o.c.r. was done using
_tesseract_, which is a "beta" o.c.r. app that works kinda funky.
for instance, it lost all of the blank lines between paragraphs...

there were errors on _every_single_page_ in this entire book...
and many pages had an error in _almost_every_single_line_...

like "planet strappers", an incompetent content provider caused 
grief that cost volunteer proofers _lots_ of their time and energy,
the time and energy they donate, in good faith, to a good cause...

so p1 was required to make what i'd estimate as _1,750_ changes,
with a huge percentage of the changes being totally unnecessary,
caused by incompetence that injected errors before any proofing.

the content provider could've redone their incompetent work in 
much less time than was spent by proofers fixing their mistake...

when you treat these people like guinea piggies, they will leave,
and never come back again.? is that really what you want to do?

even so, the p1 proofers were miracle workers, and transformed
76 of the pages to perfection, with no further changes entered...

moreover, even on the pages they "failed" to take _all_the_way_
to perfection, they got 'em very close.   p2 only had to correct a
small percentage of the 6500+ lines in this 200+page book, yet
their 122 corrections brought another _116_ pages to perfection.
(and, like p1, even the pages that weren't perfect were improved.)

so p1 and p2 combined to take _84%_ of the pages to perfection!
(for the record, this suggests overall d.p. accuracy is right at 60%.)

this left p3 to finish _36_ pages, where they had to make changes
to a tiny number of lines (39) to get us to (an assumed) perfection.

altogether, p1 had just 161 lines later changed by p2 and/or p3...

these lines are listed for your viewing pleasure on this web-page:
>    http://z-m-l.com/go/paulp/paul-p1-p3-161changes.html

again, considering how rotten the scans were, the fact that p1 took
_all_but_161_ of 6500+ lines to _assumed_perfection_ is amazing!

(and as minuscule as that number is, it still fails to describe the
awesomeness of the performance of p1, as i will discuss later.)

so once again, we get the pattern i've discussed all along,
the pattern that seems to capture a "common-sense" take,
which is that p1 fixes most of the errors, p2 gets most of
the remaining ones, and p3 comes in and does clean-up.

again, this is the pattern you get on page after page,
in book after book, day after day, over in d.p.-land...

***

i said "assumed perfection" because we _defined_ the pages as
"perfect" after p3, for the expedience of evaluating quality, but
even now, though, our suspicion is that _some_ errors remain...

indeed, p3 _did_ leave errors, which we can pin-point due to
that parallel round of p1 proofing.   yes sir, p1 proofers who
did the second parallel found errors the p3 "marines" missed!
but hey, by now, that shouldn't be surprising to you.

of course, p3 had found errors which parallel#2 had missed, so
nobody can make a clear claim of superiority based on this data.

then again, p1 never claimed that it was _better_ than p3, did it?
certainly never for _parallel#2_.   why, that would be _heretical_!

***

once before, i've mentioned these errors p3 missed.   what are they?

stick around for the next messages in this series, when i reveal them.

***

so, after dealing with the incompetence of the content provider and
demonstrating the kick-ass quality shown by the normal p1 round,
we're finally able to address the assessment of the parallel proofing.

finally...

as i said at the outset, parallel proofing works, and we know it works.
it's already proven itself over and over again, so who needs a "test"?...

well, sure enough, it proved it again here, where 2 parallel rounds
of p1 proofing produced results as good as the normal p1-p2-p3.

the parallel proofers missed a few things the serial proofers found,
and vice versa, but their overall performance was chillingly similar...

so, just as with the pervious experiment using "planet strappers",
the results fail to support a contention that p3 are better proofers.
the parallel round of p2 matched up, and it matched up very well.

also extremely spooky was the similarity of the parallel proofings...
both of them had to make an estimated 1,750 changes to the text,
but when i analyzed their real differences, there were under 100...

and all of this recalls the eerie findings on "planet strappers",
where results were so identical that they were positively freaky.

i joke when i call d.p. a "cult", but i'm wondering if there _is_
something in the water over there, because this is _strange_.

***

one good question would've been whether it's _cost-effective_ to
make two groups of proofers find and fix the exact same errors,
which is what parallel proofing forces people to do, unfortunately.

regrettably, there was little in the design of this "experiment" which
would help us to _answer_ that more-interesting question, however.

having closely examined the sad o.c.r. produced by these bad scans,
though, i can safely say that it was _not_ cost-effective on this book...

just on its face, we _know_ it's a waste of their time and energy to have
proofers correct an estimated 1,750 errors a second time, just so they
will catch a half-dozen errors which were missed the first time around.
that's a no-brainer.

maybe on a clean book, parallel proofing would be cost-effective.
but on a dirty book like this one, it's clear that it's a bad decision...

***

however, since that time and energy was already wasted on this book,
let us rejoice in the fact that we have now caught those 6 new errors...

with less than 100 real differences between the two parallel proofings,
perhaps less than 50, it was not difficult to do the resolution of them...

and now we have a book we can justifiably feel is remarkably clean...

***

so, what do we still need to do to finish the analyses of this book?

***

first, i need to show you those errors that p3 missed, as well as
get some feedback on some other possible errors that turned up.

***

i'll also be comparing tesseract's output with o.c.r. from finereader,
so i can _quantify_ exactly how many "excess" errors tesseract had.
this web-page shows about 1771 clear differences between them:
>?? http://z-m-l.com/go/paulp/1771tess_v_abbyy.html

here's a very colorful _merge_ of the tesseract and abby output,
with their identical lines in black, and the differing lines in color:
>?? http://z-m-l.com/go/paulp/abbyytessmerge02.html

i'll do more work on resolving these 2 sets of o.c.r.   but since they
are both besieged by problems from the bad pages, it's pointless
to try to do much with that resolved data, as it'll always be flawed.
but if piggy were to re-scan the bad pages, that would prove useful.

i'll also compare the _good_ abbyy o.c.r. with our refined output,
so we know exactly how close we could've gotten with good o.c.r.

***

finally, i'm gonna take a look at the 161 lines p1 "failed" to perfect.
you might (or might not) have noticed that was _new_ information
which i just dropped into this "summary" in a fairly quiet fashion...

nonetheless, i think it's an _extremely_important_ fact to process...

p1 made an estimated 1,750 changes, including some that required
the removal of garbage characters from the left margin, and then a
keying in of totally absent words, yet by the time that p1 was done,
p2 and p3 were left with a mere 161 lines (out of 6500+) to correct.

that's a huge drop, from 1,750 changes (p1) to 122 (p2) to 39 (p3).

_especially_ when you consider the huge p2 and p3 backlogs at d.p.,
the idea that the p1 people can take a book that close to perfection,
only to have it then sit for months or years in a queue, is...   i dunno,
take your pick of "irritating", or "sad", or "troubling", or "curious", or
insert your own word here to describe your reaction to that situation.

and then multiply it by 4 when i tell you that the number of errors
might have gone as low as 40, if a simple clean-up tool was used.
(don't quote me yet on that number; wait until i prove it to you...)

***

ok, so i'm done taking stock.   but maybe you're sitting there with
an empty feeling inside, and maybe you don't know exactly why...

i can tell you why.   it's because this experiment was supposedly
geared to answering the question about "confidence in page"...

that is, how can we know that a page is "done" being proofed?

so what do we have in the form of an answer to that question?

well...   not much...

that's because -- as some of its own people have pointed out --
d.p. doesn't really _do_ its experiments in the "scientific" mold;
you know, where you frame hypotheses and develop a means
of collecting data from randomly-assigned conditions that will
provide evidence that disconfirms the hypotheses you're testing.

d.p. experiments are more like "let's try it and see what happens".

which is fine, i guess.    everybody doesn't have to be a scientist...

but when you're analyzing the data, that can be underwhelming.

nonetheless, i did some elaboration that explains to the d.p. people
-- if they are willing to listen, always a dubious assumption here --
exactly how they might go about finding an answer to that question,
using the data that is already under their noses on every d.p. project.

specifically, i analyzed the "progression types" i found in this book:
->?? p1-p2-p3 -- 22 pages -- (every round made a change)
->?? p1-**-p3 -- 14 pages -- (p1 and p3 made a change)
->?? **-p2-p3 -- 1 page -- (p2 and p3 made a change)
->?? p1-p2-** -- 116 pages -- (p2 made the last change to the page)
->?? p1-**-** -- 76 pages -- (p1 made the last change to the page)
->?? **-**-** -- 15 pages -- (all these no-change pages were _blank_)

we'd like to believe a "no diff" means the page is perfect,
or at least that the probability is very high that it's perfect,
so the most puzzling pages were those 14 pages where
p2 had a "no diff" on the page, but p3 did make a change.

yet on those 14 pages were just _2_ actual, troubling errors.
(and the parallel p1 proofers found and fixed both of them.)

so i would suggest once a "no diff" round has been obtained,
you could reasonably assume that the page is "clean enough".

had that been done in this book, you would've saved the work
of the additional round of proofing on 90 pages (76 plus 14),
at the cost of missing 2 errors.   that sounds reasonable to me.

this rule-of-thumb would also advise that 22 pages (p1-p2-p3)
should be subjected to one more round (at least) for a "no diff".
i know if _i_ had to guess which pages might still harbor errors,
i would guess those 22 pages, since they haven't been "verified".

and, just to be clear, i am _not_ advocating that _one_ "no diff"
should be used as the cutoff.   i usually suggest _two_ of them.

it might be overkill, but i would still use _two_ to start out with,
especially since all the rounds would be done by p1 proofers,
who are quite plentiful.   and using p1 proofers to push books
_all_ the way to perfection, not all-but-161-of-6,500-lines,
sounds like a very intelligent use of resources, it surely does...

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&amp;
ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080328/1f28166b/attachment-0001.htm 

From Bowerbird at aol.com  Fri Mar 28 11:49:41 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 28 Mar 2008 14:49:41 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 11
Message-ID: <cf2.2c292387.351e97c5@aol.com>

in this post, on the parallel test of "paul and the printing press",
we learn what a 4th round of proofing, after a normal p1-p2-p3,
buys us.   the (ordinary and lowly) p1 proofers   who did this round
found 6 errors that had not been found previously.   now of course
they also missed some of the errors found in the normal workflow,
so it's not as if they were better proofers.   they were just _different_.

you'll observe that 2 of these errors -- the second and the sixth --
could have been autodetected.   but the others were pretty sneaky,
including a missing comma, a missing word, and 2 stealth scannos.
yet the p3 "marines" missed them, while a parallel p1 caught them.
so much for the notion of "elusive errors only experts can catch"...

bottom-line, though, once again, in yet another d.p. experiment,
the final results are crystal-clear:   the hierarchy of proofers over at
distributed proofreaders does _not_ do what it was intended to do.
and the huge backlogs that it _has_ caused could've been avoided,
not to mention all the hard feelings that it has produced in people.

***

here are _6_ errors found by parallel#2, but not p1-p2-p3:

p3>    "Why to print our life histories and obituaries
pp>    "Why, to print our life histories and obituaries

p3>    one passed through the school corridors, and `
pp>    one passed through the school corridors, and

p3>    "But there are short outs," argued Mr. Cameron.
pp>    "But there are short cuts," argued Mr. Cameron.

p3>    cast, the sections of stereotype were put
pp>    cast, the half sections of stereotype were put

p3>    fine articles from patents and distant
pp>    fine articles from parents and distant

p3>    When the acounts were found to be short,
pp>    When the accounts were found to be short,


and here is one more, not found by p1-p2-p3 or pp:

p3>    "'Thanks be to God, Hallelujah!'
pp>    "'Thanks be to God, Hallelujah!'
me>    "'Thanks be to God, Hallelujah!'"


i won't get into a "debate" about whether these are "errors",
but here are some p-book words _i_ felt should be "fixed",
even if _some_ people out there might have left them as is:
>    skilful to skillful p#73 and p#92
>    marvellous to marvelous p#93 (p-book inconsistency)
>    sceptical to skeptical p#130
>    smooths to smoothes p#182
>    signalled to signaled p#190

if you _do_ consider these as "errors", then p3 left 12 total.
whether you want to call it 6 or 12, it's clear p3 ain't perfect,
so those people who are looking for "perfection" from d.p.
have a problem on their hands, in that even _three_ rounds
of proofing is not delivering it in the present circumstances.

***

in addition to outright errors, we have a lot of _questions_...


here's a non-compound-word that wasn't asterisk-noted:
>    manuscripts, and many a one is marred by mis-spelling


i was unsure about these two compound words:
>    Paul had had time to become really down-hearted,
>    "An honest blunder is one thing; but pre-meditated


here are others (maybe errors, maybe not) i _did_ change:
>    scarfpins to scarf-pins p#65
>    under-classman to underclassman p#187


i also changed a bunch of "some one" to "someone",
just 'cause it looked better to me.   i hope that's right.      ;+)


oh yeah, and i fixed "to-day", "to-morrow", "to-night", etc.


i also eliminated those silly-looking characters from words like
alumnae, caesar, naively, papier-mache, resume, role, and so on,
just because it makes the europeans so mad when you do that...
(and sorry, albrecht durer, but i had to do that to your name too.)


i wasn't sure about this one, so i left it as it was:
>    silician queen p#44


and here's another one that has me thoroughly confused:
>    elaborate productions of a printing age, ecclesiasties p#92


finally, here's a funny word, to amuse you:
>    spondulics p#9

***

so -- all in all -- we've got around _two_dozen_ instances of
questionable items there, which is about par for the course
on a 200+page book, i'd say.   so that needs to be taken into
_account_ when we wanna talk about "achieving perfection".

any time you're dealing with language, it ain't cut-and-dried.
there are a lot of decisions in any book that can go either way.
(of course, any d.p. people who've post-processed know this.)

so it's all well and good to talk about "removing all the errors",
but we need to realize that at some point, that devolves into a
never-ending conversation about what _constitutes_ an error.

but, by the way, it's extremely easy for me to tell you _what_
constitutes an error in _my_ book -- if, when you bring it to
my attention, i _change_ it, then you have found "an error"...
if i don't change it, then you have not found an error.

of course, you are free to differ with my opinion.   who cares?

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&amp;
ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080328/d5272268/attachment.htm 

From Bowerbird at aol.com  Fri Mar 28 16:22:26 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 28 Mar 2008 19:22:26 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 12
Message-ID: <d68.2605a920.351ed7b2@aol.com>

so, were you surprised to learn that the normal p1 proofers
in the parallel proofing test of "paul and the printing press"
took 6,360+ lines to perfection, with only _161_ not perfect?

that's over a 97.5% accuracy-rating on _lines_ (not words or
characters, which are the typical units of measure for that)...

but if you are surprised by that high accuracy, you haven't
been paying attention, because p1 regularly does that well.

in fact, sometimes one gets accuracy that good out of o.c.r.!

i did a full-on analysis you can find in the d.p. forums
-- search for "a revolutionary proofing methodology --
where i found the o.c.r. from the open content alliance
got just _57_ lines incorrect in a book with 8,000+ lines!

(and the google o.c.r. on a different physical copy of a
slightly different version had all but 300 lines correct.)

face it, because of the bad scans and the use of tesseract,
on a lot of the lines in this book, the p1 proofers acted as
"the human o.c.r. program".   and they had great accuracy!

again, p1 made 1,750 changes, p2 just 122, p3 just 39...

and with decent scans and decent o.c.r., the results might
have been p1 with 300 changes, p2 with 30, p3 with 3...

***

and even though a mere 161 imperfect lines out of 6,500+
is an amazingly high rate of quality, closer analysis of those
161 lines suggests that many could have been autodetected,
meaning they should've been fixed during _preprocessing_,
before they were ever even presented to volunteers to proof.

it also suggests that p1's "failure" to take these 161 lines to
a state of perfection can be ignored, since _postprocessing_
would easily find and fix the errors the proofers left behind.

but whichever it was -- preprocessing or postprocessing --
it's clear _many_ of the bad 161 lines could've been caught.

how many?

well, by my count, all but _36_ could've been autodetected.
(my earlier guess of _40_ ending up being pretty accurate.)

i've appended my categorization of 'em, and also put it here:
>    http://z-m-l.com/go/paulp/paul-161-categorize.html

these difficult-to-detect 36 break down like this:
->   stealth scannos, 14
->   missing words, 13
->   punctuation problems, 9

some punctuation problems are easy to autodetect, such as
a sentence-terminating period not followed by a capital letter.
other punctuation problems are almost impossible to detect,
such as the speck-induced phantom comma where a real one
would not be totally inappropriate, according to the content...

missing words are also extremely hard to detect automatically.

stealth scannos, of course, are the prototype of hard-to-detect.

but still, the fact that autodetection could have fixed _many_
of the lines on which p1 "failed", so that a mere _36_ are left
which are not perfect -- out of 6,500+ lines in this book --
indicates unequivocally that p1 is doing some kick-ass work.

these p1 proofers deserve far more acclaim than they receive.

i'm not done yet, but i'll give you a break over the weekend...        ;+)

-bowerbird

>    http://z-m-l.com/go/paulp/paul-161-categorize.html

for the d.p. parallel-proofing experiment with "paul and the printing press",
analysis showed that after p1, only 161 lines were later changed by p2 and 
p3.

of these 161 imperfect lines, 124 could easily be autodetected, and 37 could 
not.

this indicates the p1 proofers could have taken this book even closer to 
perfection,
with a mere _37_ lines being incorrect after a combination of p1 and 
autodetection...

------------------------------------------------------------------

of the 161 imperfect lines left by p1, 124 could be autodetected for easy 
fixing:

029 -- spellcheck -- easy to autodetect -- n=29
019 -- quotemarks, unbalanced or inappropriate -- easy to autodetect -- n=19
026 -- letter-casing -- easy to autodetect -- n=26
010 -- dehyphenation -- should be done automatically -- n=10
005 -- diacritic nonsense -- we don't need no high-bit characters -- n=5
002 -- preprocessing changes that should be standard policy -- n=2
015 -- hyphenation and em-dash -- don't count against proofers -- n=15
018 -- punctuation impossibilities -- can be autodetected -- n=18
-----
124 lines that could be easily autodetected.


of the 161 imperfect lines left by p1, 37 would be difficult to autodetect:

015 -- stealth scannos -- hard to detect -- n=15
013 -- missing/excess words -- hard to detect -- n=13
009 -- punctuation errors that are not impossibilities -- hard to detect -- 
n=9
----
037 lines that could _not_ be easily autodetected.

------------------------------------------------------------------

spellcheck -- easy to autodetect -- n=29

002 -- 008.png -- =p1=>    To paralyze the Caesars -- and to stike
002 -- 008.png -- =p3=>    To paralyze the Caesars -- and to strike
002 -- 008.png -- diff>    ====================================^^^

009 -- 025.png -- =p1=>    the Fire-eater! Have a copy of the Jabbermock!
009 -- 025.png -- =p3=>    the Fire-eater! Have a copy of the Jabberwock!
009 -- 025.png -- diff>    =========================================^====

011 -- 026.png -- =p1=>    "The March Hare!" he repeated wlth enthusiasm.
011 -- 026.png -- =p3=>    "The March Hare!" he repeated with enthusiasm.
011 -- 026.png -- diff>    ===============================^==============

017 -- 042.png -- =p1=>    firm of George L. Kirnball and from Dalrymple
017 -- 042.png -- =p3=>    firm of George L. Kimball and from Dalrymple
017 -- 042.png -- diff>    ====================^^^^=^^^^^^^^^^^^^^^^^^^^

025 -- 050.png -- =p1=>    "Thus, you see, was the eopyist forced to
025 -- 050.png -- =p3=>    "Thus, you see, was the copyist forced to
025 -- 050.png -- diff>    ========================^================

037 -- 068.png -- =p1=>    "Yes, and not only were the first manuseripts
037 -- 068.png -- =p3=>    "Yes, and not only were the first manuscripts
037 -- 068.png -- diff>    =======================================^=====

043 -- 079.png -- =p1=>    else in the paper. Sorne thought more
043 -- 079.png -- =p3=>    else in the paper. Some thought more
043 -- 079.png -- diff>    =====================^^^^^^^^^^^^^^^^

052 -- 084.png -- =p1=>    impulse is a very seliish one," said his father.
052 -- 084.png -- =p3=>    impulse is a very selfish one," said his father.
052 -- 084.png -- diff>    =====================^==========================

053 -- 089.png -- =p1=>    and Diamonds for the more prosperous `
053 -- 089.png -- =p3=>    and Diamonds for the more prosperous
053 -- 089.png -- diff>    ====================================^^

044 -- 080.png -- =p1=>    smoothed away his objectious until, upon a
044 -- 080.png -- =p3=>    smoothed away his objections until, upon a
044 -- 080.png -- diff>    ==========================^===============

045 -- 080.png -- =p1=>    body of workers hnally stood shoulder to shoulder,
045 -- 080.png -- =p3=>    body of workers finally stood shoulder to 
shoulder,
045 -- 080.png -- diff>    ================^^^^=^^^^^=^^^^^^^^^^^^^^^^^^^^^^^

046 -- 080.png -- =p1=>    finer and more efiicient. It was, as Paul
046 -- 080.png -- =p3=>    finer and more efficient. It was, as Paul
046 -- 080.png -- diff>    =================^=======================

048 -- 081.png -- =p1=>    Into Pau1's editorial sanctum articles from
048 -- 081.png -- =p3=>    Into Paul's editorial sanctum articles from
048 -- 081.png -- diff>    ========^==================================

049 -- 082.png -- =p1=>    various sources one number after another of `
049 -- 082.png -- =p3=>    various sources one number after another of
049 -- 082.png -- diff>    ===========================================^^

050 -- 084.png -- =p1=>    like to write up fires and aceidents and wear a
050 -- 084.png -- =p3=>    like to write up fires and accidents and wear a
050 -- 084.png -- diff>    =============================^=================

084 -- 156.png -- =p1=>    Paul. Page 13T.
084 -- 156.png -- =p3=>    Paul. Page 137.
084 -- 156.png -- diff>    =============^=

096 -- 179.png -- =p1=>    visit to a big newspaper offfice Saturday evening
096 -- 179.png -- =p3=>    visit to a big newspaper office Saturday evening
096 -- 179.png -- diff>    ============================^^^^^^^^^^^^^^^^^^^^^

097 -- 181.png -- =p1=>    you must remember that it was especially diffcult
097 -- 181.png -- =p3=>    you must remember that it was especially difficult
097 -- 181.png -- diff>    =============================================^^^^

098 -- 182.png -- =p1=>    "So, son," concluded Mr. wright, "you've
098 -- 182.png -- =p3=>    "So, son," concluded Mr. Wright, "you've
098 -- 182.png -- diff>    =========================^==============

099 -- 182.png -- =p1=>    approve of the fity-dollar bill which at that
099 -- 182.png -- =p3=>    approve of the fifty-dollar bill which at that
099 -- 182.png -- diff>    =================^^^^^^=^^^^^^=^^^^^^^^^^^^^^

110 -- 193.png -- =p1=>    process and know how the brst printing
110 -- 193.png -- =p3=>    process and know how the first printing
110 -- 193.png -- diff>    =========================^^^^^^^^^^^^^

112 -- 196.png -- =p1=>    of each shelf classined and marked."
112 -- 196.png -- =p3=>    of each shelf classified and marked."
112 -- 196.png -- diff>    ====================^^^^^^^^^^^^^^^^

113 -- 200.png -- =p1=>    They had now reached the lowest Hoor and
113 -- 200.png -- =p3=>    They had now reached the lowest floor and
113 -- 200.png -- diff>    ================================^^=^^^^^

117 -- 204.png -- =p1=>    little chap over there by the bre hangs our
117 -- 204.png -- =p3=>    little chap over there by the fire hangs our
117 -- 204.png -- diff>    ==============================^^^^^^^^^^^^^

131 -- 214.png -- =p1=>    Deeker, rolling his eyes up to the ceiling with
131 -- 214.png -- =p3=>    Decker, rolling his eyes up to the ceiling with
131 -- 214.png -- diff>    ==^============================================

142 -- 226.png -- =p1=>    wont, in unselhsh fashion, to let every one else
142 -- 226.png -- =p3=>    wont, in unselfish fashion, to let every one else
142 -- 226.png -- diff>    ==============^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

151 -- 233.png -- =p1=>    delivered was clicked offon Mr. Carter's 
typewriter
151 -- 233.png -- =p3=>    delivered was clicked off on Mr. Carter's 
typewriter
151 -- 233.png -- diff>    
=========================^^^^^^^^^^^^^^^^^^^^^^^^^^

155 -- 236.png -- =p1=>    Carneron was a big enough man to be forgiving.
155 -- 236.png -- =p3=>    Cameron was a big enough man to be forgiving.
155 -- 236.png -- diff>    ==^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

158 -- 237.png -- =p1=>    and joy to the crowning event of l920's
158 -- 237.png -- =p3=>    and joy to the crowning event of 1920's
158 -- 237.png -- diff>    =================================^=====


quotemarks, unbalanced or inappropriate -- easy to autodetect -- n=19

005 -- 018.png -- =p1=>    The better way to go at such an undertaking,"
005 -- 018.png -- =p3=>    "The better way to go at such an undertaking,"
005 -- 018.png -- diff>    ^^^^^^^=^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

007 -- 023.png -- =p1=>    asserted at length. " But the ducats -- where
007 -- 023.png -- =p3=>    asserted at length. "But the ducats -- where
007 -- 023.png -- diff>    =====================^^^^^^^^^^^^^^^^=^^^^^^^

013 -- 033.png -- =p1=>    back a step or two. " I couldn't, Kip. Don't
013 -- 033.png -- =p3=>    back a step or two. "I couldn't, Kip. Don't
013 -- 033.png -- diff>    =====================^^^^^^^^^^^^^^^^^^^^^^^

015 -- 036.png -- =p1=>    So you're Paul Cameron. I've had dealings
015 -- 036.png -- =p3=>    "So you're Paul Cameron. I've had dealings
015 -- 036.png -- diff>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

018 -- 043.png -- =p1=>    the Echo?"'
018 -- 043.png -- =p3=>    the Echo?"
018 -- 043.png -- diff>    ==========^

019 -- 045.png -- =p1=>    "Oh, it's not that," said Paul quickly. " We
019 -- 045.png -- =p3=>    "Oh, it's not that," said Paul quickly. "We
019 -- 045.png -- diff>    =========================================^^^

020 -- 046.png -- =p1=>    People didn't always use to have paper,
020 -- 046.png -- =p3=>    "People didn't always use to have paper,
020 -- 046.png -- diff>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

024 -- 049.png -- =p1=>    "Thanks be to God, Hallelujah!'
024 -- 049.png -- =p3=>    "'Thanks be to God, Hallelujah!'
024 -- 049.png -- diff>    =^^^^^^^^^^^^^^^^^^^^^=^^^^^^^^

029 -- 053.png -- =p1=>    Paul waited an instant, then added dryly: " In
029 -- 053.png -- =p3=>    Paul waited an instant, then added dryly: "In
029 -- 053.png -- diff>    ===========================================^^^

058 -- 094.png -- =p1=>    in years!" ejaculated the postmaster. " Seems
058 -- 094.png -- =p3=>    in years!" ejaculated the postmaster. "Seems
058 -- 094.png -- diff>    =======================================^^=^^^

105 -- 189.png -- =p1=>    surface.'
105 -- 189.png -- =p3=>    surface."
105 -- 189.png -- diff>    ========^

119 -- 204.png -- =p1=>    we ought to pay more for our newspapers.'
119 -- 204.png -- =p3=>    we ought to pay more for our newspapers."
119 -- 204.png -- diff>    ========================================^

130 -- 213.png -- =p1=>    he heard himself saying, " I'd call it a beastly
130 -- 213.png -- =p3=>    he heard himself saying, "I'd call it a beastly
130 -- 213.png -- diff>    ==========================^^^^^^^=^^^^^^^^^^^^^^

132 -- 214.png -- =p1=>    "Nothing! 'Cut it out, that's all."
132 -- 214.png -- =p3=>    "Nothing! Cut it out, that's all."
132 -- 214.png -- diff>    ==========^^^^^^^^^^^^^^^^^^^^^=^^^

133 -- 215.png -- =p1=>    with the boy?'
133 -- 215.png -- =p3=>    with the boy?
133 -- 215.png -- diff>    =============^

139 -- 222.png -- =p1=>    Mr. Carter -- " you were just right, son. The
139 -- 222.png -- =p3=>    Mr. Carter -- "you were just right, son. The
139 -- 222.png -- diff>    ===============^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

140 -- 224.png -- =p1=>    "How are you, old man,' Paul called jubilantly.
140 -- 224.png -- =p3=>    "How are you, old man," Paul called jubilantly.
140 -- 224.png -- diff>    ======================^========================

149 -- 231.png -- =p1=>    Paul. " But it's all right now. The
149 -- 231.png -- =p3=>    Paul. "But it's all right now. The
149 -- 231.png -- diff>    =======^^^^^^^^^^^=^^^^^^^^^^^^^^^^

157 -- 236.png -- =p1=>    Cameron.'
157 -- 236.png -- =p3=>    Cameron."
157 -- 236.png -- diff>    ========^


letter-casing -- easy to autodetect -- n=26

006 -- 022.png -- =p1=>    "Say, Cart, what do you think of '20 Starting
006 -- 022.png -- =p3=>    "Say, Cart, what do you think of '20 starting
006 -- 022.png -- diff>    =====================================^=======

012 -- 028.png -- =p1=>    Kipper. we'll see what we can do toward
012 -- 028.png -- =p3=>    Kipper. We'll see what we can do toward
012 -- 028.png -- diff>    ========^==============================

022 -- 049.png -- =p1=>    the patient Workers were so glad when their
022 -- 049.png -- =p3=>    the patient workers were so glad when their
022 -- 049.png -- diff>    ============^==============================

027 -- 052.png -- =p1=>    the great objection to this method was that 
several
027 -- 052.png -- =p3=>    The great objection to this method was that 
several
027 -- 052.png -- diff>    
^==================================================

041 -- 079.png -- =p1=>    Was quite an eye opener! A paper for general
041 -- 079.png -- =p3=>    was quite an eye opener! A paper for general
041 -- 079.png -- diff>    ^===========================================

042 -- 079.png -- =p1=>    Burmingham. There Was actually something
042 -- 079.png -- =p3=>    Burmingham. There was actually something
042 -- 079.png -- diff>    ==================^=====================

054 -- 090.png -- =p1=>    was one of the later and most skilful Woodcut
054 -- 090.png -- =p3=>    was one of the later and most skilful woodcut
054 -- 090.png -- diff>    ======================================^======

066 -- 131.png -- =p1=>    what was to be done?
066 -- 131.png -- =p3=>    What was to be done?
066 -- 131.png -- diff>    ^===================

067 -- 132.png -- =p1=>    I can't understand it. we haven't branched
067 -- 132.png -- =p3=>    I can't understand it. We haven't branched
067 -- 132.png -- diff>    =======================^==================

061 -- 111.png -- =p1=>    or enamel. As time Went on and the religious
061 -- 111.png -- =p3=>    or enamel. As time went on and the religious
061 -- 111.png -- diff>    ===================^========================

063 -- 117.png -- =p1=>    cultured nation. By no means. what I mean
063 -- 117.png -- =p3=>    cultured nation. By no means. What I mean
063 -- 117.png -- diff>    ==============================^==========

065 -- 120.png -- =p1=>    "Typewriters Come at all prices," his father
065 -- 120.png -- =p3=>    "Typewriters come at all prices," his father
065 -- 120.png -- diff>    =============^==============================

069 -- 134.png -- =p1=>    "Something's fussing you. what is it?"
069 -- 134.png -- =p3=>    "Something's fussing you. What is it?"
069 -- 134.png -- diff>    ==========================^===========

070 -- 135.png -- =p1=>    Bond" was converted into cash; Paul'S typewriter
070 -- 135.png -- =p3=>    Bond" was converted into cash; Paul's typewriter
070 -- 135.png -- diff>    ====================================^===========

088 -- 161.png -- =p1=>    the machine's myriad advantages. wasn't it
088 -- 161.png -- =p3=>    the machine's myriad advantages. Wasn't it
088 -- 161.png -- diff>    =================================^========

089 -- 162.png -- =p1=>    March Hare Would branch out and be made
089 -- 162.png -- =p3=>    March Hare would branch out and be made
089 -- 162.png -- diff>    ===========^===========================

092 -- 168.png -- =p1=>    largest industries. we cannot do without
092 -- 168.png -- =p3=>    largest industries. We cannot do without
092 -- 168.png -- diff>    ====================^===================

093 -- 173.png -- =p1=>    school, and all the Web of circumstances in
093 -- 173.png -- =p3=>    school, and all the web of circumstances in
093 -- 173.png -- diff>    ====================^======================

095 -- 177.png -- =p1=>    a press Was built up Which is so intricate and
095 -- 177.png -- =p3=>    a press was built up which is so intricate and
095 -- 177.png -- diff>    ========^============^========================

104 -- 188.png -- =p1=>    have the main idea and When I see the thing in
104 -- 188.png -- =p3=>    have the main idea and when I see the thing in
104 -- 188.png -- diff>    =======================^======================

107 -- 190.png -- =p1=>    "I See"
107 -- 190.png -- =p3=>    "I see."
107 -- 190.png -- diff>    ===^==^

111 -- 194.png -- =p1=>    "I See."
111 -- 194.png -- =p3=>    "I see."
111 -- 194.png -- diff>    ===^====

115 -- 201.png -- =p1=>    during the war," Stammered Paul.
115 -- 201.png -- =p3=>    during the war," stammered Paul.
115 -- 201.png -- diff>    =================^==============

118 -- 204.png -- =p1=>    and Paul Smiled in return.
118 -- 204.png -- =p3=>    and Paul smiled in return.
118 -- 204.png -- diff>    =========^================

148 -- 229.png -- =p1=>    wretchedly. "That's what'S got me fussed.
148 -- 229.png -- =p3=>    wretchedly. "That's what's got me fussed.
148 -- 229.png -- diff>    =========================^===============

150 -- 232.png -- =p1=>    that money. It's caused too much Worry already."
150 -- 232.png -- =p3=>    that money. It's caused too much worry already."
150 -- 232.png -- diff>    =================================^==============


dehyphenation -- should be done automatically -- easy to detect -- n=10

008 -- 023.png -- =p1=>    "I suppose we couldn't buy a press secondhand
008 -- 023.png -- =p3=>    "I suppose we couldn't buy a press second-hand
008 -- 023.png -- diff>    =========================================^^^^

034 -- 065.png -- =p1=>    to what methods you resorted to win these con-
034 -- 065.png -- =p3=>    to what methods you resorted to win these 
concessions
034 -- 065.png -- diff>    =============================================^

035 -- 065.png -- =p1=>    cessions from these stern-purposed gentlemen.
035 -- 065.png -- =p3=>    from these stern-purposed gentlemen.
035 -- 065.png -- diff>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

030 -- 055.png -- =p1=>    "Mr. Carter said Judge Damon was an ex-
030 -- 055.png -- =p3=>    "Mr. Carter said Judge Damon was an expert
030 -- 055.png -- diff>    ======================================^

031 -- 055.png -- =p1=>    pert on international law," explained Paul.
031 -- 055.png -- =p3=>    on international law," explained Paul.
031 -- 055.png -- diff>    ^^^^^^^^^^=^^==^^^^^^^^^^^^^^^^^^^^^^^^^^^^

040 -- 075.png -- =p1=>    ways at liberty to send contributions back with
040 -- 075.png -- =p3=>    at liberty to send contributions back with
040 -- 075.png -- diff>    ^^^^^^^^^^^^^^^^^^=^^=^^^^^=^^^^^^^^^=^^^^^^^^^

125 -- 210.png -- =p1=>    he wanted to sell them. Father said so. Be
125 -- 210.png -- =p3=>    he wanted to sell them. Father said so. Besides,
125 -- 210.png -- diff>    ==========================================

126 -- 210.png -- =p1=>    sides, what's to become of 1921 if you sell out
126 -- 210.png -- =p3=>    what's to become of 1921 if you sell out
126 -- 210.png -- diff>    ^^^^^^=^^^^^^^^^=^^^^^^^^^^^^^^=^^^^^^^^^^^^^^^

127 -- 212.png -- =p1=>    "What else could we sell it out for, fathead?"
127 -- 212.png -- =p3=>    "What else could we sell it out for, fat-head?"
127 -- 212.png -- diff>    ========================================^^^^^^

156 -- 236.png -- =p1=>    "An honest blunder is one thing; but premeditated
156 -- 236.png -- =p3=>    "An honest blunder is one thing; but pre-meditated
156 -- 236.png -- diff>    ========================================^^^^^^^^^


diacritic nonsense -- we don't need no high-bit characters -- easy to detect 
-- n=5

047 -- 081.png -- =p1=>    manager; the alumnae, now scattered in
047 -- 081.png -- =p3=>    manager; the alumn?, now scattered in
047 -- 081.png -- diff>    ==================^^^^^^^^^^^=^^^^^^^^

062 -- 114.png -- =p1=>    at all. They get a scenario or resume of the
062 -- 114.png -- =p3=>    at all. They get a scenario or r?sum? of the
062 -- 114.png -- diff>    ================================^===^=======

072 -- 140.png -- =p1=>    the contrary it naively confessed that it was
072 -- 140.png -- =p3=>    the contrary it na?vely confessed that it was
072 -- 140.png -- diff>    ==================^==========================

109 -- 192.png -- =p1=>    cardboard, a sort of papier-mache, and by forcing
109 -- 192.png -- =p3=>    cardboard, a sort of papier-mach?, and by forcing
109 -- 192.png -- diff>    ================================^================

120 -- 205.png -- =p1=>    alumnae. Judge Damon had taken to contributing
120 -- 205.png -- =p3=>    alumn?. Judge Damon had taken to contributing
120 -- 205.png -- diff>    =====^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


preproccessing changes that should be standard policy -- easy to detect -- 
n=2

079 -- 154.png -- =p1=>    "We'll talk no more about this matter today,"
079 -- 154.png -- =p3=>    "We'll talk no more about this matter to-day,"
079 -- 154.png -- diff>    ========================================^^^^^

138 -- 218.png -- =p1=>    only that he dreaded... The knob turned
138 -- 218.png -- =p3=>    only that he dreaded.... The knob turned
138 -- 218.png -- diff>    =======================^^^^^^^^^^^^^^^^


hyphenation and em-dash escapades -- i won't count these against proofers -- 
easy to detect -- n=15

036 -- 065.png -- =p1=>    "The judge, for example-I can't imagine
036 -- 065.png -- =p3=>    "The judge, for example -- I can't imagine
036 -- 065.png -- diff>    =======================^^^^^^^^^^^^^^^^

068 -- 134.png -- =p1=>    "Could you manage it-fifty dollars?"
068 -- 134.png -- =p3=>    "Could you manage it -- fifty dollars?"
068 -- 134.png -- diff>    ====================^^^^^^^^^^^^^^^^

073 -- 144.png -- =p1=>    was no easy task. It was a thankless job, anywy
073 -- 144.png -- =p3=>    was no easy task. It was a thankless job, anyway 
-- the
073 -- 144.png -- diff>    ==============================================^

074 -- 144.png -- =p1=>    -- the least interesting of any of the positions
074 -- 144.png -- =p3=>    least interesting of any of the positions
074 -- 144.png -- diff>    ^^^^^^^^^^^^^^^^^^^^^^=^====^^^=^^^^^^^^^^^^^^^^

100 -- 185.png -- =p1=>    Paul had had time to become really downhearted,
100 -- 185.png -- =p3=>    Paul had had time to become really down-hearted,
100 -- 185.png -- diff>    =======================================^^^^^^^^

128 -- 212.png -- =p1=>    "But -- to sell it out for cash, as it stands --
128 -- 212.png -- =p3=>    "But -- to sell it out for cash, as it stands -- 
you
128 -- 212.png -- diff>    ================================================

129 -- 212.png -- =p1=>    you mean that?"
129 -- 212.png -- =p3=>    mean that?"
129 -- 212.png -- diff>    ^^^^^^^^^^^^^^^

134 -- 217.png -- =p1=>    be confessing that he had failed in his mission,
134 -- 217.png -- =p3=>    be confessing that he had failed in his mission, 
-- nay,
134 -- 217.png -- diff>    ================================================

135 -- 217.png -- =p1=>    -- nay, worse than that, that he had not even
135 -- 217.png -- =p3=>    worse than that, that he had not even
135 -- 217.png -- diff>    ^^^^^^^^^^^^^^=^^^^^^^^^=^^^^^^^=^^^^^^^^^^^^

136 -- 217.png -- =p1=>    come, something within him had leaped into being,
136 -- 217.png -- =p3=>    come, something within him had leaped into being, 
-- something
136 -- 217.png -- diff>    =================================================

137 -- 217.png -- =p1=>    -- something that had automatically prevented
137 -- 217.png -- =p3=>    that had automatically prevented
137 -- 217.png -- diff>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

146 -- 228.png -- =p1=>    me to deposit some money in the bank for him
146 -- 228.png -- =p3=>    me to deposit some money in the bank for him -- a
146 -- 228.png -- diff>    ============================================

147 -- 228.png -- =p1=>    -- a hundred-dollar bill. I put the envelope in
147 -- 228.png -- =p3=>    hundred-dollar bill. I put the envelope in
147 -- 228.png -- diff>    ^^^^^^^^=^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^^^^^^^^

160 -- 238.png -- =p1=>    when weary, sleepy, but triumphant, a half
160 -- 238.png -- =p3=>    when weary, sleepy, but triumphant, a 
half-jubilant,
160 -- 238.png -- diff>    ==========================================

161 -- 238.png -- =p1=>    jubilant, half-sorrowful lot of girls and boys
161 -- 238.png -- =p3=>    half-sorrowful lot of girls and boys
161 -- 238.png -- diff>    ^^^^^^^^^^^^^^^^=^^=^^^^^=^^^^^=^^^^^^^^^^^^^^


punctuation impossibilities -- bad constructions -- easy to detect -- n=18

001 -- 007.png -- =p1=>    Copyright, 1920
001 -- 007.png -- =p3=>    Copyright, 1920,
001 -- 007.png -- diff>    ===============

010 -- 025.png -- =p1=>    Hare it is! We"ll begin getting subscriptions
010 -- 025.png -- =p3=>    Hare it is! We'll begin getting subscriptions
010 -- 025.png -- diff>    ==============^==============================

039 -- 072.png -- =p1=>    them would fill a room.:"
039 -- 072.png -- =p3=>    them would fill a room."
039 -- 072.png -- diff>    =======================^^

055 -- 090.png -- =p1=>    woodcut was to art -- simple, direct, appealing"
055 -- 090.png -- =p3=>    woodcut was to art -- simple, direct, appealing."
055 -- 090.png -- diff>    ===============================================^

059 -- 096.png -- =p1=>    John Gutenburg,a native of Strasburg, who
059 -- 096.png -- =p3=>    John Gutenburg, a native of Strasburg, who
059 -- 096.png -- diff>    ===============^^^^^^^^^^^^^^^^^^^^^^^^^^

071 -- 138.png -- =p1=>    a patronizing scorn, For a press of the Echo's
071 -- 138.png -- =p3=>    a patronizing scorn. For a press of the Echo's
071 -- 138.png -- diff>    ===================^==========================

075 -- 147.png -- =p1=>    "How is your paper coming on, Paul?," he
075 -- 147.png -- =p3=>    "How is your paper coming on, Paul?" he
075 -- 147.png -- diff>    ===================================^^^^^

076 -- 150.png -- =p1=>    "B -- u -- t-" stammered Paul and then
076 -- 150.png -- =p3=>    "B -- u -- t -- " stammered Paul and then
076 -- 150.png -- diff>    ============^^^^^^^^^^^^^^^^^^^^^^^^^^

077 -- 153.png -- =p1=>    "I -- I-" faltered Paul.
077 -- 153.png -- =p3=>    "I -- I -- " faltered Paul.
077 -- 153.png -- diff>    =======^^^^^^^^^^^^^^^^^

078 -- 153.png -- =p1=>    "I don't quite-"
078 -- 153.png -- =p3=>    "I don't quite -- "
078 -- 153.png -- diff>    ==============^^

080 -- 155.png -- =p1=>    fifty-dollar bond I have"
080 -- 155.png -- =p3=>    fifty-dollar bond I have."
080 -- 155.png -- diff>    ========================^

081 -- 155.png -- =p1=>    it." t
081 -- 155.png -- =p3=>    it."
081 -- 155.png -- diff>    ====^^

082 -- 155.png -- =p1=>    Mr. Carter winked
082 -- 155.png -- =p3=>    Mr. Carter winked.
082 -- 155.png -- diff>    =================

083 -- 155.png -- =p1=>    "I see," he said
083 -- 155.png -- =p3=>    "I see," he said.
083 -- 155.png -- diff>    ================

085 -- 158.png -- =p1=>    prefer, A loan with a bond for security is
085 -- 158.png -- =p3=>    prefer. A loan with a bond for security is
085 -- 158.png -- diff>    ======^===================================

086 -- 158.png -- =p1=>    :But -- "
086 -- 158.png -- =p3=>    "But -- "
086 -- 158.png -- diff>    ^========

091 -- 165.png -- =p1=>    quantities of paper," answered his father;
091 -- 165.png -- =p3=>    quantities of paper," answered his father.
091 -- 165.png -- diff>    =========================================^

123 -- 207.png -- =p1=>    his classmates to earn it, -- -for earn it he 
must,
123 -- 207.png -- =p3=>    his classmates to earn it, -- for earn it he must,
123 -- 207.png -- diff>    
==============================^^^^^^^^^^^^^^^^^^^^^


stealth scannos -- hard to detect -- n=15

003 -- 017.png -- =p1=>    "Enough to till a good-sized daily, I should
003 -- 017.png -- =p3=>    "Enough to fill a good-sized daily, I should
003 -- 017.png -- diff>    ===========^================================

004 -- 018.png -- =p1=>    expensive piece of property, my son," he relied.
004 -- 018.png -- =p3=>    expensive piece of property, my son," he replied.
004 -- 018.png -- diff>    ===========================================^^^^^

033 -- 063.png -- =p1=>    the judge mischievously. "It you boys propose
033 -- 063.png -- =p3=>    the judge mischievously. "If you boys propose
033 -- 063.png -- diff>    ===========================^=================

087 -- 160.png -- =p1=>    Paul lingered the bill nervously. Fifty dollars!
087 -- 160.png -- =p3=>    Paul fingered the bill nervously. Fifty dollars!
087 -- 160.png -- diff>    =====^==========================================

090 -- 164.png -- =p1=>    money and government notes are line examples
090 -- 164.png -- =p3=>    money and government notes are fine examples
090 -- 164.png -- diff>    ===============================^============

094 -- 176.png -- =p1=>    press rooms for striking oil proof when the
094 -- 176.png -- =p3=>    press rooms for striking off proof when the
094 -- 176.png -- diff>    ==========================^^===============

103 -- 187.png -- =p1=>    This east is then fitted upon the rollers
103 -- 187.png -- =p3=>    This cast is then fitted upon the rollers
103 -- 187.png -- diff>    =====^===================================

106 -- 190.png -- =p1=>    a small space allowed it; N, too, is not much in
106 -- 190.png -- =p3=>    a small space allowed it; X, too, is not much in
106 -- 190.png -- diff>    ==========================^=====================

108 -- 192.png -- =p1=>    large metal sections that lit on the two halves of
108 -- 192.png -- =p3=>    large metal sections that fit on the two halves of
108 -- 192.png -- diff>    ==========================^=======================

122 -- 206.png -- =p1=>    bid good-by to the familiar balls of the school,
122 -- 206.png -- =p3=>    bid good-by to the familiar halls of the school,
122 -- 206.png -- diff>    ============================^===================

141 -- 225.png -- =p1=>    hollowing them out and tilling them up again
141 -- 225.png -- =p3=>    hollowing them out and filling them up again
141 -- 225.png -- diff>    =======================^====================

143 -- 226.png -- =p1=>    loyally refusing to peach on his churns. That
143 -- 226.png -- =p3=>    loyally refusing to peach on his chums. That
143 -- 226.png -- diff>    ====================================^^^^^^^^^

145 -- 227.png -- =p1=>    "They say there always has to be a fist time.
145 -- 227.png -- =p3=>    "They say there always has to be a first time.
145 -- 227.png -- diff>    =====================================^^^^^^^^

152 -- 234.png -- =p1=>    "And I oh yours, Mr. Carter. Melville is a
152 -- 234.png -- =p3=>    "And I on yours, Mr. Carter. Melville is a
152 -- 234.png -- diff>    ========^=================================

159 -- 237.png -- =p1=>    course, the far-tamed March Hare. Its advent
159 -- 237.png -- =p3=>    course, the far-famed March Hare. Its advent
159 -- 237.png -- diff>    ================^===========================


missing/excess words -- hard to detect -- n=13

014 -- 036.png -- =p1=>    of being shrewd, close-fisted, and
014 -- 036.png -- =p3=>    reputation of being shrewd, close-fisted, and
014 -- 036.png -- diff>    ^^^^^^^^^^^^^^^^^^^^=^^^^^^^^^^^^^

021 -- 046.png -- =p1=>    kings, bishops, and persons of rank could
021 -- 046.png -- =p3=>    many kings, bishops, and persons of rank could
021 -- 046.png -- diff>    ^^=^^^^^^=^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^

026 -- 051.png -- =p1=>    and were sold to of the Church or to
026 -- 051.png -- =p3=>    and were sold to dignitaries of the Church or to
026 -- 051.png -- diff>    =================^^^^^^^^^^^^^^^^^^^

028 -- 053.png -- =p1=>    ??line missing here...??
028 -- 053.png -- =p3=>    "Yes."
028 -- 053.png -- diff>    ^^^^^^^^^^^^^^^^^^^^^^^^

038 -- 070.png -- =p1=>    I have already explained, care much for reading;
038 -- 070.png -- =p3=>    have already explained, care much for reading;
038 -- 070.png -- diff>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^^^^^

051 -- 084.png -- =p1=>    a under the ropes."
051 -- 084.png -- =p3=>    under the ropes."
051 -- 084.png -- diff>    ^^^^^^^^^^^^^^^^^^^

056 -- 090.png -- =p1=>    is public that desired to read -- which this one 
did
056 -- 090.png -- =p3=>    public that desired to read -- which this one did
056 -- 090.png -- diff>    
^^^^^^^^^^=^^^^^^^^^^^=^^^^^^^=^^^^=^^=^^^^^^^^^^^^^

057 -- 092.png -- =p1=>    More than one dignified resident of town struggled
057 -- 092.png -- =p3=>    More than one dignified resident of the town 
struggled
057 -- 092.png -- diff>    =====================================^^^^^^^^^^^^^

064 -- 119.png -- =p1=>    author the prey of vultures who
064 -- 119.png -- =p3=>    author was the prey of vultures who
064 -- 119.png -- diff>    =======^^^=^^=^^^^^^^^^^^^^^^^^

101 -- 186.png -- =p1=>    their days."
101 -- 186.png -- =p3=>    their days. "I'm going to take you upstairs
101 -- 186.png -- diff>    ===========^

102 -- 186.png -- =p1=>    Mr. Hawley said briskly. "We may
102 -- 186.png -- =p3=>    first," Mr. Hawley said briskly. "We may
102 -- 186.png -- diff>    ^^^^^^^^^^^^^^^^^^^=^^^^^^^^^^^^

116 -- 202.png -- =p1=>    publishers." I
116 -- 202.png -- =p3=>    publishers."
116 -- 202.png -- diff>    ============^^

124 -- 209.png -- =p1=>    "Because -- well-it Would be so yellow,"
124 -- 209.png -- =p3=>    "Because -- well -- it would be so darn yellow,"
124 -- 209.png -- diff>    ================^^^=^^^^^^^^=^^=^^^^^^^^


punctuation errors that are not impossibilities -- hard to detect -- n=9

016 -- 038.png -- =p1=>    pay too."
016 -- 038.png -- =p3=>    pay, too."
016 -- 038.png -- diff>    ===^^^=^^

023 -- 049.png -- =p1=>    "This book was illuminated, bound, and
023 -- 049.png -- =p3=>    "'This book was illuminated, bound, and
023 -- 049.png -- diff>    =^^^^^^^=^^^^^^^^=^^^^^^^^^^^^^^^^^^^^

032 -- 059.png -- =p1=>    Cameron." Call them up this minute and nail
032 -- 059.png -- =p3=>    Cameron. "Call them up this minute and nail
032 -- 059.png -- diff>    ========^^=================================

060 -- 096.png -- =p1=>    was the principle of it is identical with that
060 -- 096.png -- =p3=>    was, the principle of it is identical with that
060 -- 096.png -- diff>    ===^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

114 -- 201.png -- =p1=>    periodicals, "Mr. Hawley managed to shout
114 -- 201.png -- =p3=>    periodicals," Mr. Hawley managed to shout
114 -- 201.png -- diff>    ============^^===========================

121 -- 205.png -- =p1=>    and two of Burminghams graduates
121 -- 205.png -- =p3=>    and two of Burmingham's graduates
121 -- 205.png -- diff>    =====================^^^^^^^^^^^

144 -- 227.png -- =p1=>    the five hundredth-time Don had been caught
144 -- 227.png -- =p3=>    the five hundredth -- time Don had been caught
144 -- 227.png -- diff>    ==================^^^^^^^^^^^^^^^^^^^^^^^^^

153 -- 235.png -- =p1=>    In fact," he continued, lapsing into seriousness,"
153 -- 235.png -- =p3=>    In fact," he continued, lapsing into seriousness,
153 -- 235.png -- diff>    =================================================^

154 -- 235.png -- =p1=>    the younger generation teaches us
154 -- 235.png -- =p3=>    "the younger generation teaches us
154 -- 235.png -- diff>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

------------------------------------------------------------------



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&amp;
ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080328/21d584e5/attachment-0001.htm 

From paulmaas at airpost.net  Fri Mar 28 19:04:48 2008
From: paulmaas at airpost.net (Paul Maas)
Date: Fri, 28 Mar 2008 19:04:48 -0700
Subject: [gutvol-d] Clay Shirkey: "Given enough eyeballs,
	all typos are shallow"
Message-ID: <1206756288.4463.1244898735@webmail.messagingengine.com>

A break from the bowerbird mass flood:

http://www.shirky.com/herecomeseverybody/2008/03/given-enough-eyeballs-all-typo.html
-- 
  Paul Maas
  paulmaas at airpost.net

-- 
http://www.fastmail.fm - One of many happy users:
  http://www.fastmail.fm/docs/quotes.html


From Bowerbird at aol.com  Fri Mar 28 19:40:34 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 28 Mar 2008 22:40:34 EDT
Subject: [gutvol-d] Clay Shirkey: "Given enough eyeballs,
	all typos are shallow"
Message-ID: <c70.2a3fd197.351f0622@aol.com>

paul said:
>    A break from the bowerbird mass flood:
>    
http://www.shirky.com/herecomeseverybody/2008/03/given-enough-eyeballs-all-typo.html

paul, you really need to engage your brain before posting.

shirky is saying _exactly_ the same thing as i'm saying,
except my "mass flood" is because i'm providing _data_
that _proves_ what i'm saying, rather than just spouting
a title for a blog-entry based on a personal experience.

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&amp;
ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080328/73283582/attachment.htm 

From paulmaas at airpost.net  Fri Mar 28 20:32:51 2008
From: paulmaas at airpost.net (Paul Maas)
Date: Fri, 28 Mar 2008 20:32:51 -0700
Subject: [gutvol-d] Clay Shirkey: "Given enough eyeballs,
 all typos are   shallow"
In-Reply-To: <c70.2a3fd197.351f0622@aol.com>
References: <c70.2a3fd197.351f0622@aol.com>
Message-ID: <1206761571.18076.1244905483@webmail.messagingengine.com>

Shaking my head on this. What the hell is wrong with you?

On Fri, 28 Mar 2008 22:40:34 EDT, Bowerbird at aol.com said:
> paul said:
> >    A break from the bowerbird mass flood:
> >    
> http://www.shirky.com/herecomeseverybody/2008/03/given-enough-eyeballs-all-typo.html
> 
> paul, you really need to engage your brain before posting.
> 
> shirky is saying _exactly_ the same thing as i'm saying,
> except my "mass flood" is because i'm providing _data_
> that _proves_ what i'm saying, rather than just spouting
> a title for a blog-entry based on a personal experience.
> 
> -bowerbird
> 
> 
> 
> **************
> Create a Home Theater Like the Pros. Watch the video on AOL 
> Home.
>       (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&amp;
> ncid=aolhom00030000000001)
-- 
  Paul Maas
  paulmaas at airpost.net

-- 
http://www.fastmail.fm - Email service worth paying for. Try it for free


From paulmaas at airpost.net  Fri Mar 28 20:37:35 2008
From: paulmaas at airpost.net (Paul Maas)
Date: Fri, 28 Mar 2008 20:37:35 -0700
Subject: [gutvol-d] Clay Shirkey: "Given enough eyeballs,
 all typos are   shallow"
In-Reply-To: <c70.2a3fd197.351f0622@aol.com>
References: <c70.2a3fd197.351f0622@aol.com>
Message-ID: <1206761855.18679.1244905587@webmail.messagingengine.com>

Also, you certainly are providing data, but why not complete
your research, then post your summary?  All I see is a huge
deluge of raw data that's best described as spam.  Are you
trying to convince us because you write these unbelievably
long messages?


On Fri, 28 Mar 2008 22:40:34 EDT, Bowerbird at aol.com said:
> paul said:
> >    A break from the bowerbird mass flood:
> >    
> http://www.shirky.com/herecomeseverybody/2008/03/given-enough-eyeballs-all-typo.html
> 
> paul, you really need to engage your brain before posting.
> 
> shirky is saying _exactly_ the same thing as i'm saying,
> except my "mass flood" is because i'm providing _data_
> that _proves_ what i'm saying, rather than just spouting
> a title for a blog-entry based on a personal experience.
> 
> -bowerbird
> 
> 
> 
> **************
> Create a Home Theater Like the Pros. Watch the video on AOL 
> Home.
>       (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&amp;
> ncid=aolhom00030000000001)
-- 
  Paul Maas
  paulmaas at airpost.net

-- 
http://www.fastmail.fm - Same, same, but different



From marcello at perathoner.de  Sat Mar 29 00:16:37 2008
From: marcello at perathoner.de (Marcello Perathoner)
Date: Sat, 29 Mar 2008 08:16:37 +0100
Subject: [gutvol-d] Clay Shirkey: "Given enough eyeballs,
 all typos are   shallow"
In-Reply-To: <1206761855.18679.1244905587@webmail.messagingengine.com>
References: <c70.2a3fd197.351f0622@aol.com>
	<1206761855.18679.1244905587@webmail.messagingengine.com>
Message-ID: <47EDECD5.6090606@perathoner.de>

Paul Maas wrote:

> Also, you certainly are providing data, but why not complete
> your research, then post your summary?  All I see is a huge
> deluge of raw data that's best described as spam.  Are you
> trying to convince us because you write these unbelievably
> long messages?

BB is a social inept troglodyte with way too much time on his hands. 
*He* thinks he is a genius because he knows how to convert his social 
security check into money. Everybody else thinks he is a crank.

His tendency of writing longer and longer nonsense in the hope of 
enticing somebody into a fight is a direct consequence of everybody else 
having him killfiled.

Just do the same.


Whats wrong with BB? See:

   http://www.gnutenberg.de/bowerbird/


-- 
Marcello Perathoner
webmaster at gutenberg.org


From hart at pglaf.org  Sat Mar 29 06:56:56 2008
From: hart at pglaf.org (Michael Hart)
Date: Sat, 29 Mar 2008 06:56:56 -0700 (PDT)
Subject: [gutvol-d] PG's #25,000
Message-ID: <Pine.LNX.4.64.0803290655320.24153@pglaf.org>


In a couple weeks we will be coming up on eBook #25,000
in our numbering cycle.

If anyone has any suggestions for #25,000. . . .



Michael


From paulmaas at airpost.net  Sat Mar 29 08:26:31 2008
From: paulmaas at airpost.net (Paul Maas)
Date: Sat, 29 Mar 2008 08:26:31 -0700
Subject: [gutvol-d] Clay Shirkey: "Given enough eyeballs,
 all typos are     shallow"
In-Reply-To: <47EDECD5.6090606@perathoner.de>
References: <c70.2a3fd197.351f0622@aol.com>
	<1206761855.18679.1244905587@webmail.messagingengine.com>
	<47EDECD5.6090606@perathoner.de>
Message-ID: <1206804391.28523.1244951345@webmail.messagingengine.com>

Wow, bowerbirdy is definitely a social misfit.  Your advice to ignore
him
is excellent.

It'd be cool if this group's listserver allowed one to kill-file at the
source.  This way the mail is never sent.  With this system each
subscriber
can query the maillist application to see how much he/she is kill-filed.
 In
the case of bowerbirdy, with such a capability, no doubt 95% of all
subscribers would flip the switch on him once told how to do it.  He'd
probably quit posting here since he'd know no-one is listening to him. 
I
also believe this group's archive is not indexed by Google and other
search
engines, so what he posts here is not even findable.  I'm amazed he
continues
posting here instead of in a Google indexed blog.  This group is almost
like
a black hole of information exchange.  I wonder why the archive is not
open
so it can be indexed?  Maybe the owners of this group are embarassed by
posters like bowerbirdy.  "Can't kick them off since that goes against
our
principles, so let's close the archive to the public so no one can see
the
garbage being posted here."  Makes sense to me. 


On Sat, 29 Mar 2008 08:16:37 +0100, "Marcello Perathoner"
> BB is a social inept troglodyte with way too much time on his hands. 
> *He* thinks he is a genius because he knows how to convert his social 
> security check into money. Everybody else thinks he is a crank.
> 
> His tendency of writing longer and longer nonsense in the hope of 
> enticing somebody into a fight is a direct consequence of everybody else 
> having him killfiled.
> 
> Just do the same.
> 
> 
> Whats wrong with BB? See:
> 
>    http://www.gnutenberg.de/bowerbird/
> 
> 
> -- 
> Marcello Perathoner
> webmaster at gutenberg.org
> 
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
-- 
  Paul Maas
  paulmaas at airpost.net

-- 
http://www.fastmail.fm - A fast, anti-spam email service.


From Bowerbird at aol.com  Sat Mar 29 11:30:58 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 29 Mar 2008 14:30:58 EDT
Subject: [gutvol-d] the living room of the project gutenberg library
Message-ID: <c7f.1095827e.351fe4e2@aol.com>

one big reason why i post on this listserve is because
it's the living room of the project gutenberg library,
and i consider that to be a neat place to hang out...

michael hart is one of my big heroes...

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&amp;
ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080329/b78825a7/attachment.htm 

From ajhaines at shaw.ca  Sat Mar 29 15:44:03 2008
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Sat, 29 Mar 2008 15:44:03 -0700
Subject: [gutvol-d] the living room of the project gutenberg library
References: <c7f.1095827e.351fe4e2@aol.com>
Message-ID: <001401c891ee$634e29d0$6501a8c0@ahainesp2400>

The library itself would be even neater, and you'd be paying a compliment to Michael, if you produced some (more) books. 

In honour of the upcoming 25000th assigned etext number, how about producing 25 books over the remainder of 2008?

Consider this a challenge--divert some of that time and energy you use in critiquing DP, and produce 25 ebooks, their titles not currently in PG, by yourself, outside of DP, starting from real books of, say, 250 pages or more each (not scansets from Internet Archive, Google Books, or similar), using your tools and techniques, and have them posted in PG, by the end of 2008.


  ----- Original Message ----- 
  From: Bowerbird at aol.com 
  To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com 
  Sent: Saturday, March 29, 2008 11:30 AM
  Subject: [gutvol-d] the living room of the project gutenberg library


  one big reason why i post on this listserve is because
  it's the living room of the project gutenberg library,
  and i consider that to be a neat place to hang out...

  michael hart is one of my big heroes...

  -bowerbird



  **************
  Create a Home Theater Like the Pros. Watch the video on AOL Home.
  (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&ncid=aolhom00030000000001) 


------------------------------------------------------------------------------


  _______________________________________________
  gutvol-d mailing list
  gutvol-d at lists.pglaf.org
  http://lists.pglaf.org/listinfo.cgi/gutvol-d
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080329/7c5e32cc/attachment.htm 

From nwolcott2ster at gmail.com  Sun Mar 30 09:19:10 2008
From: nwolcott2ster at gmail.com (Norm Wolcott)
Date: Sun, 30 Mar 2008 11:19:10 -0500
Subject: [gutvol-d] Googles denial of service messages
Message-ID: <002001c89281$d8776940$660fa8c0@atlanticbb.net>

I was poking around on books.google.com yesterday. I was doing some searches, looking at some of the results, saing the about page and downloading the pdf for the ones I was interested in. Then I got the "alphabet-soup" messae from Google saying I was acting similar to a robot. I was given the option to continue if I could read the 5 alpha characters in the box. I continued for one or 2 times then  got the alphabet sop message again, this time asking me to identify the letters in two successive boxes. Then a couple of actions later (just using the back key) I got the denial of service mesage reminding me that bots were violating their terms of service. It is not clear if I am on their permanent s--- list, or on probabation for aa month or so. 

In any event it appears that they are monitoring all of my individual IP address, my router's ID address, and my cable modem's IP address. I tried using another comoputer on the same router, when I got the alphabet soup messaage almost immediately I did not proceed to the denial of service message. 

It is obvious that since my transgressions were purely random that I was not being tracked for being a bot but for using the site too much. I don't know if being logged iinto their site hurt or not, I like to add books to "my Library", but maybe being on their list sets you up as a problem. 

Is there any way I cne control or change one or all of these IP addresses (and incidentally monitor what they are ) so that I can even the playing field with google? I saw a page somewhere where you could "assign a permanent IP address". One of my problems is that with the usual setup the IP addresses are autoomatically assigned. Thus they are always the first available on the list which sets me up for google's tracker. Evenchanging my local IP address gives me only 256 choices I think, and that will not be much hep if they are monitoriing the other two. 

I hope some of you experts out there can offer some suggestions to fighting back with Uncle Google. Of course google will be monitoring my email here as well, but maybe I should refer to them as "foogle" or maybe you have a better suggestion. 
nwolcott2 at post.harvard.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080330/dfc1423f/attachment.htm 

From Bowerbird at aol.com  Sun Mar 30 11:30:06 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 30 Mar 2008 14:30:06 EDT
Subject: [gutvol-d] Googles denial of service messages
Message-ID: <d09.2fc4883e.3521362e@aol.com>

norm said:
>    I hope some of you experts out there 
>    can offer some suggestions to fighting back with Uncle Google. 
>    Of course google will be monitoring my email here as well, 
>    but maybe I should refer to them as "foogle" 
>    or maybe you have a better suggestion.

um, i'm not an expert at this stuff, not by any means.
all that i.p. gobbledygook confuses me immensely...

but i do have a suggestion.   starting with a question:
why do you consider a need to "fight back"?

i'd think this is a simple misunderstanding, not a "fight".

instead of asking us what to do, write directly to google.
(yeah, i know that's easier said than done, but just try it.)

although you characterize your use as innocent, and i do
believe you, i'm sure it's also "heavy" use, and looks like it,
so it's not all that surprising it might look bot-like to them.

but if you explain the situation, maybe they would make an
adjustment concerning your i.p. address that allowed you to
exercise your typical heavy usage without tripping their wires.

in the long run that'd be far better than gaming i.p. addresses.

it would also give useful information to us other heavy users...

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&amp;
ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080330/d6671fd2/attachment.htm 

From marcello at perathoner.de  Sun Mar 30 12:17:18 2008
From: marcello at perathoner.de (Marcello Perathoner)
Date: Sun, 30 Mar 2008 21:17:18 +0200
Subject: [gutvol-d] Googles denial of service messages
In-Reply-To: <002001c89281$d8776940$660fa8c0@atlanticbb.net>
References: <002001c89281$d8776940$660fa8c0@atlanticbb.net>
Message-ID: <47EFE73E.3000004@perathoner.de>

Norm Wolcott wrote:

> I hope some of you experts out there can offer some suggestions to
> fighting back with Uncle Google. Of course google will be monitoring
> my email here as well, but maybe I should refer to them as "foogle"
> or maybe you have a better suggestion. nwolcott2 at post.harvard.edu

First, this is no denial of service. Google is a commercial enterprise, 
they are offering this service at a considerable cost, so they get to 
make the rules. If you don't like Google, don't use it.

Second, the only IP Google can see is your router's / modem's IP. The 
configuration of your internal LAN is completely irrelevant. Your best 
bet is to power cycle your router / modem to get a new IP from your 
provider.



-- 
Marcello Perathoner
webmaster at gutenberg.org


From Bowerbird at aol.com  Mon Mar 31 10:44:57 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 31 Mar 2008 13:44:57 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 14
Message-ID: <d14.201f8a57.35227d19@aol.com>

ok, i checked up on juliet's contention that some projects
over at d.p. are auto-dehyphenated during preprocessing,
and have found that to be the case.

in a random check of a good number, i found that many
went into p1 dehyphenated.   indeed, although i still found 
some that had not been, a clear _majority_ had been done.

i salute this as one very real step of progress by d.p.

this makes big_bill's bellowing even more inexplicable...
(and he's continued, even exacerbated, that bellowing.)
does he proof?   doesn't he know about this development?

at any rate, the fact also remains that _none_ of the test
books which d.p. has been running in these experiments
was auto-dehyphenated.   neither were any subjected to
other preprocessing that should have become "standard"
over at d.p. a long time ago, like automatically closing up
spacey punctuation, and spacey contractions (like "we 're").

considering that there can be _hundreds_ of such entities,
even thousands in a typical book, this lack is unforgivable.

evidently the content providers of these books don't know
that they should be doing preprocessing on their projects,
rather than dumping thousands of _unnecessary_changes_
on the p1 proofers...

nonetheless, i do give credit when a positive step is taken,
and autodehyphenation is a positive step, so credit given...

(and yes, dehyphenation at such an early stage still remains
_the_wrong_policy_.   but if you're gonna have proofers do it
then, you might as well have the computer do it then instead.)

***

back to our analysis of the data in the parallel proofing test.

so, were you surprised to learn that the normal p1 proofers
took 6,400+ lines to perfection, with only _161_ not perfect?

yep, p2 and p3 on this project only changed 161 lines:
>    http://z-m-l.com/go/paulp/paul-p1-p3-161changes.html

all of the rest of the lines were evidently perfect after p1.
(for a comparison, i estimate p1 fixed about 1,750 lines.)

***

and even though a mere 161 imperfect lines out of 6,500+
is an amazingly high rate of quality, closer analysis of those
161 bad lines suggests most could've been _autodetected_,
meaning they should've been fixed during _preprocessing_,
before they were ever even presented to volunteers to proof.

to see how i categorized the 161 changed lines, look here:
>    http://z-m-l.com/go/paulp/paul-161-categorize.html

well, by my count, all but _37_ could've been autodetected.
(my earlier guess of _40_ ending up being pretty accurate.)

37 errors is still too many for a book with some 200+ pages,
even to send the text to the public for "continuous proofing",
but it's one heck of a great performance for _p1_ to turn in...

***

and i wasn't done yet...

i then categorized the 37 remaining lines, and put it here:
>    http://z-m-l.com/go/paulp/paul-37-not-autodetectable.html

i have also appended it to this post, for your convenience...

these difficult-to-detect 37 lines break down like this:
->   stealth scannos -- 15
->   missing words --13
->   punctuation problems -- 9
stealth scannos, of course, are the prototype of hard-to-detect.

and there seemed to be a _lot_ of stealth scannos in this book...

but wait.   maybe that's because _tesseract_ was used for o.c.r.?

it's worth a look...

so i did an easy test for that...

i just looked to see how many of the 15 stealth scannos which
had persisted through p1 were present in the o.c.r. by _abbyy_.

wow.   1 of them was caused by a bad scan, but the other _14_
were present in the first place only because tesseract was used,
instead of abbyy finereader, which everyone knows is superior.
that is, none of these stealth scannos existed in the abbyy o.c.r.

so all 15 cases i categorized as "stealth scannos" were attributable
to _incompetence_by_the_content_provider_, and _not_ proofers!

wow.

so i took a look at the "missing words" category as well, since

many of those could clearly be attributed to the bad scans too,
and i wondered how many of the others were due to tesseract.

again, the results were very stark.   5 were due to bad scans.
that is, the missing words were on the left side of the page,
in the area where the page had been blurred by gutter noise.
but all of the other 8 cases -- i.e., missing words mid-line --
were present in tesseract only; abbyy had recognized it fine...

ao again, all 13 of the 13 cases in my "missing words" category
were because of the content provider, not caused by proofers.
yes, the proofers let these errors through, but if they wouldn't
have been present originally -- i.e., if abbyy had been used --
then they wouldn't have been present at the end of p1 either...

so let's go to our last category, which is "punctuation errors"...
here the story is a little less clear, and a little more convoluted.
only 1 error was due to tesseract, and 2 due to the bad scans.
2 were due to a dehyphenation error, which don't count against
the proofers.   that leaves just _4_ errors that were p1 mistakes...

049.png -- =p1=>    "This book was illuminated, bound, and
049.png -- =p3=>    "'This book was illuminated, bound, and ***p1 mistake
049.png -- diff>    =^^^^^^^=^^^^^^^^=^^^^^^^^^^^^^^^^^^^^

059.png -- =p1=>    Cameron." Call them up this minute and nail
059.png -- =p3=>    Cameron. "Call them up this minute and nail ***p1 mistake
059.png -- diff>    ========^^=====================

205.png -- =p1=>    and two of Burminghams graduates
205.png -- =p3=>    and two of Burmingham's graduates ***p1 mistake
205.png -- diff>    =====================^^^^^^^^^^^

227.png -- =p1=>    the five hundredth-time Don had been caught
227.png -- =p3=>    the five hundredth -- time Don had been caught ***p1 
mistake
227.png -- diff>    ==================^^^^^^^^^^^^^^^^^^^^^^^^^

(and i'm not gonna look too closely at those 4, because a second glance now
indicates that a couple of them, and maybe all 4, aren't p1 mistakes after 
all.)

the list of 37 -- appended, and at the u.r.l. above -- shows how i 
categorized
each of the 37 lines in terms of their being a result of bad scans, 
tesseract, etc.
for example, the first one, which involves a missing comma after the word 
"pay",
was due to a bad scan, which cut off that word (and the comma that 
followed)...

***

ok, let me summarize, because this conclusion is remarkable and startling.

on this project, _if_ we would have had good scans to begin with, which is
certainly _not_ an unreasonable thing to ask, and _if_ we would have had
the o.c.r. done with abbyy, which is again _not_ unreasonable to expect,
and _if_ we had done a good job of preprocessing this text (and/or done
a good job of cleaning _after_ it had came out of p1), which is _also_not_
an unreasonable expectation that we should have, _then_ the p1 proofers
would have taken all but _4_lines_ of the _6,500+_lines_ to _perfection_...

p1 -- and just one round of p1 at that -- took this book to near-perfection.

you can't see it well, because this perfection sits smack-dab in the middle 
of
literally _hundreds_ of errors "injected" by an incompetent content provider,
and literally _hundreds_ of meaningless and unnecessary changes, but if you
clear away the senseless underbrush, there's a sparkling diamond underneath.

in one short sentence, the p1 proofers are _awesome_.

-bowerbird

p.s.   the 37 bad lines out of p1 (out of 161) which were _not_ 
autodetectable...


stealth

003 -- 017.png -- =p1=>    "Enough to till a good-sized daily, I should
003 -- 017.png -- =p3=>    "Enough to fill a good-sized daily, I should 
***abbyy
003 -- 017.png -- diff>    ===========^================================

004 -- 018.png -- =p1=>    expensive piece of property, my son," he relied.
004 -- 018.png -- =p3=>    expensive piece of property, my son," he replied. 
***scan
004 -- 018.png -- diff>    ===========================================^^^^^

033 -- 063.png -- =p1=>    the judge mischievously. "It you boys propose
033 -- 063.png -- =p3=>    the judge mischievously. "If you boys propose 
***abbyy
033 -- 063.png -- diff>    ===========================^=================

087 -- 160.png -- =p1=>    Paul lingered the bill nervously. Fifty dollars!
087 -- 160.png -- =p3=>    Paul fingered the bill nervously. Fifty dollars! 
***abbyy
087 -- 160.png -- diff>    =====^==========================================

090 -- 164.png -- =p1=>    money and government notes are line examples
090 -- 164.png -- =p3=>    money and government notes are fine examples 
***abbyy
090 -- 164.png -- diff>    ===============================^============

094 -- 176.png -- =p1=>    press rooms for striking oil proof when the
094 -- 176.png -- =p3=>    press rooms for striking off proof when the 
***abbyy
094 -- 176.png -- diff>    ==========================^^===============

103 -- 187.png -- =p1=>    This east is then fitted upon the rollers
103 -- 187.png -- =p3=>    This cast is then fitted upon the rollers ***abbyy
103 -- 187.png -- diff>    =====^===================================

106 -- 190.png -- =p1=>    a small space allowed it; N, too, is not much in
106 -- 190.png -- =p3=>    a small space allowed it; X, too, is not much in 
***abbyy
106 -- 190.png -- diff>    ==========================^=====================

108 -- 192.png -- =p1=>    large metal sections that lit on the two halves of
108 -- 192.png -- =p3=>    large metal sections that fit on the two halves of 
***abbyy
108 -- 192.png -- diff>    ==========================^=======================

122 -- 206.png -- =p1=>    bid good-by to the familiar balls of the school,
122 -- 206.png -- =p3=>    bid good-by to the familiar halls of the school, 
***abbyy
122 -- 206.png -- diff>    ============================^===================

141 -- 225.png -- =p1=>    hollowing them out and tilling them up again
141 -- 225.png -- =p3=>    hollowing them out and filling them up again 
***abbyy
141 -- 225.png -- diff>    =======================^====================

143 -- 226.png -- =p1=>    loyally refusing to peach on his churns. That
143 -- 226.png -- =p3=>    loyally refusing to peach on his chums. That 
***abbyy
143 -- 226.png -- diff>    ====================================^^^^^^^^^

145 -- 227.png -- =p1=>    "They say there always has to be a fist time.
145 -- 227.png -- =p3=>    "They say there always has to be a first time. 
***abbyy
145 -- 227.png -- diff>    =====================================^^^^^^^^

152 -- 234.png -- =p1=>    "And I oh yours, Mr. Carter. Melville is a
152 -- 234.png -- =p3=>    "And I on yours, Mr. Carter. Melville is a 
***abbyy
152 -- 234.png -- diff>    ========^=================================

159 -- 237.png -- =p1=>    course, the far-tamed March Hare. Its advent
159 -- 237.png -- =p3=>    course, the far-famed March Hare. Its advent 
***abbyy
159 -- 237.png -- diff>    ================^===========================


missing/excess words -- hard to detect

014 -- 036.png -- =p1=>    of being shrewd, close-fisted, and
014 -- 036.png -- =p3=>    reputation of being shrewd, close-fisted, and 
***scan
014 -- 036.png -- diff>    ^^^^^^^^^^^^^^^^^^^^=^^^^^^^^^^^^^

021 -- 046.png -- =p1=>    kings, bishops, and persons of rank could
021 -- 046.png -- =p3=>    many kings, bishops, and persons of rank could 
***scan
021 -- 046.png -- diff>    ^^=^^^^^^=^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^

026 -- 051.png -- =p1=>    and were sold to of the Church or to
026 -- 051.png -- =p3=>    and were sold to dignitaries of the Church or to 
***tess
026 -- 051.png -- diff>    =================^^^^^^^^^^^^^^^^^^^

028 -- 053.png -- =p1=>    ??line missing here...??
028 -- 053.png -- =p3=>    "Yes." ***tess
028 -- 053.png -- diff>    ^^^^^^^^^^^^^^^^^^^^^^^^

038 -- 070.png -- =p1=>    I have already explained, care much for reading;
038 -- 070.png -- =p3=>    have already explained, care much for reading; 
***scan
038 -- 070.png -- diff>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^^^^^

051 -- 084.png -- =p1=>    a under the ropes."
051 -- 084.png -- =p3=>    under the ropes." ***scan
051 -- 084.png -- diff>    ^^^^^^^^^^^^^^^^^^^

056 -- 090.png -- =p1=>    is public that desired to read -- which this one 
did
056 -- 090.png -- =p3=>    public that desired to read -- which this one did 
***scan
056 -- 090.png -- diff>    
^^^^^^^^^^=^^^^^^^^^^^=^^^^^^^=^^^^=^^=^^^^^^^^^^^^^

057 -- 092.png -- =p1=>    More than one dignified resident of town struggled
057 -- 092.png -- =p3=>    More than one dignified resident of the town 
struggled ***tess
057 -- 092.png -- diff>    =====================================^^^^^^^^^^^^^

064 -- 119.png -- =p1=>    author the prey of vultures who
064 -- 119.png -- =p3=>    author was the prey of vultures who ***tess
064 -- 119.png -- diff>    =======^^^=^^=^^^^^^^^^^^^^^^^^

101 -- 186.png -- =p1=>    their days."
101 -- 186.png -- =p3=>    their days. "I'm going to take you upstairs 
***tess
101 -- 186.png -- diff>    ===========^

102 -- 186.png -- =p1=>    Mr. Hawley said briskly. "We may
102 -- 186.png -- =p3=>    first," Mr. Hawley said briskly. "We may ***tess
102 -- 186.png -- diff>    ^^^^^^^^^^^^^^^^^^^=^^^^^^^^^^^^

116 -- 202.png -- =p1=>    publishers." I
116 -- 202.png -- =p3=>    publishers." ***tess
116 -- 202.png -- diff>    ============^^

124 -- 209.png -- =p1=>    "Because -- well-it Would be so yellow,"
124 -- 209.png -- =p3=>    "Because -- well -- it would be so darn yellow," 
***tess
124 -- 209.png -- diff>    ================^^^=^^^^^^^^=^^=^^^^^^^^


punctuation -- hard to detect

016 -- 038.png -- =p1=>    pay too."
016 -- 038.png -- =p3=>    pay, too." ***scan
016 -- 038.png -- diff>    ===^^^=^^

023 -- 049.png -- =p1=>    "This book was illuminated, bound, and
023 -- 049.png -- =p3=>    "'This book was illuminated, bound, and ***p1 
mistake
023 -- 049.png -- diff>    =^^^^^^^=^^^^^^^^=^^^^^^^^^^^^^^^^^^^^

032 -- 059.png -- =p1=>    Cameron." Call them up this minute and nail
032 -- 059.png -- =p3=>    Cameron. "Call them up this minute and nail ***p1 
mistake
032 -- 059.png -- diff>    ========^^=================================

060 -- 096.png -- =p1=>    was the principle of it is identical with that
060 -- 096.png -- =p3=>    was, the principle of it is identical with that 
***scan
060 -- 096.png -- diff>    ===^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

114 -- 201.png -- =p1=>    periodicals, "Mr. Hawley managed to shout
114 -- 201.png -- =p3=>    periodicals," Mr. Hawley managed to shout ***tess
114 -- 201.png -- diff>    ============^^===========================

121 -- 205.png -- =p1=>    and two of Burminghams graduates
121 -- 205.png -- =p3=>    and two of Burmingham's graduates ***p1 mistake
121 -- 205.png -- diff>    =====================^^^^^^^^^^^

144 -- 227.png -- =p1=>    the five hundredth-time Don had been caught
144 -- 227.png -- =p3=>    the five hundredth -- time Don had been caught 
***p1 mistake
144 -- 227.png -- diff>    ==================^^^^^^^^^^^^^^^^^^^^^^^^^

153 -- 235.png -- =p1=>    In fact," he continued, lapsing into seriousness,"
153 -- 235.png -- =p3=>    In fact," he continued, lapsing into seriousness, 
***dehyphenation
153 -- 235.png -- diff>    =================================================^

154 -- 235.png -- =p1=>    the younger generation teaches us
154 -- 235.png -- =p3=>    "the younger generation teaches us 
***dehyphenation
154 -- 235.png -- diff>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&amp;
ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080331/4ba8f62f/attachment-0001.htm 

From ajhaines at shaw.ca  Mon Mar 31 11:27:30 2008
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Mon, 31 Mar 2008 11:27:30 -0700
Subject: [gutvol-d] parallel -- paul and the printing press -- 14
References: <d14.201f8a57.35227d19@aol.com>
Message-ID: <000f01c8935c$e17c8e90$6b01a8c0@ahainesp2400>

See PG FAQ V.105 for its discussion of spacey contractions.  

It's allowed that they get closed up, but it's up to the volunteer, so it's probably better to not close them up automatically.  I've done books where the OCR spaced some contractions and not others, and it wasn't easy to tell from the book whether the contractions were meant to be spaced or not.  If spacing was obvious, I went with it; if not, I didn't.  In either case, I went with consistency--if most were not spaced, I despaced any others, and vice versa.  

I've also encountered in a book (but only once): "wasn 't".  Since this is obviously wrong, whether the fault of the author or the typesetter, it got despaced.  Contractions can be spaced away from their companion word, but they *cannot" themselves be split.


On a separate note, I notice my challenge to bowerbird (see the "the living room of the project gutenberg library" thread) has gone unanswered.

Al


  ----- Original Message ----- 
  From: Bowerbird at aol.com 
  To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com 
  Sent: Monday, March 31, 2008 10:44 AM
  Subject: [gutvol-d] parallel -- paul and the printing press -- 14


  ok, i checked up on juliet's contention that some projects
  over at d.p. are auto-dehyphenated during preprocessing,
  and have found that to be the case.

  in a random check of a good number, i found that many
  went into p1 dehyphenated.  indeed, although i still found 
  some that had not been, a clear _majority_ had been done.

  i salute this as one very real step of progress by d.p.

  this makes big_bill's bellowing even more inexplicable...
  (and he's continued, even exacerbated, that bellowing.)
  does he proof?  doesn't he know about this development?

  at any rate, the fact also remains that _none_ of the test
  books which d.p. has been running in these experiments
  was auto-dehyphenated.  neither were any subjected to
  other preprocessing that should have become "standard"
  over at d.p. a long time ago, like automatically closing up
  spacey punctuation, and spacey contractions (like "we 're").

  considering that there can be _hundreds_ of such entities,
  even thousands in a typical book, this lack is unforgivable.

  evidently the content providers of these books don't know
  that they should be doing preprocessing on their projects,
  rather than dumping thousands of _unnecessary_changes_
  on the p1 proofers...

  nonetheless, i do give credit when a positive step is taken,
  and autodehyphenation is a positive step, so credit given...

  (and yes, dehyphenation at such an early stage still remains
  _the_wrong_policy_.  but if you're gonna have proofers do it
  then, you might as well have the computer do it then instead.)

  ***

  back to our analysis of the data in the parallel proofing test.

  so, were you surprised to learn that the normal p1 proofers
  took 6,400+ lines to perfection, with only _161_ not perfect?

  yep, p2 and p3 on this project only changed 161 lines:
  >   http://z-m-l.com/go/paulp/paul-p1-p3-161changes.html

  all of the rest of the lines were evidently perfect after p1.
  (for a comparison, i estimate p1 fixed about 1,750 lines.)

  ***

  and even though a mere 161 imperfect lines out of 6,500+
  is an amazingly high rate of quality, closer analysis of those
  161 bad lines suggests most could've been _autodetected_,
  meaning they should've been fixed during _preprocessing_,
  before they were ever even presented to volunteers to proof.

  to see how i categorized the 161 changed lines, look here:
  >   http://z-m-l.com/go/paulp/paul-161-categorize.html

  well, by my count, all but _37_ could've been autodetected.
  (my earlier guess of _40_ ending up being pretty accurate.)

  37 errors is still too many for a book with some 200+ pages,
  even to send the text to the public for "continuous proofing",
  but it's one heck of a great performance for _p1_ to turn in...

  ***

  and i wasn't done yet...

  i then categorized the 37 remaining lines, and put it here:
  >   http://z-m-l.com/go/paulp/paul-37-not-autodetectable.html

  i have also appended it to this post, for your convenience...

  these difficult-to-detect 37 lines break down like this:
  ->  stealth scannos -- 15
  ->  missing words --13
  ->  punctuation problems -- 9
  stealth scannos, of course, are the prototype of hard-to-detect.

  and there seemed to be a _lot_ of stealth scannos in this book...

  but wait.  maybe that's because _tesseract_ was used for o.c.r.?

  it's worth a look...

  so i did an easy test for that...

  i just looked to see how many of the 15 stealth scannos which
  had persisted through p1 were present in the o.c.r. by _abbyy_.

  wow.  1 of them was caused by a bad scan, but the other _14_
  were present in the first place only because tesseract was used,
  instead of abbyy finereader, which everyone knows is superior.
  that is, none of these stealth scannos existed in the abbyy o.c.r.

  so all 15 cases i categorized as "stealth scannos" were attributable
  to _incompetence_by_the_content_provider_, and _not_ proofers!

  wow.

  so i took a look at the "missing words" category as well, since
  many of those could clearly be attributed to the bad scans too,
  and i wondered how many of the others were due to tesseract.

  again, the results were very stark.  5 were due to bad scans.
  that is, the missing words were on the left side of the page,
  in the area where the page had been blurred by gutter noise.
  but all of the other 8 cases -- i.e., missing words mid-line --
  were present in tesseract only; abbyy had recognized it fine...

  ao again, all 13 of the 13 cases in my "missing words" category
  were because of the content provider, not caused by proofers.
  yes, the proofers let these errors through, but if they wouldn't
  have been present originally -- i.e., if abbyy had been used --
  then they wouldn't have been present at the end of p1 either...

  so let's go to our last category, which is "punctuation errors"...
  here the story is a little less clear, and a little more convoluted.
  only 1 error was due to tesseract, and 2 due to the bad scans.
  2 were due to a dehyphenation error, which don't count against
  the proofers.  that leaves just _4_ errors that were p1 mistakes...

  049.png -- =p1=>   "This book was illuminated, bound, and
  049.png -- =p3=>   "'This book was illuminated, bound, and ***p1 mistake
  049.png -- diff>   =^^^^^^^=^^^^^^^^=^^^^^^^^^^^^^^^^^^^^

  059.png -- =p1=>   Cameron." Call them up this minute and nail
  059.png -- =p3=>   Cameron. "Call them up this minute and nail ***p1 mistake
  059.png -- diff>   ========^^=====================

  205.png -- =p1=>   and two of Burminghams graduates
  205.png -- =p3=>   and two of Burmingham's graduates ***p1 mistake
  205.png -- diff>   =====================^^^^^^^^^^^

  227.png -- =p1=>   the five hundredth-time Don had been caught
  227.png -- =p3=>   the five hundredth -- time Don had been caught ***p1 mistake
  227.png -- diff>   ==================^^^^^^^^^^^^^^^^^^^^^^^^^

  (and i'm not gonna look too closely at those 4, because a second glance now
  indicates that a couple of them, and maybe all 4, aren't p1 mistakes after all.)

  the list of 37 -- appended, and at the u.r.l. above -- shows how i categorized
  each of the 37 lines in terms of their being a result of bad scans, tesseract, etc.
  for example, the first one, which involves a missing comma after the word "pay",
  was due to a bad scan, which cut off that word (and the comma that followed)...

  ***

  ok, let me summarize, because this conclusion is remarkable and startling.

  on this project, _if_ we would have had good scans to begin with, which is
  certainly _not_ an unreasonable thing to ask, and _if_ we would have had
  the o.c.r. done with abbyy, which is again _not_ unreasonable to expect,
  and _if_ we had done a good job of preprocessing this text (and/or done
  a good job of cleaning _after_ it had came out of p1), which is _also_not_
  an unreasonable expectation that we should have, _then_ the p1 proofers
  would have taken all but _4_lines_ of the _6,500+_lines_ to _perfection_...

  p1 -- and just one round of p1 at that -- took this book to near-perfection.

  you can't see it well, because this perfection sits smack-dab in the middle of
  literally _hundreds_ of errors "injected" by an incompetent content provider,
  and literally _hundreds_ of meaningless and unnecessary changes, but if you
  clear away the senseless underbrush, there's a sparkling diamond underneath.

  in one short sentence, the p1 proofers are _awesome_.

  -bowerbird

  p.s.  the 37 bad lines out of p1 (out of 161) which were _not_ autodetectable...


  stealth

  003 -- 017.png -- =p1=>   "Enough to till a good-sized daily, I should
  003 -- 017.png -- =p3=>   "Enough to fill a good-sized daily, I should ***abbyy
  003 -- 017.png -- diff>   ===========^================================

  004 -- 018.png -- =p1=>   expensive piece of property, my son," he relied.
  004 -- 018.png -- =p3=>   expensive piece of property, my son," he replied. ***scan
  004 -- 018.png -- diff>   ===========================================^^^^^

  033 -- 063.png -- =p1=>   the judge mischievously. "It you boys propose
  033 -- 063.png -- =p3=>   the judge mischievously. "If you boys propose ***abbyy
  033 -- 063.png -- diff>   ===========================^=================

  087 -- 160.png -- =p1=>   Paul lingered the bill nervously. Fifty dollars!
  087 -- 160.png -- =p3=>   Paul fingered the bill nervously. Fifty dollars! ***abbyy
  087 -- 160.png -- diff>   =====^==========================================

  090 -- 164.png -- =p1=>   money and government notes are line examples
  090 -- 164.png -- =p3=>   money and government notes are fine examples ***abbyy
  090 -- 164.png -- diff>   ===============================^============

  094 -- 176.png -- =p1=>   press rooms for striking oil proof when the
  094 -- 176.png -- =p3=>   press rooms for striking off proof when the ***abbyy
  094 -- 176.png -- diff>   ==========================^^===============

  103 -- 187.png -- =p1=>   This east is then fitted upon the rollers
  103 -- 187.png -- =p3=>   This cast is then fitted upon the rollers ***abbyy
  103 -- 187.png -- diff>   =====^===================================

  106 -- 190.png -- =p1=>   a small space allowed it; N, too, is not much in
  106 -- 190.png -- =p3=>   a small space allowed it; X, too, is not much in ***abbyy
  106 -- 190.png -- diff>   ==========================^=====================

  108 -- 192.png -- =p1=>   large metal sections that lit on the two halves of
  108 -- 192.png -- =p3=>   large metal sections that fit on the two halves of ***abbyy
  108 -- 192.png -- diff>   ==========================^=======================

  122 -- 206.png -- =p1=>   bid good-by to the familiar balls of the school,
  122 -- 206.png -- =p3=>   bid good-by to the familiar halls of the school, ***abbyy
  122 -- 206.png -- diff>   ============================^===================

  141 -- 225.png -- =p1=>   hollowing them out and tilling them up again
  141 -- 225.png -- =p3=>   hollowing them out and filling them up again ***abbyy
  141 -- 225.png -- diff>   =======================^====================

  143 -- 226.png -- =p1=>   loyally refusing to peach on his churns. That
  143 -- 226.png -- =p3=>   loyally refusing to peach on his chums. That ***abbyy
  143 -- 226.png -- diff>   ====================================^^^^^^^^^

  145 -- 227.png -- =p1=>   "They say there always has to be a fist time.
  145 -- 227.png -- =p3=>   "They say there always has to be a first time. ***abbyy
  145 -- 227.png -- diff>   =====================================^^^^^^^^

  152 -- 234.png -- =p1=>   "And I oh yours, Mr. Carter. Melville is a
  152 -- 234.png -- =p3=>   "And I on yours, Mr. Carter. Melville is a ***abbyy
  152 -- 234.png -- diff>   ========^=================================

  159 -- 237.png -- =p1=>   course, the far-tamed March Hare. Its advent
  159 -- 237.png -- =p3=>   course, the far-famed March Hare. Its advent ***abbyy
  159 -- 237.png -- diff>   ================^===========================


  missing/excess words -- hard to detect

  014 -- 036.png -- =p1=>   of being shrewd, close-fisted, and
  014 -- 036.png -- =p3=>   reputation of being shrewd, close-fisted, and ***scan
  014 -- 036.png -- diff>   ^^^^^^^^^^^^^^^^^^^^=^^^^^^^^^^^^^

  021 -- 046.png -- =p1=>   kings, bishops, and persons of rank could
  021 -- 046.png -- =p3=>   many kings, bishops, and persons of rank could ***scan
  021 -- 046.png -- diff>   ^^=^^^^^^=^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^

  026 -- 051.png -- =p1=>   and were sold to of the Church or to
  026 -- 051.png -- =p3=>   and were sold to dignitaries of the Church or to ***tess
  026 -- 051.png -- diff>   =================^^^^^^^^^^^^^^^^^^^

  028 -- 053.png -- =p1=>   ??line missing here...??
  028 -- 053.png -- =p3=>   "Yes." ***tess
  028 -- 053.png -- diff>   ^^^^^^^^^^^^^^^^^^^^^^^^

  038 -- 070.png -- =p1=>   I have already explained, care much for reading;
  038 -- 070.png -- =p3=>   have already explained, care much for reading; ***scan
  038 -- 070.png -- diff>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^=^^^^^^^^^

  051 -- 084.png -- =p1=>   a under the ropes."
  051 -- 084.png -- =p3=>   under the ropes." ***scan
  051 -- 084.png -- diff>   ^^^^^^^^^^^^^^^^^^^

  056 -- 090.png -- =p1=>   is public that desired to read -- which this one did
  056 -- 090.png -- =p3=>   public that desired to read -- which this one did ***scan
  056 -- 090.png -- diff>   ^^^^^^^^^^=^^^^^^^^^^^=^^^^^^^=^^^^=^^=^^^^^^^^^^^^^

  057 -- 092.png -- =p1=>   More than one dignified resident of town struggled
  057 -- 092.png -- =p3=>   More than one dignified resident of the town struggled ***tess
  057 -- 092.png -- diff>   =====================================^^^^^^^^^^^^^

  064 -- 119.png -- =p1=>   author the prey of vultures who
  064 -- 119.png -- =p3=>   author was the prey of vultures who ***tess
  064 -- 119.png -- diff>   =======^^^=^^=^^^^^^^^^^^^^^^^^

  101 -- 186.png -- =p1=>   their days."
  101 -- 186.png -- =p3=>   their days. "I'm going to take you upstairs ***tess
  101 -- 186.png -- diff>   ===========^

  102 -- 186.png -- =p1=>   Mr. Hawley said briskly. "We may
  102 -- 186.png -- =p3=>   first," Mr. Hawley said briskly. "We may ***tess
  102 -- 186.png -- diff>   ^^^^^^^^^^^^^^^^^^^=^^^^^^^^^^^^

  116 -- 202.png -- =p1=>   publishers." I
  116 -- 202.png -- =p3=>   publishers." ***tess
  116 -- 202.png -- diff>   ============^^

  124 -- 209.png -- =p1=>   "Because -- well-it Would be so yellow,"
  124 -- 209.png -- =p3=>   "Because -- well -- it would be so darn yellow," ***tess
  124 -- 209.png -- diff>   ================^^^=^^^^^^^^=^^=^^^^^^^^


  punctuation -- hard to detect

  016 -- 038.png -- =p1=>   pay too."
  016 -- 038.png -- =p3=>   pay, too." ***scan
  016 -- 038.png -- diff>   ===^^^=^^

  023 -- 049.png -- =p1=>   "This book was illuminated, bound, and
  023 -- 049.png -- =p3=>   "'This book was illuminated, bound, and ***p1 mistake
  023 -- 049.png -- diff>   =^^^^^^^=^^^^^^^^=^^^^^^^^^^^^^^^^^^^^

  032 -- 059.png -- =p1=>   Cameron." Call them up this minute and nail
  032 -- 059.png -- =p3=>   Cameron. "Call them up this minute and nail ***p1 mistake
  032 -- 059.png -- diff>   ========^^=================================

  060 -- 096.png -- =p1=>   was the principle of it is identical with that
  060 -- 096.png -- =p3=>   was, the principle of it is identical with that ***scan
  060 -- 096.png -- diff>   ===^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  114 -- 201.png -- =p1=>   periodicals, "Mr. Hawley managed to shout
  114 -- 201.png -- =p3=>   periodicals," Mr. Hawley managed to shout ***tess
  114 -- 201.png -- diff>   ============^^===========================

  121 -- 205.png -- =p1=>   and two of Burminghams graduates
  121 -- 205.png -- =p3=>   and two of Burmingham's graduates ***p1 mistake
  121 -- 205.png -- diff>   =====================^^^^^^^^^^^

  144 -- 227.png -- =p1=>   the five hundredth-time Don had been caught
  144 -- 227.png -- =p3=>   the five hundredth -- time Don had been caught ***p1 mistake
  144 -- 227.png -- diff>   ==================^^^^^^^^^^^^^^^^^^^^^^^^^

  153 -- 235.png -- =p1=>   In fact," he continued, lapsing into seriousness,"
  153 -- 235.png -- =p3=>   In fact," he continued, lapsing into seriousness, ***dehyphenation
  153 -- 235.png -- diff>   =================================================^

  154 -- 235.png -- =p1=>   the younger generation teaches us
  154 -- 235.png -- =p3=>   "the younger generation teaches us ***dehyphenation
  154 -- 235.png -- diff>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^



  **************
  Create a Home Theater Like the Pros. Watch the video on AOL Home.
  (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&ncid=aolhom00030000000001) 


------------------------------------------------------------------------------


  _______________________________________________
  gutvol-d mailing list
  gutvol-d at lists.pglaf.org
  http://lists.pglaf.org/listinfo.cgi/gutvol-d
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080331/85edd045/attachment-0001.htm 

From Bowerbird at aol.com  Mon Mar 31 12:43:21 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 31 Mar 2008 15:43:21 EDT
Subject: [gutvol-d] parallel -- paul and the printing press -- 14
Message-ID: <c89.2645ce44.352298d9@aol.com>

al said:
>    See PG FAQ V.105 for its discussion of spacey contractions.?

um, yeah...   um, no...


>    It's allowed that they get closed up, but it's up to the volunteer, 
>    so it's probably better to not close them up automatically.

see, this is why the p.g. rules don't mean much to me.
there are far too many things left "up to the volunteer".

that means p.g. has become a mere collection of works
-- many of which have inconsistencies with each other --
rather than achieving a coherence making it _a_library_...


>    I've done books where the OCR spaced some contractions 
>    and not others, and it wasn't easy to tell from the book 
>    whether the contractions were meant to be?spaced or not.

well, here's my take on all that, al...

last _year_ d.p. digitized 2,345 books.

google scans that many books every _day_...

..._before_lunch_...

if d.p. -- and p.g. -- want to have _any_ hope of keeping up,
(or anything close), it's going to become _necessary_ to stop
wasting time sweating differences that don't make a difference.

this is one of those differences...

spacey contractions "look funny" to today's reader.

maybe at some time in the past they had _meaning_
-- probably to indicate a certain pattern of speech --
but whatever it was, it's now largely lost on people,
so we need to stop spending time fretting over it...

so i have my tool automatically close up contractions,
so digitizers can move on to more important things...

because i think it's _important_ that we keep up with google.
because if we don't, people will soon forget about the various
_advantages_ which digital text offers over a plain old scan-set,
because there will be so few books (out of millions of scan-sets)
for which they actually have the _luxury_ of having digital text.
the e-book of the future will become a scan-set, _by_default_...


>    If spacing was obvious, I went with it; if not, I didn't.? 
>    In either case, I went with consistency--if most were 
>    not spaced, I despaced?any others, and vice versa.?

well, if you're following a rule fairly consistently like that,
then _that_ could be programmed...   but -- to be frank --
none of it really "matters".   spacey contractions look funny.

so even though you spent all that decision-making time,
i'm gonna take your e-text and close up the contractions.

just like i turn all the 4-dot ellipses from d.p. into 3-dots.
yeah, yeah, i know someone spent lots of time _deciding_
whether they occurred at the end of a sentence, or not...

blah.   so what?   who cares?   it's still a freaking _ellipse_,
and it still means the same thing, 3-dots or 4-dots, so
i'm _sorry_ you wasted your time.   _i_ can't be bothered.

instead, i'm going to spending my time _productively_,
on the things that _do_ make a difference to my readers,
and that's why they will use _my_ library and not _yours_.


>    On a separate note, I notice my challenge to bowerbird 
>    (see the "the living room of the project gutenberg library" 
>    thread) has gone unanswered.

not in the slightest.   my reply has already grown lengthy,
and i'm not even done yet, but it'll be coming along soon.

i generally like to hold off long posts during a weekend.
the "parallel #14" post i made today was ready on friday.

(plus i have a preference for on-topic versus personal...
which leads me to ask if you have any _real_ response to
the _data_ on _digitization_ that i presented in that post.
i mean, the analysis ended with a _startling_ conclusion,
so i'd expect that people would have _something_ to say.)

but you can expect a reply by today, tomorrow at the latest.

one of the things in my reply was my focus on the library
_as_ a library, rather than as a "mere" collection of works,
so you've gotten an introduction to that point right here...

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&amp;
ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080331/e7d788b1/attachment.htm 

From nwolcott2ster at gmail.com  Mon Mar 31 13:18:52 2008
From: nwolcott2ster at gmail.com (Norm Wolcott)
Date: Mon, 31 Mar 2008 15:18:52 -0500
Subject: [gutvol-d] Googles denial of service messages
References: <002001c89281$d8776940$660fa8c0@atlanticbb.net>
	<47EFE73E.3000004@perathoner.de>
Message-ID: <001501c8936c$8dfedba0$660fa8c0@atlanticbb.net>

I agree they get to make the rules, which say no automated bots allowed. My
random use over a period of a few hours, and I don't know what part of the
site they are objecting to my using, was certainly not a bot. And as
Bowerbird says conversing with google can be difficult if not impossible.
Perhaps if i clilcked on some of their ads I would get better treatment.
Maybe ethe "full text" key search bothered them. As to the "they get to make
the rules" argument. Yes they should abide by THEIR rules. However they have
contracted with verious libraries to allow books to be scanned onto their
site, and part of the reasons they were allowed into these libraries was
that they would make out of copyright books available to the public. I don't
think they ever said to Harvard college that no more than ten books could be
downloaded per week/day/year by the public when they
pitched their agreement. And also they make no guarantee of the "usefulness"
of their scans which are often pretty bad. And what is worse Google offers
inducements to use their site such as gmail, "my library", toolbars, etc.
And I have never had a denial of service message for using google's general
search engine. Obvioiusly they want me to use that as much as I can. And I
never subscribed to any "rules of service" agreement. My reason for changing
my IP was only that I was being unfiarly hit by google with no opportunity
to explain I was not a bot. If they were more congenial there would be no
problem.

I would note that some libraries have refused to do business with google,
and are using Internet Archive instead. And Internet Archive will accept
your own scans of public domain works "in the highest resoloution you can
provide" and make them available to the public without restrictions.

But thanks for the tip on the IP shifts.


nwolcott2 at post.harvard.edu
----- Original Message -----
From: "Marcello Perathoner" <marcello at perathoner.de>
To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
Sent: Sunday, March 30, 2008 2:17 PM
Subject: Re: [gutvol-d] Googles denial of service messages


> Norm Wolcott wrote:
>
> > I hope some of you experts out there can offer some suggestions to
> > fighting back with Uncle Google. Of course google will be monitoring
> > my email here as well, but maybe I should refer to them as "foogle"
> > or maybe you have a better suggestion. nwolcott2 at post.harvard.edu
>
> First, this is no denial of service. Google is a commercial enterprise,
> they are offering this service at a considerable cost, so they get to
> make the rules. If you don't like Google, don't use it.
>
> Second, the only IP Google can see is your router's / modem's IP. The
> configuration of your internal LAN is completely irrelevant. Your best
> bet is to power cycle your router / modem to get a new IP from your
> provider.
>
>
>
> --
> Marcello Perathoner
> webmaster at gutenberg.org
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d


From Bowerbird at aol.com  Mon Mar 31 15:50:58 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 31 Mar 2008 18:50:58 EDT
Subject: [gutvol-d] the living room of the project gutenberg library
Message-ID: <d1e.20179fbe.3522c4d2@aol.com>

al said:
>    The library itself would be even neater, 
>    and you'd be paying a compliment to Michael,
>    if you?produced some (more) books.?

um, well gee, thank you for the suggestion, al...

but i think i can decide the best use of my time,
and how i will pay my compliments to michael...

i do appreciate your _thoughtful_kindness_ in
generating the suggestion, however.   thanks...

i'm extremely grateful to all of the people who
digitize books, for project gutenberg and for
other projects.   they are doing a great service...

my energy, however, is better devoted to the
question of what happens to books _after_
they have been digitized.   how do we create
a _cyberlibrary_, and make it more _efficient_?

how can we manage the correction of errors?

what kind of _viewer-programs_ do we need?

what _conversion-tools_ do people require?
what kind of other tools should we give 'em?

what is the ecosystem in which the e-texts
exist, in relation to themselves and to the
world at large, and how do we facilitate it?

how do we enable users to _remix_ e-texts?

and, of course, as i've made it clear by now,
how can we make our digitization workflows
more efficient, and what tools do we need?

i could add little to the thousands of people
who are capable of digitizing pages at d.p.

on the other hand, those thousands are not
capable of doing the things that i can do...

they gain their power from their numbers,
and it is a remarkable power that they have,
and i cheer them loudly for the contribution.

but i gain my power from my unique skills,
and it is a remarkable power that i have...

programmers seem to be scarce in these parts...

and have you seen how the d.p. people are
approaching the analysis of their own data?
it's very apparent to me that they need help.
so i'm showing them how to do that analysis.
nobody else seems capable of showing them.


>    Consider this a challenge

i've developed my own challenges, thanks.           :+)

if i put the big one in a phrase, i want to develop
a tool that will suck up the results of o.c.r. and
-- after presenting some questions to the user
so as to resolve any ambiguities -- then spit out
a nicely-finished copy of the book as digital text,
suitable for mounting for "continuous proofing".

so asking me to do 25 books now -- manually --
is a bit like asking henry ford to stop building his
assembly-line and put together 25 cars manually.

if _my_ "assembly-line" works, al, i will eventually
create _25_thousand_ e-texts, maybe 25 million.
or maybe the guy that follow me will.   or maybe
the guy that follows him.   or maybe it'll be google.
at any rate, whoever makes that assembly-line will
put most of you hand-crafters out of the business.
but hey, some people still build their cars by hand.


>    --divert some of that time and energy 
>    you use in critiquing DP

see, i don't really think that's a good idea at all.

there's a reason i'm spending my time that way.

and i think it's _vital_and_imperative_ for me to
continue to spend my resources critiquing d.p.

the reason is because d.p. is squandering what
i believe to be an extremely important resource,
namely the good will of well-meaning volunteers.

by subjecting proofers to a _massively_inefficient_
workflow, d.p. burns them out unnecessarily, and
chases them away, causing long-term loss damage.

i know some people might not agree with me on it,
but in _my_ view, this is *the* worst problem that
confronts the world of volunteer digitization which
michael hart created with love so many decades ago.

so i'm doing my best to combat that *worst*problem*.
(i almost never use *bold*, but there you have it, al...)


>    and produce 25 ebooks, their titles not currently in PG, 
>    by yourself, outside of DP, starting from real?books of, 
>    say, 250 pages or more each (not scansets 

gee, al, you mean you don't consider those scan-sets
to be "real books"?   really?   they came from _libraries_...

or is it that you think "real men" scan p-books themselves?        ;+)


>    from Internet Archive, Google Books, or similar), 

personally, i think all of the e-books that are _not_
solidly connected with a scan-set from one of those
major scanning projects will eventually be neglected,
in favor of digitizations that _can_ be traced to them.

the project gutenberg e-texts will be very difficult to
compare visually with the major scan-sets, simply due
to the fact that you've rewrapped the lines, and thus
they'll come to be seen as _unnecessarily_unreliable_,
in favor of versions which did _not_ re-wrap the lines.

but even digitizations which did _not_ re-wrap lines
will be discarded if they can't be linked to a scan-set
that can be readily summoned from the big projects.

and even if you provide the actual scans that you used,
people won't trust you, because "who the heck are you?"

if you look around cyberspace even at this early stage,
for the classic books there are so many different e-texts
floating around that it has become a nightmare to know
exactly what you are dealing with, and the problem will
only get worse.   in order to make things uncomplicated,
people will demand that an e-text be closely associated
with a scan-set from one of the major scanning projects,
and demand that the association can be confirmed easily,
by a simple visual comparison of the text with the scans...

so, to my mind, the _only_ raw content to use is stuff that
comes "from internet archive, google books, or similar"...

and that's why i'm glad d.p. is using these more and more.

i'd consider it an absolute waste of my time to scan a book;
if someone else wants to do it, fine...   but i would not do it,
not unless there were some book that i just _had_ to have,
and all of the books in that category for me are post-1923.
so p.g. couldn't use them anyway...


>    using your tools and techniques, and have them?
>    posted in PG, by the end of 2008.

again, you seem to want me to go fishing.   ok, but
i, on the other hand, want to teach people to fish...

i want to design and build boats, weave fishing nets,
study aquaculture, and do research that improves our
ecosystems, so we can have our fish and eat them too,
because eating fish and fishing wisely makes us smart.

i don't have anything against the people who "only"
want to _go_ fishing.   indeed, i'm trying to help 'em.
and i believe i should continue trying to do just that.
it's _really_ the very best use of my time and talent...

but once again, al, i _do_ appreciate your suggestion...
so if you have more of them, please keep 'em coming.
or if you think there's more that i should consider, or
know about your reason for _this_ suggestion, tell me...

-bowerbird



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&amp;
ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080331/b1f8dd11/attachment-0001.htm 

From Bowerbird at aol.com  Mon Mar 31 22:18:42 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 1 Apr 2008 01:18:42 EDT
Subject: [gutvol-d] stopping perpetuity -- harder than it looks!
Message-ID: <d4d.19ffeade.35231fb2@aol.com>

take a look at the project page for iteration#6 of "planet strappers":
>    http://www.pgdp.net/c/project.php?id=projectID47dfd4f82feae

it's chugging along, and about half of the pages done have a "diff"...

that's right, you heard me correctly, about _half_ the pages!      :+)

"but how can that be?", you might be asking.   "already these pages
went through 5 rounds, and there are _still_ changes being made?"

yep.   sure are.

not _corrections_, mind you.   just "changes"...

meaningless changes...   every last one of them meaningless...

most of them having to do with ellipses.   and these changes appear to 
have been done by new proofers (who else would tackle a project that
has been in the rounds a half-dozen times?) who don't know the rules.
(for example they're replacing typos, ones where a note had been left.)

heck, one (or more) is even putting spaces _between_ the ellipse dots!
right after carlo, in a forum thread, said he had never seen that before.

(but -- amazingly -- in strict accordance with the p.g. f.a.q. on ellipses,
which has to be one of the most brain-dead p.g. rules devised thus far.
spaces between the dots of an ellipses will wreak havoc on any rewrap.)

throw in a couple of runarounds on end-line hyphenates as well, with
some people inserting hyphens or asterisks, and others removing 'em,
and you've got one tasty "error-injection" stew boiling in your pot...

this is crazy.

i mean, it's an excellent demonstration of what will happen when
you have "rules" that are interpreted and reinterpreted differently
all the time, and confusing to boot...   there are currently _several_
threads running in the d.p. forums dealing with ellipse confusion:
>    http://www.pgdp.net/phpBB2/viewtopic.php?t=31237
>    http://www.pgdp.net/phpBB2/viewtopic.php?t=30521

further, the proofers doing these changes don't seem to realize
five rounds of proofers have checked these pages before them.
("and just think, every _one_ of them missed _all_ these ellipses...
on one page after the next... really very surprising, that, isn't it?")

_hours_ of proofer time were spent bringing you this conclusion.
just on this iteration.   so far.   and it ain't done.   but what the heck,
it's just _proofer_time_...   and that ain't worth as much as peanuts...

***

oh, just in case you're wondering...   this iteration#6 did _not_ catch
the one remaining error, on p#33.   we'll have to wait for iteration#7.

-bowerbird

p.s.   however, iteration#6 _did_ find a p-book error that everyone
thus far missed...   the word "inconveniencies" for "inconveniences".
what a shocker!   how did everyone else manage to miss that so far?
oh, ok, you big spoilsport, dictionary says either one is acceptable.
but still, give that proofer a blue ribbon for great eyes trying hard!



**************
Create a Home Theater Like the Pros. Watch the video on AOL 
Home.
      (http://home.aol.com/diy/home-improvement-eric-stromer?video=15&amp;
ncid=aolhom00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080401/2cdd7d8c/attachment.htm