From Bowerbird at aol.com  Mon Mar  1 10:35:40 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 1 Mar 2010 13:35:40 EST
Subject: [gutvol-d] or a watch to take out of it
Message-ID: <1d231.6af0f3f0.38bd62fc@aol.com>

ok, so there's a new e-book viewer-program in town.

liza daly, who is affiliated with the o'reilly folks, has
brought out "ibisreader" for your reading enjoyment.

so i went to check it out...

at ibisreader.com, i clicked "get started" and then --
at "add a book" -- "feedbooks: popular public domain".

the 4th book down is "alice's adventures in wonderland",
and since the movie is coming out this friday, i got that.

feedbooks, as you probably know, is a site that takes
e-books from various sources, like project gutenberg,
and makes some very nice-looking versions of them...

since feedbooks re-works the books, they don't have
as many books as some of the other sites, but they are
preferred by many people because their books look nice.

so i start looking at the text, and i find this:

>    There was nothing so very remarkable in that; 
>    nor did Alice think it so very much out of the way 
>    to hear the Rabbit say to itself ?Oh dear! Oh dear! 
>    I shall be too late!? (when she thought it over 
>    afterwards, it occurred to her that she ought to 
>    have wondered at this, but at the time it all seemed 
>    quite natural); but when the Rabbit actually 
>    took a watch out of its waistcoat-pocket, 
>    and looked at it, and then hurried on, 
>    Alice started to her feet, for it flashed across her mind 
>    that she had never before seen a rabbit with 
>    either a waistcoat-pocket, or a watch to take out of it, 
>    and, burning with curiosity, she ran across the field after it, 
>    and was just in time to see it pop down a large rabbit-hole 
>    under the hedge.

well, gee.   if you're intimately familiar with this book, you know 
>    took a watch out of its waistcoat-pocket,
is a phrase that is _italicized_ in the book.   but not on this file...

here's where you can see a copy of the original:
>    http://www.archive.org/stream/alicesadventur00carr#page/2/mode/2up

you'll have to take my word for it that it's not italicized in the
feedbooks copy that is being used by ibisreader, since i don't
see any convenient way for me to link to a specific page there.
(but it's book #22 from feedbooks, if you wanna look yourself.)

it's pretty clear that what has happened here is that feedbooks
has taken pg#11 and used it as its source.   what is a sad thing,
because -- even after all the years it coulda been "improved" --
pg#11 _still_ doesn't have proper italics in it.   it was "updated"
in 2005 (leaving no trace behind) and then once again in 2008
(when an .html version was added).   but it still has zero italics.
>    http://www.gutenberg.org/files/11/11-h/11-h.htm
>    http://www.gutenberg.org/files/11/11.txt

the p.g. version _does_ have italics rendered as all-uppercase.
so there is _some_ indication of them.   but this is ambiguous,
since there are places in the book where uppercase is used too:
>    http://www.archive.org/stream/alicesadventur00carr#page/4/mode/2up
(see the reference to "orange marmalade", in all-uppercase.)

if i would've been feedbooks, i would've converted _all_ of the
uppercased words to italics.   but of course then they would've
been changing things like chapter headers and first-words too.

and of course, feedbooks could've _left_ words in all-uppercase.
i don't know why they didn't, but i assume it's because they take
_pride_ in their typography, and all-uppercase looks like crap...

(but it's true that the tell-tale examples of _actual_ uppercase,
namely "orange marmalade", "drink me", and "eat me", are all
still rendered in uppercase in feedbooks#22, so i am stumped.)

and yes, folks, i know there are other versions of "alice" posted,
including some with italics correctly specified.   so let us look...

pg#19033 _does_ have italics in it, but it does _not_ italicize
the phrase about the rabbit, and his watch, and his pocket...
whether it's a version-difference -- it _is_ another version --
or whether it's just a digitization mistake, i simply don't know.
(and since this is clearly not the explanation for the feedbooks
discrepancies, i have no interest in determining the reasons.)

pg#928 -- an .html version only -- _does_ have italics.   yay!

pg#28885 also has the italics, and it has the images as well!
although sometimes things don't go right, as shown here:
>    http://z-m-l.com/misc/alice-glitch.png

the feedbooks version has _no_ italics at all, as far as i can see,
not even the different set of italics from pg#19033.   so, at first,
i had thought they'd used pg#11 as their source text, but now,
i'm not so sure.   (they have some strange contractions, which
could indicate that they might've done their own digitization.)

at any rate, the point still stands that pg#11 lacks proper italics.
all-uppercase is _not_ a substitute.   and since the italics _have_
been done -- in pg#928 -- the changes should be incorporated
into pg#11.   the presence of other versions, at higher numbers,
won't change the fact that pg#11 is considered as "the original."

so, c'mon, let's see someone who talks about pg/dp "quality"
and "incremental improvement" actually _back_up_ the claim.
let's get this classic and legendary e-text cleaned up now, ok?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100301/4507835a/attachment-0001.html>

From Bowerbird at aol.com  Mon Mar  1 10:55:59 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 1 Mar 2010 13:55:59 EST
Subject: [gutvol-d] nathan hale -- 001
Message-ID: <1ec34.2b548214.38bd67bf@aol.com>

ok, in discussing one of the books on rfrank's roundless test-site,
i remarked that the scans were badly done, uncharacteristic of roger.

much more in keeping with his typical level of quality are the scans
from a new book on the site, a biography about nathan hale, so i've
scraped and remounted those scans on my site, and will do the book.

>    http://z-m-l.com/go/nhale/nhalep123.html
>    http://z-m-l.com/go/nhale/nhale.zml

as i've been doing lately, the last page of the book shows the changes
that i made to the file to clean it up, to show people how simple it is...

this book has a lot of correspondence in it (so it has salutations and
signatures and stuff like that), and i haven't done such books before,
so i'm gonna have to figure out how to handle all of that in z.m.l.,
which means looking at a bunch of p.g. e-texts to see how people
have represented them in the .txt versions up to this point in time...

meanwhile, i'm just dealing with them in a presentational manner,
either left-justifying or centering or right-justifying, as appropriate.
it should be easy to see how i've indicated that by viewing the pages.
(i simply used underbars and equal-signs to indicate each of them.)

the book still needs some work.   but even now, it's a demonstration
of how quickly one person can make a book available to the public...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100301/6ec5cbd4/attachment.html>

From jimad at msn.com  Mon Mar  1 21:19:25 2010
From: jimad at msn.com (Jim Adcock)
Date: Mon, 1 Mar 2010 21:19:25 -0800
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <201002260020.25754.donovan@abs.net>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<201002260020.25754.donovan@abs.net>
Message-ID: <SNT120-DS22B55534B4C198BE7386ECAE3B0@phx.gbl>

>Those 6200+ works already are available to the public, at minimum in
scanned 
pages form, and most of them with OCR available. The argument that these
works 
are "trapped" is a red herring stemming from frustration over how long it
now 
takes the DP process to produce a "finished" version of the text.

Sorry, but this is NOT a "red herring".  Looking at DP's own statistics on
this subject, the release rate is about 2/3rds the project start rate -- and
has been for many years.  Why does this matter -- "eventually all projects
will get released?"  Yes, but by the time "eventually" happens enough more
new books will be stuck on queues that it will continue to be true that the
release rate is about 2/3rds the project start rate.  This means DP is
running in a "self similar" mode where effectively 1/3 of all projects that
get started DON'T get released.  Which means that 1/3 of all volunteer
effort is being wasted.  One might say "OK, let's just slow down the project
start rate."  If you do that then P1s do not have interesting projects to
work on and they get frustrated and go do something else with their time.
But DP NEEDS to have the P1s because DP grows those -- eventually -- to be
the P3s and the F2s and the PPs necessary to get the queues unstuck.  But
the queues can't get unstuck because increasing the start rate to attract
the P1s in turn clogs the queues.  So again, what is the solution?  1)
Increase the number of P3s, F2s, and PPs by reducing the qualifications.  Or
2) improve the tools available to P3s, F2s, and PPs to make them more
productive. DP can't fix the problem without changing.

If you don't understand this, please take a closer look at the plot that DP
makes available at:

http://www.pgdp.net/c/stats/stats_central.php

where you can see that one third of projects created DO NOT get released
because they are stuck on queues.  As more books get released it is also
true that more books get stuck on queues and the ratio remains the same: 1/3
of books DO NOT get released because they are stuck on queues. Which means
that 1/3 of volunteer efforts are being wasted by a flawed process.


From jimad at msn.com  Mon Mar  1 21:41:01 2010
From: jimad at msn.com (James Adcock)
Date: Mon, 1 Mar 2010 21:41:01 -0800
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease"
	of	e-texts
In-Reply-To: <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<B592E6447AAB45C891E108507D6811FA@alp2400>	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>	<A871E93708224C1FAB0E62A643357D64@alp2400>	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>	<AA902F07008A4680BD1E6D65699E82CD@alp2400>	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
Message-ID: <SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>

>> I think we're worried about the fact that the only version available
>> is one that you have to BUY that's based on our volunteer labor.

>If that's at least an option, why not?  Nobody forces you to buy it,
>though.

 
Again, I as an unpaid volunteer don't appreciate having my time and effort
converted into a for-profit enterprise before my public domain efforts have
reached fruition through DP.  The end result is that I get turned off of DP
and go "solo" instead.  When I go "solo" I admittedly create works that are
*somewhat* more buggy than DP claims to make.  The difference it that my
efforts see the light of day this month rather than three and a half years
from now.  When my NFP volunteer efforts are used poorly then I find
somewhere else to volunteer my time and efforts.  Why should DP care?  Well,
which "DP" are we talking about?  The DP made up of volunteers who get
frustrated by the inefficiencies and leave?   Or the DP made up of lifers
who don't want to see change?


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100301/f7fc20b1/attachment.html>

From dakretz at gmail.com  Mon Mar  1 21:55:26 2010
From: dakretz at gmail.com (don kretz)
Date: Mon, 1 Mar 2010 21:55:26 -0800
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <SNT120-DS22B55534B4C198BE7386ECAE3B0@phx.gbl>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<201002260020.25754.donovan@abs.net>
	<SNT120-DS22B55534B4C198BE7386ECAE3B0@phx.gbl>
Message-ID: <627d59b81003012155w1f6b5c87n79213695a34a9574@mail.gmail.com>

It's worse than that. We all know there is a large invisible queue of
projects that
aren't being posted at all because of the daunting prospect of possibly
never
seeing your project complete in your own lifetime. And we keep adding tricky
new loops and spins for the benefit  of one or another deserving category of

workers or project types, making the ability to forecast the schedule for
*your* project highly speculative.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100301/f8c04baa/attachment.html>

From walter.van.holst at xs4all.nl  Mon Mar  1 22:33:28 2010
From: walter.van.holst at xs4all.nl (Walter van Holst)
Date: Tue, 02 Mar 2010 07:33:28 +0100
Subject: [gutvol-d] Re:
 =?utf-8?q?=5BSPAM=5D_Re=3A_Re=3A_the_d=2Ep=2E_opinion_on_=22p?=
 =?utf-8?q?rerelease=22_of=09e-texts?=
In-Reply-To: <SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
References: "\"<20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<B592E6447AAB45C891E108507D6811FA@alp2400>	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>	<A871E93708224C1FAB0E62A643357D64@alp2400>	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>	<AA902F07008A4680BD1E6D65699E82CD@alp2400>	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>"
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>"
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
Message-ID: <a95d02ffa62d3e499d445f1560d54909@xs4all.nl>

On Mon, 1 Mar 2010 21:41:01 -0800, "James Adcock" <jimad at msn.com> wrote:

> Again, I as an unpaid volunteer don't appreciate having my time and
> effort converted into a for-profit enterprise before my public domain
> efforts have reached fruition through DP. The end result is that I get
> turned off of DP and go "solo" instead. When I go "solo" I admittedly
> create works that are *SOMEWHAT* more buggy than DP claims to make. The
> difference it that my efforts see the light of day this month rather
than
> three and a half years from now. When my NFP volunteer efforts are used
> poorly then I find somewhere else to volunteer my time and efforts. Why
> should DP care? Well, which "DP" are we talking about? The DP made up of
> volunteers who get frustrated by the inefficiencies and leave? Or the DP
> made up of lifers who don't want to see change?

The end result will still be in the public domain and can be scooped up by
any entity, commercial or non-commercial. I don't really see the point you
are trying to make.

Regards,

 Walter

From sankarrukku at gmail.com  Mon Mar  1 22:50:27 2010
From: sankarrukku at gmail.com (Sankar Viswanathan)
Date: Tue, 2 Mar 2010 12:20:27 +0530
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
	e-texts
In-Reply-To: <a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>
	<AA902F07008A4680BD1E6D65699E82CD@alp2400>
	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
Message-ID: <e45c9fe71003012250j64e1267s3a88318fdc53b0@mail.gmail.com>

Why
> should DP care? Well, which "DP" are we talking about? The DP made up of
> volunteers who get frustrated by the inefficiencies and leave? Or the DP
> made up of lifers who don't want to see change?

The above two categories form a very small percentage of D.P volunteers. The
vast majority (who are silent) are continuing to work in D.P. They are aware
of the problems and hope that solutions would be found shortly. They are
convinced that the D.P Board would implement changes for effecting a better
flow of the books.

-- 
Sankar

Service to Humanity is Service to God
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100302/eec229a5/attachment.html>

From jimad at msn.com  Tue Mar  2 08:37:42 2010
From: jimad at msn.com (Jim Adcock)
Date: Tue, 2 Mar 2010 08:37:42 -0800
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease"
	of	e-texts
In-Reply-To: <a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
References: "\"<20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<B592E6447AAB45C891E108507D6811FA@alp2400>	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>	<A871E93708224C1FAB0E62A643357D64@alp2400>	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>	<AA902F07008A4680BD1E6D65699E82CD@alp2400>	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>"	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>"	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
Message-ID: <SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>

>The end result will still be in the public domain and can be scooped up by
any entity, commercial or non-commercial. I don't really see the point you
are trying to make.

The "end result" to date is that a commercial company has taken my
not-for-profit work off DP at SR time and redistributed it under DRM such
that it cannot to date be "scooped up" by any other entity, commercial or
non-commercial.  The "end result" to date is that the donation of my time
and effort to a non-profit activity has been privatized for other's profit
without any contribution to the non-profit community.  This is typically
called "conversion" and is typically considered at least morally to be theft
of non-profit contributions.  If I wanted to work for profit I would do so
in the first place -- and would do so for my own profit rather that of
bottom feeders who prey on DP. Again, if "DP" [whoever that is] doesn't care
about these issues, *I DO*, and so I will put my volunteer efforts elsewhere
-- where my volunteer efforts WILL go in fact into NFP, and where my
volunteer efforts WILL make a positive impact on the world in a finite
amount of time.


From grythumn at gmail.com  Tue Mar  2 08:59:36 2010
From: grythumn at gmail.com (Robert Cicconetti)
Date: Tue, 2 Mar 2010 11:59:36 -0500
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
	e-texts
In-Reply-To: <SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<AA902F07008A4680BD1E6D65699E82CD@alp2400>
	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
Message-ID: <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>

On Tue, Mar 2, 2010 at 11:37 AM, Jim Adcock <jimad at msn.com> wrote:
> The "end result" to date is that a commercial company has taken my
> not-for-profit work off DP at SR time and redistributed it under DRM such
> that it cannot to date be "scooped up" by any other entity, commercial or
> non-commercial. ?The "end result" to date is that the donation of my time
> and effort to a non-profit activity has been privatized for other's profit
> without any contribution to the non-profit community. ?This is typically
> called "conversion" and is typically considered at least morally to be theft
> of non-profit contributions. ?If I wanted to work for profit I would do so
> in the first place -- and would do so for my own profit rather that of
> bottom feeders who prey on DP. Again, if "DP" [whoever that is] doesn't care
> about these issues, *I DO*, and so I will put my volunteer efforts elsewhere
> -- where my volunteer efforts WILL go in fact into NFP, and where my
> volunteer efforts WILL make a positive impact on the world in a finite
> amount of time.

I'm not sure if you understand what "Public Domain" means. It is not
not-for-profit... it means there is _no_ restriction on further use of
the text. Someone can reprint it, use it for derivative works, fold,
spindle, mutilate, write slash, whatever, at any point[0]. There is no
copyright restriction attached, and *no legal way to prevent
redistribution*[1]. It also works the other way... the independent
commercial entity that republished the text on Amazon has no way to
prevent us from putting the final, polished text up *for free* at PG
once it finishes PP/PPV. Also, it can indeed be "scooped up" by anyone
else who wishes to at DP before that point.

DP, the organization, is a not-for-profit. The material that the
organization works upon are Public Domain in the US.

R C
[0] Technically there is an automatic copyright on the annotations
that the proofers insert... they'd have to strip the [**] notes.
[1] Trademarks can turn up in specific cases, but that's another issue entirely.

From jimad at msn.com  Tue Mar  2 09:43:34 2010
From: jimad at msn.com (Jim Adcock)
Date: Tue, 2 Mar 2010 09:43:34 -0800
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease"
	of	e-texts
In-Reply-To: <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<AA902F07008A4680BD1E6D65699E82CD@alp2400>	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
Message-ID: <SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>

>I'm not sure if you understand what "Public Domain" means. 

I certainly understand what it means.  I volunteer my not-for-profit efforts
to make public domain works.  Those works IN PRACTICE enter the public
domain when PG makes them available to the public, not before then. When
books get stuck on DP queues "forever" then for-profits pick them up from SR
and distribute them under DRM at which point in time the book still IN
PRACTICE fails to enter the public domain.  This makes me unhappy, not
principally because a for-profit has picked up the book but rather because
DP continues to fail to recognize that their current queuing system and work
rules are busted, such that effectively one third of the effort contributed
to DP never in practice reaches the public domain, which in turn wastes my
time and effort when I volunteer there -- not to mention more importantly
the time and effort of 1000's of others who volunteer there. But, instead of
recognizing that the current system is busted and that people there need to
fix it what happens instead is that DP'ers insult the intelligence of people
who try to point out to them that the current system is in fact busted.
Again, under the current DP system for every three books started two books
get released.  This means that about 1/3 of the DP volunteers efforts are
effectively being wasted.


From klofstrom at gmail.com  Tue Mar  2 10:16:27 2010
From: klofstrom at gmail.com (Karen Lofstrom)
Date: Tue, 2 Mar 2010 08:16:27 -1000
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
	e-texts
In-Reply-To: <SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
Message-ID: <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>

On Tue, Mar 2, 2010 at 7:43 AM, Jim Adcock <jimad at msn.com> wrote:

> But, instead of recognizing that the current system is busted and that people there need to fix it what happens instead is that DP'ers insult the intelligence of people who try to point out to them that the current system is in fact busted.

Jim, we've known that it's busted for quite some time. You don't need
to scream at us and tell us we're idiots and fools if we don't do what
YOU order us to do, immediately. The negative reaction you're getting
is to your tone and tactics, not your news flash.

The problem is knowing just how the fix the beast while it's careering
along -- like fixing your car while it's in motion. Because I'm not a
programmer, I can't contribute to the solution, but I have high hopes
that someone will code a system that can be shown (by experiment, in
practice) to work better. Once there's a working prototype, you'll see
movement.

--
Karen Lofstrom
aka Zora

From grythumn at gmail.com  Tue Mar  2 10:29:41 2010
From: grythumn at gmail.com (Robert Cicconetti)
Date: Tue, 2 Mar 2010 13:29:41 -0500
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
	e-texts
In-Reply-To: <SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
Message-ID: <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com>

On Tue, Mar 2, 2010 at 12:43 PM, Jim Adcock <jimad at msn.com> wrote:
>>I'm not sure if you understand what "Public Domain" means.
> I certainly understand what it means. ?I volunteer my not-for-profit efforts
> to make public domain works. ?Those works IN PRACTICE enter the public
> domain when PG makes them available to the public, not before then. When
> books get stuck on DP queues "forever" then for-profits pick them up from SR
> and distribute them under DRM at which point in time the book still IN
> PRACTICE fails to enter the public domain. ?This makes me unhappy, not
> principally because a for-profit has picked up the book but rather because
> DP continues to fail to recognize that their current queuing system and work
> rules are busted, such that effectively one third of the effort contributed
> to DP never in practice reaches the public domain, which in turn wastes my
> time and effort when I volunteer there -- not to mention more importantly
> the time and effort of 1000's of others who volunteer there. But, instead of
> recognizing that the current system is busted and that people there need to
> fix it what happens instead is that DP'ers insult the intelligence of people
> who try to point out to them that the current system is in fact busted.
> Again, under the current DP system for every three books started two books
> get released. ?This means that about 1/3 of the DP volunteers efforts are
> effectively being wasted.

Copyright works have to be in the public domain before any at DP
touches it. It's still in the public domain while at DP, and it is in
the public domain when it leaves DP for PG. We can try[1] to restrict
access to intermediate stages by technical means, but we do NOT have
any legal means to prevent redistribution short of trying something
with contract law (a EULA or such).[2]

You also seem to believe there is a black hole at DP where 1 out of 3
books fall into, never to emerge. This is a patent fallacy. Some books
DO get shortstopped in the middle of the process (for missing pages
and other issues) but it is nowhere near 1 in 3 and there is
significant effort (the project hospital) to push these back into the
active process. The closest thing to a black hole is PP: Available,
where books can indeed sit indefinitely... but most don't.

I'm not going to argue this any further with you, though. People have
long been aware of the problem, and it is clear that nothing I say
will influence you.

R C
[1] It would be a bad idea IMO, but it has been tried in the past.
[2] Which would be both impractical, and against the principles of
trying to get public domain works accessible, again IMO.

From marcello at perathoner.de  Tue Mar  2 11:16:13 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Tue, 02 Mar 2010 20:16:13 +0100
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease"
 of	e-texts
In-Reply-To: <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com>
Message-ID: <4B8D63FD.5020102@perathoner.de>

Robert Cicconetti wrote:

> Copyright works have to be in the public domain before any at DP
> touches it. It's still in the public domain while at DP, and it is in
> the public domain when it leaves DP for PG. We can try[1] to restrict
> access to intermediate stages by technical means, but we do NOT have
> any legal means to prevent redistribution short of trying something
> with contract law (a EULA or such).[2]

What???

Are you saying everybody can steal everybody's else's files if they 
contain only PD material?

If you *publish* PD material, everybody can take it and re-use it as 
they see fit. To publish something means to make it available to everybody.

If you keep PD material on a workgroup server which is not accessible to 
the public at large and somebody grabs this material without your 
permission, then the material is *stolen* and you can prosecute them. 
(Provided you can prove that it was indeed your file, which should not 
be difficult because the scanno pattern is practically a watermark.)


-- 
Marcello Perathoner
webmaster at gutenberg.org

From grythumn at gmail.com  Tue Mar  2 11:31:10 2010
From: grythumn at gmail.com (Robert Cicconetti)
Date: Tue, 2 Mar 2010 14:31:10 -0500
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
	e-texts
In-Reply-To: <4B8D63FD.5020102@perathoner.de>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com>
	<4B8D63FD.5020102@perathoner.de>
Message-ID: <15cfa2a51003021131q721ca6baq7aa8a28ff1efef8f@mail.gmail.com>

On Tue, Mar 2, 2010 at 2:16 PM, Marcello Perathoner
<marcello at perathoner.de> wrote:
> Robert Cicconetti wrote:
> What???
>
> Are you saying everybody can steal everybody's else's files if they contain
> only PD material?
>
> If you *publish* PD material, everybody can take it and re-use it as they
> see fit. To publish something means to make it available to everybody.
>
> If you keep PD material on a workgroup server which is not accessible to the
> public at large and somebody grabs this material without your permission,
> then the material is *stolen* and you can prosecute them. (Provided you can
> prove that it was indeed your file, which should not be difficult because
> the scanno pattern is practically a watermark.)

We're not talking about computer trespassing; the discussion is in
regards to publicly available public domain material, not locked up on
someone's personal computer or server.  PG has procedures for
establishing whether a random etext found online is public domain
work, and allowing people to republish it at PG.

http://www.gutenberg.org/wiki/Gutenberg:Copyright_Confirmation_How-To

Random scannos do not establish a new copyrightable work, nor does
sweat-of-brow. (Under current US law, etc etc.)

R C

From dakretz at gmail.com  Tue Mar  2 11:51:44 2010
From: dakretz at gmail.com (don kretz)
Date: Tue, 2 Mar 2010 11:51:44 -0800
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
	e-texts
In-Reply-To: <SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
Message-ID: <627d59b81003021151m60c0b2d3ue0a5f18981d22a69@mail.gmail.com>

The queues also seem to have the effect of promoting the release
of short, easier projects at the expense of longer, more challenging
ones. Consequently some of the more significant works are delayed.

In June of 2005, the nine volumes of The Works of William Shakespeare -
Cambridge Editionwere submitted. This was before the queues era, and the
records
aren't clear, but the first volume (processed as 6 separate projects,
1 play per project) were completed and became available by
the end of 2006.

Volumes 2 to 8 are sitting in the F2 queue, waiting to be released
so they can be formatted as the last step before post-processing
and eventual submission to PG. The first of them has yet to make
its way completely through since the introduction of queueing.

(I can't tell where Volume 9 is - it may not have been submitted yet.)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100302/9daffc83/attachment-0001.html>

From marcello at perathoner.de  Tue Mar  2 12:02:47 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Tue, 02 Mar 2010 21:02:47 +0100
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease"
 of	e-texts
In-Reply-To: <15cfa2a51003021131q721ca6baq7aa8a28ff1efef8f@mail.gmail.com>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>	<15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com>	<4B8D63FD.5020102@perathoner.de>
	<15cfa2a51003021131q721ca6baq7aa8a28ff1efef8f@mail.gmail.com>
Message-ID: <4B8D6EE7.5070409@perathoner.de>

Robert Cicconetti wrote:
> On Tue, Mar 2, 2010 at 2:16 PM, Marcello Perathoner
> <marcello at perathoner.de> wrote:
>> Robert Cicconetti wrote:
>> What???
>>
>> Are you saying everybody can steal everybody's else's files if they contain
>> only PD material?
>>
>> If you *publish* PD material, everybody can take it and re-use it as they
>> see fit. To publish something means to make it available to everybody.
>>
>> If you keep PD material on a workgroup server which is not accessible to the
>> public at large and somebody grabs this material without your permission,
>> then the material is *stolen* and you can prosecute them. (Provided you can
>> prove that it was indeed your file, which should not be difficult because
>> the scanno pattern is practically a watermark.)
> 
> We're not talking about computer trespassing; the discussion is in
> regards to publicly available public domain material, not locked up on
> someone's personal computer or server.

We are talking about files that are sitting in some queue on a DP 
server. The DP server is not publicly accessible: It asks for a 
password. Taking a file out of a password-protected site and making it 
public without the site owner's permission is illegal. It is irrelevant 
if the file contains PD material or not.

Try an art collector's home and explain to him that you have a *right* 
to enter and photograph his Monet because it happens to be in the public 
domain...


-- 
Marcello Perathoner
webmaster at gutenberg.org

From greg at durendal.org  Tue Mar  2 12:01:14 2010
From: greg at durendal.org (Greg Weeks)
Date: Tue, 2 Mar 2010 15:01:14 -0500 (EST)
Subject: [gutvol-d] [SPAM] Re: Re: [SPAM] Re: Re: the d.p. opinion on
 "prerelease" of e-texts
In-Reply-To: <15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<AA902F07008A4680BD1E6D65699E82CD@alp2400>
	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
Message-ID: <alpine.DEB.2.00.1003021459460.16217@durendal.durendal.org>

On Tue, 2 Mar 2010, Robert Cicconetti wrote:

> redistribution*[1]. It also works the other way... the independent
> commercial entity that republished the text on Amazon has no way to
> prevent us from putting the final, polished text up *for free* at PG
> once it finishes PP/PPV. Also, it can indeed be "scooped up" by anyone
> else who wishes to at DP before that point.

Well no it can't. Mostly they put DRM on it, so it's a felony in the US to 
do anything with it. Now if someone like manybooks gets it I don't care.

-- 
Greg Weeks
http://durendal.org:8080/greg/


From grythumn at gmail.com  Tue Mar  2 12:07:48 2010
From: grythumn at gmail.com (Robert Cicconetti)
Date: Tue, 2 Mar 2010 15:07:48 -0500
Subject: [gutvol-d] Re: [SPAM] Re: Re: [SPAM] Re: Re: the d.p. opinion on
	"prerelease" of e-texts
In-Reply-To: <alpine.DEB.2.00.1003021459460.16217@durendal.durendal.org>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<alpine.DEB.2.00.1003021459460.16217@durendal.durendal.org>
Message-ID: <15cfa2a51003021207v51a17093g11b62a73bd91df2e@mail.gmail.com>

On Tue, Mar 2, 2010 at 3:01 PM, Greg Weeks <greg at durendal.org> wrote:
> On Tue, 2 Mar 2010, Robert Cicconetti wrote:
>
>> redistribution*[1]. It also works the other way... the independent
>> commercial entity that republished the text on Amazon has no way to
>> prevent us from putting the final, polished text up *for free* at PG
>> once it finishes PP/PPV. Also, it can indeed be "scooped up" by anyone
>> else who wishes to at DP before that point.
>
> Well no it can't. Mostly they put DRM on it, so it's a felony in the US to
> do anything with it. Now if someone like manybooks gets it I don't care.

"Also, it can indeed be "scooped up" by anyone else who wishes to at
DP before that point."

Note I said it is accessible at DP, not suggesting that one break DRM.

-Bob

From greg at durendal.org  Tue Mar  2 12:04:35 2010
From: greg at durendal.org (Greg Weeks)
Date: Tue, 2 Mar 2010 15:04:35 -0500 (EST)
Subject: [gutvol-d] [SPAM] Re: Re: [SPAM] Re: Re: the d.p. opinion on
 "prerelease" of e-texts
In-Reply-To: <4B8D6EE7.5070409@perathoner.de>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com>
	<4B8D63FD.5020102@perathoner.de>
	<15cfa2a51003021131q721ca6baq7aa8a28ff1efef8f@mail.gmail.com>
	<4B8D6EE7.5070409@perathoner.de>
Message-ID: <alpine.DEB.2.00.1003021502120.16217@durendal.durendal.org>

On Tue, 2 Mar 2010, Marcello Perathoner wrote:

> We are talking about files that are sitting in some queue on a DP server. The 
> DP server is not publicly accessible: It asks for a password. Taking a file 
> out of a password-protected site and making it public without the site 
> owner's permission is illegal. It is irrelevant if the file contains PD 
> material or not.

I suspect that wouldn't fly in the US. There's no restriction on getting 
an account, so it's likely there was no trespass. Maybe a TOS violation, 
but I don't think there's anything preventing this in the DP TOS, and I 
don't think there should be in general. Even if it does sometimes irritate 
me.

-- 
Greg Weeks
http://durendal.org:8080/greg/


From marcello at perathoner.de  Tue Mar  2 12:29:08 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Tue, 02 Mar 2010 21:29:08 +0100
Subject: [gutvol-d] Re: [SPAM] Re: Re: [SPAM] Re: Re: the d.p. opinion on
 "prerelease" of e-texts
In-Reply-To: <alpine.DEB.2.00.1003021502120.16217@durendal.durendal.org>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>	<15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com>	<4B8D63FD.5020102@perathoner.de>	<15cfa2a51003021131q721ca6baq7aa8a28ff1efef8f@mail.gmail.com>	<4B8D6EE7.5070409@perathoner.de>
	<alpine.DEB.2.00.1003021502120.16217@durendal.durendal.org>
Message-ID: <4B8D7514.6080705@perathoner.de>

Greg Weeks wrote:
> On Tue, 2 Mar 2010, Marcello Perathoner wrote:
> 
>> We are talking about files that are sitting in some queue on a DP 
>> server. The DP server is not publicly accessible: It asks for a 
>> password. Taking a file out of a password-protected site and making it 
>> public without the site owner's permission is illegal. It is 
>> irrelevant if the file contains PD material or not.
> 
> I suspect that wouldn't fly in the US. There's no restriction on getting 
> an account, so it's likely there was no trespass. Maybe a TOS violation, 
> but I don't think there's anything preventing this in the DP TOS, and I 
> don't think there should be in general. Even if it does sometimes 
> irritate me.

That would very well fly. I don't believe the DP TOS allow you to take a 
file out and publish it on your own. And if they allow that, I don't 
understand all the fuss they are making against a PG preprint distribution.

Oh, and all those signs that say you can't take any pictures in US. 
museums, don't they fly?


-- 
Marcello Perathoner
webmaster at gutenberg.org

From greg at durendal.org  Tue Mar  2 12:36:19 2010
From: greg at durendal.org (Greg Weeks)
Date: Tue, 2 Mar 2010 15:36:19 -0500 (EST)
Subject: [gutvol-d] [SPAM] Re: Re: [SPAM] Re: Re: [SPAM] Re: Re: the d.p.
 opinion on "prerelease" of e-texts
In-Reply-To: <4B8D7514.6080705@perathoner.de>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com>
	<4B8D63FD.5020102@perathoner.de>
	<15cfa2a51003021131q721ca6baq7aa8a28ff1efef8f@mail.gmail.com>
	<4B8D6EE7.5070409@perathoner.de>
	<alpine.DEB.2.00.1003021502120.16217@durendal.durendal.org>
	<4B8D7514.6080705@perathoner.de>
Message-ID: <alpine.DEB.2.00.1003021533010.16481@durendal.durendal.org>

On Tue, 2 Mar 2010, Marcello Perathoner wrote:

> Greg Weeks wrote:
>> On Tue, 2 Mar 2010, Marcello Perathoner wrote:
>> 
>>> We are talking about files that are sitting in some queue on a DP server. 
>>> The DP server is not publicly accessible: It asks for a password. Taking a 
>>> file out of a password-protected site and making it public without the 
>>> site owner's permission is illegal. It is irrelevant if the file contains 
>>> PD material or not.
>> 
>> I suspect that wouldn't fly in the US. There's no restriction on getting an 
>> account, so it's likely there was no trespass. Maybe a TOS violation, but I 
>> don't think there's anything preventing this in the DP TOS, and I don't 
>> think there should be in general. Even if it does sometimes irritate me.
>
> That would very well fly. I don't believe the DP TOS allow you to take a file 
> out and publish it on your own. And if they allow that, I don't understand 
> all the fuss they are making against a PG preprint distribution.

It's generally been admitted that they can't stop it. It's if it should be 
officially sanctioned or not.

> Oh, and all those signs that say you can't take any pictures in US. museums, 
> don't they fly?

Only to the extent that if they ask you to leave and if you don't comply 
you are trespassing. They cannot make you delete any pictures you've 
taken. They can't stop you from doing anything with the picture you want 
if the art doesn't currently have a copyright.

-- 
Greg Weeks
http://durendal.org:8080/greg/


From grythumn at gmail.com  Tue Mar  2 12:51:18 2010
From: grythumn at gmail.com (Robert Cicconetti)
Date: Tue, 2 Mar 2010 15:51:18 -0500
Subject: [gutvol-d] Re: [SPAM] Re: Re: [SPAM] Re: Re: the d.p. opinion on
	"prerelease" of e-texts
In-Reply-To: <4B8D7514.6080705@perathoner.de>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com>
	<4B8D63FD.5020102@perathoner.de>
	<15cfa2a51003021131q721ca6baq7aa8a28ff1efef8f@mail.gmail.com>
	<4B8D6EE7.5070409@perathoner.de>
	<alpine.DEB.2.00.1003021502120.16217@durendal.durendal.org>
	<4B8D7514.6080705@perathoner.de>
Message-ID: <15cfa2a51003021251g75da7277g975f569048bf06c1@mail.gmail.com>

On Tue, Mar 2, 2010 at 3:29 PM, Marcello Perathoner
<marcello at perathoner.de> wrote:
> That would very well fly. I don't believe the DP TOS allow you to take a
> file out and publish it on your own. And if they allow that, I don't
> understand all the fuss they are making against a PG preprint distribution.

The difference is between something that is tolerated, and an
officially sanctioned central repository. Also, I think the arguments
for posting text and HTML separately got confused with the arguments
about posting earlier in the process. phpBB's threading is...
suboptimal.

Personally, I'm in the pre-publish camp (after it passes each round,
by preference. There's little point in splitting TXT and HTML posting
at PP). As well as making p1->p1 opt-out, p3 opt in, parallel f1
opt-out[1], and f2 opt in.

R C
[1] Means a little more work for the PM to do the merge, but worth it
IMO for simpler works. Would need some relatively minor tool or dev
support.

From Bowerbird at aol.com  Tue Mar  2 14:12:50 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 2 Mar 2010 17:12:50 EST
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
Message-ID: <77bdd.26df5c3.38bee762@aol.com>

bob said, to jim-
>   I'm not going to argue this any further with you, though.

truth be told, you've haven't provided any argumentation anyway.

you ignored jim's main point, to argue some legalistic crap which
jim knows quite well and was never in dispute.

indeed, it is precisely the troubling fact that material which _is_
"in the public domain" in a _legal_ sense, but is only _available_
for sale, because d.p. can't get it out the door, that's the point...

and if you have nothing to say in regard to that point, then it's
probably a good thing that you stop posting replies of any type.

(except that you've illustrated jim's point about d.p. apologists.)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100302/be3991a4/attachment-0001.html>

From gbnewby at pglaf.org  Tue Mar  2 14:16:37 2010
From: gbnewby at pglaf.org (Greg Newby)
Date: Tue, 2 Mar 2010 14:16:37 -0800
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <4B8D63FD.5020102@perathoner.de>
References: <alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com>
	<4B8D63FD.5020102@perathoner.de>
Message-ID: <20100302221637.GA27060@pglaf.org>

On Tue, Mar 02, 2010 at 08:16:13PM +0100, Marcello Perathoner wrote:
> Robert Cicconetti wrote:
> 
> >Copyright works have to be in the public domain before any at DP
> >touches it. It's still in the public domain while at DP, and it is in
> >the public domain when it leaves DP for PG. We can try[1] to restrict
> >access to intermediate stages by technical means, but we do NOT have
> >any legal means to prevent redistribution short of trying something
> >with contract law (a EULA or such).[2]
> 
> What???
> 
> Are you saying everybody can steal everybody's else's files if they
> contain only PD material?
> 
> If you *publish* PD material, everybody can take it and re-use it as
> they see fit. To publish something means to make it available to
> everybody.
> 
> If you keep PD material on a workgroup server which is not
> accessible to the public at large and somebody grabs this material
> without your permission, then the material is *stolen* and you can
> prosecute them. (Provided you can prove that it was indeed your
> file, which should not be difficult because the scanno pattern is
> practically a watermark.)

These don't seem like strongly conflicting statements.  Our "no sweat of
the brow how-to" gives a similar view.

IF someone were to gain illicit access to files at DP or elsewhere,
regardless of whether they were public domain, various legal remedies
could be applied.  (Quite a few, and most countries have their own set
of remedies ranging from contracts, to EULAs, to things like computer
fraud & abuse or misappropriation of resources.)

But as Robert mentioned, that doesn't change that the public domain
content is still public domain...no matter how much value has been added
through scanning, OCR, proofreading, etc.  What happens if such content
mysterioulsy, untraceably extracts itself from DP and becomes available
elsewhere?  Well, it's still public domain.

(Bonus reading assignment: Steven Levy's "Crypto," which describes how
the PGP software, which was ineligible for export from the US, found its
way into other countries -- where it was perfectly legal to use.)

  -- Greg

PS: Over the years, I've been involved in various efforts to bring
legal remedies to online incidents.  It is very hard to do, especially
when there is little or no money involved.  Doubly-especially if any
of the actors are in different countries.  Robert's emphasis on technical
measures, versus more legalistic ones, is more likely to give satisfaction.

From jimad at msn.com  Tue Mar  2 14:41:56 2010
From: jimad at msn.com (James Adcock)
Date: Tue, 2 Mar 2010 14:41:56 -0800
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease"
	of	e-texts
In-Reply-To: <1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>
Message-ID: <SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>

>The negative reaction you're getting
is to your tone and tactics, not your news flash.

Sorry, but *my* negative reactions are based on DP people who say:

a) That there is no problem having books stuck on queues for an average of
3.5 years now.

And/or

b) Offer "solutions" which will not in fact reduce the size of the queues
and how long books sit there.

Again:

a) There IS a problem with having books stuck on queues, including that fact
that 1/3 of the volunteers' time and energy is being wasted currently.

b) Any proposed "solution" has to in fact act to reduce the size of the
queues and how long books sit there.  And it needs to do so without chasing
away any class of volunteers including P1s -- since P1s represent the future
of DP.

One simple suggestion to start with would be to start by changing the stated
"Goals" for P3 and F2 and PP to be larger than the Goals for P2 and F1.  To
do otherwise is to have DP suggesting that they want the queues to be even
longer than they are now. Right now the stated goals for P2 and F1 are
larger than the stated goals for P3 and PP -- which will only make the
queuing situation worse. The fact that the "Goals" are inverted would seem
to imply that the powers that be do not understand the nature of the problem
-- in which case how can they fix it?


From dakretz at gmail.com  Tue Mar  2 14:42:55 2010
From: dakretz at gmail.com (don kretz)
Date: Tue, 2 Mar 2010 14:42:55 -0800
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <20100302221637.GA27060@pglaf.org>
References: <alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com>
	<4B8D63FD.5020102@perathoner.de> <20100302221637.GA27060@pglaf.org>
Message-ID: <627d59b81003021442j1ff97b5bk4b0dc96604c02c22@mail.gmail.com>

And what's the message that we send when we use someone else's work (the
book)
that someone else scans, and someone else collects, posts, and manages (TIA)
and a bunch of other people proof and/or format, and then keep that
accumulated
and integrated value  that's been generously and freely provided for us to
usel
locked away exclusively for several years for one Post Processor to work on,

when they get around to it?

On Tue, Mar 2, 2010 at 2:16 PM, Greg Newby <gbnewby at pglaf.org> wrote:

> On Tue, Mar 02, 2010 at 08:16:13PM +0100, Marcello Perathoner wrote:
> > Robert Cicconetti wrote:
> >
> > >Copyright works have to be in the public domain before any at DP
> > >touches it. It's still in the public domain while at DP, and it is in
> > >the public domain when it leaves DP for PG. We can try[1] to restrict
> > >access to intermediate stages by technical means, but we do NOT have
> > >any legal means to prevent redistribution short of trying something
> > >with contract law (a EULA or such).[2]
> >
> > What???
> >
> > Are you saying everybody can steal everybody's else's files if they
> > contain only PD material?
> >
> > If you *publish* PD material, everybody can take it and re-use it as
> > they see fit. To publish something means to make it available to
> > everybody.
> >
> > If you keep PD material on a workgroup server which is not
> > accessible to the public at large and somebody grabs this material
> > without your permission, then the material is *stolen* and you can
> > prosecute them. (Provided you can prove that it was indeed your
> > file, which should not be difficult because the scanno pattern is
> > practically a watermark.)
>
> These don't seem like strongly conflicting statements.  Our "no sweat of
> the brow how-to" gives a similar view.
>
> IF someone were to gain illicit access to files at DP or elsewhere,
> regardless of whether they were public domain, various legal remedies
> could be applied.  (Quite a few, and most countries have their own set
> of remedies ranging from contracts, to EULAs, to things like computer
> fraud & abuse or misappropriation of resources.)
>
> But as Robert mentioned, that doesn't change that the public domain
> content is still public domain...no matter how much value has been added
> through scanning, OCR, proofreading, etc.  What happens if such content
> mysterioulsy, untraceably extracts itself from DP and becomes available
> elsewhere?  Well, it's still public domain.
>
> (Bonus reading assignment: Steven Levy's "Crypto," which describes how
> the PGP software, which was ineligible for export from the US, found its
> way into other countries -- where it was perfectly legal to use.)
>
>  -- Greg
>
> PS: Over the years, I've been involved in various efforts to bring
> legal remedies to online incidents.  It is very hard to do, especially
> when there is little or no money involved.  Doubly-especially if any
> of the actors are in different countries.  Robert's emphasis on technical
> measures, versus more legalistic ones, is more likely to give satisfaction.
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100302/269b1e60/attachment.html>

From Bowerbird at aol.com  Tue Mar  2 15:10:54 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 2 Mar 2010 18:10:54 EST
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
Message-ID: <9889.63fb9605.38bef4fe@aol.com>

jim quoted someone as saying:
>   The negative reaction you're getting
>    is to your tone and tactics, not your news flash.

now, i didn't see this quote when it came up originally,
meaning that it probably came from someone who is
in my spam folder, which would mean marcello or zora.

i'm betting it's zora, the supreme apologist here for d.p.

this is the kind of "blame-the-messenger" crap that they
_love_ to do over at d.p.   they can't argue with the message,
so they talk about your "tone" instead, when it's their own
damn fault that you had to adopt that tone in the first place,
because they're bound and determined to ignore you totally.

and that's because they are incapable of solving anything.

and it's interesting to see _why_ d.p. can't solve anything,
as the d.p. people here -- right on up to board member
newby -- are unable to avoid dragging a thread off-topic.
(although we must give marcello credit for a serious detour,
by raising a phantom that files are being _stolen_ from d.p.)

i mean, seriously, you want to witness something _amazing_,
just take a look at recent posts in this thread, were _jim_ is
the one who manages to (a) stay on topic! and (b) make sense.

jim!, for crying out loud, the same jim who often has difficulty
arguing his way out of a wet paper bag, and he's the one here
who is doing the _best_, absolutely outshining all of the rest!

so, of course, let's attack jim, and his "tone and tactics"...

let me break it down to a nutshell...

d.p. has thousands of proofers doing p1, the first proof pass.
d.p. has hundreds of proofers in p2, the more-careful pass.
d.p. has dozens of proofers for p3, the "final final pass" pass.

i don't know about you, but i'd expect that a "final pass" will
take a closer reading (and thus more time) than a first-pass,
but assume p1 and p2 and p3 proofers all take the same time.

somehow, however, the fundamental workflow at d.p. expects 
dozens of p3 people to keep up with thousands of p1 people...

this is ridiculous on the face of it.   and this is the main problem.

or perhaps the _main_ problem is that the d.p. "powers that be"
could seriously install such a ridiculous-on-its-face workflow...

whichever way we look at it, it's purely and absolutely ludicrous.
(quick, somebody please give me more synonyms for "ridiculous.")

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100302/88dd68bc/attachment.html>

From jimad at msn.com  Tue Mar  2 15:26:12 2010
From: jimad at msn.com (James Adcock)
Date: Tue, 2 Mar 2010 15:26:12 -0800
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease"
	of	e-texts
In-Reply-To: <15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com>
Message-ID: <SNT120-DS2AB43A630E55FA5DB5E37AE3B0@phx.gbl>

>You also seem to believe there is a black hole at DP where 1 out of 3
books fall into, never to emerge. This is a patent fallacy. 

The fallacy is in assuming that the only way DP can waste volunteer efforts
is to never ship some particular book.  On the contrary large and increasing
queue sizes can as effectively waste volunteer efforts as much as never
shipping some particular book.  Again, consider the Russian Roulette test:
DP managers randomly shoot 1/3 of the projects at DP (prior to PP).  How do
these murders affect the shipping rate out of DP?  Answer: They don't change
the shipping rate out of DP.  Conclusion: If you can destroy 1/3 of the
projects at DP without affecting the productivity rate out of DP then 1/3 of
the productivity at DP is being wasted.  How is that productivity being
wasted?  By sticking it on large and increasing queues.  Consider a factory
that only ships 2/3rds of what it ever starts to make.  Does the unfinished
inventory represent value or not? Well, the factory only *realizes* value by
shipping product.  The shipped product has value, and eventually every piece
of product gets shipped, but as long as the factory only ships 2/3rds of
everything it ever makes the fact remains that the cost of manufacturing is
50% higher than it need be.  IE the factory is only running at 2/3rds of its
potential productivity. That unfinished inventory *might* be considered to
have value, but only if new owners buy out the old owners, and change the
manufacturing process such that you don't have unshipped inventory plugging
up the factory anymore.  Or if buyers get tired of paying 50% more for
products than they should be and stop buying, then the factory has an
opportunity to work off that unfinished inventory, realizing its value --
assuming they can lure back buyers at the new now lower price that doesn't
include the wasted 50% markup for product started but not yet shipped. In
the DP case what this analogy means is that DP gets a chance to work off the
inventory if and when P1s get tired of DP wasting their time and energy and
thus stop putting new work into the head of the DP queue.  But DP needs P1s
since they represent the future of DP.

Now how can it be that a factory only ships 2/3rds of what it makes but at
the same time it eventually ships every item?  Consider for simplicity that
the factory makes rolls of toilet paper and ships those rolls out to
customers based on a "First In First Out" FIFO toilet paper roll queuing
system.  Does every roll of toilet paper eventually get shipped?  Yes.  But
the problem is is that the queues are constantly getting larger, and as they
do so they consume 1/3rd of the factory's resources.  Consider if we changed
to a "Last In First Out" queuing system.  Does that change the nature of the
problem?  NO -- a roll of toilet paper is a roll of toilet paper.  But now,
based on LIFO it becomes obvious that some rolls of paper never do get
shipped -- the 1/3rd of the older toilet paper rolls at any given time never
get shipped -- 1/3 of all toilet paper rolls every made, and the situation
keeps getting worse.  But the choice of FIFO vs. LIFO queuing system in no
way changes the nature of the problem -- a toilet paper roll is a toilet
paper roll.  Thus, on the contrary to the previously stated hypothesis, it
is NOT necessary to have a "black hole" in order to waste time and effort.
All that is necessary is to have a large and increasing queuing system --
whether that queuing system is LIFO or FIFO. Or stated another way, large
queuing systems ARE the black hole.  The mere fact that any given book
eventually makes it out of the queue is not sufficient to keep the large
queuing systems from being a black hole -- as long as the black hole
continues to suck in more than it spits out.


From ke at gnu.franken.de  Tue Mar  2 15:27:57 2010
From: ke at gnu.franken.de (Karl Eichwalder)
Date: Wed, 03 Mar 2010 00:27:57 +0100
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease"
	of	e-texts
In-Reply-To: <SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl> (James Adcock's
	message of "Tue, 2 Mar 2010 14:41:56 -0800")
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>
	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>
Message-ID: <m28waaxq0y.fsf@gnu.franken.de>

"James Adcock" <jimad at msn.com> writes:

> a) That there is no problem having books stuck on queues for an average of
> 3.5 years now.

It's a storage "problem"--nothing more, nothing less.  There are books
waiting in the google cache for more than x years.  Not to mention all
the libraries...

The problem is you and me, who don't want to understand that is
impossible to read all the books in livetime.

-- 
Karl Eichwalder

From jimad at msn.com  Tue Mar  2 15:39:18 2010
From: jimad at msn.com (James Adcock)
Date: Tue, 2 Mar 2010 15:39:18 -0800
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on
	"prerelease"	of	e-texts
In-Reply-To: <m28waaxq0y.fsf@gnu.franken.de>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>
	<m28waaxq0y.fsf@gnu.franken.de>
Message-ID: <SNT120-DS8B6835146941519282237AE3B0@phx.gbl>

>The problem is you and me, who don't want to understand that is
impossible to read all the books in livetime.

By the same argument volunteers should stop working on DP because there are
more books at PG than can be read in a lifetime...

...In fact there are more books stuck on the queues at DP than can be read
in a lifetime....


From pterandon at gmail.com  Tue Mar  2 15:56:13 2010
From: pterandon at gmail.com (Greg M. Johnson)
Date: Tue, 2 Mar 2010 18:56:13 -0500
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
Message-ID: <a0bf3e961003021556i1882a602p3daca82ce7d51466@mail.gmail.com>

From: don kretz <dakretz at gmail.com>
> And what's the message that we send when we use someone
> else's work (the book) that someone else scans, and
> someone else collects, posts, and manages (TIA) and a
> bunch of other people proof and/or format, and then keep
> that accumulated and integrated value  that's been generously
> and freely provided for us to use locked away exclusively
> for several years for one Post Processor to work on,
> when they get around to it?


Is something else happening to the work during this time-- like papers,
etc., being written on it?


-- 
Greg M. Johnson
http://pterandon.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100302/8052da0c/attachment.html>

From Bowerbird at aol.com  Tue Mar  2 16:49:19 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 2 Mar 2010 19:49:19 EST
Subject: [gutvol-d] Re: Processing eTexts
Message-ID: <10608.54ae9312.38bf0c0f@aol.com>


so, carel, i hope i didn't scare you away...             ;+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100302/4dc5d4ec/attachment.html>

From marcello at perathoner.de  Tue Mar  2 23:04:32 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed, 03 Mar 2010 08:04:32 +0100
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <20100302221637.GA27060@pglaf.org>
References: <alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>	<15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com>	<4B8D63FD.5020102@perathoner.de>
	<20100302221637.GA27060@pglaf.org>
Message-ID: <4B8E0A00.7070502@perathoner.de>

Greg Newby wrote:

> But as Robert mentioned, that doesn't change that the public domain
> content is still public domain...no matter how much value has been added
> through scanning, OCR, proofreading, etc.  What happens if such content
> mysterioulsy, untraceably extracts itself from DP and becomes available
> elsewhere?  Well, it's still public domain.

But you would sue them for trespass, not for copyright infringement.

> PS: Over the years, I've been involved in various efforts to bring
> legal remedies to online incidents.  It is very hard to do, especially
> when there is little or no money involved.  Doubly-especially if any
> of the actors are in different countries.  Robert's emphasis on technical
> measures, versus more legalistic ones, is more likely to give satisfaction.

Amazon would be an US company though. And sueing Amazon would bring some 
interesting facts to the public attention as to the provenience of some 
material they DRM.


-- 
Marcello Perathoner
webmaster at gutenberg.org

From richfield at telkomsa.net  Tue Mar  2 23:22:24 2010
From: richfield at telkomsa.net (Jon Richfield)
Date: Wed, 03 Mar 2010 09:22:24 +0200
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <alpine.DEB.2.00.1002201628580.13867@mail.pglaf.org>
References: <4B76F4C6.3030006@teksavvy.com>	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>	<4B785C40.5000304@xs4all.nl>	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>	<1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com>	<9F120957CF48439F9C63FD74DE1B25F7@alp2400>	<SNT120-DS93F6FB34B82E76AE781E9AE450@phx.gbl>
	<alpine.DEB.2.00.1002201628580.13867@mail.pglaf.org>
Message-ID: <4B8E0E30.4090905@telkomsa.net>

Sorry, I have been out and had email problems etc...

I strongly urge you to follow up this line of thought. There are several 
sites on the internet doing fine work by making valuable material 
available, much of which is either full of scanning errors or even in 
scanned form. Is it satisfactory? Certainly not.  Is it worth making 
available against the time that someone else improves it, if ever? MOST 
certainly. Is it consonant with our dignity to prefer making perfection 
available? Certainly. Is it consonant with our dignity to sit on 
material in case bairns and fools think that the job should do itself?

Think about it.

Make it available first, and let anyone dissatisfied get busy and make 
it satisfactory.

Cheers,

Jon
>
> Let's just forget the whole idea of error free texts. . . .
>
> Ever since I started Project Gutenberg I've never seen even
> one book I read, even most articles and essays, without big
> bluders you would think could never be published.
>
> I would prefer just to get these materials in circulation--
> then worry about approaching perfection along with Xeno.
>
> Does anybody have a serious objection to putting the 8,000,
> or so, books that were listed earlier as being in limbo, in
> something like our "PrePrints" section, where we put eBooks
> that are admittedly not ready for prime time???
>
> Please. . . .
>


From gbnewby at pglaf.org  Tue Mar  2 23:34:54 2010
From: gbnewby at pglaf.org (Greg Newby)
Date: Tue, 2 Mar 2010 23:34:54 -0800
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <4B8E0A00.7070502@perathoner.de>
References: <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com>
	<4B8D63FD.5020102@perathoner.de> <20100302221637.GA27060@pglaf.org>
	<4B8E0A00.7070502@perathoner.de>
Message-ID: <20100303073454.GA12104@pglaf.org>

On Wed, Mar 03, 2010 at 08:04:32AM +0100, Marcello Perathoner wrote:
> Greg Newby wrote:
> 
> >But as Robert mentioned, that doesn't change that the public domain
> >content is still public domain...no matter how much value has been added
> >through scanning, OCR, proofreading, etc.  What happens if such content
> >mysterioulsy, untraceably extracts itself from DP and becomes available
> >elsewhere?  Well, it's still public domain.
> 
> But you would sue them for trespass, not for copyright infringement.

Right.  That was the point I was making.  But finding a lawyer to take
the case is tough.  Getting the case before a judge is tougher.
Pursuing yourself (i.e., in small claims court) is possible for people
with time on their hands, but it limited in various ways.

> >PS: Over the years, I've been involved in various efforts to bring
> >legal remedies to online incidents.  It is very hard to do, especially
> >when there is little or no money involved.  Doubly-especially if any
> >of the actors are in different countries.  Robert's emphasis on technical
> >measures, versus more legalistic ones, is more likely to give satisfaction.
> 
> Amazon would be an US company though. And sueing Amazon would bring
> some interesting facts to the public attention as to the provenience
> of some material they DRM.

Amazon is an interesting and somewhat unique example (Google,
Apple and Microsoft are also interesting, and unique in their
own ways).  

You are right that PG or DP could sue Amazon.  Some days, I think
we should (they sell a lot of Project Gutenberg titles - with
the "small print" intact, in various illegitimate ways).  

What we're talking about, though, is intentional tresspass on
DP.  I would be surprised if Amazon or the other big companies
were interested in that.
 
  -- Greg


From schultzk at uni-trier.de  Wed Mar  3 00:46:48 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Wed, 3 Mar 2010 09:46:48 +0100
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <77bdd.26df5c3.38bee762@aol.com>
References: <77bdd.26df5c3.38bee762@aol.com>
Message-ID: <C3390D38-2C6C-4F37-B6C6-E4C9800D8F15@uni-trier.de>

Hold on a sec!


Am 02.03.2010 um 23:12 schrieb Bowerbird at aol.com:

> bob said, to jim-
> >   I'm not going to argue this any further with you, though.
> 
> truth be told, you've haven't provided any argumentation anyway.
> 
> you ignored jim's main point, to argue some legalistic crap which
> jim knows quite well and was never in dispute.
> 
> indeed, it is precisely the troubling fact that material which _is_
> "in the public domain" in a _legal_ sense, but is only _available_
> for sale, because d.p. can't get it out the door, that's the point...

	There is a difference between a text being copyright free and in the
	public domain..
	One can put a copyright and have it be still in the public domain.

	Personally, as I see it PG text are more or less copyright free and
	in the public domain.
	I can use the PG text as I wish as long aas I give them credit. Which I would do.

	Yet, there is actually no practical way of stopping me from taking a PG text
	removing all hints to it. Reformatting it and publishing it( even in paper form)
	under copyright and thereby protecting my WORK. 

	Naturally, I would not do this, but others due. 

	Even if someone puts a text up for sale and copyrights it from PG or DP,
	there is NOTHING THEY could due against PG or DP from publishing
	their own version! You see, PG/DP is working with the rights of the law
	as they can prove where their material is coming from an it was obtained
	legally and they had not infringed on the copyright.
	
	As an example NOBODY in the world is going to get a copyright on
	Shakespeares works, so that somebody else can not produce Shakepeares
	works on their own!!! 

	So once the original copyright expires that text is a free for all. Nobody,
	can get a copyright that will stop anybody else from publishing that text.
	What they can get is protection for their wok and only their work/book/publication.

	To come back to the point of prereleasing texts. The best way of catching someone
	is by using texts that are NOT error-free, Since that error just might propagate.
	One has then a indisputable MARK to identify your work.


	regards
		Keith.
 
	 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100303/b7308c33/attachment-0001.html>

From schultzk at uni-trier.de  Wed Mar  3 01:01:52 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Wed, 3 Mar 2010 10:01:52 +0100
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <20100302221637.GA27060@pglaf.org>
References: <alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com>
	<4B8D63FD.5020102@perathoner.de> <20100302221637.GA27060@pglaf.org>
Message-ID: <498D1B7A-6EB0-4872-B654-46F29F072D51@uni-trier.de>


Am 02.03.2010 um 23:16 schrieb Greg Newby:

> On Tue, Mar 02, 2010 at 08:16:13PM +0100, Marcello Perathoner wrote:

> These don't seem like strongly conflicting statements.  Our "no sweat of
> the brow how-to" gives a similar view.
> 
> IF someone were to gain illicit access to files at DP or elsewhere,
> regardless of whether they were public domain, various legal remedies
> could be applied.  (Quite a few, and most countries have their own set
> of remedies ranging from contracts, to EULAs, to things like computer
> fraud & abuse or misappropriation of resources.)
> 
> But as Robert mentioned, that doesn't change that the public domain
> content is still public domain...no matter how much value has been added
> through scanning, OCR, proofreading, etc.  What happens if such content
> mysterioulsy, untraceably extracts itself from DP and becomes available
> elsewhere?  Well, it's still public domain.
	As I have mention in another post. In the public domain and
	copyrighted are to different animals. I can put source code of a program
	in the public domain and still maintain a copyright. The same goes for texts.
> 
> (Bonus reading assignment: Steven Levy's "Crypto," which describes how
> the PGP software, which was ineligible for export from the US, found its
> way into other countries -- where it was perfectly legal to use.)
> 
>  -- Greg
> 
> PS: Over the years, I've been involved in various efforts to bring
> legal remedies to online incidents.  It is very hard to do, especially
> when there is little or no money involved.  Doubly-especially if any
> of the actors are in different countries.  Robert's emphasis on technical
> measures, versus more legalistic ones, is more likely to give satisfaction.
	Thats what DRM is. Now, how can they be applied to texts.
	It can only be done in the file itself. The only way to reach this is
	with special  format for the file that can only be read our own tools,
	and those tools source should not be publicly available. 
	With most readers available one can still extract the text and their
	by defeating its protection. This has be done with music. Effectively
	defeating DRM and that is why iTunes music is now DRM free.

	The saying still goes, if they is a will there is a way.

	regards
		Keith.
   

From schultzk at uni-trier.de  Wed Mar  3 01:18:19 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Wed, 3 Mar 2010 10:18:19 +0100
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease"
	of	e-texts
In-Reply-To: <SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>
	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>
Message-ID: <BE3C5C25-44E0-4AD8-8857-A4BF65DED639@uni-trier.de>

The way I look at is that its DP ball.

Yet, If the queues are so stuck up, then
DP has to shift its work force.

That is get volunteers trained and motivated so that
they can help clear the queues. 
This is simple economics. No production company
can afford to produce parts for a product and not
produce the end product.  

The only way for a company to to survive is to
out-source. Which would be prerelease.

Naturally, DP is not interested in making money,
yet the analogy holds true, for their goals.

regards
	Keith.

Am 02.03.2010 um 23:41 schrieb James Adcock:

>> The negative reaction you're getting
> is to your tone and tactics, not your news flash.
> 
> Sorry, but *my* negative reactions are based on DP people who say:
> 
> a) That there is no problem having books stuck on queues for an average of
> 3.5 years now.
> 
> And/or
> 
> b) Offer "solutions" which will not in fact reduce the size of the queues
> and how long books sit there.
> 
> Again:
> 
> a) There IS a problem with having books stuck on queues, including that fact
> that 1/3 of the volunteers' time and energy is being wasted currently.
> 
> b) Any proposed "solution" has to in fact act to reduce the size of the
> queues and how long books sit there.  And it needs to do so without chasing
> away any class of volunteers including P1s -- since P1s represent the future
> of DP.
> 
> One simple suggestion to start with would be to start by changing the stated
> "Goals" for P3 and F2 and PP to be larger than the Goals for P2 and F1.  To
> do otherwise is to have DP suggesting that they want the queues to be even
> longer than they are now. Right now the stated goals for P2 and F1 are
> larger than the stated goals for P3 and PP -- which will only make the
> queuing situation worse. The fact that the "Goals" are inverted would seem
> to imply that the powers that be do not understand the nature of the problem
> -- in which case how can they fix it?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100303/76d0cca5/attachment.html>

From schultzk at uni-trier.de  Wed Mar  3 01:29:42 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Wed, 3 Mar 2010 10:29:42 +0100
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease"
	of	e-texts
In-Reply-To: <m28waaxq0y.fsf@gnu.franken.de>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>
	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>
	<m28waaxq0y.fsf@gnu.franken.de>
Message-ID: <EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>

A decade or so ago I pulled the whole PG repository via ftp.
I have not gotten through it. What a waste of my time???


On the other side, lets just shut everthing down as most
of the consumer computers have some way of displaying
scans. 

So we are just wasting everybodies time?

regards
	Keith.
  
Am 03.03.2010 um 00:27 schrieb Karl Eichwalder:

> "James Adcock" <jimad at msn.com> writes:
> 
>> a) That there is no problem having books stuck on queues for an average of
>> 3.5 years now.
> 
> It's a storage "problem"--nothing more, nothing less.  There are books
> waiting in the google cache for more than x years.  Not to mention all
> the libraries...
> 
> The problem is you and me, who don't want to understand that is
> impossible to read all the books in livetime.


From Bowerbird at aol.com  Wed Mar  3 01:38:42 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 3 Mar 2010 04:38:42 EST
Subject: [gutvol-d] do not listen, pray
Message-ID: <30a2.40c90764.38bf8822@aol.com>

do not listen to non-lawyers discussing legal matters.

do not even listen to lawyers discussing legal matters,
not unless you are paying them.

do not pay lawyers, if there's any way you can help it.

***

do not listen to people who are talking about "theft".
or "trespass".   or any other stupid crap such as that.
this is project gutenberg, where we transcend via gift.

***

do not listen to the people who treat d.p. as if it is a
factory, where "parts" are assembled into "products".
the improper metaphor will only distract from truth.

the queues are not the problem, they are an _effect_
of the problem.   treating symptoms is bad strategy.

the queues cause problems of their own, but _those_
problems are not the cause either; do not forget that.

the problem is you cannot expect dozens of people
to match the output created by thousands of people.
remember what the problem is.   treat the problem.

***

pray for the volunteers whose time and energy is wasted.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100303/465d48e9/attachment.html>

From schultzk at uni-trier.de  Wed Mar  3 01:40:48 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Wed, 3 Mar 2010 10:40:48 +0100
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <4B8E0A00.7070502@perathoner.de>
References: <alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>	<15cfa2a51003021029v3ba5cdej74f65e1154af17b7@mail.gmail.com>	<4B8D63FD.5020102@perathoner.de>
	<20100302221637.GA27060@pglaf.org> <4B8E0A00.7070502@perathoner.de>
Message-ID: <8AD2D132-1286-45BC-83CD-1281C1A814AD@uni-trier.de>


Am 03.03.2010 um 08:04 schrieb Marcello Perathoner:

> Greg Newby wrote:
> 
>> But as Robert mentioned, that doesn't change that the public domain
>> content is still public domain...no matter how much value has been added
>> through scanning, OCR, proofreading, etc.  What happens if such content
>> mysterioulsy, untraceably extracts itself from DP and becomes available
>> elsewhere?  Well, it's still public domain.
> 
> But you would sue them for trespass, not for copyright infringement.
	So how do you prove they did it. You have to prove that they did
	indeed trespass. Not an easy job to do!!! 
> 
>> PS: Over the years, I've been involved in various efforts to bring
>> legal remedies to online incidents.  It is very hard to do, especially
>> when there is little or no money involved.  Doubly-especially if any
>> of the actors are in different countries.  Robert's emphasis on technical
>> measures, versus more legalistic ones, is more likely to give satisfaction.
> 
> Amazon would be an US company though. And sueing Amazon would bring some interesting
> facts to the public attention as to the provenience of some material they DRM.
	DRM is not there to protect copyright, but to protect their investment into
	 the work they have done.
	Besides, in is not that hard to remove DRM, nowadays. 

	regards
		Keith.


From Bowerbird at aol.com  Wed Mar  3 02:31:29 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 3 Mar 2010 05:31:29 EST
Subject: [gutvol-d] roundlessness -- 009
Message-ID: <41e5.5c7d201.38bf9481@aol.com>

in our "glass-is-one-quarter-full" news today, i note that rfrank
has this to say about using reg-ex tests on his roundless site:

>    It seems to be a big win to make REs that usually are used 
>    during post-processing available to users during proofing.

now if roger would realize those reg-ex checks would be even
_more_ useful if they were done in book-wide preprocessing,
we could award him the "glass-is-three-quarters-full" prize...

but let's be thankful for the huge progress he's made thus far.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100303/5a505e14/attachment.html>

From schultzk at uni-trier.de  Wed Mar  3 04:50:34 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Wed, 3 Mar 2010 13:50:34 +0100
Subject: [gutvol-d] Re: do not listen, pray
In-Reply-To: <30a2.40c90764.38bf8822@aol.com>
References: <30a2.40c90764.38bf8822@aol.com>
Message-ID: <E7C160A1-4B56-4C00-A815-8EF157EFD574@uni-trier.de>


Am 03.03.2010 um 10:38 schrieb Bowerbird at aol.com:

> 
> do not listen to the people who treat d.p. as if it is a
> factory, where "parts" are assembled into "products".
> the improper metaphor will only distract from truth.
	Oh, puppi-cock!
	You do not even know the difference between an analogy and
	a methaphor!
	DPs approach is that of an assembly line.
	Scans of pages are processed, put together, processed further,
	go through further processes and eventually a final product
	comes out.
	

> the queues are not the problem, they are an _effect_
> of the problem.  treating symptoms is bad strategy.
	They are part of the system and assembly line!
> 
> the queues cause problems of their own, but _those_
> problems are not the cause either; do not forget that.
	Especially, if input and output are not balanced. 
> 
> the problem is you cannot expect dozens of people
> to match the output created by thousands of people.
> remember what the problem is.  treat the problem.
	So you suggest slowing down the work of the volunteers, stopping
	them? Come on you are smarter than that. 
	The queues definately are not the problem. There are just to few
	handling the output, or input depending from which side you look at them.
	As you claim there are thousands creating output. That output becomes
	input. At some stage SOMEONE has to process that output to finalize
	it. 
	So QED. There need to be more volunteers working in the latter
	stages of the production. 

	Nice of you to prove my point!!
	
	Cheers
		Keith.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100303/443e79ed/attachment-0001.html>

From hart at pglaf.org  Wed Mar  3 07:41:48 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Wed, 3 Mar 2010 07:41:48 -0800 (PST)
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <C3390D38-2C6C-4F37-B6C6-E4C9800D8F15@uni-trier.de>
References: <77bdd.26df5c3.38bee762@aol.com>
	<C3390D38-2C6C-4F37-B6C6-E4C9800D8F15@uni-trier.de>
Message-ID: <alpine.DEB.2.00.1003030736570.24264@mail.pglaf.org>


Keith Schultz asked why text and not scans?

Here are the most obvious advantages of text over scans:

1.  Speed

2.  Storage

3.  Searching

4.  Quotations

5.  Corrections


The Details


1.  Speed

Reading from online scans can be a real pain as changing
pages involves downloading another large file of scanning.

Flipping through the pages becomes virtually impossible.

If you have the time you can download the whole thing and
then start reading, but flipping through the pages might
still be a pain if they are not relationally linked, and
many places still seem to forget that THEIR links do not
work on YOUR SYSTEM unless the links are proper for that.


2.  Storage

You can store about a million eBooks of about a million
character each on a terabyte drive at minimal cost and
with very little hassle setting up the drive, even just
a pocket terabyte drive will do, though it is slower.

However, storing a million scans of books is virtually
impossible for the everyday person, not to mention the
problems reading them listed above.

More terabytes and more cables than the average person
is really willing to put up with, even for a library.


3.  Searching

In my own personal and professional opinion the greatest
advantage to having text versus scans is searchability.

I won't go into every kind of file pretending to be text
but the plain text files are the most searchable and the
storage space required is the least, particularly in the
.zip or similar compressed formats.

All the other formats seem to create errors that we have
all seen where the search program can't find a word that
is right there in front of us on the screen.

Pretty much ANY editor or reader program does .txt files
without much hassle, both for reading and searching.


4.  Quotations

I can cut and paste any text quotation into this article
without any hassle at all from text files, but you can't
do that from a scan.

Same for cutting and pasting into your emails, Twitter &
other IM formats, and even into .pdf files.

For those who never quote anything, not a problem.

However, when someone recommends I read something I will
likely ask for a few choice quotations to evaluate.


5.  Corrections


It's difficult in the extreme to correct a scan error...
you literally have to do it somethingm like Photoshop as
if you were changing pixels, which you really are.

It's still not easy to make those same corrections in an
Adobe "Portable Document File" as they are NOT PORTABLE!
Just try it a few times and you will understand.

The more elevated the format, the harder is correction.


///


Also, about copyright and public domain. . . .

No, you can't have it both ways. . . .

You can do a number of things like the PG and GNU, even
the EFF, stuff like various forms of "Copyleft," but it
is either copyrighted and with permission or it has the
legal status of public domain to give everyone a legal,
if not totally understood right to redistribute.

Some of these give you ONLY the right to your own copy,
without the right to hand out other copies.

This means you have to read the fine print.

With PG's license there is no difficulty:

ALL PG eBOOKS CAN BE REDISTRIBUTED WITHOUG PG HASSLE--
there may be other laws in other countries that apply,
but not from the PG license.


mh

From lee at novomail.net  Wed Mar  3 08:15:51 2010
From: lee at novomail.net (Lee Passey)
Date: Wed, 03 Mar 2010 09:15:51 -0700
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <C3390D38-2C6C-4F37-B6C6-E4C9800D8F15@uni-trier.de>
References: <77bdd.26df5c3.38bee762@aol.com>
	<C3390D38-2C6C-4F37-B6C6-E4C9800D8F15@uni-trier.de>
Message-ID: <4B8E8B37.3050707@novomail.net>

On 3/3/2010 1:46 AM, Keith J. Schultz wrote:

> Hold on a sec!

[snip]

> There is a difference between a text being copyright free and in the
> public domain..
> One can put a copyright and have it be still in the public domain.

On 3/3/2010 2:38 AM, Bowerbird at aol.com wrote:

 > do not listen to non-lawyers discussing legal matters.

Good advice.

Mr. Schultz, you are wrong. If something is in the public domain, by 
definition it cannot have a copyright, and vice-versa.

There is, in fact, no such legally recognized entity as "the public 
domain." The phrase is simply shorthand for "those works for which 
copyright has expired or is otherwise unenforceable."

I have heard it argued (by lawyers) that under the Berne convention one 
cannot create a copyrightable work and then dedicate it to the public 
domain. Under Berne, a copyright attaches automatically, instantaneously 
and unavoidably at the moment of creation. Because there is no real 
entity called "the public domain," the automatic copyright cannot be 
transferred to it. At best you have a promise on the part of the 
creator, unsupported by any consideration, not to sue. If no one has 
placed detrimental reliance on the promise, the creator can revoke it at 
any time, putting us back to square one.

Just one of the noxious (and perhaps unintended) consequences of the 
Berne convention.

From schultzk at uni-trier.de  Wed Mar  3 08:21:03 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Wed, 3 Mar 2010 17:21:03 +0100
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <alpine.DEB.2.00.1003030736570.24264@mail.pglaf.org>
References: <77bdd.26df5c3.38bee762@aol.com>
	<C3390D38-2C6C-4F37-B6C6-E4C9800D8F15@uni-trier.de>
	<alpine.DEB.2.00.1003030736570.24264@mail.pglaf.org>
Message-ID: <EC9C5CB8-7198-4855-ABC8-8DFC99F07850@uni-trier.de>

Hi Michael,
	
	You did not quite catch the irony in my message.

	regards
		Keith.

Am 03.03.2010 um 16:41 schrieb Michael S. Hart:

> 
> Keith Schultz asked why text and not scans?
> 
> Here are the most obvious advantages of text over scans:
> 
> 1.  Speed
> 
> 2.  Storage
> 
> 3.  Searching
> 
> 4.  Quotations
> 
> 5.  Corrections
> 
> 
> 
> The Details
> 
> 
> 1.  Speed
> 
> Reading from online scans can be a real pain as changing
> pages involves downloading another large file of scanning.
> 
> Flipping through the pages becomes virtually impossible.
> 
> If you have the time you can download the whole thing and
> then start reading, but flipping through the pages might
> still be a pain if they are not relationally linked, and
> many places still seem to forget that THEIR links do not
> work on YOUR SYSTEM unless the links are proper for that.
> 
> 
> 2.  Storage
> 
> You can store about a million eBooks of about a million
> character each on a terabyte drive at minimal cost and
> with very little hassle setting up the drive, even just
> a pocket terabyte drive will do, though it is slower.
> 
> However, storing a million scans of books is virtually
> impossible for the everyday person, not to mention the
> problems reading them listed above.
> 
> More terabytes and more cables than the average person
> is really willing to put up with, even for a library.
> 
> 
> 
> 3.  Searching
> 
> In my own personal and professional opinion the greatest
> advantage to having text versus scans is searchability.
> 
> I won't go into every kind of file pretending to be text
> but the plain text files are the most searchable and the
> storage space required is the least, particularly in the
> .zip or similar compressed formats.
> 
> All the other formats seem to create errors that we have
> all seen where the search program can't find a word that
> is right there in front of us on the screen.
> 
> Pretty much ANY editor or reader program does .txt files
> without much hassle, both for reading and searching.
> 
> 
> 
> 4.  Quotations
> 
> I can cut and paste any text quotation into this article
> without any hassle at all from text files, but you can't
> do that from a scan.
> 
> Same for cutting and pasting into your emails, Twitter &
> other IM formats, and even into .pdf files.
> 
> For those who never quote anything, not a problem.
> 
> However, when someone recommends I read something I will
> likely ask for a few choice quotations to evaluate.
> 
> 
> 
> 5.  Corrections
> 
> 
> It's difficult in the extreme to correct a scan error...
> you literally have to do it somethingm like Photoshop as
> if you were changing pixels, which you really are.
> 
> It's still not easy to make those same corrections in an
> Adobe "Portable Document File" as they are NOT PORTABLE!
> Just try it a few times and you will understand.
> 
> The more elevated the format, the harder is correction.
> 
> 
> ///
> 
> 
> Also, about copyright and public domain. . . .
> 
> No, you can't have it both ways. . . .
> 
> You can do a number of things like the PG and GNU, even
> the EFF, stuff like various forms of "Copyleft," but it
> is either copyrighted and with permission or it has the
> legal status of public domain to give everyone a legal,
> if not totally understood right to redistribute.
> 
> Some of these give you ONLY the right to your own copy,
> without the right to hand out other copies.
> 
> This means you have to read the fine print.
> 
> With PG's license there is no difficulty:
> 
> ALL PG eBOOKS CAN BE REDISTRIBUTED WITHOUG PG HASSLE--
> there may be other laws in other countries that apply,
> but not from the PG license.
> 
> 
> mh
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d


From schultzk at uni-trier.de  Wed Mar  3 08:39:50 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Wed, 3 Mar 2010 17:39:50 +0100
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <4B8E8B37.3050707@novomail.net>
References: <77bdd.26df5c3.38bee762@aol.com>
	<C3390D38-2C6C-4F37-B6C6-E4C9800D8F15@uni-trier.de>
	<4B8E8B37.3050707@novomail.net>
Message-ID: <EAFEB407-C974-4F86-96C5-815779B3F121@uni-trier.de>

Hi Lee,

	For one the term is "in the public domain".
	Furthermore, putting something in the public domain
	is if you care to be technical a license of use.
	How far that license goes depends on the statements of
	the author.

	The coining of the terminology was not originally used
	in copyright law, but in the protection of intellectual property.
	It was adopted to by the internet users and publishers to texts.

	Secondly you ought to get your own facts straight. How can a 
	lawyer argue that said property not be dedicated to the public domain
	if not said entity is not defined!!
	S/He could not. 

	regards
		Keith.
	
Am 03.03.2010 um 17:15 schrieb Lee Passey:

> On 3/3/2010 1:46 AM, Keith J. Schultz wrote:
> 
>> Hold on a sec!
> 
> [snip]
> 
>> There is a difference between a text being copyright free and in the
>> public domain..
>> One can put a copyright and have it be still in the public domain.
> 
> On 3/3/2010 2:38 AM, Bowerbird at aol.com wrote:
> 
> > do not listen to non-lawyers discussing legal matters.
> 
> Good advice.
> 
> Mr. Schultz, you are wrong. If something is in the public domain, by definition it cannot have a copyright, and vice-versa.
> 
> There is, in fact, no such legally recognized entity as "the public domain." The phrase is simply shorthand for "those works for which copyright has expired or is otherwise unenforceable."
> 
> I have heard it argued (by lawyers) that under the Berne convention one cannot create a copyrightable work and then dedicate it to the public domain. Under Berne, a copyright attaches automatically, instantaneously and unavoidably at the moment of creation. Because there is no real entity called "the public domain," the automatic copyright cannot be transferred to it. At best you have a promise on the part of the creator, unsupported by any consideration, not to sue. If no one has placed detrimental reliance on the promise, the creator can revoke it at any time, putting us back to square one.
> 
> Just one of the noxious (and perhaps unintended) consequences of the Berne convention.
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d


From cmiske at ashzfall.com  Wed Mar  3 10:03:07 2010
From: cmiske at ashzfall.com (cmiske at ashzfall.com)
Date: Wed, 03 Mar 2010 11:03:07 -0700
Subject: [gutvol-d] Re: Processing eTexts
Message-ID: <20100303110307.0dedd0f3f91314fbc67db20f64e304ca.09cdf66229.wbe@email05.secureserver.net>

An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100303/a2733f22/attachment.html>

From jimad at msn.com  Wed Mar  3 10:24:24 2010
From: jimad at msn.com (Jim Adcock)
Date: Wed, 3 Mar 2010 10:24:24 -0800
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on
	"prerelease"	of	e-texts
In-Reply-To: <EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>	<m28waaxq0y.fsf@gnu.franken.de>
	<EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>
Message-ID: <SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>

>On the other side, lets just shut everthing down as most
of the consumer computers have some way of displaying
scans. 

>So we are just wasting everybodies time?

Yes and no.  The Google "photocopies" of books available at books.google.com
aka their PDF downloads which are just page images ARE useful, I can even
read many of them successfully on my Kindle DX.  There is even some charm in
reading books in their original layout -- and some charm in seeing the
occasional scanner's thumb.  Reading pages and pages that have been
scribbled on by 200 years of students is not very charming, IMHO.  And the
Google page images have the blotchy blurry heavy-font characteristics of bad
photocopies.  Even some of Google's EPUB files, which are just OCRs of these
same books with all the scannos intact, can sometimes be an interesting
read.

The question is, in my mind, is Google preserving the books, and doing so
for the public good or not?  I suspect when Google digitizes the book the
original is then trashed by the college library -- the whole point being
they do not want to have to pay to maintain physical library books in
various states of decay.  Google then becomes the sole repository for this
information -- excepting a smallish number of copies at TIA.  Further, is
Google dedicated to trying to keep this work public, or on the contrary is
Google hoping for changes in the copyright law so that they can fully
privatize these digitizations?

Compare to what happens when volunteers at DP or PG correct a text and
publish it in electronic form.  Publically available?  Yes.  Available from
a huge variety of redundant sources?  Yes.  Suitable to be republished
easily on paper by either NFPs or For-Profit publishers?  Yes.  Reflowable
so that it can be read comfortably on a wide variety of devices by people
with differently aged eyes including by people with little or no vision?
Yes. Yes. Yes. Etc.

However, The DP/PG approach is extremely expensive compared to what Google
is doing.  Consider: Google Books == about 10 million books photo scanned.
DP/PG == 30,000 books "fully restored." So Google's approach is about 300X
faster than the DP/PG approach.  My Conclusion: In the best of all world's
there would be some measure of VALUE in choosing which books DP/PG chooses
to put effort into fully restoring -- the idea that somehow DP/PG is going
to be able to fully restore all the world's books is surely false.  When
someone at DP chooses to introduce a book that is expensive to do and the
end result has relatively little value to society, that means other more
important books will not be restored.  It is not simply a question of "First
Come First Serve" because on DP a worthy book can easily become stuck on the
queues behind a less worthy book, such that the more worthy book is not
allowed to be worked on by anybody.  How does one measure "worthy vs.
non-worthy?"  Not a trivial matter, I admit.  But to my mind one measure is
obvious:  Books that real people do not in practice want to read we should
not bother to restore!  I don't care if it's a book on ancient Sanskrit.  If
1000 people want to read it, it's worth doing.  If only 6 people want to
read it, it's not worth doing.  As a simple measure at least the total
amount of time people spend reading the book has to exceed the amount of
time volunteers spend preparing the book, or it's a loss to society.  Again,
the most popular books on PG are read 100,000 times more often than the
least popular books.  Now it's hard to find one of these most popular books
to tackle today.  But it is trivial to find a book to work on that will be
50X more popular than the average book DP finishes.

Let Google deal with the unpopular books, and let DP/PG work on books that
people actually *want* to read.


From jimad at msn.com  Wed Mar  3 10:29:56 2010
From: jimad at msn.com (Jim Adcock)
Date: Wed, 3 Mar 2010 10:29:56 -0800
Subject: [gutvol-d] Re: do not listen, pray
In-Reply-To: <E7C160A1-4B56-4C00-A815-8EF157EFD574@uni-trier.de>
References: <30a2.40c90764.38bf8822@aol.com>
	<E7C160A1-4B56-4C00-A815-8EF157EFD574@uni-trier.de>
Message-ID: <SNT120-DS15FE8ADC78D1DA1664ED39AE3A0@phx.gbl>


 >There need to be more volunteers working in the latter stages of the
production. 

And under the current DP "high priesthood" system the only way to get more
volunteers working in the latter stages of the production is to get new
people working on the earlier stages of production, which then perpetuates
the problem.  You have to be willing to adjust or modify the "high
priesthood" system.


From hart at pglaf.org  Wed Mar  3 10:39:39 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Wed, 3 Mar 2010 10:39:39 -0800 (PST)
Subject: [gutvol-d] Re: do not listen, pray
In-Reply-To: <SNT120-DS15FE8ADC78D1DA1664ED39AE3A0@phx.gbl>
References: <30a2.40c90764.38bf8822@aol.com>
	<E7C160A1-4B56-4C00-A815-8EF157EFD574@uni-trier.de>
	<SNT120-DS15FE8ADC78D1DA1664ED39AE3A0@phx.gbl>
Message-ID: <alpine.DEB.2.00.1003031033470.3022@mail.pglaf.org>


On Wed, 3 Mar 2010, Jim Adcock wrote:

>
>  >There need to be more volunteers working in the latter stages of the
> production.
>
> And under the current DP "high priesthood" system the only way to get more
> volunteers working in the latter stages of the production is to get new
> people working on the earlier stages of production, which then perpetuates
> the problem.  You have to be willing to adjust or modify the "high
> priesthood" system.

I wrote an entire essay on this subject overnight, but was uncertain
as to whether I should send it or not, for obvious reasons.

However, this brings up at least one point I wanted to make:

TO BE EFFICIENT YOU HAVE TO ADJUST YOUR HIGHER LEVELS TO LOWER LEVELS:

Meaning that what the higher levels do, and how they do it, the time a
higher level person is given, has to be in proportion to lower levels,
or you will be inefficient, either due to to much or too little, going
through the higher levels. . .it's like the gas to air ratio, driving.
You get the most mileage AND the most power when the mixture is right.

If people are interested, I will post at least part of that essay, I'm
afraid it was VERY late at night, and I got carried away at the end.


Please advise,

Many thanks!!!


Michael


From cmiske at ashzfall.com  Wed Mar  3 11:01:11 2010
From: cmiske at ashzfall.com (cmiske at ashzfall.com)
Date: Wed, 03 Mar 2010 12:01:11 -0700
Subject: [gutvol-d] Re: Processing eTexts
Message-ID: <20100303120111.0dedd0f3f91314fbc67db20f64e304ca.b320c83f4e.wbe@email05.secureserver.net>

An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100303/d2305b02/attachment.html>

From cmiske at ashzfall.com  Wed Mar  3 11:06:08 2010
From: cmiske at ashzfall.com (cmiske at ashzfall.com)
Date: Wed, 03 Mar 2010 12:06:08 -0700
Subject: [gutvol-d] Re: do not listen, pray
Message-ID: <20100303120608.0dedd0f3f91314fbc67db20f64e304ca.71441d184d.wbe@email05.secureserver.net>

An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100303/72103503/attachment.html>

From Bowerbird at aol.com  Wed Mar  3 11:17:08 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 3 Mar 2010 14:17:08 EST
Subject: [gutvol-d] Re: do not listen, pray
Message-ID: <23668.53e77d2.38c00fb4@aol.com>

i said:
>    thousands/hundreds/dozens

and just to remind everybody, the solution is very clear,
and has been for a very long time, ever since we learned
-- by my careful analysis of the many d.p. experiments --
that p1 proofers find as many errors in subsequent proofings
as p2 proofers and even p3 proofers do.   (indeed, of the 3,
the p2 proofers were the least good at locating the errors.)

so it's obvious that we can move text to perfection by simply
running it through p1 repeatedly.   one problem with that --
as we've already found -- is that sometimes p1 proofers will
change correct text to incorrect text.   that problem can be
eliminated easily with a policy to review and reconcile diffs.
(this policy is easy to implement roundlessly, and will also
serve to train up your low-quality proofers, so it's win-win.)
the other problem currently with repeated p1 is that d.p.
hasn't created an unambiguous set of proofing instructions
-- i know, you'd think the need for that would be obvious --
and thus sometimes proofers "cycle through" corrections...
(e.g., a first proofer dehyphenates, a second rehyphenates,
a third asterisks the hyphen, a fourth dehyphenates, etc.)

i haven't discussed the f1/f2 problem, because it's a mirror
of the p1/p2/p3 problem.   a quick-and-easy confirmation
of the f1 by a subsequent f1 view, and we're off to the races.

likewise, i have not discussed the postprocessing problem,
not explicitly for the most part, because it's the microcosm
of the thousands/hundreds/dozens problem, it certainly is.

and once again, the problem is the workflow.   by the time
all of its pages have been proofed and formatted, the book
should fall in place more or less naturally and automatically.

the fact that it does not, in the d.p. workflow, is a shortfall...
it indicates that the workflow is deficient in some major way.
but correcting the postprocessing problems is relatively easy.
you simply need to analyze each page that needs "finishing"
and determine how the proofers could've provided that for it,
and you modify the proofing instructions appropriately. easy.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100303/5bfec821/attachment.html>

From hart at pglaf.org  Wed Mar  3 11:31:11 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Wed, 3 Mar 2010 11:31:11 -0800 (PST)
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <EAFEB407-C974-4F86-96C5-815779B3F121@uni-trier.de>
References: <77bdd.26df5c3.38bee762@aol.com>
	<C3390D38-2C6C-4F37-B6C6-E4C9800D8F15@uni-trier.de>
	<4B8E8B37.3050707@novomail.net>
	<EAFEB407-C974-4F86-96C5-815779B3F121@uni-trier.de>
Message-ID: <alpine.DEB.2.00.1003031127250.3022@mail.pglaf.org>


IRRC, the public domain existed, and included nearly everything in print,
etc., long before copyright was implented 300 years ago in Western law.

The terminology may have varied over the years, but the concept is there.

Copyright [Western] was invented 250 years earlier to stifle Gutenberg's
Press as a threat to The Stationers' Guild's historic monopoly.

They wanted it back.

And, finally, with the weak queen, Anne, they got it.

And we have been stuck with it ever since!!!


On Wed, 3 Mar 2010, Keith J. Schultz wrote:

> Hi Lee,
>
> 	For one the term is "in the public domain".
> 	Furthermore, putting something in the public domain
> 	is if you care to be technical a license of use.
> 	How far that license goes depends on the statements of
> 	the author.
>
> 	The coining of the terminology was not originally used
> 	in copyright law, but in the protection of intellectual property.
> 	It was adopted to by the internet users and publishers to texts.
>
> 	Secondly you ought to get your own facts straight. How can a
> 	lawyer argue that said property not be dedicated to the public domain
> 	if not said entity is not defined!!
> 	S/He could not.
>
> 	regards
> 		Keith.
>
> Am 03.03.2010 um 17:15 schrieb Lee Passey:
>
> > On 3/3/2010 1:46 AM, Keith J. Schultz wrote:
> >
> >> Hold on a sec!
> >
> > [snip]
> >
> >> There is a difference between a text being copyright free and in the
> >> public domain..
> >> One can put a copyright and have it be still in the public domain.
> >
> > On 3/3/2010 2:38 AM, Bowerbird at aol.com wrote:
> >
> > > do not listen to non-lawyers discussing legal matters.
> >
> > Good advice.
> >
> > Mr. Schultz, you are wrong. If something is in the public domain, by definition it cannot have a copyright, and vice-versa.
> >
> > There is, in fact, no such legally recognized entity as "the public domain." The phrase is simply shorthand for "those works for which copyright has expired or is otherwise unenforceable."
> >
> > I have heard it argued (by lawyers) that under the Berne convention one cannot create a copyrightable work and then dedicate it to the public domain. Under Berne, a copyright attaches automatically, instantaneously and unavoidably at the moment of creation. Because there is no real entity called "the public domain," the automatic copyright cannot be transferred to it. At best you have a promise on the part of the creator, unsupported by any consideration, not to sue. If no one has placed detrimental reliance on the promise, the creator can revoke it at any time, putting us back to square one.
> >
> > Just one of the noxious (and perhaps unintended) consequences of the Berne convention.
> > _______________________________________________
> > gutvol-d mailing list
> > gutvol-d at lists.pglaf.org
> > http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

From Bowerbird at aol.com  Wed Mar  3 11:39:16 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 3 Mar 2010 14:39:16 EST
Subject: [gutvol-d] Re: Processing eTexts
Message-ID: <2516d.7515f64.38c014e4@aol.com>

carel said:
>   Yes, we have been doing a semantic dance

ok, that's kinda what i thought.


>    I just feel that a final 'proofing' stage before release 
>    would assist in locating?errors that were either 
>    missed or introduced by the processing. 

well, having "one more proofing" is _always_ a great thing.

provided that somebody else is willing to _do_ it, that is...

the acid test is whether you deem it to be so necessary that
you will do it yourself.   that's a nice way to help you decide
whether the _cost_ of that additional proofing is _worth_it_,
whether it will provide enough _benefit_ in the text accuracy.

once you gain enough trust in your tool and its performance,
believe me that you'll decide that it performs "well enough"...

but yes, it is important to gauge the accuracy of your tool...

my goal is less-than-1-error-every-10-pages, and my tool
and workflow consistently delivers better results than that.


>    A human will do the processing and humans 
>    can make mistakes and some of the mistakes 
>    that could be made in what would be both?
>    error and formatting processes could be quite grand.

i don't worry about errors that are "quite grand"...
they're easy to spot, and obvious to debug and fix.

my experience is small errors are more troubling...


>    I feel that a second set of eyes can never be 
>    a bad thing when it comes to something like this.

that's easy to say until we ask you to be "the second set"
on a million e-texts, all of which are almost perfect now.

you -- and anyone else we ask -- will say "good enough".

at some point, the benefit of greater accuracy isn't worth it.

and then we say, "if the people reading this book because
they _want_ to read it cannot find any errors in it, then
that's their problem, but we cannot spend any more time
having innocent people re-proof this book _once_again_
simply because there _might_ still be an error in the thing."

again, i draw the line quite specifically.   if a page has been
looked at by 2 people in a row who could not find an error,
then i certify that page as "good enough for the public" and
stop looking at it.   you can make it 1 person, or 3 people,
or 4 people or 8 people or 22 people, whatever you like, but
nobody would ever suggest we keep proofing a book forever.

now, let me be clear that i understand that you only said
"a second set of eyes" and not 22 sets of them.   i agree...
and that's specifically why i use the comparison method,
because it gives us two sets of eyes on a book, essentially.


>    And, those with less experience (or no experience) 
>    in shaping the output of a text may feel more confident 
>    about doing the process if they know someone else will 
>    provide a checksum for their work before it goes public.

except that the _public_ provides that checksum for them.

they constitute your "second set of eyes", your 3rd, your 23rd.


>    the text would be released to PG and then should be 
>    placed in some environment that allows for editing it 
>    to 'perfection.' 

it would be nice if p.g. did this.   or d.p.   but neither does.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100303/a4bc1fad/attachment-0001.html>

From pjb at informatimago.com  Wed Mar  3 11:42:28 2010
From: pjb at informatimago.com (Pascal J. Bourguignon)
Date: Wed, 3 Mar 2010 20:42:28 +0100
Subject: [gutvol-d] DP/PG vs. Google
In-Reply-To: <SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>	<m28waaxq0y.fsf@gnu.franken.de>
	<EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>
	<SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>
Message-ID: <DAF927CE-3A14-4DD0-960B-77B84B48050C@informatimago.com>


On 2010-03-03, at 19:24, Jim Adcock wrote:
> However, The DP/PG approach is extremely expensive compared to what  
> Google
> is doing.  Consider: Google Books == about 10 million books photo  
> scanned.
> DP/PG == 30,000 books "fully restored." So Google's approach is  
> about 300X
> faster than the DP/PG approach.  My Conclusion: In the best of all  
> world's
> there would be some measure of VALUE in choosing which books DP/PG  
> chooses
> to put effort into fully restoring -- the idea that somehow DP/PG is  
> going
> to be able to fully restore all the world's books is surely false.

I think that the bet made by Google, is that sooner or later,  
sufficiently
smart AI and OCR technology will be developed to allow to process its  
scans
and do the job of PG automatically.

The only question is when it will happen, and some think that  
singularity
will occur within 20 years.

But this is probably not a reason to stop working on PG! :-)
-- 
__Pascal Bourguignon__
http://www.informatimago.com/


From jimad at msn.com  Wed Mar  3 13:05:50 2010
From: jimad at msn.com (James Adcock)
Date: Wed, 3 Mar 2010 13:05:50 -0800
Subject: [gutvol-d] Re: DP/PG vs. Google
In-Reply-To: <DAF927CE-3A14-4DD0-960B-77B84B48050C@informatimago.com>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>	<m28waaxq0y.fsf@gnu.franken.de>	<EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>	<SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>
	<DAF927CE-3A14-4DD0-960B-77B84B48050C@informatimago.com>
Message-ID: <SNT120-DS31593A09B17A9795E04F5AE3A0@phx.gbl>

>I think that the bet made by Google, is that sooner or later,  
Sufficiently smart AI and OCR technology will be developed to
allow to process its scans and do the job of PG automatically.

I would think that anyone who has worked on OCR, or automated grammars, or
AI, or in making books for PG can tell you they would lose that bet!

(Not that a lot can't be done to get rid of 90% of the errors
"automagically!")


From lee at novomail.net  Wed Mar  3 13:49:50 2010
From: lee at novomail.net (Lee Passey)
Date: Wed, 03 Mar 2010 14:49:50 -0700
Subject: [gutvol-d] The conundrum of FOSS projects (was Re: do not listen,
	pray)
In-Reply-To: <SNT120-DS15FE8ADC78D1DA1664ED39AE3A0@phx.gbl>
References: <30a2.40c90764.38bf8822@aol.com>	<E7C160A1-4B56-4C00-A815-8EF157EFD574@uni-trier.de>
	<SNT120-DS15FE8ADC78D1DA1664ED39AE3A0@phx.gbl>
Message-ID: <4B8ED97E.5020105@novomail.net>

On 3/3/2010 11:29 AM, Jim Adcock wrote:

> You have to be willing to adjust or modify the "high
> priesthood" system.

I have participated, or attempted to participate, in a number of FOSS 
projects over my career as a programmer, and I have a few observations 
which you may find relevant.

Every successful FOSS project I have ever observed has started with the 
vision of a single individual. In the years leading up to 1995, Eric A. 
Young single-handedly managed to implement the full suite of 
cryptosystems used in SSL, and in that year made it available on the 
internet for free. This effort became the foundation of OpenSSL. Until 
he was lured away by RSA, Mr. Young was the driving force behind 
OpenSSL. Today, the role of visionary is played by Ralf Engelschall and 
Ben Laurie.

In 1991 Andrew Tridgell, another Australian needed to mount disk space 
from a Unix server to his DOS PC. Using a packet sniffer he was able to 
reverse engineer the System Message Block protocol used by IBM's NetBIOS 
system, and which was the basis for DOS and Windows networking. This 
work eventually became Samga, a Unix/Linux software suite that provides 
file and print services to Windows-based clients. Mr. Tridgell still 
participates, and is the driving force behind the Samba open source project.

While many have criticized his alleged heavy-handedness, I believe that 
the success of the Linux kernel is primarily due to the fact that Linux 
Torvalds still has absolute authority over what changes go into that kernel.

Michael Hart plays the same role at Project Gutenberg that these 
programming giants played in the development of their respective 
software projects. Project Gutenberg was the brainchild of Mr. Hart, and 
he continues to be the driving force and visionary behind the project. 
While he, with uncharacteristic modesty, primarily credits the 
volunteers for the nature of Project Gutenberg, I disagree. For better 
or for worse, Project Gutenberg is the product of Mr. Harts vision and 
tenacity.

Distributed Proofreaders was founded in 2000 by Charles Franks to assist 
in the production of electronic texts specifically to be distributed by 
Project Gutenberg. According to my recollection, Mr. Franks' theory was 
that production of e-texts was hampered by the fact that few people were 
willing to take on the task of producing an entire e-text, particularly 
through the arduous text proofreading process. His vision was to take a 
text and break it up into discrete units (in this case, pages) so that 
many people could be involved in the proofreading process and lightening 
the burden. Thus, the one time DP catch-phrase, "Proofread a page a day, 
that's all we ask." The volunteers at Distributed Proofreaders have 
become very good at proofreading texts.

I have also seen any number of FOSS projects which have attempted to 
begin through consensus and team building. I can't name any of these 
projects for you, because they have all either failed or were still-born.

I think I have learned this lesson from my observations of these 
projects: to be successful you must have one single visionary who 
controls, more or less, the project. Having that visionary will not 
guarantee success, but not having it will surely doom it. At if a 
project loses its visionary, or marginalizes him or her to the point 
where he or she no longer controls the vision, the project will become 
increasingly ineffective and inefficient, and will descend into 
in-fighting and turf wars as others try to control the vision.

Vision cannot be obtained by consensus.

When someone criticizes Project Gutenberg for supposed failings, or the 
inability or unwillingness to keep up with the times, and Michael Hart 
responds with his now inevitable suggestion to "JUST GO FOR IT," what he 
is saying is "what you are suggesting does not match my vision. If you 
feel your vision is better than mine I encourage you to go elsewhere to 
pursue it. We can offer some infrastructure support (disk space) and you 
are welcome to invite Project Gutenberg volunteers to go help you 
actualize your vision, but I will not substitute your vision for mine."

I am not prepared to say that Distributed Proofreaders has lost its 
vision. It is still proofreading a lot of pages every day. It is clearly 
/not/ an efficient process, but efficiency was not one of the project 
goals. We are all familiar with the old saw that while one woman can 
have a baby in 9 months that doesn't mean that 9 women can have a baby 
in one month. I don't believe that DP is saying "if one person can 
proofread a text in 10 days, then 10 people can proofread it in one 
day," but they are saying "100 people can proofread it in two days." 
Distributed Proofreaders goal was to increase the speed that texts would 
be proofread, to lighten the load from any one individual and to make 
the process more fault-tolerant (if one volunteer quit, the project 
would not need to be restarted).

What has happened is that the needs of the consumer has changed. I'm 
fairly certain that the proofread texts now sitting in DP's 
Post-Processing queue would meet Michael Hart's standards (or lack 
thereof, as he is continually telling me he has no standards) and could 
be released to Project Gutenberg as is. Other consumers, however, have 
higher standards, and Distributed Proofreaders is now trying to satisfy 
those standards as well, and those new standards require post-processing 
of a work as a single unit by a single person.

DP's vision and expertise is in the area of distributed proofreading, 
not in the area of efficient e-book creation. This is why texts languish 
in the Post-Processing queue.

Your problem, Mr. Adcock, is that you believe you can change the vision 
underlying either of these organization through rational argument. 
Vision is an intuitive, almost religious, experience, and blind faith is 
immune to rationality. It is virtually impossible that you will be able 
to change the vision either of Mr. Hart or whomever is currently the 
visionary at Distributed Proofreaders.

I suspect that this is why Roger Frank has created his own web site for 
"roundless proofing;" his vision differs from that of Distributed 
Proofreaders, and it was simply easier to go his own way than to try and 
change someone else's vision.

I believe I agree with every criticism you have leveled at both Project 
Gutenberg and Distributed Proofreaders, which is to say, I believe I 
accept your vision. So let me mimic the words of Michael Hart:

GO FOR IT!

Put together your own project to complete high-quality public domain 
e-books. You could certainly harvest all of the files currently in the 
DP post-processing queue to start with. You might be able to grab the 
HTML files from PG if you can find scans to go with them. Take advantage 
of the hardware resources that Mr. Newby has offered. Post messages here 
and at DP inviting volunteers to help you out. No need to return the 
e-books to either of those organizations; if they want them they will 
know where to find them. I will help out as much as possible.

But please stop trying to convince Distributed Proofreaders or Project 
Gutenberg to accept a new vision. They are old and are set in their 
ways. They represent the last internet generation, not the current one. 
Show us the way forward, and let sleeping dogs lie.

From Bowerbird at aol.com  Wed Mar  3 14:17:41 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 3 Mar 2010 17:17:41 EST
Subject: [gutvol-d] Re: DP/PG vs. Google
Message-ID: <3103a.2b657b5c.38c03a05@aol.com>

jim said:
>    they would lose that bet!

then jim said:
>    (Not that a lot can't be done to 
>    get rid of 90% of the errors "automagically!")

so you won't grant 100%, but you will grant 90%.

well, google is probably betting they can
get rid of 97% of the errors automatically.

do you want to bet against google?

because i'll take that bet against you.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100303/41fb3a36/attachment.html>

From jimad at msn.com  Wed Mar  3 14:57:58 2010
From: jimad at msn.com (James Adcock)
Date: Wed, 3 Mar 2010 14:57:58 -0800
Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not
	listen, pray)
In-Reply-To: <4B8ED97E.5020105@novomail.net>
References: <30a2.40c90764.38bf8822@aol.com>	<E7C160A1-4B56-4C00-A815-8EF157EFD574@uni-trier.de>	<SNT120-DS15FE8ADC78D1DA1664ED39AE3A0@phx.gbl>
	<4B8ED97E.5020105@novomail.net>
Message-ID: <SNT120-DS6331DA5A3CAB3D5EA4EF2AE3A0@phx.gbl>

>I have participated, or attempted to participate, in a number of FOSS 
projects over my career as a programmer, and I have a few observations 
which you may find relevant.

Sorry, but by a "high priesthood" system I mean the typical pattern of a
tech organization, the same way that DP is organized, where a newbie starts
at the "grunt" level, and by playing the game and following the rules
advances to the roll of "Lord High Pooh-Bah."  My only objection to this
organization at DP is that they are not getting the right number people in
each of the various roles, and don't seem to understand (or be willing to
accept) what changes they would have to make in order to get the right
number of people in any particular role.

>Every successful FOSS project I have ever observed has started with the 
vision of a single individual

And every (continuing to be) successful organization eventually must grow
past that individual.
 
>I don't believe that DP is saying "if one person can 
proofread a text in 10 days, then 10 people can proofread it in one 
day," but they are saying "100 people can proofread it in two days." 

On the contrary, the problem is that an individual, such as myself, can
create a decent book in about 40 hours work over the course of one month
which consists of about 720 hours elapsed time.  The average book passing
through DP nowadays takes over 30,000 hours elapsed time, with an average of
20 volunteers working on each book. I think we know from previous analysis
that doing a book through DP takes at least 1.5X as much hands-on time as
doing it "solo."  Whether that is a problem or not depends on what you think
about volunteers and their time.  I look at it and say gee, we could be
getting an additional 10,000 books out of DP if we got the system tweaked
right.  That seems like a change worth doing to me. Now the fact that doing
a book through DP takes 40X more elapsed time than doing it "solo" is that a
problem or not?  Obviously some people think that taking that long
corresponds to "quality" -- a project needs to age on the queues like an old
cheese.  Other people like me find waiting for our projects to "go live"
again for a few days or weeks once or twice a year a bore and a nuisance.
Some DP insiders agree that getting "scooped" by others posting that which
DP is still sitting on can be disheartening -- but there seems to be a
misunderstanding for who there is to blame when this happens.

>Your problem, Mr. Adcock, is that you believe you can change the vision 
underlying either of these organization through rational argument. 

I wouldn't think that having a wrong number of people in any particular role
at a particular point in time would be a big-enough deal as to qualify as a
"vision statement".  But if does then I agree this would be a problem. I
would certainly agree based on personal experience that NFP organizations
that run into difficulties are frequently not very receptive to rational
analysis!  "My problem", if we have to talk about my problems of which there
are many, is that I submitted two books in good faith to DP which are now
stuck there indefinitely after I contributed many many hours of my own time
and tears, and I have no way to get those books back out.

>GO FOR IT!

I am.  I create books for PG "solo."  Are they are high quality as DP?  No,
probably not quite there.  Are they created much more efficiently?  Yes,
much more efficiently.  I have created at least one tool that makes this
much more efficient for me.  Others are welcome to try it if they wish.


From jimad at msn.com  Wed Mar  3 15:03:34 2010
From: jimad at msn.com (James Adcock)
Date: Wed, 3 Mar 2010 15:03:34 -0800
Subject: [gutvol-d] Re: DP/PG vs. Google
In-Reply-To: <3103a.2b657b5c.38c03a05@aol.com>
References: <3103a.2b657b5c.38c03a05@aol.com>
Message-ID: <SNT120-DS1FA3CD8A7EE5AE1BBB9D6AE3A0@phx.gbl>

>do you want to bet against google?

>because i'll take that bet against you.

 
Sure, I'd be happy to take that bet, if I am allowed to win it or lose it in
a finite amount of time - such as a decade.  What I think is much more
likely in a decade is that Google is either gives up or they figure out how
to post much more attractive page images.  I actually don't think they have
much of any interest in posting higher quality automatic OCR transcriptions.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100303/3c0aee85/attachment-0001.html>

From sly at victoria.tc.ca  Wed Mar  3 15:14:04 2010
From: sly at victoria.tc.ca (Andrew Sly)
Date: Wed, 3 Mar 2010 15:14:04 -0800 (PST)
Subject: [gutvol-d] Re: The conundrum of FOSS projects
In-Reply-To: <4B8ED97E.5020105@novomail.net>
References: <30a2.40c90764.38bf8822@aol.com>
	<E7C160A1-4B56-4C00-A815-8EF157EFD574@uni-trier.de>
	<SNT120-DS15FE8ADC78D1DA1664ED39AE3A0@phx.gbl>
	<4B8ED97E.5020105@novomail.net>
Message-ID: <Pine.GSO.4.58.1003031509300.6156@vtn1.victoria.tc.ca>

Thanks for the thought-provoking post Lee.

That helped put things in a new context for me.

--Andrew

On Wed, 3 Mar 2010, Lee Passey wrote:

> I have participated, or attempted to participate, in a number of FOSS
> projects over my career as a programmer, and I have a few observations
> which you may find relevant.
>

[snip]

From Bowerbird at aol.com  Wed Mar  3 15:16:57 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 3 Mar 2010 18:16:57 EST
Subject: [gutvol-d] roundlessness -- 010
Message-ID: <34f8a.2b48e537.38c047e9@aol.com>

by the way, i just thought i would reiterate an old screenshot...

perhaps some of the people planting bugs in rfrank's ears
will consider pointing him in the directions suggested here:
>    http://z-m-l.com/3column-zml.jpg

that's a big screenshot, because i've got a big screen, but
the idea is that the proofer does the word-by-word scan
_not_ against a web-page's textfield version of the page,
but rather against an .html-realized version of the page...

(if proofers wanted, you could even use the d.p. font on it.)

the main benefit is that you free 'em from having to look
at the markup, because that's an unnecessary distraction.
they see actual rendered italics, not the markup for italics.

it's also possible this way to red-flag any possible scannos,
as well as capitalization and punctuation improbabilities...

you can also colorize quotations, which helps locate any
missing or incorrect quotemarks.

like i said, i have a big screen, so i can put up 3 pages --
the textfield, the original scan, and the .html version, but
for a smaller screen, you'd put up the scan and the .html.

then, only if there are changes to be made will the proofer
summon the textfield for editing.

i also show lots of buttons on the screen.   some are there
to add words to the book's custom dictionary, so that you
don't have to have them flagged the next time they appear.

the others are just marked with numbers, to indicate things
that they could be used for, such as italicizing selected text.

again, proofing the .html version is easier than the textfield.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100303/8e5cf8f9/attachment.html>

From hart at pglaf.org  Wed Mar  3 15:44:18 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Wed, 3 Mar 2010 15:44:18 -0800 (PST)
Subject: [gutvol-d] Re: DP/PG vs. Google
In-Reply-To: <SNT120-DS1FA3CD8A7EE5AE1BBB9D6AE3A0@phx.gbl>
References: <3103a.2b657b5c.38c03a05@aol.com>
	<SNT120-DS1FA3CD8A7EE5AE1BBB9D6AE3A0@phx.gbl>
Message-ID: <alpine.DEB.2.00.1003031539110.15331@mail.pglaf.org>


Google's plan, from the outset, a year before we ever heard about it
via the media, was to create the most "eBooks" for the cheapest cost
and to generate the most media blitz public relations they could; it
really had very little to do with creating high quality eBooks, tho,
even I must admit, some came out better than I expected.

When it comes to comparisons to PG/DP, Google is a paper tiger quite
literally when it comes to quality, but when it comes to quantity it
is PG/DP that is the dead tree big stripey cat.

All in all, it won't hurt either way, and the ends will hit middles,
with greater numbers of eBooks and greater quality.

Don't forget The Internet Archive, etc.


On Wed, 3 Mar 2010, James Adcock wrote:

>
> >do you want to bet against google?
>
> >because i'll take that bet against you.
>
> ?
>
> Sure, I?d be happy to take that bet, if I am allowed to win it or lose it in a finite
> amount of time ? such as a decade.? What I think is much more likely in a decade is that
> Google is either gives up or they figure out how to post much more attractive page images.?
> I actually don?t think they have much of any interest in posting higher quality automatic
> OCR transcriptions.
>
>
>
>
>

From hart at pglaf.org  Wed Mar  3 16:06:59 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Wed, 3 Mar 2010 16:06:59 -0800 (PST)
Subject: [gutvol-d] Re: Processing eTexts
In-Reply-To: <2516d.7515f64.38c014e4@aol.com>
References: <2516d.7515f64.38c014e4@aol.com>
Message-ID: <alpine.DEB.2.00.1003031606230.15331@mail.pglaf.org>


For those who worry about BB and "perfect" eBooks,
the following should ease that worry greatly!!!


On Wed, 3 Mar 2010, Bowerbird at aol.com wrote:

> carel said:
> >? Yes, we have been doing a semantic dance
>
> ok, that's kinda what i thought.
>
>
> >?? I just feel that a final 'proofing' stage before release
> >?? would assist in locating?errors that were either
> >?? missed or introduced by the processing.
>
> well, having "one more proofing" is _always_ a great thing.
>
> provided that somebody else is willing to _do_ it, that is...
>
> the acid test is whether you deem it to be so necessary that
> you will do it yourself.? that's a nice way to help you decide
> whether the _cost_ of that additional proofing is _worth_it_,
> whether it will provide enough _benefit_ in the text accuracy.
>
> once you gain enough trust in your tool and its performance,
> believe me that you'll decide that it performs "well enough"...
>
> but yes, it is important to gauge the accuracy of your tool...
>
> my goal is less-than-1-error-every-10-pages, and my tool
> and workflow consistently delivers better results than that.
>
>
> >?? A human will do the processing and humans
> >?? can make mistakes and some of the mistakes
> >?? that could be made in what would be both?
> >?? error and formatting processes could be quite grand.
>
> i don't worry about errors that are "quite grand"...
> they're easy to spot, and obvious to debug and fix.
>
> my experience is small errors are more troubling...
>
>
> >?? I feel that a second set of eyes can never be
> >?? a bad thing when it comes to something like this.
>
> that's easy to say until we ask you to be "the second set"
> on a million e-texts, all of which are almost perfect now.
>
> you -- and anyone else we ask -- will say "good enough".
>
> at some point, the benefit of greater accuracy isn't worth it.
>
> and then we say, "if the people reading this book because
> they _want_ to read it cannot find any errors in it, then
> that's their problem, but we cannot spend any more time
> having innocent people re-proof this book _once_again_
> simply because there _might_ still be an error in the thing."
>
> again, i draw the line quite specifically.? if a page has been
> looked at by 2 people in a row who could not find an error,
> then i certify that page as "good enough for the public" and
> stop looking at it.? you can make it 1 person, or 3 people,
> or 4 people or 8 people or 22 people, whatever you like, but
> nobody would ever suggest we keep proofing a book forever.
>
> now, let me be clear that i understand that you only said
> "a second set of eyes" and not 22 sets of them.? i agree...
> and that's specifically why i use the comparison method,
> because it gives us two sets of eyes on a book, essentially.
>
>
> >?? And, those with less experience (or no experience)
> >?? in shaping the output of a text may feel more confident
> >?? about doing the process if they know someone else will
> >?? provide a checksum for their work before it goes public.
>
> except that the _public_ provides that checksum for them.
>
> they constitute your "second set of eyes", your 3rd, your 23rd.
>
>
> >?? the text would be released to PG and then should be
> >?? placed in some environment that allows for editing it
> >?? to 'perfection.'
>
> it would be nice if p.g. did this.? or d.p.? but neither does.
>
> -bowerbird
>
>

From Bowerbird at aol.com  Wed Mar  3 16:48:36 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 3 Mar 2010 19:48:36 EST
Subject: [gutvol-d] Re: Processing eTexts
Message-ID: <3985b.3dbb423c.38c05d64@aol.com>

michael said:
>    For those who worry about BB and "perfect" eBooks,
>    the following should ease that worry greatly!!!

i must admit i can no longer tell when you are serious,
when you are misreading me, and when you are joking.

but i've been _perfectly_ clear, and consistent, all along.

i say a book that has 1-error-or-less-every-10-pages
is _perfectly_ ready for release to the general public,
with the explicit understanding that we do all we can
to encourage and make it easy for that general public
to help us in moving the books toward _perfection_...

i hear lots of chestbeating about quality -- both by those
who argue for it, and those who argue otherwise -- but i
see very little activity productively engaged in attaining it.

i've also done scads of research on how to develop tools
and processes that will help us improve on our accuracy,
and means by which we can have the general public help.
precious little of my progress has been utilized by anyone.

i'm not hung up on perfection -- it's nigh unattainable --
but neither have i ever been willing to have any other goal.

i consider my position to be a fully reasonable one, and
i've been perfectly clear on it, and preached it consistently,
since the start.   if you heard anything else, you misheard.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100303/18a0ae76/attachment.html>

From hart at pglaf.org  Wed Mar  3 17:14:39 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Wed, 3 Mar 2010 17:14:39 -0800 (PST)
Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not
 listen, pray)
In-Reply-To: <4B8ED97E.5020105@novomail.net>
References: <30a2.40c90764.38bf8822@aol.com>
	<E7C160A1-4B56-4C00-A815-8EF157EFD574@uni-trier.de>
	<SNT120-DS15FE8ADC78D1DA1664ED39AE3A0@phx.gbl>
	<4B8ED97E.5020105@novomail.net>
Message-ID: <alpine.DEB.2.00.1003031641400.17633@mail.pglaf.org>


While Lee's comments are pretty great, there are a few comments/corrections:


On Wed, 3 Mar 2010, Lee Passey wrote:

> On 3/3/2010 11:29 AM, Jim Adcock wrote:
>
> > You have to be willing to adjust or modify the "high
> > priesthood" system.
>
> I have participated, or attempted to participate, in a number of FOSS
> projects over my career as a programmer, and I have a few observations which
> you may find relevant.
>
> Every successful FOSS project I have ever observed has started with the
> vision of a single individual. In the years leading up to 1995, Eric A.
> Young single-handedly managed to implement the full suite of cryptosystems
> used in SSL, and in that year made it available on the internet for free.
> This effort became the foundation of OpenSSL. Until he was lured away by
> RSA, Mr. Young was the driving force behind OpenSSL. Today, the role of
> visionary is played by Ralf Engelschall and Ben Laurie.
>
> In 1991 Andrew Tridgell, another Australian needed to mount disk space from
> a Unix server to his DOS PC. Using a packet sniffer he was able to reverse
> engineer the System Message Block protocol used by IBM's NetBIOS system, and
> which was the basis for DOS and Windows networking. This work eventually
> became Samga, a Unix/Linux software suite that provides file and print
> services to Windows-based clients. Mr. Tridgell still participates, and is
> the driving force behind the Samba open source project.
>
> While many have criticized his alleged heavy-handedness, I believe that the
> success of the Linux kernel is primarily due to the fact that Linux Torvalds
> still has absolute authority over what changes go into that kernel.
>
> Michael Hart plays the same role at Project Gutenberg that these programming
> giants played in the development of their respective software projects.
> Project Gutenberg was the brainchild of Mr. Hart, and he continues to be the
> driving force and visionary behind the project. While he, with
> uncharacteristic modesty, primarily credits the volunteers for the nature of
> Project Gutenberg, I disagree. For better or for worse, Project Gutenberg is
> the product of Mr. Harts vision and tenacity.
>
> Distributed Proofreaders was founded in 2000 by Charles Franks to assist in
> the production of electronic texts specifically to be distributed by Project
> Gutenberg. According to my recollection, Mr. Franks' theory was that
> production of e-texts was hampered by the fact that few people were willing
> to take on the task of producing an entire e-text, particularly through the
> arduous text proofreading process. His vision was to take a text and break
> it up into discrete units (in this case, pages) so that many people could be
> involved in the proofreading process and lightening the burden. Thus, the
> one time DP catch-phrase, "Proofread a page a day, that's all we ask." The
> volunteers at Distributed Proofreaders have become very good at proofreading
> texts.
>
> I have also seen any number of FOSS projects which have attempted to begin
> through consensus and team building. I can't name any of these projects for
> you, because they have all either failed or were still-born.

Sadly to say, this is all too true, both locally and nationally,
not to mention internationally.


> I think I have learned this lesson from my observations of these projects:
> to be successful you must have one single visionary who controls, more or
> less, the project. Having that visionary will not guarantee success, but not
> having it will surely doom it. At if a project loses its visionary, or
> marginalizes him or her to the point where he or she no longer controls the
> vision, the project will become increasingly ineffective and inefficient,
> and will descend into in-fighting and turf wars as others try to control the
> vision.

I would like think that Project Gutenberg, and Distributed Proofreaders will
continue on without me until they can't find anything more to do on eBooks,
and perhaps then even continue on to something else.


> Vision cannot be obtained by consensus.

I suppose I have been lucky enough to have managed this once or twice.


> When someone criticizes Project Gutenberg for supposed failings, or the
> inability or unwillingness to keep up with the times, and Michael Hart
> responds with his now inevitable suggestion to "JUST GO FOR IT," what he is
> saying is "what you are suggesting does not match my vision. If you feel
> your vision is better than mine I encourage you to go elsewhere to pursue

"I encourage you to go elsewhere to pursue it" is not quite correct,
even though there is some amerlioration below.  We are more than happy
to house any free eBooks efforts right here at Project Gutenberg, with
or without our gutenberg.org or pglaf.org domain being associated, it's
pretty much up the the people in question, and if they don't want some
asscociation with PG we will provide readingroo.ms, etc., etc., etc.

We will provide ALL of the infrastructure possible, and ask volunteers
to help, but, being volunteers, it is really up to them.

To lead here at Project Gutenberg you have to lead by example.

DO SOMETHING!!!  [You'll probably have to do it a couple dozen times.]

Then ask others to get on the bandwagon with you and do it some more.

When this works it is like starting an avalanche with snowballs.

///

I think if Mr. Bowerbird had been willing to follow such a plan
and to post an example of a completed book he did once a month,
or even once every two or three months, he/we would have dozens
of them online by now and there would no longer be arguments of
such hypothetical types, but much more concretized.

I must state for the record that I have encouraged him to this,
pretty much every single year he has been here.

I would encourage anyone/everyone else to do the same.

It's all you would have to do to wrest "control" of PG from me,
and then I could go invent something else.


> it. We can offer some infrastructure support (disk space) and you are
> welcome to invite Project Gutenberg volunteers to go help you actualize your
> vision, but I will not substitute your vision for mine."


Not quite right:

What I will not do, as asked so many times, is to state for official record
that YOU are the official boss of Project Gutenberg and that YOUR method IS
THE ONLY OFFICIAL METHOD OF PROJECT GUTENBERG.


> I am not prepared to say that Distributed Proofreaders has lost its vision.
> It is still proofreading a lot of pages every day. It is clearly /not/ an
> efficient process, but efficiency was not one of the project goals. We are
> all familiar with the old saw that while one woman can have a baby in 9
> months that doesn't mean that 9 women can have a baby in one month. I don't

No, but a group of women can have an average of one baby per month.

When you are dealing with larger numbers it's not exactly the same.


> believe that DP is saying "if one person can proofread a text in 10 days,
> then 10 people can proofread it in one day," but they are saying "100 people
> can proofread it in two days." Distributed Proofreaders goal was to increase
> the speed that texts would be proofread, to lighten the load from any one
> individual and to make the process more fault-tolerant (if one volunteer
> quit, the project would not need to be restarted).

Actually, 10 people CAN do that kind of job in one day, and have!!!

However, it is nice to have both someone at the wheel and a substitute.


> What has happened is that the needs of the consumer has changed. I'm fairly
> certain that the proofread texts now sitting in DP's Post-Processing queue
> would meet Michael Hart's standards (or lack thereof, as he is continually
> telling me he has no standards)

Again not quite right:

It's not that I have no standards, I just don't force them on people.

Even when it comes down to hard and fast accuracy percentages, I will
state the accuracy level I hope for at any given time.

Right now it is 99.975%

Earlier it was 99.95% [co-opted by the Library of Congress, hee hee!]

Before that it was 99.9%, but that was when I started with a version 0.1
not a version 1.0, and worked up to 1.0.


> and could be released to Project Gutenberg as is. Other consumers, however,
> have higher standards, and Distributed Proofreaders is now trying to satisfy
> those standards as well, and those new standards require post-processing of
> a work as a single unit by a single person.

We always had a single person as the last post-processor.

First it was me, then Judy Boss, then me again, then Greg Newby,
then me again, then Newby again, etc., etc., etc.


> DP's vision and expertise is in the area of distributed proofreading, not in
> the area of efficient e-book creation. This is why texts languish in the
> Post-Processing queue.
>
> Your problem, Mr. Adcock, is that you believe you can change the vision
> underlying either of these organization through rational argument.

Personally, I believe in rational argument, with stated premises followed
by stated conclusions, stacked on top of each other to final conclusions.

However, as many of you have undoubtedly note bened, when such arguments
are put forth, the opposition ignores them in "fair and balanced" ways.

[Just to make sure those who never heard of "fair and balanced" look it up]


> Vision is an intuitive, almost religious, experience, and blind faith is
> immune to rationality. It is virtually impossible that you will be able to
> change the vision either of Mr. Hart or whomever is currently the visionary
> at Distributed Proofreaders.

While my faith in the whole of the eBook movmement and Open Source is
pretty much unshakeable, it is a rational faith, not blind, based on
the simple cost/benefit ratio.

In then end just plain individuals can do all the eBooks and post them
where seach engines can find them.

It's nice to have large collections, but not necessary.


> I suspect that this is why Roger Frank has created his own web site for
> "roundless proofing;" his vision differs from that of Distributed
> Proofreaders, and it was simply easier to go his own way than to try and
> change someone else's vision.

And so too could anyone else, with less effort, and more cooperation.

However, doing it yourself has certain inalienable advantages!!!


> I believe I agree with every criticism you have leveled at both Project
> Gutenberg and Distributed Proofreaders, which is to say, I believe I accept
> your vision. So let me mimic the words of Michael Hart:
>
> GO FOR IT!
>
> Put together your own project to complete high-quality public domain
> e-books. You could certainly harvest all of the files currently in the DP
> post-processing queue to start with. You might be able to grab the HTML
> files from PG if you can find scans to go with them. Take advantage of the
> hardware resources that Mr. Newby has offered. Post messages here and at DP
> inviting volunteers to help you out. No need to return the e-books to either
> of those organizations; if they want them they will know where to find them.
> I will help out as much as possible.
>
> But please stop trying to convince Distributed Proofreaders or Project
> Gutenberg to accept a new vision. They are old and are set in their ways.
> They represent the last internet generation, not the current one. Show us
> the way forward, and let sleeping dogs lie.

I'm still interested in new visions, but just not those that tell me to do
something YOU should be doing, even though I am willing to help.

I am willing to help!!!

Period.

That's the bottom line.

And you don't even have to give me or PG any credit. . . .


From prosfilaes at gmail.com  Wed Mar  3 17:30:30 2010
From: prosfilaes at gmail.com (David Starner)
Date: Wed, 3 Mar 2010 20:30:30 -0500
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
	e-texts
In-Reply-To: <SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>
	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>
	<m28waaxq0y.fsf@gnu.franken.de>
	<EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>
	<SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>
Message-ID: <6d99d1fd1003031730v633bef5ck431409a9d8b36e21@mail.gmail.com>

On Wed, Mar 3, 2010 at 1:24 PM, Jim Adcock <jimad at msn.com> wrote:
>?I suspect when Google digitizes the book the
> original is then trashed by the college library

That would be silly. When you have endowments the size of the Harvard
University, you have no need to do that; since they're already built
book-storage buildings where books can stored in more space-efficient
forms than browseable stacks and are retrievable only with a couple
days' notice, you're simply more free to exile them there.

> -- the whole point being
> they do not want to have to pay to maintain physical library books in
> various states of decay.

The whole point of this thing is that Google thought that digitalizing
this material would be valuable, and the universities all thought that
it would be valuable to have digital copies of their collection, and
that it would further their mission to spread knowledge.

> ?Google then becomes the sole repository for this
> information

No. The universities all have copies of all the scans made from their books.

> But to my mind one measure is
> obvious: ?Books that real people do not in practice want to read we should
> not bother to restore!

Then we aren't doing enough porn. If your sole measure of worthiness
is the number of hits, then forget about doing the works of Sarah Orne
Jewett, let's start digging up all that erotica published in the 20s
and 30s under the table and watch the Google hits come flying it.

> As a simple measure at least the total
> amount of time people spend reading the book has to exceed the amount of
> time volunteers spend preparing the book, or it's a loss to society.

It's not a loss to society to take time that would be used for
watching TV and use it to restore books. It's not a loss to society if
we make a work accessible to the right scholar, or if we inspire the
right person.

> But it is trivial to find a book to work on that will be
> 50X more popular than the average book DP finishes.

First, looking at the puerile crap (no offense intended) that comes up
as done by you, I'm not sure you can find it. The first Slashdotting
of DP, someone complained that among the little material we had
available was my scan of "From October to Brest-Litovsk", but to this
day, I think that book--history written with lightning--was one of the
more important works I did, and probably more read too (someone did it
for Librivox).

In some sense, the single most popular work PG has has to be the 1913
Webster's, which has been borrowed as the basis of just about every
online free dictionary, and referred to by people who don't even know
that PG exists.

And another major point is, what do DPers actually want to work on?
Hard material tends to go through slowly, where as junk fiction tends
to go through pretty quickly. That has nothing to do with the
popularity or worthiness of the text. We could toss out a bunch of the
"less worthy" books in exchange for the OED or porn, but I doubt that
will increase DP production overall.

-- 
Kie ekzistas vivo, ekzistas espero.

From hart at pglaf.org  Wed Mar  3 17:59:35 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Wed, 3 Mar 2010 17:59:35 -0800 (PST)
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
 e-texts
In-Reply-To: <6d99d1fd1003031730v633bef5ck431409a9d8b36e21@mail.gmail.com>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>
	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>
	<m28waaxq0y.fsf@gnu.franken.de>
	<EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>
	<SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>
	<6d99d1fd1003031730v633bef5ck431409a9d8b36e21@mail.gmail.com>
Message-ID: <alpine.DEB.2.00.1003031758120.17633@mail.pglaf.org>


Sorry, but lots of libraries are doing JUST that!!!

Selling the books after digitizing. . . .

I bought several volumes of the NY Herald when this was done.

I will probably buy more.


On Wed, 3 Mar 2010, David Starner wrote:

> On Wed, Mar 3, 2010 at 1:24 PM, Jim Adcock <jimad at msn.com> wrote:
> >?I suspect when Google digitizes the book the
> > original is then trashed by the college library
>
> That would be silly. When you have endowments the size of the Harvard
> University, you have no need to do that; since they're already built
> book-storage buildings where books can stored in more space-efficient
> forms than browseable stacks and are retrievable only with a couple
> days' notice, you're simply more free to exile them there.
>
> > -- the whole point being
> > they do not want to have to pay to maintain physical library books in
> > various states of decay.
>
> The whole point of this thing is that Google thought that digitalizing
> this material would be valuable, and the universities all thought that
> it would be valuable to have digital copies of their collection, and
> that it would further their mission to spread knowledge.
>
> > ?Google then becomes the sole repository for this
> > information
>
> No. The universities all have copies of all the scans made from their books.
>
> > But to my mind one measure is
> > obvious: ?Books that real people do not in practice want to read we should
> > not bother to restore!
>
> Then we aren't doing enough porn. If your sole measure of worthiness
> is the number of hits, then forget about doing the works of Sarah Orne
> Jewett, let's start digging up all that erotica published in the 20s
> and 30s under the table and watch the Google hits come flying it.
>
> > As a simple measure at least the total
> > amount of time people spend reading the book has to exceed the amount of
> > time volunteers spend preparing the book, or it's a loss to society.
>
> It's not a loss to society to take time that would be used for
> watching TV and use it to restore books. It's not a loss to society if
> we make a work accessible to the right scholar, or if we inspire the
> right person.
>
> > But it is trivial to find a book to work on that will be
> > 50X more popular than the average book DP finishes.
>
> First, looking at the puerile crap (no offense intended) that comes up
> as done by you, I'm not sure you can find it. The first Slashdotting
> of DP, someone complained that among the little material we had
> available was my scan of "From October to Brest-Litovsk", but to this
> day, I think that book--history written with lightning--was one of the
> more important works I did, and probably more read too (someone did it
> for Librivox).
>
> In some sense, the single most popular work PG has has to be the 1913
> Webster's, which has been borrowed as the basis of just about every
> online free dictionary, and referred to by people who don't even know
> that PG exists.
>
> And another major point is, what do DPers actually want to work on?
> Hard material tends to go through slowly, where as junk fiction tends
> to go through pretty quickly. That has nothing to do with the
> popularity or worthiness of the text. We could toss out a bunch of the
> "less worthy" books in exchange for the OED or porn, but I doubt that
> will increase DP production overall.
>
> --
> Kie ekzistas vivo, ekzistas espero.
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

From dakretz at gmail.com  Wed Mar  3 18:04:57 2010
From: dakretz at gmail.com (don kretz)
Date: Wed, 3 Mar 2010 18:04:57 -0800
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
	e-texts
In-Reply-To: <6d99d1fd1003031730v633bef5ck431409a9d8b36e21@mail.gmail.com>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>
	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>
	<m28waaxq0y.fsf@gnu.franken.de>
	<EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>
	<SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>
	<6d99d1fd1003031730v633bef5ck431409a9d8b36e21@mail.gmail.com>
Message-ID: <627d59b81003031804s6f679594hbe7a3ed26be58ac8@mail.gmail.com>

 And another major point is, what do DPers actually want to work on?
Hard material tends to go through slowly, where as junk fiction tends
to go through pretty quickly. That has nothing to do with the
popularity or worthiness of the text. We could toss out a bunch of the
"less worthy" books in exchange for the OED or porn, but I doubt that
will increase DP production overall.

This is at least due to the urgency DP places on moving people out of their
comfort zone.

New people at every level are encouraged to choose (naturally enough) easy
projects to climb the learning curve; and since virtually everyone is being
encouraged to advance, this material comprises a larger portion than it
would otherwise. To assist this, easy projects are released from the queues
more quickly (again to encourage new skills).

I've mentioned that no Shakespeare play has been released into F2 or
processed
into PG for several years, despite sitting in the F2 queue much of that
time.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100303/d4774735/attachment.html>

From donovan at abs.net  Wed Mar  3 18:21:49 2010
From: donovan at abs.net (D Garcia)
Date: Wed, 3 Mar 2010 21:21:49 -0500
Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not
	listen, pray)
In-Reply-To: <SNT120-DS6331DA5A3CAB3D5EA4EF2AE3A0@phx.gbl>
References: <30a2.40c90764.38bf8822@aol.com> <4B8ED97E.5020105@novomail.net>
	<SNT120-DS6331DA5A3CAB3D5EA4EF2AE3A0@phx.gbl>
Message-ID: <201003032121.50581.donovan@abs.net>

Jim/James:

re:
>... I submitted two books in good faith to DP which are now
>stuck there indefinitely after I contributed many many hours of my own time

Of your two projects, the first went from creation to completion of all rounds 
solely through the normal operation of the queues in two months, and is being 
post-processed/verified. If you have concerns as PM about specifics of the 
project or its status, you should contact the post-processor or the PP-verifier 
of the project via the several means available to you on the DP site.

The second project has also gone from creation to completion of P1, P2, P3 and 
F1 in two months, also solely through the normal operation of the queues, and 
has been waiting in F2 for seven months. In the normal operation of the 
queues, this project would release about six weeks from now. That's pretty far 
from "indefinitely."

I have taken the liberty of releasing this project into F2 where a group of F2 
volunteers are focusing their efforts on it and will easily complete it before 
day's end, possibly before this post reaches the list.

re:
>... and I have no way to get those books back out.

Since you are the project manager, you could have assigned yourself as post-
processor and requested that it skip F2. However, as the F2'ers are finding and 
correcting formatting and other errors, it's probably better that you didn't.

Project managers have options within the DP process, including, but not 
limited to those mentioned above, either of which could have progressed your 
project. Any of the DP project facilitators, db-req, dp-help, or admins could 
have heard your concerns and and discussed options with you, which could have 
saved much of the frustration which you've expressed on this list, had you 
only asked.

David (donovan)

James Adcock wrote:
>"My problem", if we have to talk about my problems of which there
>are many, is that I submitted two books in good faith to DP which are now
>stuck there indefinitely after I contributed many many hours of my own time
>and tears, and I have no way to get those books back out.

From schultzk at uni-trier.de  Thu Mar  4 01:36:41 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Thu, 4 Mar 2010 10:36:41 +0100
Subject: [gutvol-d] Re: DP/PG vs. Google
In-Reply-To: <DAF927CE-3A14-4DD0-960B-77B84B48050C@informatimago.com>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>	<m28waaxq0y.fsf@gnu.franken.de>
	<EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>
	<SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>
	<DAF927CE-3A14-4DD0-960B-77B84B48050C@informatimago.com>
Message-ID: <7A4BA002-C18C-4B65-8606-A895FFB2CC91@uni-trier.de>

Hi All,

	I step in here and reply to a couple posts at one time.

Am 03.03.2010 um 20:42 schrieb Pascal J. Bourguignon:

> 
> On 2010-03-03, at 19:24, Jim Adcock wrote:
>> However, The DP/PG approach is extremely expensive compared to what Google
>> is doing.  Consider: Google Books == about 10 million books photo scanned.
>> DP/PG == 30,000 books "fully restored." So Google's approach is about 300X
>> faster than the DP/PG approach.  My Conclusion: In the best of all world's
>> there would be some measure of VALUE in choosing which books DP/PG chooses
>> to put effort into fully restoring -- the idea that somehow DP/PG is going
>> to be able to fully restore all the world's books is surely false.
	Google produces scan sets. Sure they put some in amore pleasurable form,
	But they are not interested producing books or even conserving them.
	The quality of the work is proof in fact. 

	My personal opinion is that Google is simply interested in producing revenue,
	by what even means! That does not mean that Google does not have any merit.

	DP want to produce pleasurable eBooks. Personally, DP/PG has more value.

> I think that the bet made by Google, is that sooner or later, sufficiently
> smart AI and OCR technology will be developed to allow to process its scans
> and do the job of PG automatically.
	I doubt this very much. 
		AI proper has been dead since the failing of the ELISA project.
		Yes, the term is still used today to refer to anything that a computer
		does that seems to be intelligent. But it is hardly AI.

		In the 80s machine translation was all the fad. The japanese said
		they would  have a MT system that would translate in real time your telephone
		conversations by the 90s. Well, here we are some 20 years later and
		we can have the most horrific translation made online. The standard is that
		of my introductory class I had in the 80s. Googles service does not even use the
		half of the developments made  MT. 
	
> 
> The only question is when it will happen, and some think that singularity
> will occur within 20 years.
	BB if it was realistic I would take you up on your bet. In 50 years their will not be a 
	system finished that will do job of creating proper output anything above 95%
	fully automatically That is without any human interaction whatsoever..

	Already, in the 90s it was said that with faster computers and cheaper storage
	the problems of knowledge engineering. Again, here we are and all is vaporware.
	It has been proven already in the 80s that human language Type 0 is and it is know
	that Type 0 can not be processed completely automatically by a computer. So the emphasis
	has change to simulate as much as possible. Yet, this will be always far from prefect.

	Sorry, for being here more than OT. But it was needed to prove the fact
	that anything having to do with language can not be handle by a computer
	program by itself.
	
	regards
		Keith.

From hart at pglaf.org  Thu Mar  4 07:02:07 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Thu, 4 Mar 2010 07:02:07 -0800 (PST)
Subject: [gutvol-d] !@! I Take That Bet! Re:  Re: DP/PG vs. Google
In-Reply-To: <7A4BA002-C18C-4B65-8606-A895FFB2CC91@uni-trier.de>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>
	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>
	<m28waaxq0y.fsf@gnu.franken.de>
	<EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>
	<SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>
	<DAF927CE-3A14-4DD0-960B-77B84B48050C@informatimago.com>
	<7A4BA002-C18C-4B65-8606-A895FFB2CC91@uni-trier.de>
Message-ID: <alpine.DEB.2.00.1003040656330.6569@mail.pglaf.org>


BB if it was realistic I would take you up on your bet. In 50 years their will
not be a system finished that will do job of creating proper output anything
above 95% fully automatically That is without any human interaction whatsoever..

_I_ will take that bet!!!

Even thought there are no realistic odds I will be here to collect.

I will be only too glad to have the proceeds go to PG, or In Memoriam.

The bet is that a Xerox machine type of scanning and OCR will produce
a 95% accurate copy of certain pages selected from an average set of
books, magazines, etc.  Just go to a library and ask for samples.

Fair enough???


Michael


From Bowerbird at aol.com  Thu Mar  4 08:00:40 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 4 Mar 2010 11:00:40 EST
Subject: [gutvol-d] Re:
 =?iso-8859-1?q?!=40!_I_Take_That_Bet!_Re=3A=A0_Re=3A_DP/PG_vs?=
 =?iso-8859-1?q?=2E_Google?=
Message-ID: <5b5e0.56a961e8.38c13328@aol.com>

michael said:
>    The bet is that a Xerox machine type of scanning and OCR 
>    will produce a 95% accurate copy of certain pages selected 
>    from an average set of books, magazines, etc.? 
>    Just go to a library and ask for samples.

that's not the bet at all.

the bet is whether google can increase accuracy to 96% or 97%.

we're not talking about the limits of scanning and o.c.r., people.

we're talking about what a company with virtually unlimited funds
and lots and lots and lots and lots of expertise with handling text
can do _after_ they've scanned books and done o.c.r. on the scans,
in order to improve the accuracy of that text.

folks, we're talking about how well they can clean up their o.c.r.

and i'm conservative by saying 96% or 97%...   quite conservative.

i've shown how useful it can be to compare two book digitizations.

but for some editions of some books, google will have _many_
different digitizations, involving different physical copies taken
from different physical libraries throughout the country, scanned
by different machines, and perhaps processed using different o.c.r.

they will certainly experiment with despeckling and resolution,
and other variables, and should hit on a comparison combination
which -- for their particular scans -- works remarkably effectively.

they will also have tons of data on the types of errors that are made
by their equipment, and knowing that _will_ help them fix the errors.

but mostly just having _multiple_digitizations_ of the same edition
of a book gives them the chance to raise accuracy through the roof.

you guys want to tie google's hands in the same way yours are tied.

but google's money and expertise mean they are _miles_ ahead...

and eventually probably even light-years ahead...

-bowerbird

p.s.   and the limitation on the bet that google can't use humans?
why not?   they have billions of pageviews every single day, not?
why do you think they bought recaptcha and hired luis von ahn?
they're not limited by the shackles that you want them to wear...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100304/377032fe/attachment.html>

From lee at novomail.net  Thu Mar  4 08:50:17 2010
From: lee at novomail.net (Lee Passey)
Date: Thu, 04 Mar 2010 09:50:17 -0700
Subject: [gutvol-d] Re: roundlessness -- 010
In-Reply-To: <34f8a.2b48e537.38c047e9@aol.com>
References: <34f8a.2b48e537.38c047e9@aol.com>
Message-ID: <4B8FE4C9.5090605@novomail.net>

On 3/3/2010 4:16 PM, Bowerbird at aol.com wrote:

[snip]

> the idea is that the proofer does the word-by-word scan
> _not_ against a web-page's textfield version of the page,
> but rather against an .html-realized version of the page...
...
> the main benefit is that you free 'em from having to look
> at the markup, because that's an unnecessary distraction.
> they see actual rendered italics, not the markup for italics.

FWIW, I like this idea very much. I think it meshes quite nicely with 
Mr. Frank's notion that markup should never be separated from its 
associated text.

> it's also possible this way to red-flag any possible scannos,
> as well as capitalization and punctuation improbabilities...
>
> you can also colorize quotations, which helps locate any
> missing or incorrect quotemarks.

This would have to be done subtly, so as not to influence users to make 
changes where they are inappropriate, but I think the idea has merit.

[snip]

> then, only if there are changes to be made will the proofer
> summon the textfield for editing.

The Kupu editor which is part of the Plone and Apache Lenya projects 
could be a very nice choice for this editor (good software engineers 
never want to reinvent wheels). In fact, I think it may be possible to 
use Lenya to build a prototype of this very sort of application.

I'll do a little research and get back to you.

From Bowerbird at aol.com  Thu Mar  4 09:18:58 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 4 Mar 2010 12:18:58 EST
Subject: [gutvol-d] let them eat cake
Message-ID: <61369.187e84e6.38c14582@aol.com>

michael said:
>   Sorry, but lots of libraries are doing JUST that!!!
>    Selling the books after digitizing. . . .
>    I bought several volumes of the NY Herald when this was done.
>    I will probably buy more.

i'm afraid this situation is already bad, and will certainly get worse.

universities nationwide are coming into a huge money crunch and
it costs a lot of money to house books.   and the benefits of doing so
are becoming ever less clear, since usage has dropped precipitously.

(ends up if the kids can't get it online, they won't trudge to a library,
they'll just do without.   and not just the kids, but the _faculty_ too!)

so yes, universities are building book warehouses for collections.
(the u.c. southern regional library facility is on-campus at u.c.l.a.)

but there's even more to the story.

some google library partners like _michigan_ are forming co-ops
(e.g., the hathi trust) that will offer scansets to other institutions.

right now, obviously, they're aiming at colleges and universities,
but it's fairly clear they will soon target research institutions and
private schools, public schools in big cities, and big city libraries.

every entity that is now funding a library (which is probably quite
limited in scope and fairly expensive to maintain) will soon find
they can instead get access to a much bigger corpus of material
for a much cheaper price by subscribing to these rent-a-libraries.

so they will all get rid of their paper-books.   (can't afford both!)

so the problem is not just the libraries where scans are made,
but every single library across the entire country.

now, in an ideal world, that would be great!, because we would all
agree that everyone should have unlimited access to this library,
just like they have unlimited access to their neighborhood library.

but that's not what the moneychangers have in mind, no siree...

this isn't a chance for society to save money.

to the contrary, it's a way for the moneychangers to rob society.

in addition to the fee they'll extract from each overall institution,
they'll likely want to charge a fee to each individual user as well,
perhaps even a per-page fee for every page every user views...

some of you might think that that would be totally reasonable.

you're mistaken.   you're badly mistaken.   you're very badly mistaken.

the reason you're mistaken is that sharing these scans is a process
that has very little variable cost.   most of the costs were fixed costs.
scanning, for instance, was a fixed cost, and a one-time cost at that.

by the time these scans have been pushed out a dozen times, they
will have paid their fixed costs...   everything after that, and there'll
be much usage after that, will be profit, pure profit, _excess_ profit.

the moneychangers want to get paid over and over and over again.

when you consider that these books were purchased and housed
at _public_ expense -- some of 'em for well over a century -- this
profiteering against the public's pocket is totally unconscionable.

still, librarian bureaucrats have proven time and time and time again
that they're complete idiots who will cut their own throats long-term
to get even a questionable good in the short-term.   so they will play
along into this little con-game being played by the moneychangers,
and the public will once again be left holding the bag of bills to pay...

and the final upshot?   we'll pay even more for books than we pay now,
and the poor among us will find that their access is sharply curtailed...

but, hey, what do poor people need books for anyway?   let 'em eat cake.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100304/d8a2962b/attachment.html>

From jimad at msn.com  Thu Mar  4 10:02:14 2010
From: jimad at msn.com (Jim Adcock)
Date: Thu, 4 Mar 2010 10:02:14 -0800
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease"
	of	e-texts
In-Reply-To: <627d59b81003031804s6f679594hbe7a3ed26be58ac8@mail.gmail.com>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>	<m28waaxq0y.fsf@gnu.franken.de>	<EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>	<SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>	<6d99d1fd1003031730v633bef5ck431409a9d8b36e21@mail.gmail.com>
	<627d59b81003031804s6f679594hbe7a3ed26be58ac8@mail.gmail.com>
Message-ID: <SNT120-DS2062FBFAF37454DACA911AAE390@phx.gbl>

Well, I guess I should stop complaining now because one of my DP texts has
made it to PP and I was able to snag it back myself.  But, I will point out
its statistics on the latest round, and people can judge for themselves:

This book sat on the F2 queue for 7,200 hours.

It then went "live" in F2 status for 3 hours, which is how long it took 14
F2 volunteers to do all the pages.  Since about 3 volunteers were working on
the book at any given time, the total volunteer-hours spent on F2 was about
10.

So the ratio of [time sitting on queue]/[volunteer-hours working on text] is
about 700 to one.

Is this a well-designed system?

PS: This book WAS classified as "porn" when it first came out -- which may
explain WHY the volunteers are interested in tackling it.  I did tag it as
containing material related to sexuality and infidelity in case anyone
didn't want to work on those subjects.  Nowadays the "porn" label would be a
joke and the book is considered a classic of modern American literature.  In
defense of the DP volunteers the other book I have stuck in DP was tackled
even more voraciously by DP volunteers -- and that one was never considered
"porn."

>Hard material tends to go through slowly, where as junk fiction tends to go
through pretty quickly. 

Material can be hard AND junk.  I am perfectly happy to work on hard stuff
if it will actually get used by anyone.  I spent some time proofing a hard
book on DP [that was labeled "Easy"] that should have been titled "How to
Torture a Horse." Put up OED and I will help tackle it. I would also be
happy to put up "Outline of Science Vol. II" which is hard AND popular -- if
DP were willing to get it out the door in say a year or less.

>I've mentioned that no Shakespeare play has been released into F2 or
processed into PG for several years, despite sitting in the F2 queue much of
that time.

And I would also be willing to work on the bard. Again, I won't be one
processing him *into* DP unless I have some assurance that he's ever going
to come *out* again!


From Bowerbird at aol.com  Thu Mar  4 10:09:32 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 4 Mar 2010 13:09:32 EST
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
Message-ID: <64a88.4b99698d.38c1515c@aol.com>

jim said:
>    Well, I guess I should stop complaining now

why?   i thought your complaints were made on a principle.
you were just bellyaching for a speed-up exception favor?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100304/fe2064ec/attachment.html>

From jimad at msn.com  Thu Mar  4 10:25:34 2010
From: jimad at msn.com (Jim Adcock)
Date: Thu, 4 Mar 2010 10:25:34 -0800
Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do
	not	listen, pray)
In-Reply-To: <201003032121.50581.donovan@abs.net>
References: <30a2.40c90764.38bf8822@aol.com>
	<4B8ED97E.5020105@novomail.net>	<SNT120-DS6331DA5A3CAB3D5EA4EF2AE3A0@phx.gbl>
	<201003032121.50581.donovan@abs.net>
Message-ID: <SNT120-DS2204076B44A6EAEC127BA1AE390@phx.gbl>

>Of your two projects, the first went from creation to completion of all
rounds 
solely through the normal operation of the queues in two months, and is
being 
post-processed/verified. If you have concerns as PM about specifics of the 
project or its status, you should contact the post-processor or the
PP-verifier 
of the project via the several means available to you on the DP site.

I have certainly done so, and have been told that it is "normal" for a PP to
take a long time at DP and that it would not be nice to keep asking the PP
every three months or so "how's it going."

>Any of the DP project facilitators, db-req, dp-help, or admins could 
have heard your concerns and and discussed options with you, which could
have 
saved much of the frustration which you've expressed on this list, had you 
only asked.

I did ask, and I was told that there was nothing I could do to expedite the
process and that these delays are normal, and what I should do is spend my
time and energy sticking more projects into the front end of the queue.


From jimad at msn.com  Thu Mar  4 10:34:09 2010
From: jimad at msn.com (Jim Adcock)
Date: Thu, 4 Mar 2010 10:34:09 -0800
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <64a88.4b99698d.38c1515c@aol.com>
References: <64a88.4b99698d.38c1515c@aol.com>
Message-ID: <SNT120-DS14FD681B6C206374EF60FEAE390@phx.gbl>

>jim said:
>>   Well, I guess I should stop complaining now
>
>why?  i thought your complaints were made on a principle.
>you were just bellyaching for a speed-up exception favor?

...said tongue in cheek. I haven't stopped complaining that the system
really doesn't work the way it's currently designed.  Although I don't see
why pointing out that the system doesn't work as designed should be
considered "bellyaching" any more than telling a webmaster that their server
is down is considered "bellyaching"!  Agreed that the phone company
considers that I am "bellyaching" when I call them to say that my phone
service isn't working and then I tell them that *their* phone service isn't
working either! [which typically takes about three hours because their
queuing systems don't work either....]


From marcello at perathoner.de  Thu Mar  4 10:50:44 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Thu, 04 Mar 2010 19:50:44 +0100
Subject: [gutvol-d] Re: !@! I Take That Bet! Re:  Re: DP/PG vs. Google
In-Reply-To: <alpine.DEB.2.00.1003040656330.6569@mail.pglaf.org>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>	<m28waaxq0y.fsf@gnu.franken.de>	<EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>	<SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>	<DAF927CE-3A14-4DD0-960B-77B84B48050C@informatimago.com>	<7A4BA002-C18C-4B65-8606-A895FFB2CC91@uni-trier.de>
	<alpine.DEB.2.00.1003040656330.6569@mail.pglaf.org>
Message-ID: <4B900104.2020702@perathoner.de>

Michael S. Hart wrote:
> 
> BB if it was realistic I would take you up on your bet. In 50 years their will
> not be a system finished that will do job of creating proper output anything
> above 95% fully automatically That is without any human interaction whatsoever..
> 
> _I_ will take that bet!!!
> 
> Even thought there are no realistic odds I will be here to collect.
> 
> I will be only too glad to have the proceeds go to PG, or In Memoriam.
> 
> The bet is that a Xerox machine type of scanning and OCR will produce
> a 95% accurate copy of certain pages selected from an average set of
> books, magazines, etc.  Just go to a library and ask for samples.

Accuracy of OCR already exceeds 99%.

Send me the money.


-- 
Marcello Perathoner
webmaster at gutenberg.org

From lee at novomail.net  Thu Mar  4 10:57:18 2010
From: lee at novomail.net (Lee Passey)
Date: Thu, 04 Mar 2010 11:57:18 -0700
Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not
 listen, pray)
In-Reply-To: <201003032121.50581.donovan@abs.net>
References: <30a2.40c90764.38bf8822@aol.com>
	<4B8ED97E.5020105@novomail.net>	<SNT120-DS6331DA5A3CAB3D5EA4EF2AE3A0@phx.gbl>
	<201003032121.50581.donovan@abs.net>
Message-ID: <4B90028E.4020103@novomail.net>

On 3/3/2010 7:21 PM, D Garcia wrote:

[snip]

> I have taken the liberty of releasing this project into F2 where a group of F2
> volunteers are focusing their efforts on it and will easily complete it before
> day's end, possibly before this post reaches the list.

Man, you've got to love it!

Mr. Adcock points out that the production process at Distributed 
Proofreaders is broken, and offers a sample demonstrating /how/ it is 
broken. In response, Mr. Garcia removes the sample from the standard 
process and deals with it as a special case. In other words, instead of 
trying to fix the broken process, Mr. Garcia has simply tried to 
neutralize the complaint!

I've worked at a number of few different companies for which this was 
just Standard Operating Procedure ... most of whom are no longer with us.


From marcello at perathoner.de  Thu Mar  4 11:02:24 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Thu, 04 Mar 2010 20:02:24 +0100
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease"
 of	e-texts
In-Reply-To: <SNT120-DS2062FBFAF37454DACA911AAE390@phx.gbl>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>	<m28waaxq0y.fsf@gnu.franken.de>	<EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>	<SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>	<6d99d1fd1003031730v633bef5ck431409a9d8b36e21@mail.gmail.com>	<627d59b81003031804s6f679594hbe7a3ed26be58ac8@mail.gmail.com>
	<SNT120-DS2062FBFAF37454DACA911AAE390@phx.gbl>
Message-ID: <4B9003C0.9050106@perathoner.de>

Jim Adcock wrote:

> PS: This book WAS classified as "porn" when it first came out -- which may
> explain WHY the volunteers are interested in tackling it.

Personally I'd like to see more porn on PG, we still lack most of De Sade.


-- 
Marcello Perathoner
webmaster at gutenberg.org

From lee at novomail.net  Thu Mar  4 11:21:06 2010
From: lee at novomail.net (Lee Passey)
Date: Thu, 04 Mar 2010 12:21:06 -0700
Subject: [gutvol-d] Re: !@! I Take That Bet! Re:  Re: DP/PG vs. Google
In-Reply-To: <4B900104.2020702@perathoner.de>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>	<m28waaxq0y.fsf@gnu.franken.de>	<EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>	<SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>	<DAF927CE-3A14-4DD0-960B-77B84B48050C@informatimago.com>	<7A4BA002-C18C-4B65-8606-A895FFB2CC91@uni-trier.de>	<alpine.DEB.2.00.1003040656330.6569@mail.pglaf.org>
	<4B900104.2020702@perathoner.de>
Message-ID: <4B900822.9070006@novomail.net>

On 3/4/2010 11:50 AM, Marcello Perathoner wrote:

> Michael S. Hart wrote:

[snip]

>> The bet is that a Xerox machine type of scanning and OCR will produce
>> a 95% accurate copy of certain pages selected from an average set of
>> books, magazines, etc. Just go to a library and ask for samples.
>
> Accuracy of OCR already exceeds 99%.

Absolutely.

According to what I learned in typing class (yes, I really am that old) 
a standard typewritten sheet of paper averages 72 lines of 66 characters 
each, resulting in 4752 characters per page. Based solely on a per 
character basis 99% accuracy would allow 47 errors per page. Modern OCR, 
even that POS that IA uses, gives better accuracy than that.

If you choose to look at words instead of characters, it is generally 
accepted that the average word length is 6 characters, for an average of 
9.5 words per line (I have omitted spaces which is why it is not 11 
words per line). This results in an average of 679 words per page, which 
at 99% accuracy would allow for 6 misrecognized /words/ per page. That 
is still well within the recognition accuracy of modern OCR.

Personally, I find bowerbird's stated goal of 1 error per 10 pages a 
worthwhile goal. This is actually an accuracy rate (based upon words) of 
99.9998527%. So maybe the bet ought to be when automated OCR will exceed 
four 9s of accuracy (basically one word error per page). Some of the 
recent work I have done, from my own scans, already reaches that threshold.

(Accuracy will, of course, vary depending on the quality of the scanned 
image. YMMV and all that jazz.)

From hart at pglaf.org  Thu Mar  4 12:13:37 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Thu, 4 Mar 2010 12:13:37 -0800 (PST)
Subject: [gutvol-d] Re: !@! I Take That Bet! Re:  Re: DP/PG vs. Google
In-Reply-To: <4B900104.2020702@perathoner.de>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>
	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>
	<m28waaxq0y.fsf@gnu.franken.de>
	<EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>
	<SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>
	<DAF927CE-3A14-4DD0-960B-77B84B48050C@informatimago.com>
	<7A4BA002-C18C-4B65-8606-A895FFB2CC91@uni-trier.de>
	<alpine.DEB.2.00.1003040656330.6569@mail.pglaf.org>
	<4B900104.2020702@perathoner.de>
Message-ID: <alpine.DEB.2.00.1003041210560.20842@mail.pglaf.org>


Marcello don't you ever READ anything before replying???!!!

Still???!!!

"In 50 years there will NOT be as system. . .above 95%. . ."

I took that bet, betting there WILL be. . .SHOW ME THE MONEY!!!

How do you expect anyone to EVER take you seriously when you do
this kind of thing over, and over, and over. . .???!!!


On Thu, 4 Mar 2010, Marcello Perathoner wrote:

> Michael S. Hart wrote:
> >
> > BB if it was realistic I would take you up on your bet. In 50 years their
> > will
> > not be a system finished that will do job of creating proper output anything
> > above 95% fully automatically That is without any human interaction
> > whatsoever..
> >
> > _I_ will take that bet!!!
> >
> > Even thought there are no realistic odds I will be here to collect.
> >
> > I will be only too glad to have the proceeds go to PG, or In Memoriam.
> >
> > The bet is that a Xerox machine type of scanning and OCR will produce
> > a 95% accurate copy of certain pages selected from an average set of
> > books, magazines, etc.  Just go to a library and ask for samples.
>
> Accuracy of OCR already exceeds 99%.
>
> Send me the money.
>
>
>

From schultzk at uni-trier.de  Thu Mar  4 12:37:29 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Thu, 4 Mar 2010 21:37:29 +0100
Subject: [gutvol-d] Re: !@! I Take That Bet! Re:  Re: DP/PG vs. Google
In-Reply-To: <alpine.DEB.2.00.1003040656330.6569@mail.pglaf.org>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>
	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>
	<m28waaxq0y.fsf@gnu.franken.de>
	<EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>
	<SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>
	<DAF927CE-3A14-4DD0-960B-77B84B48050C@informatimago.com>
	<7A4BA002-C18C-4B65-8606-A895FFB2CC91@uni-trier.de>
	<alpine.DEB.2.00.1003040656330.6569@mail.pglaf.org>
Message-ID: <A460A75E-3749-40C1-8BBD-70BC93C178EB@uni-trier.de>

	One big problem,

	You dio not stil a a PG or DP text ebook.
	You do have any markup what so even!

	Plus, what happens if you give them the Google scan
	sets!!


	I have work with OCR that will get me 100%
	text accuracy, but it took a hell alot of training,
	aka human interaction. 

	Also, OCR today achieves their accuracy from 
	dictionaries and guessing at the correct spelling.

	Which under many circumstances this type of heuristics
	causes a quite a few errors.

	regards
		Keith.
		
Am 04.03.2010 um 16:02 schrieb Michael S. Hart:

> 
> 
> BB if it was realistic I would take you up on your bet. In 50 years their will
> not be a system finished that will do job of creating proper output anything
> above 95% fully automatically That is without any human interaction whatsoever..
> 
> _I_ will take that bet!!!
> 
> Even thought there are no realistic odds I will be here to collect.
> 
> I will be only too glad to have the proceeds go to PG, or In Memoriam.
> 
> The bet is that a Xerox machine type of scanning and OCR will produce
> a 95% accurate copy of certain pages selected from an average set of
> books, magazines, etc.  Just go to a library and ask for samples.
> 
> Fair enough???
> 
> 
> Michael
> 
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d


From donovan at abs.net  Thu Mar  4 12:41:18 2010
From: donovan at abs.net (D Garcia)
Date: Thu, 4 Mar 2010 15:41:18 -0500
Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not
	listen, pray)
In-Reply-To: <4B90028E.4020103@novomail.net>
References: <30a2.40c90764.38bf8822@aol.com>
	<201003032121.50581.donovan@abs.net>
	<4B90028E.4020103@novomail.net>
Message-ID: <201003041541.18286.donovan@abs.net>

Lee Passey wrote:

>Mr. Adcock points out that the production process at Distributed
>Proofreaders is broken, and offers a sample demonstrating how it is
>broken. In response, Mr. Garcia removes the sample from the standard
>process and deals with it as a special case. In other words, instead of
>trying to fix the broken process, Mr. Garcia has simply tried to
>neutralize the complaint!

It's telling that based on zero knowledge you first assume (wrongly) that I am 
not working on improving the DP process, and then compound the error by 
assuming that addressing a volunteers issue constitutes "neutralizing" a 
complaint, all the while ignoring the rest of the message which outlined the 
full situation instead of the narrowly spun perspective you present.

I'm sorry you believe that DP has nefarious intent in responding to a 
situation where a volunteer believed they had no recourse. Since I can't 
believe that you think ignoring that issue would have somehow been better, I 
am forced to conclude that your only concern is the spin you've tried to put 
on it.

Congratulations on winning a bet for me that someone would attempt to do 
exactly that. :)

David (donovan)

From dakretz at gmail.com  Thu Mar  4 13:07:54 2010
From: dakretz at gmail.com (don kretz)
Date: Thu, 4 Mar 2010 13:07:54 -0800
Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not
	listen, pray)
In-Reply-To: <201003041541.18286.donovan@abs.net>
References: <30a2.40c90764.38bf8822@aol.com>
	<201003032121.50581.donovan@abs.net> <4B90028E.4020103@novomail.net>
	<201003041541.18286.donovan@abs.net>
Message-ID: <627d59b81003041307o2461de16ye52446d30fa19b51@mail.gmail.com>

I will in this case vouch for at least part of the representation
given by David (donovan). What you experienced is in fact the
primary method employed by the DP process managers for
trying to ameliorate the consequences of their system. When
a perceived deficiency is detected, it is defined as a "special
case" and given "special treatment".

So your first project probably qualifies as a "First Project",
and therefore has access to a good deal of standard "special
treatment" that you might not have been aware of (though it
was your responsibility to be so, unfortunately.)

Your second project may have also been qualified for another
standard "special treatment"; I'm not very familiar with all the
nuances, but put he certainly is - as he points out, he is one
of those primarily responsible for it. (In tact, it's also true that
he is one of the primary gatekeeepers for innovation and
process improvement generally.)

It's too bad you had the misfortune to be advised by someone
not familiar with the proper navigation of the dp process. As is
apparently also the case for whoever is responsible for the
Shakespeare projects.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100304/8f941d98/attachment-0001.html>

From Bowerbird at aol.com  Thu Mar  4 14:13:40 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 4 Mar 2010 17:13:40 EST
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
Message-ID: <75210.438f65cb.38c18a94@aol.com>

jim said:
>    Although I don't see why pointing out that the system 
>    doesn't work as designed should be considered "bellyaching"

...also said tongue in cheek...          ;+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100304/5ae30105/attachment.html>

From Bowerbird at aol.com  Thu Mar  4 15:10:32 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 4 Mar 2010 18:10:32 EST
Subject: [gutvol-d] Re:
 =?iso-8859-1?q?!=40!_I_Take_That_Bet!_Re=3A=A0_Re=3A_DP/PG_vs?=
 =?iso-8859-1?q?=2E_Google?=
Message-ID: <78ae8.531a7bed.38c197e8@aol.com>

keith said:
>   One big problem,
>    You dio not stil a a PG or DP text ebook.

there seems to be a transcription difficulty there...
that's ok, it happens even to the human among us.


>    You do have any markup what so even!

who needs markup?

o.c.r. can manage italics, not so well, agreed, but still.

as for the structural aspects of a text, like chapter-heads
and block-quotes and stuff like that, google is already
showing that they are capable of handling such things...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100304/cda866ad/attachment.html>

From lee at novomail.net  Thu Mar  4 15:15:12 2010
From: lee at novomail.net (Lee Passey)
Date: Thu, 04 Mar 2010 16:15:12 -0700
Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not
 listen, pray)
In-Reply-To: <201003041541.18286.donovan@abs.net>
References: <30a2.40c90764.38bf8822@aol.com>	<201003032121.50581.donovan@abs.net>	<4B90028E.4020103@novomail.net>
	<201003041541.18286.donovan@abs.net>
Message-ID: <4B903F00.3050704@novomail.net>

On 3/4/2010 1:41 PM, D Garcia wrote:

> I'm sorry you believe that DP has nefarious intent in responding to a
> situation where a volunteer believed they had no recourse.

I would never suggest than anyone at DP has nefarious intent without 
incontrovertible evidence, and I do not do so now; I'm a firm believer 
in Hanlon's razor.

> Since I can't
> believe that you think ignoring that issue would have somehow been better, I
> am forced to conclude that your only concern is the spin you've tried to put
> on it.

First of all, I don't think it is the responsibility of anyone at 
Distributed Proofreaders to make sure that any particular volunteer is 
satisfied. I'm confident that Mr. Adcock is a competent producer of 
e-texts, and if you were to have told him, "look, we're doing the best 
we can here, but if you want to pick up the project on your own here's 
where you can get everything we've done up until now," that would have 
been sufficient.

However, I certainly don't believe that ignoring the issue would have 
been a better option; instead I believe that addressing the issue 
head-on would have been better. There is an adage in Washington that 
"sunshine is the best disinfectant." Likewise, I believe that 
transparency is the best defense.

First of all, we must recognize that the problem that Mr. Adcock was 
complaining of was /not necessarily/ that two of his projects remained 
in the Post-Processing queue for an unduly long time. Rather the problem 
he identified is that somehow the current production processes allow 
/any and all/ projects to become backed up in that queue. Given that 
problem statement, I would have liked to have seen something more like 
one of the following responses:

1. "There are no problems with the processes at Distributed 
Proofreaders. If you don't like the way we do things here, you don't 
have to participate."

or,

2. "We recognize that there is a problem but we can't seem to agree upon 
the cause. We'll keep you informed as to the results of our inquiry. In 
the meantime, here's where you or the public-at-large can retrieve /all/ 
of the pieces of the stuck projects so you can take one and move it 
forward outside of the aegis of DP if you like."

or,

3. "We recognize that there is a problem in our production process and 
we think we have identified the cause, which is [fill in the cause 
here]. As of yet we have not agreed on the best way to reform the 
process, but we'll keep you informed as to our progress. In the 
meantime, here's where you or the pubic-at-large, etc..."

or,

4. "We have identified a problem in our production process, and believe 
it can be resolved by [fill in the proposed resolution here]. Please be 
patient while we see if this proposal resolves the backlog. If it does 
not, we will resume our search for the underlying problem, and in the 
meantime, here's where you or the public-at-large, etc..."

Instead we saw a response more along the lines of:

"We can not confirm or deny the existence of any problems in the 
production processes of Distributed Proofreaders, nor can we confirm or 
deny that we have identified any of the causes for these problems which 
may or may not exist. We may or may not have agreed upon what may or may 
not be a solution to these unidentified, alleged problems, but there is 
a possibility that we might change our process in unspecified ways. Or 
not. But as a special favor to you we will extract the two projects you 
are interested in to route around the damage, which may or may not 
exist, and process them using an entirely different procedure so that 
you will be satisfied."

It seems to me that this kind of response is designed, in fact, to 
ignore the issue at hand, which is that changes need to be made at D.P. 
to increase the throughput of e-texts.

Now it very well may be that this problem has already been recognized by 
The Powers That Be, and that a solution will be in place Real Soon Now. 
In that case, wouldn't it have been better to just say so?

> Congratulations on winning a bet for me that someone would attempt to do
> exactly that. :)

I'm always happy to help. ;-)

From sly at victoria.tc.ca  Thu Mar  4 15:23:53 2010
From: sly at victoria.tc.ca (Andrew Sly)
Date: Thu, 4 Mar 2010 15:23:53 -0800 (PST)
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
 e-texts
In-Reply-To: <4B9003C0.9050106@perathoner.de>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>
	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>
	<m28waaxq0y.fsf@gnu.franken.de>
	<EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>
	<SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>
	<6d99d1fd1003031730v633bef5ck431409a9d8b36e21@mail.gmail.com>
	<627d59b81003031804s6f679594hbe7a3ed26be58ac8@mail.gmail.com>
	<SNT120-DS2062FBFAF37454DACA911AAE390@phx.gbl>
	<4B9003C0.9050106@perathoner.de>
Message-ID: <Pine.GSO.4.58.1003041520570.9618@vtn1.victoria.tc.ca>


On Thu, 4 Mar 2010, Marcello Perathoner wrote:

> Jim Adcock wrote:
>
> > PS: This book WAS classified as "porn" when it first came out -- which may
> > explain WHY the volunteers are interested in tackling it.
>
> Personally I'd like to see more porn on PG, we still lack most of De Sade.
>

I recall seeing something just recently, as I was looking up author
names... it's in German too... Here we go:
Josefine Mutzenbacher
http://www.gutenberg.org/etext/31284

Looks like it came through DP.

--Andrew

From Bowerbird at aol.com  Thu Mar  4 15:37:44 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 4 Mar 2010 18:37:44 EST
Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not
	listen, pray)
Message-ID: <7a9ea.45b89c9.38c19e48@aol.com>

dkretz said:
>   What you experienced is in fact the primary method 
>    employed by the DP process managers for
>    trying to ameliorate the consequences of their system. 
>    When a perceived deficiency is detected, it is 
>    defined as a "special case" and given "special treatment".

it's actually become quite humorous to see all the efforts
to "route around the damage" caused by the bad workflow.

they're repeating rounds, skipping rounds, limiting people,
cajoling people, it's a cavalcade of exceptions to the rule...

and hey, it makes perfect sense to "make someone happy"
(i.e., shut them up) if you can do it by doing them a favor.

meanwhile, if you're not a "special case" and you don't get
"special treatment", you'd better enjoy the end of the line.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100304/63c71bfd/attachment.html>

From donovan at abs.net  Thu Mar  4 18:07:42 2010
From: donovan at abs.net (D Garcia)
Date: Thu, 4 Mar 2010 21:07:42 -0500
Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do not
	listen, pray)
In-Reply-To: <4B903F00.3050704@novomail.net>
References: <30a2.40c90764.38bf8822@aol.com>
	<201003041541.18286.donovan@abs.net>
	<4B903F00.3050704@novomail.net>
Message-ID: <201003042107.43123.donovan@abs.net>

Lee Passey wrote:

>... Rather the problem [Jim] identified is that somehow the current
> production processes allow  /any and all/ projects to become backed up in
> that queue. Given that problem statement, I would have liked to have seen
> something more like one of the following responses:

[Lee's four hypothetical responses omitted]

>Instead we saw a response more along the lines of:

[Paraphrase of my actual response omitted]
I addressed Jim's issue here solely because gutvol-d is where he raised it.

>It seems to me that this kind of response is designed, in fact, to
>ignore the issue at hand, which is that changes need to be made at D.P.
>to increase the throughput of e-texts.

I don't believe that there is anyone at DP at any level of participation who 
is unaware of the need for improvements in the process. However, the variously 
proposed "solutions" run the gamut from the obviously naive/simplistic, 
through horribly manual kludges, all the way up to byzantine complexities 
requiring considerable effort from the entire volunteer base.

Your statement even reflects a common barrier to getting a grip on the issues: 
"to increase the throughput of e-texts" is actually a statement of goal. While 
increasing the throughput of the process should be and is a component of DP's 
long-term goals, the specific problems and their underlying causes need to be 
identified first in order to effectively address them. 

>Now it very well may be that this problem has already been recognized by
>The Powers That Be, and that a solution will be in place Real Soon Now.
>In that case, wouldn't it have been better to just say so?

I'm certain some of these problems have been identified and the underlying 
causes and potential solutions are being examined, but I'm equally certain 
that no consensus can be achieved within the DP community as to what the 
causes are, much less what solutions are feasible, achievable, or even 
desirable. Whatever solutions do eventually result, interim or otherwise, some 
fraction of volunteers at DP will disagree.

By no means am I ignoring the broader issues at DP--but I feel those are 
usually better discussed in a more appropriate venue. Overall, experience has 
shown that any discussion of DP on gutvol-d generally and unfortunately serves 
little productive purpose. While positive and insightful comments do occur, 
(and are read and appreciated!), they are easily lost in the background of 
posts which far too often contain derision, belittlement, accusation, and 
misrepresentation.

One almost wonders whatever happened to basic respect. But then I remember the 
synergistic relationship between media and popular culture (of which old books 
are an excellent reminder). :)

David (donovan)

From jimad at msn.com  Thu Mar  4 18:28:08 2010
From: jimad at msn.com (James Adcock)
Date: Thu, 4 Mar 2010 18:28:08 -0800
Subject: [gutvol-d] Re: The conundrum of FOSS projects (was Re: do
	not	listen, pray)
In-Reply-To: <201003042107.43123.donovan@abs.net>
References: <30a2.40c90764.38bf8822@aol.com>	<201003041541.18286.donovan@abs.net>	<4B903F00.3050704@novomail.net>
	<201003042107.43123.donovan@abs.net>
Message-ID: <SNT120-DS25C5E647E92D39FD1AE5C0AE380@phx.gbl>


>I addressed Jim's issue here solely because gutvol-d is where he raised it.

I also raised the issue on two DP forums which were discussing the issue,
where my points have been discussed, with less heat generated perhaps, but
also generating less light, and certainly no less action.  Sorry to find
these issues are still so controversial in NFPs -- these issues were
controversial in industry when Deming first applied them to Japan quality
issues in the 1950s, and again in US industries in the 1980s -- nowadays
these principles are almost universally applied: JIT means no investment
locked up unused, and no place for a LACK of quality to hide. JIT also keeps
people busy rather than idled.


From prosfilaes at gmail.com  Thu Mar  4 18:58:28 2010
From: prosfilaes at gmail.com (David Starner)
Date: Thu, 4 Mar 2010 21:58:28 -0500
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <SNT120-DS14FD681B6C206374EF60FEAE390@phx.gbl>
References: <64a88.4b99698d.38c1515c@aol.com>
	<SNT120-DS14FD681B6C206374EF60FEAE390@phx.gbl>
Message-ID: <6d99d1fd1003041858i252eddeev1a29eb697d834482@mail.gmail.com>

On Thu, Mar 4, 2010 at 1:34 PM, Jim Adcock <jimad at msn.com> wrote:
> Although I don't see
> why pointing out that the system doesn't work as designed should be
> considered "bellyaching" any more than telling a webmaster that their server
> is down is considered "bellyaching"!

Go, hyperbole! Repeatedly complaining about anything that works, but
is too complex to work as designed, is bellyaching. DP is not down; it
does work. It in fact works a heck of a lot better than originally
designed.

-- 
Kie ekzistas vivo, ekzistas espero.

From hart at pglaf.org  Thu Mar  4 19:36:57 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Thu, 4 Mar 2010 19:36:57 -0800 (PST)
Subject: [gutvol-d] Was FOSS:  Heinlein's Razor
In-Reply-To: <4B903F00.3050704@novomail.net>
References: <30a2.40c90764.38bf8822@aol.com>
	<201003032121.50581.donovan@abs.net>
	<4B90028E.4020103@novomail.net>
	<201003041541.18286.donovan@abs.net>
	<4B903F00.3050704@novomail.net>
Message-ID: <alpine.DEB.2.00.1003041934420.13853@mail.pglaf.org>


Robert A. Heinlein said it 1941.

Not original with Robert J. Hanlon
[many suspect error from Robert Heinlein to Robert Hanlon]

Napoleon might have said something like it.


From vze3rknp at verizon.net  Fri Mar  5 07:16:56 2010
From: vze3rknp at verizon.net (Juliet Sutherland)
Date: Fri, 05 Mar 2010 10:16:56 -0500
Subject: [gutvol-d] Re: Preservation in the big scanning projects
In-Reply-To: <SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>
	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>	<m28waaxq0y.fsf@gnu.franken.de>
	<EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>
	<SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>
Message-ID: <4B912068.6060003@verizon.net>


On 3/3/2010 1:24 PM, Jim Adcock wrote:
>
> The question is, in my mind, is Google preserving the books, and doing so
> for the public good or not?  I suspect when Google digitizes the book the
> original is then trashed by the college library -- the whole point being
> they do not want to have to pay to maintain physical library books in
> various states of decay.  Google then becomes the sole repository for this
> information -- excepting a smallish number of copies at TIA.
This is absolutely not true. First of all, part of every agreement 
between a library and Google is that the library gets a copy of all the 
scans that Google makes. Depending on the exact contract, there may or 
may not be some restrictions on what the library can do with the scans, 
but they definitely get them.

Further, the libraries do not get rid of the books. In fact, they are 
very protective of their books, which is why a face-up, human controlled 
scanning method is used (thus resulting in the occasional hand or finger 
in the scan). All books are returned to the libraries with as little 
wear as possible. For logistical reasons, both Google and the Internet 
Archive started with books that were in off-site repositories, but those 
repositories are not being removed. The librarians in charge of the 
scanning projects all understand that what Google is providing is a 
search tool, not preservation. The Internet Archive is much closer to 
doing archival quality work, but the libraries are still keeping the 
books. Remember, these librarians were burned by the promise of 
microfilm and microfiche as more compact storage formats for periodicals 
and such.

A bunch of major libraries have put together a consortium called the 
Hathi Trust which has the explicit purpose of making sure that book 
scans are not lost. It provides off-site, secure storage for what the 
participant libraries want to put there. This includes the libraries' 
copies of the Google scans, as well as whatever else they decide to 
include. The last I was aware, the Hathi Trust did not do much, if 
anything, to provide public access to those scans, since that is not its 
purpose. I mention it here only to make folks aware that the libraries 
are making provision for storage even if places like Google, the 
Internet Archive, or, indeed, one of their own members, should disappear.

I now return you to your arguments about DP.

Juliet Sutherland

From vze3rknp at verizon.net  Fri Mar  5 07:52:39 2010
From: vze3rknp at verizon.net (Juliet Sutherland)
Date: Fri, 05 Mar 2010 10:52:39 -0500
Subject: [gutvol-d] Re: DP/PG vs. Google
In-Reply-To: <SNT120-DS1FA3CD8A7EE5AE1BBB9D6AE3A0@phx.gbl>
References: <3103a.2b657b5c.38c03a05@aol.com>
	<SNT120-DS1FA3CD8A7EE5AE1BBB9D6AE3A0@phx.gbl>
Message-ID: <4B9128C7.8040504@verizon.net>


On 3/3/2010 6:03 PM, James Adcock wrote:
>
> >do you want to bet against google?
>
> >because i'll take that bet against you.
>
> Sure, I'd be happy to take that bet, if I am allowed to win it or lose 
> it in a finite amount of time -- such as a decade.  What I think is 
> much more likely in a decade is that Google is either gives up or they 
> figure out how to post much more attractive page images.  I actually 
> don't think they have much of any interest in posting higher quality 
> automatic OCR transcriptions.
>
Wrong again. Google is funding development of open source OCR software 
via project called ocropus. I believe a beta version is due out shortly.

Further, Google bought ReCaptcha. That's the company and software that 
make you prove you are human on many websites. They provide two scanned 
words, one known and one not. The human types in both. This works well 
because what is hard for OCR software, eg a computer, is often easy for 
a human. Over millions of comparisons they are able to build up a pretty 
good version of the text. Since they don't address punctuation, and 
because capital and non-capital letters, and some blobs, can be hard to 
recognize out of context, they won't get the text perfect. But they can 
turn something from total gibberish into readable text.

I believe that there will always be a place for humans in preparing 
etext versions of some books. But, just as OCR eventually became good 
enough to start with, eventually technology will improve enough humans 
will add value only on very difficult texts, or by contributing semantic 
information. I don't know when that will happen, but it is certainly coming.

Juliet Sutherland
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100305/c911914d/attachment.html>

From Bowerbird at aol.com  Fri Mar  5 09:31:57 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 5 Mar 2010 12:31:57 EST
Subject: [gutvol-d] roundlessness -- 011
Message-ID: <14b8f.f185aa8.38c29a0d@aol.com>

ack!

rfrank has started "archiving" the books from his roundless site...
it sounds like he'll be deleting the scans when a book posts to p.g.,
even though, so far, he has _not_ posted those scans to p.g. as well.

please won't somebody tell him p.g. will mount those files for him?

(and there are lots of i.s.p. who offer huge amounts of storage
and bandwidth nowadays at a very cheap price, like dreamhost;
there is absolutely no need to delete files for "space" reasons.)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100305/bad4bc9f/attachment.html>

From hart at pglaf.org  Fri Mar  5 18:07:45 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Fri, 5 Mar 2010 18:07:45 -0800 (PST)
Subject: [gutvol-d] Re: Preservation in the big scanning projects
In-Reply-To: <4B912068.6060003@verizon.net>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<SNT120-DS133E025EB3C4C35C855628AE3B0@phx.gbl>
	<a95d02ffa62d3e499d445f1560d54909@xs4all.nl>
	<SNT120-DS5B9E247D0CC33543A36AAAE3B0@phx.gbl>
	<15cfa2a51003020859j552162b2t567987f8d5437462@mail.gmail.com>
	<SNT120-DS8BFCFEC98C41EDF17615AAE3B0@phx.gbl>
	<1e8e65081003021016r4661a445n8f5ddbba28639414@mail.gmail.com>
	<SNT120-DS2413EC12D351358E4F5DD0AE3B0@phx.gbl>
	<m28waaxq0y.fsf@gnu.franken.de>
	<EDA87561-329E-4E85-9AEC-310EA5C8FAB2@uni-trier.de>
	<SNT120-DS2336BAF4CC8229C5E6609EAE3A0@phx.gbl>
	<4B912068.6060003@verizon.net>
Message-ID: <alpine.DEB.2.00.1003051803120.21875@mail.pglaf.org>


If you actually visit the library archives working with Google,
you should be able to find out that what was promised is not an
entirely true case when it comes to reality. . .at least in POV
of the librarians who will speak to you freely.

Of course, I will also be the first to admit that you can get a
number of librarians from the same institution who will say all
is perfectly well.

But it's not perfect. . .not down at the lower level realities,
not where the rubber meets the road.

I do note that the ones who say all is well and dandy are those
with political and academic aspirations, and those who tell you
things are not what they should be are more street level.

We have plenty of both here at the University of Illinois.

;=)


On Fri, 5 Mar 2010, Juliet Sutherland wrote:

>
>
> On 3/3/2010 1:24 PM, Jim Adcock wrote:
> >
> > The question is, in my mind, is Google preserving the books, and doing so
> > for the public good or not?  I suspect when Google digitizes the book the
> > original is then trashed by the college library -- the whole point being
> > they do not want to have to pay to maintain physical library books in
> > various states of decay.  Google then becomes the sole repository for this
> > information -- excepting a smallish number of copies at TIA.
> This is absolutely not true. First of all, part of every agreement between a
> library and Google is that the library gets a copy of all the scans that
> Google makes. Depending on the exact contract, there may or may not be some
> restrictions on what the library can do with the scans, but they definitely
> get them.
>
> Further, the libraries do not get rid of the books. In fact, they are very
> protective of their books, which is why a face-up, human controlled scanning
> method is used (thus resulting in the occasional hand or finger in the scan).
> All books are returned to the libraries with as little wear as possible. For
> logistical reasons, both Google and the Internet Archive started with books
> that were in off-site repositories, but those repositories are not being
> removed. The librarians in charge of the scanning projects all understand that
> what Google is providing is a search tool, not preservation. The Internet
> Archive is much closer to doing archival quality work, but the libraries are
> still keeping the books. Remember, these librarians were burned by the promise
> of microfilm and microfiche as more compact storage formats for periodicals
> and such.
>
> A bunch of major libraries have put together a consortium called the Hathi
> Trust which has the explicit purpose of making sure that book scans are not
> lost. It provides off-site, secure storage for what the participant libraries
> want to put there. This includes the libraries' copies of the Google scans, as
> well as whatever else they decide to include. The last I was aware, the Hathi
> Trust did not do much, if anything, to provide public access to those scans,
> since that is not its purpose. I mention it here only to make folks aware that
> the libraries are making provision for storage even if places like Google, the
> Internet Archive, or, indeed, one of their own members, should disappear.
>
> I now return you to your arguments about DP.
>
> Juliet Sutherland
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

From hart at pglaf.org  Fri Mar  5 18:08:40 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Fri, 5 Mar 2010 18:08:40 -0800 (PST)
Subject: [gutvol-d] Re: roundlessness -- 011
In-Reply-To: <14b8f.f185aa8.38c29a0d@aol.com>
References: <14b8f.f185aa8.38c29a0d@aol.com>
Message-ID: <alpine.DEB.2.00.1003051808130.21875@mail.pglaf.org>


Just to make it "official". . .we will save all scans sent.


On Fri, 5 Mar 2010, Bowerbird at aol.com wrote:

> ack!
>
> rfrank has started "archiving" the books from his roundless site...
> it sounds like he'll be deleting the scans when a book posts to p.g.,
> even though, so far, he has _not_ posted those scans to p.g. as well.
>
> please won't somebody tell him p.g. will mount those files for him?
>
> (and there are lots of i.s.p. who offer huge amounts of storage
> and bandwidth nowadays at a very cheap price, like dreamhost;
> there is absolutely no need to delete files for "space" reasons.)
>
> -bowerbird
>
>

From Bowerbird at aol.com  Sat Mar  6 12:07:29 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 6 Mar 2010 15:07:29 EST
Subject: [gutvol-d] Re: roundlessness -- 011
Message-ID: <58190.5b1478bd.38c41001@aol.com>

michael said:
>   Just to make it "official". . .we will save all scans sent.

roger doesn't appear to be interested in submitting scans yet...
(d.p. reluctance on this perverts all who come in contact with it.)

would you mount scans for his books that _i_ sent in?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100306/e39cbdec/attachment-0001.html>

From ajhaines at shaw.ca  Sat Mar  6 13:29:15 2010
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Sat, 6 Mar 2010 13:29:15 -0800
Subject: [gutvol-d] Re: roundlessness -- 011
References: <58190.5b1478bd.38c41001@aol.com>
Message-ID: <2335A1A5E23445C2AD8B4E25E21FB2B7@alp2400>

Seems to me that if Roger has files of any kind on his personal (or personally paid for) server, those files are his, to do with as he wishes.  Their content is irrelevant.  If he chooses to submit them to PG, fine; if not, that's his choice.  

IMO - what bowerbird is proposing is outright theft.  His apparent "holier-than-thou" attitude doesn't make up for that.

Speaking both personally and as a Whitewasher, I wouldn't touch such files with the proverbial ten-foot pole.

Al

  ----- Original Message ----- 
  From: Bowerbird at aol.com 
  To: gutvol-d at lists.pglaf.org ; bowerbird at aol.com 
  Sent: Saturday, March 06, 2010 12:07 PM
  Subject: [gutvol-d] Re: roundlessness -- 011


  michael said:
  >   Just to make it "official". . .we will save all scans sent.

  roger doesn't appear to be interested in submitting scans yet...
  (d.p. reluctance on this perverts all who come in contact with it.)

  would you mount scans for his books that _i_ sent in?

  -bowerbird


------------------------------------------------------------------------------


  _______________________________________________
  gutvol-d mailing list
  gutvol-d at lists.pglaf.org
  http://lists.pglaf.org/mailman/listinfo/gutvol-d
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100306/437dcbfb/attachment.html>

From Bowerbird at aol.com  Sun Mar  7 12:26:26 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 7 Mar 2010 15:26:26 EST
Subject: [gutvol-d] al and his 10-foot pole
Message-ID: <89722.3405ffe2.38c565f2@aol.com>

al said:
>    IMO - what bowerbird is proposing is outright theft.

i guess it's silly season here on the project gutenberg listserve.
people are quite casual lately tossing about the "t" word ("theft!").

won't somebody please come train the firehose on this attack-dog
so i don't have to?   some elementary public-domain f.a.q. will do.

in the meantime, as i have "confessed" before, i've mounted some
of roger's scansets on my own site, so if roger has a problem with
me doing that, he should send me a cease-and-desist.   (kidding,
of course, which should be obvious since i hate lawyers so much;
roger can just send an e-mail, and we can discuss it all friendly.)


>    His apparent "holier-than-thou" attitude doesn't make up for that.

well, at least when i accuse someone of a moral shortcoming,
i provide _evidence_, instead of just making some bogus claim.


>    Speaking both personally and as a Whitewasher, 
>    I wouldn't touch such files with the proverbial ten-foot pole.

gee, al, we certainly wouldn't want you, either personally or as
a capital-w whitewasher, to get involved with _criminal_activity._

one day you're letting your parking meter run out prematurely,
and the next day you're dealing in public-domain scansets, and
the next day the russian mafia has got your sorry ass in a sling.
can't be too careful.   good thing you have a 10-foot pole handy.

***

donovan said:
>    experience has shown that any discussion of DP on gutvol-d 
>    generally and unfortunately serves little productive purpose. 
>    While positive and insightful comments do occur, 
>    (and are read and appreciated!), they are easily lost 
>    in the background of posts which far too often contain 
>    derision, belittlement, accusation, and misrepresentation.

and again, i wish i could just toss out a phalanx of terms like
"derision, belittlement, accusation, and misrepresentation"
with a wave of the hand.   but when i say something like that,
i feel a very strong need to back the charges with _evidence_.

because when the derision (and the belittlement) is _deserved_,
then the "accusations" are not a "misrepresentation", but indeed
a clear-cut indictment.

beware the person who doesn't want to discuss the charges,
but merely have them dismissed as a "misrepresentation"...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100307/ddb0334f/attachment.html>

From Bowerbird at aol.com  Sun Mar  7 14:42:43 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 7 Mar 2010 17:42:43 EST
Subject: [gutvol-d] the best proofers
Message-ID: <8dea8.69848de4.38c585e3@aol.com>

here are some of the findings from my analyses
of the various "experiments" done over at d.p.

the best proofers miss between 5% and 25% of the
errors over the course of proofing an entire book.
the worst proofers miss an even higher percentage,
but it is not all that much higher, probably 10-40%.

there is no evidence for the position that proofers
"get bored" and therefore miss a higher percentage
if the text they proof is clean (i.e., has few errors)...

p3 proofers are no better than p2 or p1 proofers.

some errors withstood over 5 rounds of proofing;
there was nothing obviously "difficult" about them.

the best predictor of whether a page is now "clean"
is how many people proof it without finding an error.

if the last person to proof a page found an error,
then you cannot reliably predict it to be error-free,
no matter how confident the proofer believes that...

if anyone wants to dispute or discuss these findings,
i'd be open, and will ask about your supportive data.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100307/87c5f67b/attachment.html>

From lee at novomail.net  Sun Mar  7 15:45:50 2010
From: lee at novomail.net (Lee Passey)
Date: Sun, 07 Mar 2010 16:45:50 -0700
Subject: [gutvol-d] Re: roundlessness -- 011
In-Reply-To: <2335A1A5E23445C2AD8B4E25E21FB2B7@alp2400>
References: <58190.5b1478bd.38c41001@aol.com>
	<2335A1A5E23445C2AD8B4E25E21FB2B7@alp2400>
Message-ID: <4B943AAE.6070701@novomail.net>

> His apparent "holier-than-thou" attitude doesn't make up for that.

said the pot to the kettle...

From schultzk at uni-trier.de  Mon Mar  8 01:40:28 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Mon, 8 Mar 2010 10:40:28 +0100
Subject: [gutvol-d] Re: al and his 10-foot pole (parts OT)
In-Reply-To: <89722.3405ffe2.38c565f2@aol.com>
References: <89722.3405ffe2.38c565f2@aol.com>
Message-ID: <49F9B423-6D68-4E3D-B64C-10CEA67DD6DE@uni-trier.de>

HI All,

	Theft and copyright infringement are interresting
	things in the internet world.

	In one sense anything on the net is up for grabs.
	That is anybody can download it. That is not theft.
	it is part of the web.

	On the other side if you use that copy on the web you need
	permission unless, already given per se.

	An interresting case is where you just use the URLs
	to access the entiity.  You are effectively citing it!!
	You have given reference to the source. You are not using a copy.

	A sad development here in Germany is that it is now 
	considered ownership of child pornography once it is 
	loaded into main memeory! Please do not get me wrong
	I am against child pronograohy.

	But, visting a web-site thereby constitutes onwnership.
	What a bag of bad worms. 
	Then again anything I load from the web I own, to use
	as I please privately!!! 

	Cool ? !!

	regards
		Keith.
	
	 
Am 07.03.2010 um 21:26 schrieb Bowerbird at aol.com:

> al said:
> >   IMO - what bowerbird is proposing is outright theft.
> 
> i guess it's silly season here on the project gutenberg listserve.
> people are quite casual lately tossing about the "t" word ("theft!").
> 
> won't somebody please come train the firehose on this attack-dog
> so i don't have to?  some elementary public-domain f.a.q. will do.
> 
> in the meantime, as i have "confessed" before, i've mounted some
> of roger's scansets on my own site, so if roger has a problem with
> me doing that, he should send me a cease-and-desist.  (kidding,
> of course, which should be obvious since i hate lawyers so much;
> roger can just send an e-mail, and we can discuss it all friendly.)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100308/0004a15d/attachment.html>

From Bowerbird at aol.com  Mon Mar  8 10:25:49 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 8 Mar 2010 13:25:49 EST
Subject: [gutvol-d] taking stock on march 8th
Message-ID: <20efd.68f1c528.38c69b2d@aol.com>

march 8th is international women's day.   rock on, estrogen!        :+)

***

congratulations to d.p. on the posting of their 17,000th e-text!
the volunteers who digitize these books are _awesome_people_!

***

speaking of women, and d.p. volunteers, juliet sutherland is
the volunteer who has donated d.p. the most time and energy,
so a special shout-out to her on this day.   in a recent post to
the d.p. forums, juliet admits to being frustrated by many of
the "volunteers versus admins" threads on those d.p. forums.

she also notes that her frustration is multiplied whenever she
considers that the site does have many problems and that she,
as the former top dog, bears a good deal of the responsibility.

i've criticized juliet a lot, because -- frankly -- she deserved it.

but it's not fun to see that anyone is frustrated, ready to quit...

so i'd urge juliet to hang in there...

juliet also said she often finds dkretz comments "off the wall".
here's a quick note, juliet: he's usually right, and you're wrong.
so try to see it from don's perspective as you hang in there...

***

i'm in the middle of lots of different threads here,
so i'll try and work on finishing them up this week.

in no particular order...

***

i want to finish my work on gardner's e-text, showing
how it can now be auto-converted into various formats,
as a way to demonstrate my version of "postprocessing",
which seems to be far more direct than the d.p. version.

***

i have some messages to post in response to carel...

i've started notes about creating a proofing system,
for carel and anyone else contemplating doing that.

***

i have a pair of posts on the good and bad aspects of
roger's roundlessness experiment...

i will be finishing up the handful of books that i have
taken from roger's site, comparing 'em with his output.

***

i'm also going to show you a little perl script i coded
which summons up the various pieces of each page
of a book on roger's roundless site, and stitches 'em
on one webpage, and also lets you   "thumb through"
the pages of the books on his site, to check them...

here's the first draft of that script:
>    http://z-m-l.com/go/showbarebones.pl

it pulls in the scan and text for each page, obviously,
as well as the tweet (if any) and the log file information.

under roger's current system per se, it's not convenient
to do this on page after page.   (although a person _can_
access each of the pieces of information independently.)

but really, there's no reason why it shouldn't be easy to
"walk through" the pages of a book as it's being worked.

the script works for my cinema screen, but i'm not sure
how well it will fit on smaller screens.   but i'm unlikely
to improve it in that regard, since i believe it's unlikely
anyone will spend much time actually using the thing...
(not that it's not useful; everything i program is useful;
it's just that few people here actually use what i create.)
i wrote it just to get myself back into some perl coding.

notice that i _am_ willing to smooth out this code _if_
anyone really wants to use it, but -- other than that --
i'm just gonna add a few refinements to it and it's done.

***

there you go...   plenty of meat for this week...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100308/961e12b7/attachment-0001.html>

From Bowerbird at aol.com  Mon Mar  8 14:01:17 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 8 Mar 2010 17:01:17 EST
Subject: [gutvol-d] roundlessness -- 012
Message-ID: <335f9.5e85a104.38c6cdad@aol.com>

rfrank's roundless experiment is proving to be _very_ interesting...

and, as you might expect, there is good news and there is bad news.

let's talk about the good news here in post #12, the bad in post #13.

***

first of all, rfrank is showing that it's not all that difficult to set up
a proofing site.   in just 2 months, he's put together a critical mass,
and that's quite an achievement.   if he shares his code with others,
they'll be able to move even faster.   (if he doesn't, i've got a little
code that'll do the trick for people who want a bit of a head-start.)

it's another matter to pull workers to the site, of course.   however,
if project gutenberg chose to steer people to these _other_ sites,
instead of funneling all the volunteers to distributed proofreaders
(who -- truth be told -- don't even _want_ new people nowadays),
it wouldn't be hard at all for these sites to get enough volunteers.

but even with his low numbers of volunteers, what rfrank is doing is
_head_and_shoulders_ more interesting than anything d.p. is doing.
his site is dynamic, while d.p. has been too moribund for too long...

***

in the last week, rfrank installed a spellcheck capability to his site.
after a mere 2 months.   d.p. went about 5 or 6 _years_ without it.

***

moreover, when d.p. finally got a programmer to code spellcheck,
the process was plagued by a forum discussion that ran 30 pages.

at 15 messages per page, that's 450 messages.   and most of 'em
were from people who didn't know what they were talking about,
and thus just added a buncha noise and confusion to the process.

which is why it's probably not surprising that it was coded wrong.
well, "wrong" is perhaps a bit strong.   but the decision was made
to do spellcheck using "aspell", because "it's open-source code".

which would be fine, if you needed a full-fledged spellcheck...

but that's not what a proofing site needs, because the object is
_not_ to have another word "suggested" (which is the hard part
about coding spellcheck), but merely to _flag_suspicious_words_
(a ridiculously easy task consisting of searching a dictionary to
ascertain whether the word you're checking is included therein)
so that all the suspicious words can be compared to the scan...

i'm guessing rfrank did his spellcheck the simple way.

***

rfrank also installed a capacity for a "good" and "bad" wordlist,
necessary since that customizes the dictionary for each book,
and -- like d.p. -- lets the proofers suggest words to include.

unlike d.p., however, under rfrank's system, whenever a person
"suggests" a word, it's _automatically_ included _immediately_.

at d.p., a suggestion must be considered by a superior, who
might or might not agree, and might or might not be timely.
this is a signal of the disgust with which d.p. treats proofers.

it also means that rfrank's system throws far fewer false flags,
which means it provides much greater value to the proofers...

i worked very hard, in the confines of that 30-page thread,
to have d.p. give the proofers an automatic capability to add
words to the good and bad lists, but they just wouldn't do it.
rfrank did.   good for rfrank. he's smarter than the d.p. crowd.

***

rfrank has also included reg-ex checks, and scanno checks, so
his list of helpful tools is already very impressive, 2 months in.

***

rfrank has also shown he's willing to do global changes to text,
which is one of those things that d.p. has been unwilling to do,
in spite of the fact that i've pointed out the utility of it for years.

d.p. would rather have individual proofers correct every instance
of a global error -- one by one by one, painstakingly -- instead
of fixing 'em all immediately, with one global change.   shameful.

***

rfrank also showed considerable independence when he decided
he would have his people do proofing and formatting together...

it's unclear to me whether the d.p. split between those two tasks
is effective or not, but the _religion_ at d.p. is that it has been...

so it is quite courageous of rfrank to test that accepted "wisdom".

***

rfrank also seems committed to using diffs to train up volunteers.
this, of course, is one of the benefits a roundless system offers,
so it's natural that he'd take advantage, but it's still a good thing.

***

rfrank has given workers a way to make comments _about_ a page
without actually putting them _inside_ the text, which is fantastic.
(he calls this feature "page tweets".)

at d.p., they have a "project thread" in the forums (as does rfrank),
but the only way to make comments about a page is to put them
_inside_ the text.   but of course then someone later down the line
has to _remove_ them from there.   that's a sign of a bad workflow,
when someone later on must undo something that was done earlier.

***

in looking at some of the projects, it seems that rfrank has finally
started doing more aggressive preprocessing of the o.c.r. itself...
for instance, the number of spacey quotes has dropped remarkably.
there are still some, but nowhere near the number he had before...

since this is an area that i know to be _so_ important, any progress
toward enlightenment at all is the sign of a very good development.

***

so, all in all, there's lots of positive aspects to rfrank's experiment.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100308/24865cf6/attachment.html>

From Bowerbird at aol.com  Mon Mar  8 16:46:27 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 8 Mar 2010 19:46:27 EST
Subject: [gutvol-d] roundlessness -- 013
Message-ID: <3ed15.48c76ac1.38c6f463@aol.com>

ok, i talked about the positive parts of rfrank's roundless experiment.
now it's time to review the "bad news" -- the not-so-positive parts...

many of the _negative_ parts are quibbles about the implementation
of the positive aspects, so i will discuss them at the end of this post...

there are, however, a few points that are almost fully negative, still.

***

the very first thing i do with a set of files that i start working with
is to make sure that they're named _correctly_and_intelligently_...

that is, a filename must explain, all by itself, the file's _contents_,
and the filename must contain the pagenumber from the p-book.

moreover, every file must have a _unique_ filename, and every
file associated with the same page should share a similar name
(e.g., the name will be the same, but with a different extension.)

i've done extensive work with datasets that follow these rules and
with datasets that do not, and i can say without any hesitation that
the datasets which do not follow these rules are much more clumsy,
and waste bits of my time that are small but cumulate to significance.

and that's why i now no longer will even work with badly-named files.
it's just unnecessary frustration.

people who use badly-named files will tell you they have "adapted"
to the naming.   that's pure and simple crap.   they don't know better
because they haven't worked long and hard with both kinds of data.
they are handicapped, and they just don't know they are handicapped.

rfrank names his files incorrectly.   maybe someday he will learn...

***

rfrank does a lot of things right.   his scans are extremely well-done,
which indicates that he is very careful and meticulous when scanning.
it's quite likely he also does some refinements on the scans, such as
straightening them, centering them, and perhaps despeckling them.
they look quite nice, and they are generally a pleasure to work with...

however, all this care seems to be dropped once he's done the o.c.r.

his preprocessing routines used to be abysmal.   they're better now,
but they still have considerable room for improvement.   i'm hopeful
that he's learned the lesson.   he has pushed many of his checks back,
from postprocessing to the proofing stage.   so now he just needs to
push them back from the proofing stage to the preprocessing stage.

it would perhaps be very helpful in this regard if _somebody_ who is
working at the fadedpage.com site would _volunteer_ to do the step
of nondistributed preprocessing, thus freeing rfrank from doing it...
he's probably feeling very overwhelmed at the moment, so an offer
like that would probably be something that he would accept readily,
and it would make a remarkable difference in evolving his progress.

***

we've already discussed recently that rfrank should submit his scans
along with his postings to p.g.   alas, he's picked up the bad habit of
failing to do that from his distributed proofreader upbringing.   pity.

***

it would also be good if rfrank kept the linebreaks and pagebreaks of
the original p-book when he submitted the book to project gutenberg.

but hey, that's unlikely, isn't it?

what _is_ more likely, however, is that he would keep the linebreaks
_consistent_ between the various versions that he submits to p.g.
but on one file i checked, the 7-bit version was wrapped differently
than the 8-bit version, which was wrapped differently than the .html.
this is madness, if/when it comes to doing long-term version-control.

***

rfrank also picked up the bad habit from d.p. of "clothing" em-dashes.

of all the stupid things d.p. does, this is among the most stupid of all.
and yet rfrank, who showed the ability to rethink proofing/formatting
and roundlessness per se, failed to grasp the basic stupidity of this...

***

ditto with unhyphenating the end-of-line hyphenates.

i take it all back about "clothing hyphens" being the most stupid thing...

dehyphenating has to be the _most_ stupid, because when the proofers
do this, they actually destroy the evidence that a computerized routine
would use to do the job _properly_, which is _on_a_book-wide_basis_...

again, the failure of rfrank to rethink such an obvious stupidity is sad...

(kudos, however, to one of his members, for spelling it out in a post.
let's just hope that that reasoning will soak in to rfrank's busy brain.)

***

again, repeating a d.p. flaw, rfrank strips runheads and pagenumbers
from his o.c.r.   and perhaps fate is trying to teach him a lesson on this,
because he has had several problems where text on a page was deleted,
or replaced with text from some other page.   these types of problems
can be detected and prevented when each page contains its pagenumber.

in general, you want to _retain_ this information because it "earmarks"
each page of text, making it clear what book it comes from, and where.
it also serves as the "suspenders" in a "belt and suspenders approach"
along with the filename, which will contain the very same information,
and thus the two make it very easy to crosscheck and confirm each other.

the silliness of naming all your scansets "001.png" through "999.png"
and expecting their subdirectory name to distinguish them is _stark_...

(and it has caused all kinds of grief for people in the past, i assure 
you.)

***

rfrank hasn't really installed any instructions of his own, just letting 
his
members rely on their d.p. training, so he has no policy of his own on
ellipses, at least that i've been able to detect.   but it would be 
refreshing
if he decided to avoid the merry-go-round of never-ending changes that
sometimes happens at d.p., and went _exclusively_ with the 3-dot ellipse.

(it's funny, because many of his books don't even seem to _have_ ellipses!)

***

rfrank is putting a lot of stock in "c.i.p."   except, to confuse 
_everyone_,
to him, "c.i.p." means "confidence in proofer", not "confidence in page",
which is how everyone else defined the term, up to this point in time...

now, me?, i don't think you can put much stock in "confidence in proofer".

even the best proofers miss errors, and they don't know when they miss,
so i don't think that you can trust their judgment and get perfect pages.

rfrank's big mistake here is that he's not necessarily looking for 
"perfect",
since he sees himself, as the postprocessor, as the last line of 
correction,
and he's willing to take a non-perfect page if he can get it a little 
faster...

even if that's fine for him, i don't think it's a good way to build a 
system.

but even then, i just don't think "confidence in proofer" will actually 
work.

or, to be more accurate, i think it'll work just well enough that rfrank 
will
put lots of energy into it before he finds out it doesn't work well enough.

or, worst case scenario, he'll convince himself that it really _is_ 
working,
and other people believe him, and we all end up with non-perfect pages.

on the other hand, rfrank has shown in some cases that he _can_ learn
from the data, and change his mind on something he held dearly, so...

***

ok, now we're down to the implementation quibbles...

***

first, i'll repeat that it's sad that rfrank is "archiving" his finished 
projects.
it would help all of us learn more about roundlessness if he left them up.
i offer webspace if rfrank needs it.   and project gutenberg has offered 
too.

***

it's still "in-process", so i expect that it might improve, but the 
spellcheck
display that rfrank is offering would benefit by retaining the linebreaks, 
so
the search for "unresolved words" on the pagescan was much easier to do.

i'd also like to see each unresolved word in _clickable-button_ form, for
both the good-word and bad-word lists, so a button-push would do that.
(in the current form, a person has to copy-and-paste each of the words.)

i must add, though, that the ability to include these words immediately
is a _tremendous_improvement_ over the d.p. method, one that shows
its value to the proofer right away, and is thus very robust and valuable.
empowering the proofers to benefit themselves is a remarkable asset...

***

in this regard, an ability for a _proofer_ to execute a global change
would be a mind-blowing step, and thus a very brave thing to try...

of course, bear in mind that i believe that all global changes should
have every occurrence verified, so take the suggestion appropriately.

and i believe that any global changes that might be required _should_
be sussed out during the preprocessing, before proofers even see text.

but nonetheless, putting such a powerful tool in the hands of proofers
would speak _volumes_ on the responsibilities you entrust them with,
and thus make a tremendous statement that would _embolden_ them...

even if they never ever used it...

***

as it is now, though, rfrank does the global changes himself, and he has
been a little reluctant to do the job in the way that he really "should"...

at least he was in one case -- where he declined to fix a contraction --
but perhaps that was not representative of his feelings more generally,
so i'll let it go for now...

***

as i said before, i don't know whether the d.p. separation between
proofing and formatting is a good thing or not.   i see the arguments
in favor of it, and they seem compelling to some degree, but i also
know that the vast majority of pages have little or no formatting, so
i'm reluctant to lay another step on the overall process for no benefit.

so, in cases like this where the answer is unclear, i'd do an experiment.

luckily, rfrank is doing an experiment.   it's not a well-controlled one,
and we're not really privy to all of the data, so it's far from being 
ideal,
but at least we're engaged in the active questioning of an unknown...

still, it would be nicer if we were doing the experiment _properly_...

***

rfrank does a pretty good job of showing proofers their diffs, _except_
that you must visit each project-page to see your diffs for that book...

it would be far better if you were presented with all of 'em on one page.

(and i would emphasize that page by presenting it to the user _first_,
when they return for more proofing, so they'll realize its importance.)

there's also the slightly troubling aspect that if you mark a page "done",
the odds are lowered that it will be proofed again, so you don't obtain
the satisfaction of getting a "no-diff" result on that page.   i do believe
that's counterproductive, and i'd like to see every page reproofed once,
even after the page was marked "done", even by a high-c.i.p. proofer...

and, of course, having the page reproofed, and having the "done" status
confirmed by a "no-diff" by the subsequent proofer, would also raise the
"confidence-in-page" for that page, and thereby serve a double benefit...

(conversely, if the next proofer finds an error, they rescued a false 
done.)

***

the "page tweet" idea is a good one.   (the astute observer might realize
that this is the same idea i always use on the bottom of my web-pages,
where a person can leave a comment about that specific p-book page.)

however, a way to _consolidate_ the tweets for a book would be useful.
(and easy to code.)   that way, a proofer could look at all of the "tweets"
and perhaps answer some of the questions being posed, or fix some of
the problems being reported, or take some other kind of positive action
(such as finding a person who _can_ fix the problem if you cannot do it).

also, it would be good if there were some dedicated buttons on the page,
so it would be easy to say things like "difficult formatting, please check"
or "foreign language specialist needed on this page", or stuff like that...
again, that way a person perusing all of the tweets for a book will know
exactly what needs to be done among this list of possible specific tasks,

***

and -- just to finish up this post by taking it back to the beginning --
i note with amusement that rfrank uses the term "page" throughout
his system.   he lists the "pages" that you've done, and calls the notes
you attach "page tweets", and the diffs are listed by "page", and so on.

so it is ironic that when he talks about "page 123", he's not _really_
talking about _page_123_ at all!   he's really talking about _.png_ 123!

and the file named "123.png" probably isn't about page 123 at all!

indeed, let's review the 7 files named "123.png" rfrank has up now:

>    http://fadedpage.com/p/201002140505/d/123.png
>    http://fadedpage.com/p/201002140533/d/123.png
>    http://fadedpage.com/p/201002270757/d/123.png
>    http://fadedpage.com/p/201002280257/d/123.png
>    http://fadedpage.com/p/201003020840/d/123.png
>    http://fadedpage.com/p/201003040537/d/123.png
>    http://fadedpage.com/p/201003070309/d/123.png

what we actually find are pages 120, 112, 118, 106, 82, 124!, and 118,
respectively.   that's quite a range of pages, but alas, none are page 123.

so any reference to a "page" number on the faded.com website is gonna
frustrate anyone who wants to know what _page_ was being talked about,
once rfrank has gone and deleted all of those files.   which is a real 
pity...

but alas, here i am talking about filenaming conventions again.   help me!
time to draw this to a close...

***

while i'm letting myself discuss the negatives without feeling any guilt,
i might add that it'd be nice if rfrank shared data from his experiments.

of particular interest are all the intermediate files, such as the various
pages as saved by individuals, and the concatenated text file at various
"checkpoints" along the way, notably before and after postprocessing...
without such data, we really have no way of evaluating the experiment!

rfrank comes from a world of engineers working in private companies,
where data is closely guarded, and he doesn't seem to have the attitude
that is prevalent in the scientific world that data belongs to the public,
and that sunshine is the best disinfectant, and open data is a positive...

in this regard, i read and liked this article:
>    
http://flowingdata.com/2010/03/04/think-like-a-statistician-without-the-math/

i sure could learn a lot more if rfrank were open with sharing his data,
and my guess is that lots of other people could learn lots more as well.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100308/45ad68d1/attachment-0001.html>

From Bowerbird at aol.com  Tue Mar  9 13:01:50 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 9 Mar 2010 16:01:50 EST
Subject: [gutvol-d] the comparison method does indeed work, and work well
Message-ID: <c989.10a9942c.38c8113e@aol.com>

a little while back, i pointed out that a person can
compare two independent digitizations to find errors
in both of them, and that this method works very well.

carel said:
>    That depends on a lot of factors   including the assumption
>    that two OCR programs would not make the same mistake

that's a good point.

if the two digitizations have errors in common, then
the comparison method won't be able to find them,
and thus its effectiveness will be lessened somewhat.

there's no argument with that.

what's surprising to me, however, is how many people
are completely defeated by this _possible_ shortcoming.
upon learning that there _might_ be a problem with the
comparison method, they dismiss it with no other thought.

not me.   i set out to actually _test_ the assumption.

i documented the results in a thread in the d.p. forums.
you can search for "revolutionary o.c.r. proofing".   it's at:
>    http://www.pgdp.net/phpBB2/viewtopic.php?t=24008

as i note there, i presented the data earlier elsewhere, at:
>    http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&
post=2005-10-03,3

so yeah, that's right, my findings are over 4 years old now.

***

what my research found is that there were virtually _no_
errors-found-in-common between the two digitizations.

and this finding was replicated, and replicated once again.

in other words, the effectiveness of the comparison method
is _not_ lessened by this possible shortcoming.   no indeed,
the evidence says it is not even affected in the slightest way.

the clarity of the results was striking; they are unforgettable.

if you doubt the data, i encourage you to repeat the research.

because repeating the possible problem, without any data,
won't get anyone very far in the future, not if i'm listening...

***

here's a quick-and-dirty experiment, for anyone willing...

i just used the comparison method on gardner's e-text,
and found 159 differences between his work and mine...
i then resolved the differences by consulting the scans...

79 differences were due to errors in his work.

77 were due to errors in mine.

3 were due to errors in _both_ his and mine.

now, of course, any errors-in-common will still reside in
both his and mine.   why don't you see if you can find any?

>    http://z-m-l.com/go/gardn/gardn.zml
>    http://z-m-l.com/go/gardn/gardnp123.html

i'll be waiting.   but i won't be holding my breath...

***

carel said:
>    I feel that a human looking at 
>    a smaller subset of a large document 
>    is a good thing in the error finding process. 
>    You apparently do not think it is. 

if the comparison method has already found all the errors,
why waste the time and energy of a human rechecking that?


>    Neither of us is right or wrong: 
>    It is a matter of perspective and opinion.

unless i get a good answer to the question that i just asked,
my opinion will continue to be that i am _absolutely_right_,
and you're wrong because you're wasting human resources.

that's my perspective, and i'm not changing it...           :+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100309/9a176b66/attachment.html>

From Bowerbird at aol.com  Tue Mar  9 15:46:35 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 9 Mar 2010 18:46:35 EST
Subject: [gutvol-d] Re: roundlessness -- 013
Message-ID: <379bb.624c0680.38c837db@aol.com>

i said:
>    even then, i just don't think "confidence in proofer" will actually 
work.
>
>    or, to be more accurate, i think it'll work just well enough that 
rfrank will
>    put lots of energy into it before he finds out it doesn't work well 
enough.
>
>    or, worst case scenario, he'll convince himself that it really _is_ 
working,
>    and other people believe him, and we all end up with non-perfect 
pages.

good thing i posted that yesterday.

because today rfrank posted his first "informal analysis".

and it looks like i was right...

rfrank did his analysis on 32 pages that were marked as "done" but
then subsequently proofed again, as is done for a random sample...

he admits this is a small number of pages, and that there are also
"many factors at play", but then goes on to draw conclusions anyway.

of the 32 pages, two had added proofer notes, and 1 error was fixed.

he doesn't tell us if either (or both) of the proofer notes were good,
in the sense that they pointed out something of value, so we'll have to
assume that they were meaningless and just added noise to the text.

but even then, we have 1 error missed in 32 pages.

on the face of it, that means that 3% of the "done" pages had an error.
so, for a 200-page book, that would cumulate to a total of 6 mistakes.

again, by my 1-error-every-10-pages criterion, that's fully acceptable.

but by the (unrealistic) standards of _most_ of the volunteers, it's not.

rfrank concludes that "this seems to say that making sure every page
is seen by two proofers is not warranted"...   so that's his take on this.

***

partly the decision rests on the abundance of proofers.

if you have lots and lots of proofers, like d.p., then you can afford
to send a page through them 2 times or 3 times, even 4 or 5 times.

but if your proofers are scarce, like they are over at fadedpage.com,
then you might be reluctant to have them view a page even twice...

i think i'm pretty good about making sure proofers are used _wisely_.
i don't think i abuse their contribution, or that i take 'em for granted;
neither am i afraid to use their resources if it is responsible to do so.

and i think having 2 people verify a page as clean is responsible use.

***

the other thing, though, in evaluating all these experiments, is that
you need to know how many errors there _really_ were on each page.
only _then_ can you accurately access the accuracy of the proofers...

remember that there are lots of pages in these books that have _no_
errors on them, none at all.   is it any surprise, then, that they were
_actually_ "done" when they were _marked_ as "done"?   not hardly...

likewise, it isn't really a surprise when a page with _one_ error on it
has that error fixed, is then marked as "done", and is _really_ done.

what you have to pay attention to, in such cases, are the pages where
an error is _not_ found by the first person, who marks it "done", but
is then found by the second person.

rfrank isn't making nearly enough information available that we
can analyze the results in a reasonable way.   so i guess we just
have to "trust" him.   i just wish i had more faith in his reasoning.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100309/727e01ed/attachment.html>

From Bowerbird at aol.com  Wed Mar 10 15:52:02 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 10 Mar 2010 18:52:02 EST
Subject: [gutvol-d] any arguments against "free-range" proofing?
Message-ID: <36c3a.27b91b5.38c98aa2@aol.com>

the d.p. proofing system locks each page to a single proofer.
(there's one and only one p1 proofer, p2 proofer, and so on.)

so does rfrank's roundless system; once a page has been
assigned to a proofer, it's semi-difficult to even look at it.

and if someone else has reproofed it _after_ that person,
then the old version is stored somewhere i can't figure out,
so tracking the diffs simply cannot be done by an outsider.

(the d.p. system at least allows you to do that tracking, and
even has a routine that will show you round-to-round diffs.)

it is by analyzing these round-to-round diffs very closely
that you can get a sense for how a page progresses from
the initial o.c.r. to its final -- hopefully perfect -- stage...

***

the question i have today is whether there is a good reason
why a page needs to be assigned-and-locked to one person.

is there any reason why you shouldn't allow any proofer to
go and proof any page in a book?   yes, it would mean that
some pages might be proofed several times, but so what?
that's not necessarily a _bad_ thing, is it?

i'm writing code now to build my own proofing system, and
i'm curious about this particular aspect.

i think it would be important to inform a proofer how many
previous people have proofed each specific page, so as to
let that proofer choose whether to do an additional proof,
but if they _want_ to do it, is there any reason to disallow it?

***

partly this ties into _incentives_...

most people like _finding_and_fixing_ errors, so there'll be
a good incentive for people to work in the "first" proofing...

but even in that first proofing, there are a lot of pages that
are _already_ perfect, so there are no errors to find or fix...

and in the second and third proofings, the number of errors
that are left will be small, even collected over a whole book.

so i feel it's very important to reward people for _certifying_
a page -- i.e., confirming that the page is indeed error-free.

if i was to put this in terms of a "point" system, it'd be this:

>    5 points for fixing all of the remaining errors on a page.
>    4 points for doing the first "certification" of a clean page.
>    3 point for doing the second "certification" of a page.
>    2 point for doing the third "certification" of a page.
>    1 point for fixing _some_ (but not all) errors on a page.

if you certify a page clean, and someone later finds an error,
the points turn _negative_.   so make sure of your certification!

if you gather enough points, you win _a_million_dollars_!        ;+)

***

there are a few things you need to stipulate for such a system:

1.   there is one -- and only one -- "correct" way to do a page.
2.   which means there are no ambiguous guidelines in place.
3.   and whitespace is significant.
4.   which means there are _no_ "insignificant" diffs.
5.   all diffs are reviewed, and can be challenged for correctness.
6.   so when a page comes out of proofing, that page is _done_.
7.   which means "postprocessing" is a largely automatic thing.

***

you can discuss any aspect of this post, but what i'm seeking are
any arguments people can think of _against_ free-range proofing.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100310/dca40180/attachment.html>

From schultzk at uni-trier.de  Thu Mar 11 00:01:55 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Thu, 11 Mar 2010 09:01:55 +0100
Subject: [gutvol-d] Re: any arguments against "free-range" proofing?
In-Reply-To: <36c3a.27b91b5.38c98aa2@aol.com>
References: <36c3a.27b91b5.38c98aa2@aol.com>
Message-ID: <A9FE60E0-11D8-45EB-86E3-8BC3B029EE45@uni-trier.de>

Hi BB,

	I do not see anything truely speaking against such a system.
	The only problems are the administrative tasks involved.

	1) you have to track all this.
	2) keep everything store somewhere
	3) keep everything in sync

	The other question that comes to mind is you will need
	an authority/ies that finally certify that a page satisfies
	your criteria as being done. 

	Some may call it a administrative nightmare, but it should be
	workable.

	regards
		Keith.

Am 11.03.2010 um 00:52 schrieb Bowerbird at aol.com:

[snip, snip]

> it is by analyzing these round-to-round diffs very closely
> that you can get a sense for how a page progresses from
> the initial o.c.r. to its final -- hopefully perfect -- stage...
> 
> ***
> 
> the question i have today is whether there is a good reason
> why a page needs to be assigned-and-locked to one person.
> 
> is there any reason why you shouldn't allow any proofer to
> go and proof any page in a book?  yes, it would mean that
> some pages might be proofed several times, but so what?
> that's not necessarily a _bad_ thing, is it?
> 
> i'm writing code now to build my own proofing system, and
> i'm curious about this particular aspect.
> 
> i think it would be important to inform a proofer how many
> previous people have proofed each specific page, so as to
> let that proofer choose whether to do an additional proof,
> but if they _want_ to do it, is there any reason to disallow it?
> 
> ***
> 
> partly this ties into _incentives_...
> 
> most people like _finding_and_fixing_ errors, so there'll be
> a good incentive for people to work in the "first" proofing...
> 
> but even in that first proofing, there are a lot of pages that
> are _already_ perfect, so there are no errors to find or fix...
> 
> and in the second and third proofings, the number of errors
> that are left will be small, even collected over a whole book.
> 
> so i feel it's very important to reward people for _certifying_
> a page -- i.e., confirming that the page is indeed error-free.
> 
> if i was to put this in terms of a "point" system, it'd be this:
> 
> >   5 points for fixing all of the remaining errors on a page.
> >   4 points for doing the first "certification" of a clean page.
> >   3 point for doing the second "certification" of a page.
> >   2 point for doing the third "certification" of a page.
> >   1 point for fixing _some_ (but not all) errors on a page.
> 
> if you certify a page clean, and someone later finds an error,
> the points turn _negative_.  so make sure of your certification!
> 
> if you gather enough points, you win _a_million_dollars_!       ;+)
> 
> ***
> 
> there are a few things you need to stipulate for such a system:
> 
> 1.  there is one -- and only one -- "correct" way to do a page.
> 2.  which means there are no ambiguous guidelines in place.
> 3.  and whitespace is significant.
> 4.  which means there are _no_ "insignificant" diffs.
> 5.  all diffs are reviewed, and can be challenged for correctness.
> 6.  so when a page comes out of proofing, that page is _done_.
> 7.  which means "postprocessing" is a largely automatic thing.
> 
> ***
> 
> you can discuss any aspect of this post, but what i'm seeking are
> any arguments people can think of _against_ free-range proofing.
> 
> -bowerbird
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100311/0ed52125/attachment-0001.html>

From Bowerbird at aol.com  Thu Mar 11 10:51:18 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 11 Mar 2010 13:51:18 EST
Subject: [gutvol-d] a question for dkretz about twisted
Message-ID: <6d556.783644ab.38ca95a6@aol.com>

ok, so i'm coding a proofing system.

and wow, i'm impressed with myself
and how far i've gotten in just 2 days.
i've got a solid engine going already...

in programming, the saying goes that
the first 90% of a project takes 90% of
the time, and the remaining 10% takes
the other 90% of the time.   and it's true.

but still, to have a solid engine after
just 2 days means i think i can have
a pretty smooth system in 2 weeks...

but before i go reinvent the wheel...

a question for dkretz on "twisted"...

it was coded in "air", so in _theory_
anyway, it will run on a web-server.

so, don, can you make it do that?

is there any place where it _is_
running on a web-server now?

if someone (like me) wanted to
run it on their server, would you
make the app available to them?

i'd love to see it run in a browser.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100311/c150396c/attachment.html>

From dakretz at gmail.com  Thu Mar 11 13:39:37 2010
From: dakretz at gmail.com (don kretz)
Date: Thu, 11 Mar 2010 13:39:37 -0800
Subject: [gutvol-d] Re: a question for dkretz about twisted
In-Reply-To: <6d556.783644ab.38ca95a6@aol.com>
References: <6d556.783644ab.38ca95a6@aol.com>
Message-ID: <627d59b81003111339y2604c47fmd4261ea72894bb14@mail.gmail.com>

Close, but I don't think close enough.

After working with Adobe/Actionscript/Flex/AIR for a while,
I appreciate what Steve Jobs said recently when he was
asked why Apple doesn't want to work closely with them.
He said they were lazy. I think he meant that they have
had an unchallenged franchise (with PDF and Flash)
for so long that everything is just "good enough" and
will be really ready in the next version.

What I'd recommend you consider is building on the
WordPress blog engine. Almost every variation of user
input technique gets implemented early and often
because text input is such a core requirement. You
get built-in user validation, text-versioning, etc and
the free support community is huge.

On Thu, Mar 11, 2010 at 10:51 AM, <Bowerbird at aol.com> wrote:

> ok, so i'm coding a proofing system.
>
> and wow, i'm impressed with myself
> and how far i've gotten in just 2 days.
> i've got a solid engine going already...
>
> in programming, the saying goes that
> the first 90% of a project takes 90% of
> the time, and the remaining 10% takes
> the other 90% of the time.  and it's true.
>
> but still, to have a solid engine after
> just 2 days means i think i can have
> a pretty smooth system in 2 weeks...
>
> but before i go reinvent the wheel...
>
> a question for dkretz on "twisted"...
>
> it was coded in "air", so in _theory_
> anyway, it will run on a web-server.
>
> so, don, can you make it do that?
>
> is there any place where it _is_
> running on a web-server now?
>
> if someone (like me) wanted to
> run it on their server, would you
> make the app available to them?
>
> i'd love to see it run in a browser.
>
> -bowerbird
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100311/b0784827/attachment.html>

From dakretz at gmail.com  Thu Mar 11 13:43:21 2010
From: dakretz at gmail.com (don kretz)
Date: Thu, 11 Mar 2010 13:43:21 -0800
Subject: [gutvol-d] Re: a question for dkretz about twisted
In-Reply-To: <627d59b81003111339y2604c47fmd4261ea72894bb14@mail.gmail.com>
References: <6d556.783644ab.38ca95a6@aol.com>
	<627d59b81003111339y2604c47fmd4261ea72894bb14@mail.gmail.com>
Message-ID: <627d59b81003111343t30e2d63eq4cf510fced410c2b@mail.gmail.com>

Another reasonable alternative might te Google Docs/Google Apps.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100311/a7ecaa22/attachment.html>

From Morasch at aol.com  Thu Mar 11 15:04:43 2010
From: Morasch at aol.com (Morasch at aol.com)
Date: Thu, 11 Mar 2010 18:04:43 EST
Subject: [gutvol-d] Re: a question for dkretz about twisted
Message-ID: <81cb4.31b86b2e.38cad10b@aol.com>

don said:
>    Close, but I don't think close enough.

ok, cool.   thank you.   just thought i'd ask.

i'd be interested in playing with it, though,
if you've got it up and available somewhere,
or if i can install it on my own site, just to see
exactly how "close" it comes...


>    I appreciate what Steve Jobs said recently

i've hated adobe for a long, long, long time...
but yeah, at one time, i did respect their work.
now it just seems shoddy.   bloated and shoddy.


>    What I'd recommend you consider is 
>    building on the WordPress blog engine.

sounds like too much overhead cruft to me...
i like to be close to the metal.


>     Another reasonable alternative 
>    might te Google Docs/Google Apps.

sounds like more cruft.   i'll stick with perl...

thanks again.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100311/d6838442/attachment.html>

From dakretz at gmail.com  Thu Mar 11 15:23:00 2010
From: dakretz at gmail.com (don kretz)
Date: Thu, 11 Mar 2010 15:23:00 -0800
Subject: [gutvol-d] Re: a question for dkretz about twisted
In-Reply-To: <81cb4.31b86b2e.38cad10b@aol.com>
References: <81cb4.31b86b2e.38cad10b@aol.com>
Message-ID: <627d59b81003111523s24bfe613lf8987abef5a50d14@mail.gmail.com>

If you go to the same site where you download Twister, the
source is all there too.

http://code.google.com/p/dp50/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100311/9a736c4d/attachment.html>

From jimad at msn.com  Thu Mar 11 15:24:14 2010
From: jimad at msn.com (James Adcock)
Date: Thu, 11 Mar 2010 15:24:14 -0800
Subject: [gutvol-d] New Tool "pgdiff"
In-Reply-To: <379bb.624c0680.38c837db@aol.com>
References: <379bb.624c0680.38c837db@aol.com>
Message-ID: <SNT120-DS1221A0C992C29520117AEFAE320@phx.gbl>

I have created a new command line tool "pgdiff" along the lines of what BB
has been talking about, which compares two independently OCR'ed texts on a
word-by-word basis, so as to find and flag errors.  In this regards it is
similar to "worddiff", as opposed to "diff" which is the approach BB has
been talking about, which compares on a per-line basis.  But my new tool has
several tricks that haven't been seen before:

 
It can be used with two different versions or editions of the text as long
as there are not really long differences in the texts. IE the two texts do
not have to have their linebreaks at the same locations. It tries to retain
the linebreak locations of the first input text in preference to the second
input text. IE the first input text should represent the target text you are
trying to create.

 
This means it can also be used for "versioning" - for example using a copy
of a PG text from one version or edition of a text to help fix and create a
text from a different version or edition of the text.

 
It can also be used to recover linebreak information, where linebreak
information has been lost, for example to take an older PG text and recover
linebreak information in order to allow, for example, the resubmission of
that PG text back to DP for a clean-up pass.

 
In normal mode when if finds an mismatch it outputs the mismatch like this {
it'll | it'11 } within the body of the text so that given a regex compatible
editor it is very quick to search for and fix the errors found.

 
As BB says, having tried this approach, the manual approach of trying to
visually spot errors seems pretty painful and silly.

 
I find that finding differences on a word basis rather than a line basis
makes it quicker and easier to fix the errors in general.

 
You do want to do some regex punc normalization on the two OCRs to try to
remove the trivial differences prior to running the tool, in order to cut
down the number of trivial errors it finds that you have to fix. 

 
Source and a compiled windows version at
http://www.freekindlebooks.org/Dev/StringMatch

 
It is based on traditional Levenshtein Distances where the token is taken to
be the non-white part of a "word" as opposed to measuring distances between
lines of text or on individual characters.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100311/54c15ce8/attachment-0001.html>

From jimad at msn.com  Thu Mar 11 15:40:16 2010
From: jimad at msn.com (James Adcock)
Date: Thu, 11 Mar 2010 15:40:16 -0800
Subject: [gutvol-d] Re: any arguments against "free-range" proofing?
In-Reply-To: <A9FE60E0-11D8-45EB-86E3-8BC3B029EE45@uni-trier.de>
References: <36c3a.27b91b5.38c98aa2@aol.com>
	<A9FE60E0-11D8-45EB-86E3-8BC3B029EE45@uni-trier.de>
Message-ID: <SNT120-DS169F48C47349B5EEA546BBAE320@phx.gbl>

First, it depends on what you mean by "locked to a particular person."
Typical of DB type stuff having two people editing the same record (in this
case the same page) at the same time is typically taken to not be a good
thing.  Assuming you are not suggesting doing away with the typical DB
convention of only having one person editing a record (the same page) at a
given time, then the remaining problem is "fix thrashing" which we already
see happening some in DP land.  IE P1 introduces a fix, and then P2 says no
*I* think it should be fixed this way and then P3 says no *I* think it
should be fixed THIS way.  At least in DP land P1, P2, and P3 are different
people, so the "fix" may not converge but at least its not thrashing -
meaning that there is only three rounds of time-wasting going on.  With
roundlessness you could potentially run into "proofer wars."  Well, actually
in DP land you can run into proofer wars too - trust me - its just that the
proofers have to run to a "higher authority" to engage in fix thrashing -
the DP system doesn't seem to me to directly allow proofer wars to happen.

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100311/07259af7/attachment.html>

From gbuchana at teksavvy.com  Thu Mar 11 15:40:35 2010
From: gbuchana at teksavvy.com (Gardner Buchanan)
Date: Thu, 11 Mar 2010 18:40:35 -0500
Subject: [gutvol-d] Re: a question for dkretz about twisted
In-Reply-To: <6d556.783644ab.38ca95a6@aol.com>
References: <6d556.783644ab.38ca95a6@aol.com>
Message-ID: <4B997F73.6040700@teksavvy.com>

On 11-Mar-2010 13:51, Bowerbird at aol.com wrote:
>
> a question for dkretz on "twisted"...
>
> it was coded in "air", so in _theory_
> anyway, it will run on a web-server.


Adobe AIR is a kind of stand-alone container for Flash and Flex.
While Flash and Flex are mostly thought of as "web" technologies
what is actually going on is that the application is being sent
to your browser and executed there.  The net is that the whole
shebang is essentially a client-side proposition, and an AIR
application does not translate easily into a "just a browser"
server-hosted application.

What could be done is to have a Flex/LiveCycle/Blaze Data Services
app on the server that could manage and dish out page images and
whatnot, to allow collaborative operation.

============================================================
Gardner Buchanan                     <gbuchana at teksavvy.com>
Ottawa, ON             FreeBSD: Where you want to go. Today.

From jimad at msn.com  Thu Mar 11 16:09:10 2010
From: jimad at msn.com (James Adcock)
Date: Thu, 11 Mar 2010 16:09:10 -0800
Subject: [gutvol-d] Re: New Tool "pgdiff"
In-Reply-To: <SNT120-DS1221A0C992C29520117AEFAE320@phx.gbl>
References: <379bb.624c0680.38c837db@aol.com>
	<SNT120-DS1221A0C992C29520117AEFAE320@phx.gbl>
Message-ID: <SNT120-DS5A5795E416896C478CE0AAE310@phx.gbl>

PS: To help clarify what I am talking about I enclose below an except of the
output of this tool

(being used for versioning, error-flagging and linebreak recovery)

 
=====

 
got to going away so much, too, and locking me in. 

Once he locked me in and was gone three days. It 

was dreadful lonesome. I judged he had got 

{ drowned, | drownded, } and I wasn't ever going to get out any 

more. I was scared. I made up my mind I would 

fix up some way to leave there. I had tried to get 

out of that cabin many a time, but I couldn't find 

no way. There { warn't | wam't } a window to it big enough 

for a dog to get through. I couldn't get up the 

chimbly; it was too narrow. The door was thick, 

solid oak slabs. Pap was pretty careful not to leave 

a knife or anything in the cabin when he was away; 

I reckon I had { hunted | himted } the place over as much as a 

{ hundred | himdred } times; well, I was most all the time at it, 

because it was about the only way to put in the time. 

But this time I found something at { last ; | last; } I found an 

old rusty wood-saw { without | v/ithout } any handle; it was laid 

in between a rafter and the clapboards of the roof. 

I greased it up and went to work. There was an old 

horse-blanket nailed against the logs at the far end 

of the cabin behind the table, to keep the wind 

from blowing through the chinks and putting the 

candle out. I got under the table and raised the 

blanket, and went to work to { saw | sav/ } a section of the 

big bottom log { out - big | out--big } enough to let me through. 

Well, it was a good long job, but I was getting 

{ towards | toward } the end of it when I heard pap's gun in the 

woods. I got rid of the signs of my work, and 

dropped the blanket and hid my saw, and pretty 

soon pap come in.

 
=====

 
One input file has line breaks that look like this:

 
.it. I was all over welts.  He got to going away so much, too, and locking

me in.  Once he locked me in and was gone three days.  It was dreadful

lonesome.  I judged he had got drowned, and I wasn't ever going to get.

 
=====

 
The other input file has line breaks that look like this:

 
.got to going away so much, too, and locking me in.

Once he locked me in and was gone three days. It

was dreadful lonesome. I judged he had got.

 
But it doesn't matter, the algorithm will still find the word differences.

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100311/dcc1ecdd/attachment.html>

From Bowerbird at aol.com  Thu Mar 11 17:58:09 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 11 Mar 2010 20:58:09 EST
Subject: [gutvol-d] Re: a question for dkretz about twisted
Message-ID: <8cfa2.79442125.38caf9b1@aol.com>

gardner said:
>   an AIR application does not translate easily into 
>    a "just a browser" server-hosted application.

so, gardner, i think you're telling me that "it can't be done",
in regard to running "twister" in a browser.   if i'm mistaken,
and it can be done -- i.e., _you_ think that _you_ can do it --
do please let me know...

thanks.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100311/debdbccd/attachment.html>

From Bowerbird at aol.com  Thu Mar 11 18:29:47 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 11 Mar 2010 21:29:47 EST
Subject: [gutvol-d] [SPAM] re: New Tool "pgdiff"
Message-ID: <8ee0f.65afe1b7.38cb011b@aol.com>

jim said:
>    To help clarify what I am talking about 
>    I enclose below an except of the output of this tool

personally, i find that output to be obtuse and hard to read.
and editing it would be very problematic, and error-ridden.

giving people a tool that would present the choices to them,
and let them click a button for the correct one, would make
this far more easier to work with.

***

jim said:
>    I have created a new command line tool ?pgdiff?

good for you, jim!

i assume you wrapped the "wdiff" routine in an .exe?
that'll make it easier to use for normal windows users.


>    In this regards it is similar to ?worddiff?, as opposed
>    to ?diff? which is the approach BB has been 
>    talking about, which compares on a per-line basis.

well, i use "diff" as a generic term.   whether you use
"diff" or "wdiff" depends largely on whether the lines
are broken in a similar way, or have been rewrapped.

i usually find it's worthwhile to fix the linebreaks so
they are identical in the files, and match the p-book.

that's because to resolve many of these differences,
you have to look at the actual page, and that job is
infinitely easier if your linebreaks match the page...


>    But my new tool has several tricks
>    that haven?t been seen before:

um, ok...


>    It can be used with two different versions or editions of
>    the text as long as there are not really long differences

ok, but that's something that's been "seen before"...


>    This means it can also be used for ?versioning? ?
>    for example using a copy of a PG text from one version
>    or edition of a text to help fix and create a text
>    from a different version or edition of the text.

i'm not sure i understand what you're talking about here.
if there are differences, how do you know if the differences
are edition differences or o.c.r. differences?   you'd have to
refer to the page-scans for one version or the other, right?


>    It can also be used to recover linebreak information,
>    where linebreak information has been lost, for example
>    to take an older PG text and recover linebreak information
>    in order to allow, for example, the resubmission of that
>    PG text back to DP for a clean-up pass.

again, not something that hasn't been seen before...

but i'd love to see this in action.   carlo has _posted_ that
people could use wdiff to do this chore automatically, but
when asked to explain the procedure, he failed to follow up.


>    In normal mode when if finds an mismatch it outputs
>    the mismatch like this { it?ll | it?11 } within the body of
>    the text so that given a regex compatible editor it is
>    very quick to search for and fix the errors found.

i'd really like to learn the reg-ex that makes this "very quick".

i assume you'd search for the first half of the pair, and erase
it if it's incorrect.   then you'd do the same for the second half.
then you'd go back and globally remove the excess characters.

but i'd sure like to see that in action.

and i don't think it would be very fast.   or feel very easy.

especially when -- for an error like '11 -- a global change
within each of the files would end up being more efficient.

it's also the case that, as i mentioned up above, you _need_
to have the scan available for viewing to resolve some diffs,
so the ability of the tool to present those scans is _crucial_.


>    I find that finding differences on a word basis rather than
>    a line basis makes it quicker and easier to fix the errors

if you've looked at the diffs i've presented, the _indicator_line_
narrows your focus down to a single word (if that's the diff),
or even a single _character_ (like a comma, if that's the diff).

it's just showing you the entire line so you have the _context_,
and so you can _find_that_line_ more easily on the page-scan.


>    Source and a compiled windows version at

i'll take a look, as soon as i happen to be around a windows box.

in the meantime, congratulations for programming a tool!       :+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100311/3c4b3021/attachment-0001.html>

From gbuchana at teksavvy.com  Thu Mar 11 19:33:03 2010
From: gbuchana at teksavvy.com (Gardner Buchanan)
Date: Thu, 11 Mar 2010 22:33:03 -0500
Subject: [gutvol-d] Re: New Tool "pgdiff"
In-Reply-To: <SNT120-DS5A5795E416896C478CE0AAE310@phx.gbl>
References: <379bb.624c0680.38c837db@aol.com>	<SNT120-DS1221A0C992C29520117AEFAE320@phx.gbl>
	<SNT120-DS5A5795E416896C478CE0AAE310@phx.gbl>
Message-ID: <4B99B5EF.8000100@teksavvy.com>


On 11-Mar-2010 19:09, James Adcock wrote:
 > PS: To help clarify what I am talking about I enclose below an except of
 > the output of this tool
 >

This suits me.  I have a project on the go that I will try this on
pretty promptly.  I will let you know what I come up with.

Thank you!

============================================================
Gardner Buchanan                     <gbuchana at teksavvy.com>
Ottawa, ON             FreeBSD: Where you want to go. Today.

From traverso at posso.dm.unipi.it  Thu Mar 11 20:30:57 2010
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Fri, 12 Mar 2010 05:30:57 +0100 (CET)
Subject: [gutvol-d] Re: New Tool "pgdiff"
In-Reply-To: <SNT120-DS5A5795E416896C478CE0AAE310@phx.gbl> (jimad@msn.com)
References: <379bb.624c0680.38c837db@aol.com>
	<SNT120-DS1221A0C992C29520117AEFAE320@phx.gbl>
	<SNT120-DS5A5795E416896C478CE0AAE310@phx.gbl>
Message-ID: <20100312043057.B1BDBFFC5@cardano.dm.unipi.it>

>>>>> "James" == James Adcock <jimad at msn.com> writes:


    James> PS: To help clarify what I am talking about I enclose below
    James> an except of the output of this tool

    James> (being used for versioning, error-flagging and linebreak
    James> recovery)

It seems very much similar to wdiff output, may you please show where
your tool gives something basically different from wdiff?

Carlo Traverso

From ke at gnu.franken.de  Thu Mar 11 23:13:00 2010
From: ke at gnu.franken.de (Karl Eichwalder)
Date: Fri, 12 Mar 2010 08:13:00 +0100
Subject: [gutvol-d] Re: New Tool "pgdiff"
In-Reply-To: <20100312043057.B1BDBFFC5@cardano.dm.unipi.it> (Carlo Traverso's
	message of "Fri, 12 Mar 2010 05:30:57 +0100 (CET)")
References: <379bb.624c0680.38c837db@aol.com>
	<SNT120-DS1221A0C992C29520117AEFAE320@phx.gbl>
	<SNT120-DS5A5795E416896C478CE0AAE310@phx.gbl>
	<20100312043057.B1BDBFFC5@cardano.dm.unipi.it>
Message-ID: <m2k4tidncz.fsf@gnu.franken.de>

traverso at posso.dm.unipi.it (Carlo Traverso) writes:

> It seems very much similar to wdiff output, may you please show where
> your tool gives something basically different from wdiff?

Or ediff, coming with Emacs.

-- 
Karl Eichwalder

From Bowerbird at aol.com  Fri Mar 12 11:05:14 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 12 Mar 2010 14:05:14 EST
Subject: [gutvol-d] any arguments against automatic good-word listing?
Message-ID: <1da4d.5875f551.38cbea6a@aol.com>

i recently praised rfrank for simplifying the procedure of
adding a word to the "good-words list", compared to d.p.

when a word which _should_ be on that list is missing,
proofers have to struggle through the false flagging of
that word, and that decreases the efficiency of flagging.

for instance, when the name of a character in the book is
flagged every time it appears, it can cause "flag fatigue",
by virtue of its frequent nature.   moreover, it can cause
you to miss the cases where the name was misrecognized,
because you've grown accustomed to skipping that flag...
(if the name is in the good-words list, only an _incorrect_
version of the name gets flagged, which is what we want.)

and once your good-words list is _complete_, you can
do spellcheck on the book and have it come out _clean_.
this is extremely valuable, because it means that you can
repeat that spellcheck after any major editing operation
(or at any milestone that you decide during the workflow)
to make sure that your processing didn't introduce errors.

so it's in everyone's best interest to have a good-words
list which actually contains all "good words" in the book.

which is why rfrank's simple-and-immediate procedure
is far superior to the d.p. way, which is hard and slow...

but there are methods even better than rfrank's...

one of the most useful tools in my arsenal is one that
takes text as its input -- up to the entirety of a book --
and quickly spits out a list of words not in its dictionary.

thus it gives me a list of words that i'll need to check...

but _many_ of these words, primarily _names_, but also
words that appear a relatively large number of times, are
ones that will go onto the good-words list for the book.

so this tool can be used in _preprocessing_ to create a
good-words list that is actually compellingly complete.

but let's say you did not use this tool in preprocessing,
and your good-words list is still missing lots of words...

now let's look at a case where a proofer has just finished
a page that had a number of words flagged on it, because
they were not in the dictionary or on the good-words list.

even if the proofer didn't take the time to add the flagged
words to the good-words list, should they be auto-added?

because, if they're ok on this page, they should be added!

in other words, why make the proofer go to _any_ trouble
to add a word to the good-words list?   just analyze the
page they've saved, as "good", finding all words that are
not on the good-words list, and adding them automatically?

as far as i can see, there are 2 problems that might result.

the first is that the proofer made an error, and failed to
catch a flagged word that was incorrect.   in such a case,
that would mean that any other occurrences of that word
would not be flagged.   that's unfortunate, of course, but
is it a great tragedy?   i think not.   proofers need to know
that an unflagged word _might_ be incorrect, n'est pas?
of course they do.   that's the essence of a stealth scanno.

the second problem is that the word might be correct on
_this_ page, but is _incorrect_ if it appeared elsewhere,
and would need to be flagged there.   again, i do not take
a failure to flag every bad word as being necessarily bad.
but directly to the point of this second possible problem,
i simply don't think this situation occurs all that often...

so i would like to issue a challenge, to the people who
look at more books-in-progress than i do, to _locate_
this situation, where a non-dictionary word is _correct_
on _one_ page, yet _incorrect_ on some _other_ page...

after all, this is the raison d'etre for the d.p. procedure,
which insists that a word nominated for the good-words
list must be inspected/approved by the project manager.

so -- if this situation is more common than i believe --
project managers should have _lots_ of examples for me.
so let's hear them.

anyway, that's the suggestion, that when a page is proofed,
all the words which had been flagged are _automatically_
added to the good-words list.   and yes, i will also add that
i believe such additions should be _screened_ by someone,
but that's in keeping with my overall plan that any and all
changes that are made will be doublechecked by someone.

-bowerbird

p.s.   if you were quick enough to grok a flip-side suggestion
that any word on the good-words list which was _changed_
on any page should automatically be _removed_ from the
good-words list, give yourself a gold star for a sharp mind.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100312/e98cfd9c/attachment.html>

From jimad at msn.com  Fri Mar 12 15:58:40 2010
From: jimad at msn.com (James Adcock)
Date: Fri, 12 Mar 2010 15:58:40 -0800
Subject: [gutvol-d] Re: [SPAM] re: New Tool "pgdiff"
In-Reply-To: <8ee0f.65afe1b7.38cb011b@aol.com>
References: <8ee0f.65afe1b7.38cb011b@aol.com>
Message-ID: <SNT120-DS256EB4C728F4FD4B9D5C74AE310@phx.gbl>

>giving people a tool that would present the choices to them,
and let them click a button for the correct one, would make
this far more easier to work with.

 
Sorry ? I guess I assumed people know how to use a regex editor!


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100312/4186302c/attachment.html>

From dakretz at gmail.com  Fri Mar 12 20:53:57 2010
From: dakretz at gmail.com (don kretz)
Date: Fri, 12 Mar 2010 20:53:57 -0800
Subject: [gutvol-d] Re: [SPAM] re: New Tool "pgdiff"
In-Reply-To: <SNT120-DS256EB4C728F4FD4B9D5C74AE310@phx.gbl>
References: <8ee0f.65afe1b7.38cb011b@aol.com>
	<SNT120-DS256EB4C728F4FD4B9D5C74AE310@phx.gbl>
Message-ID: <627d59b81003122053p28ca57c5wef6a9f618c81ef0d@mail.gmail.com>

The good news is probably all of them who do and also have any interest in
proofreading are probably right here on this mailing ist.

On Fri, Mar 12, 2010 at 3:58 PM, James Adcock <jimad at msn.com> wrote:

>  >giving people a tool that would present the choices to them,
> and let them click a button for the correct one, would make
> this far more easier to work with.
>
>
>
> Sorry ? I guess I assumed people know how to use a regex editor!
>
>
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100312/d239e4fb/attachment-0001.html>

From Bowerbird at aol.com  Fri Mar 12 22:08:56 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 13 Mar 2010 01:08:56 EST
Subject: [gutvol-d] Re: New Tool "pgdiff"
Message-ID: <43aab.ad1a0fd.38cc85f8@aol.com>

jim said:
>   Sorry ? I guess I assumed people know how to use a regex editor!

um, you just shot yourself in the foot, jim.

you assume that people know how to use a reg-ex editor, but
you also assume they need to have wdiff wrapped in an .exe?

not much logic there, i'm afraid...

but hey, for the sake of completeness of the thread, how about
you quickly run through how i'd "use a reg-ex editor" for this?
because i honestly don't know.   (and i _do_ know how to wdiff.)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100313/0edc107e/attachment.html>

From lee at novomail.net  Sat Mar 13 11:34:49 2010
From: lee at novomail.net (Lee Passey)
Date: Sat, 13 Mar 2010 12:34:49 -0700
Subject: [gutvol-d] Re: New Tool "pgdiff"
In-Reply-To: <SNT120-DS1221A0C992C29520117AEFAE320@phx.gbl>
References: <379bb.624c0680.38c837db@aol.com>
	<SNT120-DS1221A0C992C29520117AEFAE320@phx.gbl>
Message-ID: <4B9BE8D9.5020904@novomail.net>

On 3/11/2010 4:24 PM, James Adcock wrote:

> I have created a new command line tool ?pgdiff? along the lines of what
> BB has been talking about, which compares two independently OCR?ed texts
> on a word-by-word basis, so as to find and flag errors.

[snip]

I think this will be a very useful tool moving forward, at least to me. 
I particularly like the fact that the code is not derived from the GNU 
diff program. wdiff, of which Mr. Traverso is so fond, is actually just 
a front end to diff; it takes the input files and rewrites them so that 
each word is on a separate line, and then passes the rewritten lines to 
diff. Once you have the diff output it somehow figures out how to merge 
the results back with the originals, but I actually lost interest in 
figuring out the code when I realized in required the GNU diff program 
to work.

One of the reasons I wanted to avoid GNU diff and wdiff is because of 
the restrictive, viral GPL. I have no problem /using/ GPLed programs, 
but I have no interest in extending or improving them -- which leads me 
to wonder about your own claims to intellectual property in this code. 
Here in the United States I don't think any author can avoid a copyright 
even if he or she doesn't want one. Copyright is created and attached by 
operation of law, and there is no actual legal entity called "the public 
domain" that you can assign your copyright to.

I think it would be nice to have a non-profit organization whose mission 
is solely to hold copyrights and refuse to enforce them. In the 
meantime, here is the verbiage I use on my code; I'm not completely 
convinced it will actually work, but you might want to adopt it as well:

/*
   Copyright-Only Dedication (based on United States law)

   The person or persons who have associated their work with this
   document (the "Dedicators") hereby dedicate whatever copyright they
   may have in the work of authorship herein (the "Work") to the
   public domain.

   Dedicators make this dedication for the benefit of the public at
   large and to the detriment of Dedicators' heirs and successors.
   Dedicators intend this dedication to be an overt act of
   relinquishment in perpetuity of all present and future rights
   under copyright law, whether vested or contingent, in the Work.
   Dedicators understand that such relinquishment of all rights
   includes the relinquishment of all rights to enforce (by lawsuit
   or otherwise) those copyrights in the Work.

   Dedicators recognize that, once placed in the public domain, the
   Work may be freely reproduced, distributed, transmitted, used,
   modified, built upon, or otherwise exploited by anyone for any
   purpose, commercial or non-commercial, and in any way, including
   by methods that have not yet been invented or conceived.
*/

I suspect that your own code may need to be "hardened" against 
particularly ill-formed files, and might possibly be enhanced to satisfy 
other needs, or could even become the back end for a visual tool for 
those users who need it. I'd be happy to route enhancements or bug fixes 
back to you if I have permission to use the code in other ways.

From jon.ingram at gmail.com  Sat Mar 13 12:06:31 2010
From: jon.ingram at gmail.com (Jon Ingram)
Date: Sat, 13 Mar 2010 20:06:31 +0000
Subject: [gutvol-d] Re: New Tool "pgdiff"
In-Reply-To: <4B9BE8D9.5020904@novomail.net>
References: <379bb.624c0680.38c837db@aol.com>
	<SNT120-DS1221A0C992C29520117AEFAE320@phx.gbl>
	<4B9BE8D9.5020904@novomail.net>
Message-ID: <4baf53721003131206gde7e5cl761bb0f8706adc32@mail.gmail.com>

On 13 March 2010 19:34, Lee Passey <lee at novomail.net> wrote:

> <snip>


> In the meantime, here is the verbiage I use on my code; I'm not completely
> convinced it will actually work, but you might want to adopt it as well:
>
> /*
>  Copyright-Only Dedication (based on United States law)
>
>  The person or persons who have associated their work with this
>  document (the "Dedicators") hereby dedicate whatever copyright they
>  may have in the work of authorship herein (the "Work") to the
>  public domain.
>
>  Dedicators make this dedication for the benefit of the public at
>  large and to the detriment of Dedicators' heirs and successors.
>  Dedicators intend this dedication to be an overt act of
>  relinquishment in perpetuity of all present and future rights
>  under copyright law, whether vested or contingent, in the Work.
>  Dedicators understand that such relinquishment of all rights
>  includes the relinquishment of all rights to enforce (by lawsuit
>  or otherwise) those copyrights in the Work.
>
>  Dedicators recognize that, once placed in the public domain, the
>  Work may be freely reproduced, distributed, transmitted, used,
>  modified, built upon, or otherwise exploited by anyone for any
>  purpose, commercial or non-commercial, and in any way, including
>  by methods that have not yet been invented or conceived.
> */
>
This sounds quite similar to the 'Creative Commons Zero' licence:

 http://creativecommons.org/publicdomain/zero/1.0/


-- 
Jon Ingram
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100313/027145ef/attachment.html>

From jimad at msn.com  Sat Mar 13 17:51:20 2010
From: jimad at msn.com (James Adcock)
Date: Sat, 13 Mar 2010 17:51:20 -0800
Subject: [gutvol-d] Re: New Tool "pgdiff"
In-Reply-To: <43aab.ad1a0fd.38cc85f8@aol.com>
References: <43aab.ad1a0fd.38cc85f8@aol.com>
Message-ID: <SNT120-DS1285014BB98D7D86E927A8AE2F0@phx.gbl>

>but hey, for the sake of completeness of the thread, how about
you quickly run through how i'd "use a reg-ex editor" for this?
because i honestly don't know.  (and i _do_ know how to wdiff.)


On Vim I type:

 
:/[{|}]/

 
Which highlights the edits and takes me to the next set of edits to choose from, thereafter I just type ?n? to move to the next set of fixes that I need to deal with.  I like Vim because I can just keep my fingers on the keyboard where they belong while editing and not have to mess with the mouse.

 
When you are versioning it is frequently not as simple as ?choose A? or ?choose B? but often a mix of both that you have to edit.  And I like seeing each next to each other in context to help figure out what the ?correct? editing moves are. IE if A is the target then maybe it has a word with a scanno, and B has the word without the scanno but with an incorrect capitalization.  For example if B is an old PG text you are versioning then it may have ?italics? in ALL CAPS whereas A had it in real italics which increases the chance of the A OCR making a scanno. Also these are often the edits are a mixture of inserts, deletes, and substitutions. 

 
PS: You criticize me for doing that which the creator of wdiff said he would do if only he had the gumption.

 
PPS: How do you use wdiff to recover lost linebreaks?

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100313/dc9a0e6d/attachment.html>

From jimad at msn.com  Sat Mar 13 17:58:41 2010
From: jimad at msn.com (James Adcock)
Date: Sat, 13 Mar 2010 17:58:41 -0800
Subject: [gutvol-d] [SPAM] RE:  Re: New Tool "pgdiff"
In-Reply-To: <4B9BE8D9.5020904@novomail.net>
References: <379bb.624c0680.38c837db@aol.com>	<SNT120-DS1221A0C992C29520117AEFAE320@phx.gbl>
	<4B9BE8D9.5020904@novomail.net>
Message-ID: <SNT120-DS227B7326587C185BD27519AE2F0@phx.gbl>

I decline to attach any verbiage at all.  I tell you I wrote it and you can
use it any way you like -- at your own risk and amusement, obviously. If you
need to get more serious than that contact me by email and we can talk about
it. If you find bugs in it or difficulties porting to other platforms I
would like to know about it.

I recommend the code not be used by NASA.  I have written code that others
potentially depended on for life and limb and I would rather not have to go
there again.


From Bowerbird at aol.com  Sun Mar 14 14:42:38 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 14 Mar 2010 17:42:38 EDT
Subject: [gutvol-d] [SPAM] re: any arguments against "free-range" proofing?
Message-ID: <c92e.7307eef5.38ceb24e@aol.com>

nobody came up with many objections to "free-range" proofing...

***

keith said:
>    I do not see anything truely speaking against such a system.
>    The only problems are the administrative tasks involved.

thanks for the feedback keith...


>    1) you have to track all this.

i think that's pretty easy.   when they hit the "save/confirm" button,
if there were changes made, then they have _saved_ a new version,
which will then set up a "diff appointment" with any prior proofers.

or, if there were no changes made, it's registered as a "certification".
once a page has two consecutive certifications, it's marked as "done".
(any free-range proofers can still proof the page, of course, but once
all the pages are marked as "done", the book is ready to be finished.)

the "diff appointment" can be resolved in one of three different ways.
the first proofer can say "i goofed", or the second one can say "oops!"
either of these actions mean one person loses points, and one gains...

the third resolution -- when they can't come to a mutual agreement --
comes via a referee, who decides a winner and rewrites documentation
so the issue doesn't come up again.   points might or might not be lost.
(or deducted points might be doubled, if a ref was called unnecessarily.)

the purpose of diff review is simply to train correct proofing and coding.
people who continue making bad changes might be asked to leave, but
i don't anticipate it happening very often.   people like to do a good job.


>    2) keep everything store somewhere

each subsequent saved-text will be stored, for subsequent diff tests.

upon saving, it will be compared to all of the earlier saved versions,
to see if it is a revert to an earlier save.   if it is, it will be dealt 
with
appropriately, depending on the resolution of that earlier version...

any proofer will be able to step through all the versions of each file,
viewing which changes were made.   i expect that some proofers will
specialize in this particular tactic, making sure every change is good.


>    3) keep everything in sync

i've had my share of sync problems in the past, coding e-book
authoring-tools, so i think i know where all the pitfalls are now.       
:+)

which is not to say i won't fall in some of 'em again sometimes.
but i can usually figure out pretty quickly now what i did wrong.


>    you will need an authority/ies that finally certify that 
>    a page satisfies your criteria as being done.?

that's easy.   if the text is .zml that creates .html which
looks like the page-scan, then that satisfies the criteria.
the proofers view .html output, so can see for themselves.


>    Some may call it a administrative nightmare, 
>    but it should be workable.

yes, i think i can make it work.   again, thanks for the feedback.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100314/eea5a4b4/attachment-0001.html>

From Bowerbird at aol.com  Sun Mar 14 15:58:29 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 14 Mar 2010 18:58:29 EDT
Subject: [gutvol-d] Re: New Tool "pgdiff"
Message-ID: <103fd.1a43a953.38cec415@aol.com>

jim said:
>    I decline to attach any verbiage at all.
>    I tell you I wrote it and you can use it any way you like
>    -- at your own risk and amusement, obviously.

except that some of that "verbiage" was people asking
just how exactly your program differs from one that
they've been using all along.   don't you wanna tell 'em?

and as for me, perhaps you noticed i congratulated you
for programming a tool.   was that just "verbiage" to you?

in addition, i will analyze any new tool, to check how well
it performs the job for which it is intended. it's fine if you
don't want to discuss it, but such a review is not "verbiage".
it's necessary to take an objective look at our tools to see
if they do the job, how they can do it better, and so on...

you specifically said your tool helps in 3 areas:
1.   line-break recovery
2.   error-flagging
3.   versioning

you even said:
>    my new tool has several tricks
>    that haven?t been seen before

(if anything has been "verbiage" in this thread, it's that!)

so, at the end of this post, i'll begin to look at those 3 areas.


>    If you need to get more serious than that
>    contact me by email and we can talk about it.

imagine the d.p. people had told you to make your complaints
"via e-mail".   i'd venture a guess that you would laugh at that...

***


>    On Vim I type:
>    :/[{|}]/
>    Which highlights the edits and takes me to the next set of edits

but that selects both the options, and the surrounding characters.

that's not really what you want -- what _most_people_ would want.

and it involves typing.   either typing or a lot of delicate deleting.
both of which increase the probability that errors are introduced.


>    When you are versioning it is frequently not as simple as
>    ?choose A? or ?choose B? but often a mix of both that
>    you have to edit.

i'm sure i know the reality much better than you do, jim, because
i've actually _done_ this resolution job, for lots and lots of books.

but maybe rather than schooling me personally, you've said this
for the benefit of the lurkers who might not have thought about it
very much, if at all.   (and that's an entirely appropriate thing to do.)

but if we're going to enlighten them, let's do it properly, ok?

your word "frequently" is simply (but completely) out of place.

in the vast majority of cases (96%) where there is a difference
between the two versions, _one_ of the versions is _correct_...

there _are_ some cases where both are incorrect, meaning that
you need to do some editing, but such cases are relatively rare.

in the last book for which i did a comparison, gardner's text,
there were 159 differences.   there were only _3_ cases where
_both_ versions were incorrect.   so yes, it happens, but rarely.


>    And I like seeing each next to each other in context
>    to help figure out what the ?correct? editing moves are.

oh yeah, the context is _crucial_.

but i'm not sure that your _display_ is the optimal one...
it takes a lot of visual parsing to figure out a diff like this:
>    no way. There { warn't | wam't } a window to it big enough

personally, i find this display _much_ easier to understand:
>    no way. There warn't a window to it big enough
>    no way. There wam't a window to it big enough
>    ================^^============================

(i hope the monospaced font came through.   if so,
you'll see the "^^" markers line up with the diff.)

and i believe most users would agree that this display is better.

but, you know, if some users like _your_ display better, _fine!_    :+)

oh, and one more note on "context".   sometimes it can fool you.
the choice that looks right might not be what was in the book...
that's why it's vitally important that your tool show you the scan.
otherwise, you're doing your edits blind...


>    PS: You criticize me for doing that which the creator of wdiff
>    said he would do if only he had the gumption.

you'll need to provide a little more information to be understood.


>    How do you use wdiff to recover lost linebreaks?

i don't use wdiff for that.   i wrote my own program.

i asked carlo to explain how _he_ does it, but he never answered.
i found it humorous he was willing to come out to challenge you,
but isn't willing to come out when he is challenged...

***

anyway, in order to "kick the tires" on your pgdiff program, jim,
i'll set up some files that we can compare.   (real books, real files,
and not of my own choosing, either, but from rfrank's test-site.)

i'll run the files when i next find myself around a p.c. machine...
or, if you feel like it, jim, you can run them and post your output.

once i have some real output to look at, i'll be able to do a much
more thorough review of this new tool.

while you're waiting for that, though, here's a screenshot of a
tool that i wrote that makes it easier to work with jim's output.

>    http://z-m-l.com/misc/jim-tool-addon-screenshot.png

basically, when it finds a line with a diff in it, it presents the
options to the user, who can then click a button to choose one,
or enter a number -- 1 or 2 -- to activate the appropriate button.

in the case where editing is needed, either option can be edited
before the button is clicked to select it.   the "stop loop" button
will stop the loop that presents the next diff display; otherwise,
the app loops through the entire file, jumping to the next diff.

so, you see jim, i'm really trying to _help_ you in your quest here.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100314/8a181366/attachment.html>

From schultzk at uni-trier.de  Mon Mar 15 00:40:10 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Mon, 15 Mar 2010 08:40:10 +0100
Subject: [gutvol-d] Re: [SPAM] re: any arguments against "free-range"
	proofing?
In-Reply-To: <c92e.7307eef5.38ceb24e@aol.com>
References: <c92e.7307eef5.38ceb24e@aol.com>
Message-ID: <16186B3D-DA3B-44E8-9B60-4435264D4779@uni-trier.de>

Hi BB,

You are basically, taking the standard approach to the problem.
You did not need to explain. 
More interesting would be how you track everything. I believe
you will need some form of a database for the points system,
who made what version, when, is a page done, etc..

I wish you luck. Looks promising.

regards
	Keith.

Referees will have to special rights for changing it. 

Am 14.03.2010 um 22:42 schrieb Bowerbird at aol.com:

> nobody came up with many objections to "free-range" proofing...
> 
> ***
> 
> keith said:
> >   I do not see anything truely speaking against such a system.
> >   The only problems are the administrative tasks involved.
> 
> thanks for the feedback keith...
> 
> 
> >   1) you have to track all this.
> 
> i think that's pretty easy.  when they hit the "save/confirm" button,
> if there were changes made, then they have _saved_ a new version,
> which will then set up a "diff appointment" with any prior proofers.
> 
> or, if there were no changes made, it's registered as a "certification".
> once a page has two consecutive certifications, it's marked as "done".
> (any free-range proofers can still proof the page, of course, but once
> all the pages are marked as "done", the book is ready to be finished.)
> 
> the "diff appointment" can be resolved in one of three different ways.
> the first proofer can say "i goofed", or the second one can say "oops!"
> either of these actions mean one person loses points, and one gains...
> 
> the third resolution -- when they can't come to a mutual agreement --
> comes via a referee, who decides a winner and rewrites documentation
> so the issue doesn't come up again.  points might or might not be lost.
> (or deducted points might be doubled, if a ref was called unnecessarily.)
> 
> the purpose of diff review is simply to train correct proofing and coding.
> people who continue making bad changes might be asked to leave, but
> i don't anticipate it happening very often.  people like to do a good job.
> 
> 
> >   2) keep everything store somewhere
> 
> each subsequent saved-text will be stored, for subsequent diff tests.
> 
> upon saving, it will be compared to all of the earlier saved versions,
> to see if it is a revert to an earlier save.  if it is, it will be dealt with
> appropriately, depending on the resolution of that earlier version...
> 
> any proofer will be able to step through all the versions of each file,
> viewing which changes were made.  i expect that some proofers will
> specialize in this particular tactic, making sure every change is good.
> 
> 
> >   3) keep everything in sync
> 
> i've had my share of sync problems in the past, coding e-book
> authoring-tools, so i think i know where all the pitfalls are now.      :+)
> 
> which is not to say i won't fall in some of 'em again sometimes.
> but i can usually figure out pretty quickly now what i did wrong.
> 
> 
> >   you will need an authority/ies that finally certify that 
> >   a page satisfies your criteria as being done. 
> 
> that's easy.  if the text is .zml that creates .html which
> looks like the page-scan, then that satisfies the criteria.
> the proofers view .html output, so can see for themselves.
> 
> 
> >   Some may call it a administrative nightmare, 
> >   but it should be workable.
> 
> yes, i think i can make it work.  again, thanks for the feedback.
> 
> -bowerbird
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100315/fc3cb12e/attachment.html>

From schultzk at uni-trier.de  Mon Mar 15 01:12:47 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Mon, 15 Mar 2010 09:12:47 +0100
Subject: [gutvol-d] Re: New Tool "pgdiff"
In-Reply-To: <103fd.1a43a953.38cec415@aol.com>
References: <103fd.1a43a953.38cec415@aol.com>
Message-ID: <8949C23D-FED3-47F8-AB1F-5629CC201107@uni-trier.de>


Am 14.03.2010 um 23:58 schrieb Bowerbird at aol.com:

> jim said:
> >   I decline to attach any verbiage at all.
> >   I tell you I wrote it and you can use it any way you like
> >   -- at your own risk and amusement, obviously.

	[snip, snip]
> 
> your word "frequently" is simply (but completely) out of place.
> 
> in the vast majority of cases (96%) where there is a difference
> between the two versions, _one_ of the versions is _correct_...
> there _are_ some cases where both are incorrect, meaning that
> you need to do some editing, but such cases are relatively rare.
> 
> in the last book for which i did a comparison, gardner's text,
> there were 159 differences.  there were only _3_ cases where
> _both_ versions were incorrect.  so yes, it happens, but rarely.

	True enough. Yet, the arguement stands. 
	At least in my opinion. The trivial cases are easy to handle, yet it is
	always the RARE cases where tools can shine and set themselves
	apart from the rest. 
> 
> 
> >   And I like seeing each next to each other in context
> >   to help figure out what the ?correct? editing moves are.
> 
> oh yeah, the context is _crucial_.
> 
> but i'm not sure that your _display_ is the optimal one...
> it takes a lot of visual parsing to figure out a diff like this:
> >   no way. There { warn't | wam't } a window to it big enough
> 
> personally, i find this display _much_ easier to understand:
> >   no way. There warn't a window to it big enough
> >   no way. There wam't a window to it big enough
> >   ================^^============================
> 
> (i hope the monospaced font came through.  if so,
> you'll see the "^^" markers line up with the diff.)
> 
> and i believe most users would agree that this display is better.
> 
> but, you know, if some users like _your_ display better, _fine!_   :+)
	Actually, both methods are kind of primitive from a Human Interface standpoint.
	a better way would be having two windows containing two or more lines above and
	below the diff and marking each. 
	If you ever work with critical editions you will understand the cavet of this method.
	The changes can then be made in a third. All can be enhanced with colors and other neat
	features.

> 
> oh, and one more note on "context".  sometimes it can fool you.
> the choice that looks right might not be what was in the book...
> that's why it's vitally important that your tool show you the scan.
> otherwise, you're doing your edits blind...
	Very true. 

	regards
		Keith.

P.S. There will always more than one way to skin a cat!

		
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100315/a16f776b/attachment-0001.html>

From danweber at mindspring.com  Mon Mar 15 09:29:27 2010
From: danweber at mindspring.com (Dan Weber)
Date: Mon, 15 Mar 2010 12:29:27 -0400
Subject: [gutvol-d] (no subject)
Message-ID: <003301cac45c$af100b20$0d302160$@com>

To whom it may concern:

 
www.popsci.com

 
This site has 137 years of Popular Science magazine page scans online for
free.

 
danweber at mindspring.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100315/a6ef5738/attachment.html>

From Bowerbird at aol.com  Mon Mar 15 12:40:47 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 15 Mar 2010 15:40:47 EDT
Subject: [gutvol-d] Re: New Tool "pgdiff"
Message-ID: <5bb98.38a53811.38cfe73f@aol.com>

keith said:
>    True enough. Yet, the arguement stands.

perhaps you didn't catch my entire gist.

_of_course_ one needs to allow for the possibility of
editing either version, since both might be incorrect.

as i pointed out, my tool (which supports jim's tool)
does exactly that.


>    At least in my opinion. The trivial cases are 
>    easy to handle, yet it is always the RARE cases 
>    where tools can shine and set themselves
>    apart from the rest.

but jim's methodology -- where his tool simply
_marks_ the differences after which a person uses
a reg-ex editor to actually _make_the_changes_ --
handles neither the rare cases nor the trivial well.

whereas my tool-in-support-of-his handles both,
equally well.

a reg-ex editor, by requiring manual editing even
in the "trivial" cases, handles neither trivial nor rare
very well, in my opinion.

my tool-in-support-of-his makes the "trivial" cases,
which are by far the most common, trivial to handle,
with a mere button-click or keypress.   and the user
only has to do manual editing in the rare case, where
it simply cannot be avoided.


>    Actually, both methods are kind of primitive 
>    from a Human Interface standpoint.

i always appreciate it when someone analyzes my tools.

so let's see what you have to say here, keith.


>    a better way would be having 
>    two windows containing two or more lines 
>    above and below the diff and marking each.?

a little bit of context can help elucidate the difference.
too much context can bury it, depending on the display.
i'd have to see exactly what you mean in order to decide.

in my-tool-in-support-of-jim's tool, the change-window
is a movable modal, so people can simply look back at the
main window if they need to see more than 1 line of context.

(i could also put multiple content lines in the top box of the
change-window, if feedback indicated people wanted them.)


>    If you ever work with critical editions 
>    you will understand the cavet of this method.

if it impossible for you to explain in words?


>    The changes can then be made in a third.

again, not sure what you really mean here...


>    All can be enhanced with colors and other neat features.

you can always "enhance" anything with "other neat features".
the hard part is _coming_up_ with those "other neat features".

***

we should remember that my tool-in-support-of-jim's tool
isn't how _i_ would do the job.   i was just trying to show how
to make his tool work better.   i've shown how i do the job...

here's how i showed diffs with gardner's book, on 23 february:
>    http://z-m-l.com/go/gardn/gardn-hybrid6.html

that laid out the entire book, with diffs in different colors...

here's a simple reworking of that file, which i just posted:
>    http://z-m-l.com/go/gardn/gardn-hybrid7.html

this version of the file lets you click a link to see each scan,
and gives you radio-buttons where you can select the correct
alternative for each diff.   (or choose neither if both are wrong.)

this is how i would approach this task with an _online_ thrust,
working in a collaborative manner.   but i'd probably prefer to
do it with an _offline_ app instead, since that's more efficient.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100315/bfa26e3c/attachment.html>

From Bowerbird at aol.com  Mon Mar 15 14:09:58 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 15 Mar 2010 17:09:58 EDT
Subject: [gutvol-d] [SPAM] re: New Tool "pgdiff"
Message-ID: <6311e.21050e3b.38cffc26@aol.com>

ok, jim, here's some sample files for your tool...

i'm using the book "sitka" that rfrank used on his test-site.

here's the original text uploaded by rfrank for his proofers:
>    http://z-m-l.com/go/jimad/sitka0-ocr.txt

and here's the text after the proofers were done with it:
>    http://z-m-l.com/go/jimad/sitka1-pp.txt

if you can run that through your tool and share its output,
that would be great.

or i'll do it, when i next encounter a windows box.       :+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100315/4231e8d8/attachment.html>

From gbnewby at pglaf.org  Mon Mar 15 19:14:03 2010
From: gbnewby at pglaf.org (Greg Newby)
Date: Mon, 15 Mar 2010 19:14:03 -0700
Subject: [gutvol-d] Newby/Hart at Illinois symposium April 15-16
Message-ID: <20100316021403.GA26102@pglaf.org>

For those in the region, this might be of interest:
  http://50years.lis.illinois.edu/

PGLAF CEO Greg Newby will join PG founder Michael Hart
at a symposium on the U. Illinois campus.  Registration
is free but limited.  The panel with Michael & Greg is
scheduled for Thursday April 15 from 1:30-3pm.

  -- Greg

From pterandon at gmail.com  Tue Mar 16 02:59:53 2010
From: pterandon at gmail.com (Greg M. Johnson)
Date: Tue, 16 Mar 2010 05:59:53 -0400
Subject: [gutvol-d] Re: [SPAM] re: any arguments against "free-range"
	proofing?
Message-ID: <a0bf3e961003160259x6817add1xef3813b04568f844@mail.gmail.com>

From: "Keith J. Schultz" <schultzk at uni-trier.de>
>
> More interesting would be how you track everything. I
> believe you will need some form of a database for
> the points system,

Could points be *the* problem?

I don't think "points" works well in performance metrics for either
high-level professionals or in nonprofits (say, the parents of a Cub
Scout den expected to contribute so much volunteer effort per year).
For one, it creates the expectation that the reason one contributes to
humanity is to get recognition for their effort on a piecemeal basis.
That the person who puts in 40 hours a week of effort needs more
praise than he or she who put in 35, 30, or 10 hours.   Secondly, it
creates false hierarchies where you don't allow someone into
leadership until they've "reached a level".

It's just my philosophical bias that the best nonprofits are not those
with the best volunteer awards dinners.


-- 
Greg M. Johnson
http://pterandon.blogspot.com

From lee at novomail.net  Tue Mar 16 07:58:15 2010
From: lee at novomail.net (Lee Passey)
Date: Tue, 16 Mar 2010 08:58:15 -0600
Subject: [gutvol-d] Co-operative proofreading
Message-ID: <4B9F9C87.4080608@novomail.net>

Inspired by Mr. Frank, Mr. Adcock, Ms. Miske and Mr. Morasch, I decided 
to try to implement my own vision of a co-operative proofreading 
process. Anyone wanting to watch me flail about can follow my work at 
www.ebookcooperative.com. Login as guest, no password.

Apparently the engineers at Microsoft have not yet figured out how to 
implement CSS percentages, and I haven't had the time (or inclination) 
to build an Internet Explorer-aware implementation yet, so visitors 
would be advised to use a different browser.

From dakretz at gmail.com  Tue Mar 16 08:58:25 2010
From: dakretz at gmail.com (don kretz)
Date: Tue, 16 Mar 2010 08:58:25 -0700
Subject: [gutvol-d] Re: Co-operative proofreading
In-Reply-To: <4B9F9C87.4080608@novomail.net>
References: <4B9F9C87.4080608@novomail.net>
Message-ID: <627d59b81003160858s7dc274a5w88c29adedcd607d0@mail.gmail.com>

Nice one!

I can see we're going through the classic learning curve! Let a thousand
flowers bloom!


On Tue, Mar 16, 2010 at 7:58 AM, Lee Passey <lee at novomail.net> wrote:

> Inspired by Mr. Frank, Mr. Adcock, Ms. Miske and Mr. Morasch, I decided to
> try to implement my own vision of a co-operative proofreading process.
> Anyone wanting to watch me flail about can follow my work at
> www.ebookcooperative.com. Login as guest, no password.
>
> Apparently the engineers at Microsoft have not yet figured out how to
> implement CSS percentages, and I haven't had the time (or inclination) to
> build an Internet Explorer-aware implementation yet, so visitors would be
> advised to use a different browser.
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100316/ffe66d77/attachment-0001.html>

From Bowerbird at aol.com  Tue Mar 16 12:06:49 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 16 Mar 2010 15:06:49 EDT
Subject: [gutvol-d] Re: Co-operative proofreading
Message-ID: <a1b3f.2b8557c7.38d130c9@aol.com>

dkretz said:
>    Nice one!
>    I can see we're going through the classic learning curve! 
>    Let a thousand flowers bloom!

you seem like a nice-enough fellow, don.

so why are all those people over at d.p. throwing rocks at you?       ;+)

***

lee said:
>    Inspired by Mr. Frank, Mr. Adcock, Ms. Miske and Mr. Morasch

who are all these people?

hey lee, you left out dkretz.   who has done more than anyone,
including rfrank.   don was instrumental in producing dp-canada.

i have not talked about dp-canada because i was banned from it
before it even started by one of the crazy people involved with it.

but it seems to be limping along just fine, as far as i know, so if
anyone wants to start a site, i'd advise you to look at dp-canada.

or talk to don.   no one has come nearly as close to me in inspiring
the d.p. rock-throwers as don; he must be doing something right.


>    I decided to try to implement 
>    my own vision of a co-operative proofreading process. 

good luck!

i'd give you some feedback, but since you're in my spam folder
and all, dialog would be stilted, so you're on your own, passey.

it would be nice if a thousand flowers bloomed, but my bet is
999 will die on the vine.   which might also be fine, i don't know.

i just know i won't bother to keep count until there's a shakeout...

but geez louise, even some people from d.p. itself now seem to be
newly motivated to _do_something_.   which would be a good sign,
except they're planning a facelift to a system with structural damage.
it might _look_ a little better, but it won't really _work_ any better.
but -- largely due to rfrank's competition -- they're now _trying_...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100316/06ec3a0a/attachment.html>

From jon.ingram at gmail.com  Tue Mar 16 12:39:05 2010
From: jon.ingram at gmail.com (Jon Ingram)
Date: Tue, 16 Mar 2010 19:39:05 +0000
Subject: [gutvol-d] Re: Co-operative proofreading
In-Reply-To: <4B9F9C87.4080608@novomail.net>
References: <4B9F9C87.4080608@novomail.net>
Message-ID: <4baf53721003161239p7d1c5e81n7ca4163b5ffc0d87@mail.gmail.com>

Interesting, and it's good to see someone using a rich text editor for the
text, rather than expecting proofers to mess around with <i>, etc.

I'm not sure how the page was supposed to look, however --

- I'm using a widescreen 1680x1050 monitor, and there was still material off
the bottom of the page. This is using Google Chrome.

- I couldn't see any way to resize the image, so as to see the page width
rather than the zoomed in image, which gives me about 4 words before I have
to scroll to the right

- I couldn't see any way to change the font in the text window, preferably
to dpcustommono, which is ugly, but is the best font I've yet used for
proofing.

- It would be nice to have some instructions. What exactly are you expecting
me to do to the page? Do you want headers/footers/page numbers to be kept?
Do you want end of line hyphens kept? Do you want paragraphs joined?

- It would be nice to have (the option of) a horizontal rather than vertical
layout. I used to really like the vertical layout, but found I was more
accurate at proofing with a horizontal one.

- I really prefer block paragraphs rather than indented ones for
computer-based text.

A very good implementation so far -- I'll await developments.

On 16 March 2010 14:58, Lee Passey <lee at novomail.net> wrote:

> Inspired by Mr. Frank, Mr. Adcock, Ms. Miske and Mr. Morasch, I decided to
> try to implement my own vision of a co-operative proofreading process.
> Anyone wanting to watch me flail about can follow my work at
> www.ebookcooperative.com. Login as guest, no password.
>
> Apparently the engineers at Microsoft have not yet figured out how to
> implement CSS percentages, and I haven't had the time (or inclination) to
> build an Internet Explorer-aware implementation yet, so visitors would be
> advised to use a different browser.
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100316/de146b82/attachment.html>

From dakretz at gmail.com  Tue Mar 16 12:50:53 2010
From: dakretz at gmail.com (don kretz)
Date: Tue, 16 Mar 2010 12:50:53 -0700
Subject: [gutvol-d] Re: Co-operative proofreading
In-Reply-To: <a1b3f.2b8557c7.38d130c9@aol.com>
References: <a1b3f.2b8557c7.38d130c9@aol.com>
Message-ID: <627d59b81003161250w56295732jf0724b4e7960b064@mail.gmail.com>

Whan that Aprill, with his shoures soote
The droghte of March hath perced to the roote
And bathed every veyne in swich licour,
Of which vertu engendred is the flour;
Whan Zephirus eek with his sweete breeth
Inspired hath in every holt and heeth
The tendre croppes, and the yonge sonne
Hath in the Ram his halfe cours yronne,
And smale foweles maken melodye,
That slepen al the nyght with open eye--
(So priketh hem Nature in hir corages);
Thanne longen folk to write proofreading software.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100316/bec9ed10/attachment.html>

From vze3rknp at verizon.net  Tue Mar 16 13:57:47 2010
From: vze3rknp at verizon.net (Juliet Sutherland)
Date: Tue, 16 Mar 2010 16:57:47 -0400
Subject: [gutvol-d] Re: Co-operative proofreading
In-Reply-To: <4baf53721003161239p7d1c5e81n7ca4163b5ffc0d87@mail.gmail.com>
References: <4B9F9C87.4080608@novomail.net>
	<4baf53721003161239p7d1c5e81n7ca4163b5ffc0d87@mail.gmail.com>
Message-ID: <4B9FF0CB.1070803@verizon.net>

I tried it with Chrome and couldn't get the text box to let me edit. It 
worked fine in Firefox, except that I didn't find a way to save the 
work, aside from asking for another page and then saying yes when it 
asked if I wanted to save. Also, it insisted on indenting the line at 
the top of the page, even when it wasn't the beginning of the paragraph.

I, too, find a horizontal interface works better for me.

JulietS

On 3/16/2010 3:39 PM, Jon Ingram wrote:
> Interesting, and it's good to see someone using a rich text editor for 
> the text, rather than expecting proofers to mess around with <i>, etc.
>
> I'm not sure how the page was supposed to look, however --
>
> - I'm using a widescreen 1680x1050 monitor, and there was still 
> material off the bottom of the page. This is using Google Chrome.
>
> - I couldn't see any way to resize the image, so as to see the page 
> width rather than the zoomed in image, which gives me about 4 words 
> before I have to scroll to the right
>
> - I couldn't see any way to change the font in the text window, 
> preferably to dpcustommono, which is ugly, but is the best font I've 
> yet used for proofing.
>
> - It would be nice to have some instructions. What exactly are you 
> expecting me to do to the page? Do you want headers/footers/page 
> numbers to be kept? Do you want end of line hyphens kept? Do you want 
> paragraphs joined?
>
> - It would be nice to have (the option of) a horizontal rather than 
> vertical layout. I used to really like the vertical layout, but found 
> I was more accurate at proofing with a horizontal one.
>
> - I really prefer block paragraphs rather than indented ones for 
> computer-based text.
>
> A very good implementation so far -- I'll await developments.
>
> On 16 March 2010 14:58, Lee Passey <lee at novomail.net 
> <mailto:lee at novomail.net>> wrote:
>
>     Inspired by Mr. Frank, Mr. Adcock, Ms. Miske and Mr. Morasch, I
>     decided to try to implement my own vision of a co-operative
>     proofreading process. Anyone wanting to watch me flail about can
>     follow my work at www.ebookcooperative.com
>     <http://www.ebookcooperative.com>. Login as guest, no password.
>
>     Apparently the engineers at Microsoft have not yet figured out how
>     to implement CSS percentages, and I haven't had the time (or
>     inclination) to build an Internet Explorer-aware implementation
>     yet, so visitors would be advised to use a different browser.
>     _______________________________________________
>     gutvol-d mailing list
>     gutvol-d at lists.pglaf.org <mailto:gutvol-d at lists.pglaf.org>
>     http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>    
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100316/896ae630/attachment.html>

From vze3rknp at verizon.net  Tue Mar 16 14:50:18 2010
From: vze3rknp at verizon.net (Juliet Sutherland)
Date: Tue, 16 Mar 2010 17:50:18 -0400
Subject: [gutvol-d] Re: any arguments against "free-range" proofing?
In-Reply-To: <36c3a.27b91b5.38c98aa2@aol.com>
References: <36c3a.27b91b5.38c98aa2@aol.com>
Message-ID: <4B9FFD1A.8000409@verizon.net>

I do have some thoughts about "free-range" proofing.

The size of the corpus that is being proofed is important. The 
Australian Newspaper project (http://newspapers.nla.gov.au/ndp/del/home) 
allows a volunteer to proof any article from ~100 yrs of lots of 
newspapers. They built it so that their readers could improve articles 
when they found errors. It works very well for that purpose, but it also 
has several  problems. One is that they don't provide any information 
about whether or not someone has already proofed this article. The 
proofing interface is totally optional, so if a reader doesn't see any 
errors, then they don't invoke the interface. From that point of view, 
it works beautifully. But they didn't make provision for someone who 
just wants to proof an article, any article. There is no way to say 
"give me another article". I find it very hard to choose at random when 
the number of possibilities is so large. Also, since there's no 
information as to whether or not anyone has looked at (proofed) this 
article yet, there's no way to know if one is duplicating work already done.

Another problem with their system is one of completeness. For example, 
if they want to know whether an entire issue of a newspaper (1 day) is 
completely corrected (or at least that someone has edited every article) 
they can't do it. Part of this can be solved by them keeping track of 
this information. But, by the nature of their system, with efforts 
scattered all over the place, it is very unlikely that any one issue 
will be completely done. For their purposes, that doesn't matter. But 
when working on things that are meant to be read from beginning to end, 
it *does* matter.

All of this ties in to a sense of progress. If the unit of proofing 
produces a complete entity (as with an article in a newspaper) then one 
can count progress by counting how many articles have been done. But if 
the unit of proofing is not the complete entity (as with a page of a 
book), then matters change. The whole idea of distributing the work of 
proofreading is that no one has to feel like they must do an entire book 
by themselves. With the current systems, a volunteer knows that even if 
they can't do the entire book themselves, someone else will help out and 
it will get done. In a free-range system, there is no such assurance 
that anyone else will want to help finish that book.

I guess what I'm saying is that people who proof for the sake of 
proofing like to see progress. To have a sense of accomplishment while 
knowing that they contributed. The only way I can see to achieve that in 
a free-range environment is by limiting the number of books that are 
currently available. That is, concentrating the work somehow so that 
eventually a book is completely "done" (or, as good as it's going to get 
for now).

I think that there is a need for both kinds of systems. The free-range 
system is good for material that is short. It's also good for allowing 
casual readers to fix something that's wrong. I don't think it works 
very well as a system for producing entire corrected books.

Another issue with a free-range system has to do with abuse. If no one 
is likely to look again at whatever page I've just done, there is 
nothing to keep me from changing what it says. Think of it as a kind of 
graffiti. The Australian Newspaper project hasn't had trouble with that, 
but I believe that that is because they haven't been going long enough 
and haven't attracted a wide enough audience yet. I predict that they 
will have trouble with it eventually. Most people are well-meaning, but 
there's always the few who have to write "John was here" on a wall, or 
in an online book. And there will inevitably be a few fanatics who just 
have to substitute their view of the world, either by carefully changing 
a few words, or by simply putting an entire tract in place of the text 
that used to be there. One advantage of many people looking at a single 
page (or, at least 2) is that it becomes hard to get away with that kind 
of thing. As long as the proofing effort is relatively small, and not 
very high profile, a free-range system would probably not have trouble 
with vandalism. But if the effort were associated with a high profile 
organization (Google, say) it suddenly it would become much more 
interesting to folks who like to disrupt.

In summary, I think there are three issues that a free-range proofing 
system must address: choice, completeness, and vandalism. I'm not saying 
that a free-range system wouldn't work. It obviously can. But I do think 
that how well it works depends on what its purpose is.

JulietS

On 3/10/2010 6:52 PM, Bowerbird at aol.com wrote:
> the d.p. proofing system locks each page to a single proofer.
> (there's one and only one p1 proofer, p2 proofer, and so on.)
>
> so does rfrank's roundless system; once a page has been
> assigned to a proofer, it's semi-difficult to even look at it.
>
> and if someone else has reproofed it _after_ that person,
> then the old version is stored somewhere i can't figure out,
> so tracking the diffs simply cannot be done by an outsider.
>
> (the d.p. system at least allows you to do that tracking, and
> even has a routine that will show you round-to-round diffs.)
>
> it is by analyzing these round-to-round diffs very closely
> that you can get a sense for how a page progresses from
> the initial o.c.r. to its final -- hopefully perfect -- stage...
>
> ***
>
> the question i have today is whether there is a good reason
> why a page needs to be assigned-and-locked to one person.
>
> is there any reason why you shouldn't allow any proofer to
> go and proof any page in a book?  yes, it would mean that
> some pages might be proofed several times, but so what?
> that's not necessarily a _bad_ thing, is it?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100316/a6df2361/attachment-0001.html>

From Bowerbird at aol.com  Tue Mar 16 16:33:18 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 16 Mar 2010 19:33:18 EDT
Subject: [gutvol-d] Re: Co-operative proofreading
Message-ID: <f033.49c8f24b.38d16f3e@aol.com>

jon said:
>   Interesting, and it's good to see someone 

do people really think lee's system is "interesting"?

unless i'm missing something, it's just a mockup?
it doesn't actually save the text, not that i can see.

and the reg-ex cleanup doesn't really work, does it?


>   Interesting, and it's good to see someone 
>    using a rich text editor for the text, rather than 
>    expecting proofers to mess around with <i>, etc.

except there is a tremendous conundrum at work...

because we don't want proofers to go "presentational",
do we?   we want them to make structural distinctions...

but with all those presentational w.y.s.i.w.y.g. buttons
littering the interface, how would proofers ignore them?


>    I'm using a widescreen 1680x1050 monitor, and 
>    there was still material off the bottom of the page. 
>    This is using Google Chrome.

i had similar problems in safari.   camino worked fine.


>   It would be nice to have (the option of) a horizontal 
>    rather than vertical layout. I used to really like
>    the vertical layout, but found I was more accurate 
>    at proofing with a horizontal one.

i don't understand the appeal of a horizontal interface
-- way too much scrolling for me! -- but lots of people
seem to prefer it.


>   I really prefer block paragraphs rather than 
>    indented ones for computer-based text.

but you want the display to look like the p-book, not?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100316/093e2bc5/attachment.html>

From Bowerbird at aol.com  Tue Mar 16 16:20:50 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 16 Mar 2010 19:20:50 EDT
Subject: [gutvol-d] Re: any arguments against "free-range" proofing?
Message-ID: <e3ad.6447b2af.38d16c52@aol.com>

juliet said:
>    The Australian Newspaper project allows a volunteer to 
>    proof any article from ~100 yrs of lots of newspapers.

well, that's quite different from what i'm talking about,
which is to allow people to proof any page of a _book_,
one that is being actively worked on at the present time.


>    Also, since there's no information as to whether or not 
>    anyone has looked at (proofed) this article yet, there's 
>    no way to know if one is duplicating work already done.

again, quite different.   i will explicitly inform people of
the exact status of every page.   however, if they _want_
to "duplicate" work that is "already done" -- by proofing
a page that's already "finished" -- they can certainly do so.

indeed, up through the 3rd "confirmation" a page is "done",
the person would continue to receive "points" for doing so...

the main reason a person would go the "free-range" route,
i would think, would be so they could actually read the book
in the process of proofing.   i think that's a useful perspective.

another reason might be to do a "specialized" look at the book.
for instance, i think it'd be great for a person to look through
the entire book just to find cases of _italics_ and _formatting_.

even a pass checking the paragraph-starts at page-tops will
be a useful quality-control mechanism i'd want to encourage.


>     All of this ties in to a sense of progress.

indeed.


>    In a free-range system, there is no such assurance that 
>    anyone else will want to help finish that book.

i think it's just the opposite.   if i inform people which pages
haven't yet been proofed, many people would choose them.

if i show people which pages need to be confirmed, i think
some people will want to get their "points" that way instead.

and other people will want to read straight through the book,
without any regard for the state of any one particular page...

by letting them choose whatever they like, rather than just
_assigning_ them a page, with a choice to "take it or leave it",
i think they're going to do a good job of progressing a book.


>    The only way I can see to achieve that 
>    in a free-range environment is by 
>    limiting the number of books that are currently available.

which is, i think, a perfectly good way to achieve that goal.

the idea that d.p. seems to have settled upon is that anyone
can put a book in the system, without anybody knowing who
-- if anyone -- will be there at the end to pick up the pieces.
as a consequence, you now have two tons of half-done books.

and this has nothing to do with "free-range", as evidenced by
the fact that rfrank is using a "limited-number" approach in
his system, in order to make sure that books don't get beached.


>    If no one is likely to look again at whatever page I've just done, 
>    there is nothing to keep me from changing what it says.

ok, i guess you're talking about the australian newspaper thing.

which, again, has no applicability to what i'm talking about, so
i won't bother to address it here, except to say that my system
is _expressly_and_extensively_ geared to checking changes made.
there won't be any "graffiti" that won't be painted over very quickly.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100316/2cbd0d47/attachment.html>

From schultzk at uni-trier.de  Wed Mar 17 01:35:44 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Wed, 17 Mar 2010 09:35:44 +0100
Subject: [gutvol-d] Re: New Tool "pgdiff"
In-Reply-To: <5bb98.38a53811.38cfe73f@aol.com>
References: <5bb98.38a53811.38cfe73f@aol.com>
Message-ID: <3930D74D-95CB-412E-B201-4E25BC5963D0@uni-trier.de>

Hi BB, All,

	BB you wanted some clarification so I will try.

	First, I do not have the time to look closely at either
	tool nor overs, as it would take to long for me to
	properly analyze them, and due the analysis merit.
	So I will proppose a possible design. 
	Take the good, leave the bad.

	Second the design is taken from tools used for creating
	critical editions. A critical edition is where two or more
	versions of a text are put side by side and commented on.
	It is used most on historical texts, translations, etc.

	Some of this is possibly overkill. 

	In my opinion would be a tool with four windows/frames:
		1) the scan
		2) version 1
		3) version 2
		4) the proofed version

	Yes, BB,  there is the problem of screen clutter and
	the possibility of offering to much information.
	Yet, a single line for 2 and 3 may be to little as the
	co-text may offer hints for a possible correct version.
	Then again we are proofing and since we can assume
	that a proofer has done this regularly s/he will easily
	adjust.

	The scan (1) would be optimally syncrhonised with the 
	passage being checked. 

	2 and 3 should have at least three lines of text each. Generally,
	5 would be better. Though for the purpose of proofing
	less might do!

	4 would be contain the entire new proofed version but 
	sync at first to the conflict being investigated.

	Now, we have three cases to consider:
		a) version 1 is correct
		b) version 2 is correct
		c) neither 1 or 2 is correct
		d) the case where 1 and 2 are correct is actually not possible in our context
                     unless version 1 and 2 are comming from different edition. 
		     Still we can handle this in the same manner as c.

        For cases a and b you have a button to accept that version as correct.
	
	For case c we could simply fall back into the editor and let the proofer
	hand edit. Another possiblity would be to offer possible hints for a correction.
	These could come from:
			- spellchecker
			- list of changes already made in the text or entire scan set.
	The spell checker is trivial.
	The list is a can of worms by itself. Though it would be a compromise
	to your text wise change BB. That is we had this before and could be the case.

	When the proofer is done with the "diffs", fold up the windows of 1 and 2
	expand windows/frames for the scan and corrected version check for 
	other possible mistakes and save.

	Hope this helps. If not hit delete.

	regards
		Keith.  
	
Am 15.03.2010 um 20:40 schrieb Bowerbird at aol.com:

> keith said:
> >   True enough. Yet, the arguement stands.
> 
> perhaps you didn't catch my entire gist.
> 
> _of_course_ one needs to allow for the possibility of
> editing either version, since both might be incorrect.
> 
> as i pointed out, my tool (which supports jim's tool)
> does exactly that.
> 
> 
> >   At least in my opinion. The trivial cases are 
> >   easy to handle, yet it is always the RARE cases 
> >   where tools can shine and set themselves
> >   apart from the rest.

	[snip, snip]

> >   Actually, both methods are kind of primitive 
> >   from a Human Interface standpoint.
> 
> i always appreciate it when someone analyzes my tools.
> 
> so let's see what you have to say here, keith.
> 
> 
> >   a better way would be having 
> >   two windows containing two or more lines 
> >   above and below the diff and marking each. 
> 
> a little bit of context can help elucidate the difference.
> too much context can bury it, depending on the display.
> i'd have to see exactly what you mean in order to decide.
> 
> in my-tool-in-support-of-jim's tool, the change-window
> is a movable modal, so people can simply look back at the
> main window if they need to see more than 1 line of context.
> 
> (i could also put multiple content lines in the top box of the
> change-window, if feedback indicated people wanted them.)
> 
> 
> >   If you ever work with critical editions 
> >   you will understand the cavet of this method.
> 
> if it impossible for you to explain in words?
> 
> 
> >   The changes can then be made in a third.
> 
> again, not sure what you really mean here...
> 
> 
> >   All can be enhanced with colors and other neat features.
> 
> you can always "enhance" anything with "other neat features".
> the hard part is _coming_up_ with those "other neat features".
> 
> ***
> 
> we should remember that my tool-in-support-of-jim's tool
> isn't how _i_ would do the job.  i was just trying to show how
> to make his tool work better.  i've shown how i do the job...
> 
> here's how i showed diffs with gardner's book, on 23 february:
> >   http://z-m-l.com/go/gardn/gardn-hybrid6.html
> 
> that laid out the entire book, with diffs in different colors...
> 
> here's a simple reworking of that file, which i just posted:
> >   http://z-m-l.com/go/gardn/gardn-hybrid7.html
> 
> this version of the file lets you click a link to see each scan,
> and gives you radio-buttons where you can select the correct
> alternative for each diff.  (or choose neither if both are wrong.)
> 
> this is how i would approach this task with an _online_ thrust,
> working in a collaborative manner.  but i'd probably prefer to
> do it with an _offline_ app instead, since that's more efficient.
> 
> -bowerbird
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100317/1b24f160/attachment-0001.html>

From Bowerbird at aol.com  Wed Mar 17 09:35:53 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 17 Mar 2010 12:35:53 EDT
Subject: [gutvol-d] Re: New Tool "pgdiff"
Message-ID: <bcaa2.20468cb9.38d25ee9@aol.com>

keith said:
>    BB you wanted some clarification so I will try.

um, ok, i guess.


>    First, I do not have the time to look closely at either
>    tool nor overs, as it would take to long for me to
>    properly analyze them, and due the analysis merit.

um, ok, i guess.        :+)


>    So I will proppose a possible design. 
>    Take the good, leave the bad.

um, ok, i guess.        ;+)


>    In my opinion would be a tool with four windows/frames:
>    1) the scan
>    2) version 1
>    3) version 2
>    4) the proofed version
...
>    The scan (1) would be optimally syncrhonised 
>    with the passage being checked. 
...
>    Now, we have three cases to consider:
>    a) version 1 is correct
>    b) version 2 is correct
>    c) neither 1 or 2 is correct
>    d) the case where 1 and 2 are correct is 
>    actually not possible in our context
>    unless version 1 and 2 are comming from different edition. 
>    Still we can handle this in the same manner as c.
...
>    Hope this helps. If not hit delete.

ok.

now i'm curious...           :+)

keith, how do you think i pull off all the comparisons i do?

how do you think i can sling around lists of diffs like i do?
>    http://z-m-l.com/go/gardn/gardn-hybrid6.html

how do you think i can mount entire books with diffs?
>    http://z-m-l.com/go/gardn/gardn-hybrid7.html

how do you think i resolve the diffs in all the books i do?

i can tell you how i do it!

i do it with tools i've programmed that do all the things
that you talk about, and more.   that's how i do it, keith.

so you don't have to do hypothetical writeups, keith,
especially if you're short on time, because i have a big
batch of post-hypothetical reality sitting in my toolbox.

it's not that your writeup doesn't "help".   it's that we are
past that point in time...   and we have been, for a while.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100317/01204207/attachment.html>

From richfield at telkomsa.net  Wed Mar 17 09:49:28 2010
From: richfield at telkomsa.net (Jon Richfield)
Date: Wed, 17 Mar 2010 18:49:28 +0200
Subject: [gutvol-d] Re: (no subject)
In-Reply-To: <003301cac45c$af100b20$0d302160$@com>
References: <003301cac45c$af100b20$0d302160$@com>
Message-ID: <4BA10818.4090209@telkomsa.net>

Dan, yes it concerns me, but I cannot find those scans on that page. 
Should I be looking more intelligently?

Cheers,

Jon

On 2010/03/15 18:29 PM, Dan Weber wrote:
>
> To whom it may concern:
>
> www.popsci.com <http://www.popsci.com>
>
> This site has 137 years of Popular Science magazine page scans online 
> for free.
>
> danweber at mindspring.com
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>    
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100317/ac1e6941/attachment.html>

From Bowerbird at aol.com  Wed Mar 17 11:44:02 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 17 Mar 2010 14:44:02 EDT
Subject: [gutvol-d]  Re: Co-operative proofreading
Message-ID: <30421.2ce5a884.38d27cf2@aol.com>

i said:
>    unless i'm missing something, it's just a mockup?

while i don't think a mockup is very hard to do,
not when compared to the programming per se,
it's not like i'm mocking mockups...

so if you're interested in mockups, there are more.

here's one from dkretz:
>    
http://www.pgdp.org/~dkretz/c/editpagedemo.php?projectid=projectID45c572a149feb&pagecode=70298&round=F1&taskcode=F1

here's a page giving several from cpeel:
>    http://www.pgdp.org/~cpeel/prototypes/

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100317/f942053f/attachment.html>

From lee at novomail.net  Wed Mar 17 12:06:00 2010
From: lee at novomail.net (Lee Passey)
Date: Wed, 17 Mar 2010 13:06:00 -0600
Subject: [gutvol-d] Re: Co-operative proofreading
In-Reply-To: <4B9FF0CB.1070803@verizon.net>
References: <4B9F9C87.4080608@novomail.net>	<4baf53721003161239p7d1c5e81n7ca4163b5ffc0d87@mail.gmail.com>
	<4B9FF0CB.1070803@verizon.net>
Message-ID: <4BA12818.70101@novomail.net>

First of all, let me say that I am gratified by, and appreciative of 
those who have visited the site and offered feedback. Be aware that what 
you are seeing is more than a mock-up and less than a prototype; it is, 
in fact, my workbench. As my development process proceeds, I deploy 
software and files to that site for testing and evaluation. There is no 
guarantee that the behavior or appearance today will be the behavior or 
appearance tomorrow. What I intended was to provide a window into my 
development process. When I invited people to watch me flail about, that 
was exactly what I meant.

On 3/16/2010 2:57 PM, Juliet Sutherland wrote:

> I tried it with Chrome and couldn't get the text box to let me edit.
> It worked fine in Firefox, except that I didn't find a way to save the
> work, aside from asking for another page and then saying yes when it
> asked if I wanted to save.

The save button is the little "floppy disk" icon in the formatting 
toolbar, next to the "block type" drop down box.

> Also, it insisted on indenting the line at
> the top of the page, even when it wasn't the beginning of the paragraph.

This is an artifact of the OCR process. To the best of my knowledge, no 
OCR program is capable of starting a page and recognizing that the text 
is, in fact, a continuation of the text on a previous page.

As bowerbird has suggested, I have named my working files sequentially 
and in synchronization with the image files. My intent is to enhance my 
post-processing program a bit so that it will look at the first 
paragraph of one page together with the last paragraph on the preceding 
page. If the first does /not/ begin with a majuscule and the following 
does /not/ end with line terminating punctuation, I would mark the 
paragraph 'class="continuation".' The editor's CSS would not indent 
paragraphs of that class, and the merge program (which would create a 
single file of all the component files) would merge paragraphs when the 
class was encountered.

Of course, this algorithm could create false positives where the OCR 
drops punctuation, or doesn't recognize capitalization, and create false 
negatives where sentences, but not paragraphs, begin on a new page. 
There will need to be a yet-to-be-determined method for the user 
interface to allow a proofreader to make this distinction.

> I, too, find a horizontal interface works better for me.
>
> JulietS

So before continuing, let me explain a little of my strategy and tactics.

I am a firm believer in markup. Like Mr. Frank, I believe that the 
markup should be carried though with the text at every stage of the process.

I am a firm believer in internet standards, even unofficial, de-facto 
internet standards. No re-inventing any wheels for me.

Lastly, I am an extraordinarily lazy programmer. I'm not going to write 
any new code unless I absolutely have to.

I will not, however, use any code infected by the Gnu Public License. 
Standalone programs are fine, but I won't touch GPL code with Mr. 
Haines' ten-foot pole.

So...

There is nothing any nearer to a standard for e-books than HTML. I 
decided that the original OCR should produce HTML output and that the 
markup should stick with the text until the final single file is 
created. Because of this decision, the final single file could be 
created over and over as small tweaks to the component files were made; 
there would be no need for any concept of finality or "doneness."

I discovered that both the Plone and the Apache Lenya content management 
systems used a javascript-based visual HTML editor called Kupu. Kupu is 
now part of the Apache Lenya project and the source is available from 
apache.org under the Apache license.

The editor you see at my website is Kupu, unmodified except for 
modifications to the CSS file that governs how it is displayed. I am 
assuming that at some point I will have to make slight modifications to 
the Kupu code, but that will be among the last things I do. I need to 
get the underlying workflow nailed down first. If there is anyone who 
wants to help out by tackling the Kupu interface (cough, Carel, cough) I 
would welcome the help.

Your comments about the behavior of my site with Chrome makes me wonder 
how well Chrome is supported by the Apache Lenya project; maybe I should 
ping them to try it out.

I needed a repository to track all the individual files for each 
project, and the changes thereto. Well, there's tons of software and 
applications that supports CVS, so CVS it is. The current plan is to 
have /three/ mostly-identical CVS repositories for each project. As 
registered users select a project each will be assigned to that 
repository which has been least-used. While the editor contents can be 
saved as many times as a user wants, when the user leaves a page (or 
after an appropriate timeout) the file will be committed to its repository.

Upon commitment a file will be "diffed" against the other two 
repositories. When conflicts are found, a voting algorithm will resolve 
the conflict, if possible, and the changes will be committed to /all 
three/ repositories. The algorithm will not be a pure "two out of 
three," but will be weighted based on the number of users who have view 
a page. Hopefully, this kind of algorithm can minimize the problem of 
e-graffiti. If a "vote" is two close to call, both options will be 
placed in all the committed files in a manner similar to that proposed 
by Mr. Adcock. My biggest problem here is finding a "diff" application 
that can work as I need it to.

Hmm, if I'm going to make users register and login, and if I'm going to 
track things like which repositories they have been assigned to, I'm 
going to need some kind of data store. My site uses the Apache web 
server, and has MySQL installed. Apache's authdb module can use MySQL as 
the authentication database. I guess all the data I need to track will 
be stored in MySQL (and I haven't even /started/ to think about how to 
define the tables I need).

Now all I need is some glue to hold all the pieces together. I'm an 
accomplished Java programmer, familiar with JDBC and servlets. My site's 
server has Apache Tomcat installed and available. I guess that 
decision's a no-brainer.

So, there's my strategy and some of the tactics to the extent I have 
worked them out. Now a few specific responses.

> On 3/16/2010 3:39 PM, Jon Ingram wrote:
>> Interesting, and it's good to see someone using a rich text editor for
>> the text, rather than expecting proofers to mess around with <i>, etc.

As pointed out above, the editing window technically is not a rich text 
editor (which produces output in RTF format). It is the Kupu HTML 
editor, which I am still not very familiar with. But I agree that 
proofreaders need a tool where they can make the proofed text look like 
the scanned image. One of the things I like about Kupu is the little 
"scroll" button, which brings up a plain text editor where you /can/ 
edit the HTML source directly if you desire. I also need to add a method 
to add internal anchors, and a method to build tables of contents.

>> I'm not sure how the page was supposed to look, however --
>>
>> - I'm using a widescreen 1680x1050 monitor, and there was still
>> material off the bottom of the page. This is using Google Chrome.

It appears that either Chrome hasn't figured out how to use the CSS 
"percentage" value either, or perhaps it's understanding simply differs 
from that of Mozilla (from your description, I would guess the latter). 
I could go on at length about how /I/ think it should be implemented, 
but I won't.

>> - I couldn't see any way to resize the image, so as to see the page
>> width rather than the zoomed in image, which gives me about 4 words
>> before I have to scroll to the right

Resizing images is a problem. Right now, images are the size that 
FineReader exported them. Firefox autosizes the images into the 
constraining box, and provides a "zoom" function. Apparently Chrome does 
not have any sort of similar function (IE definitely does not), and 
Opera works even worse than IE. Maybe I can come up with an automated 
tool to resize the images into a set of standard resolutions (e.g. 25%, 
50%, 75%, 100%). Then each user could individually set the preferences 
for the image size that works best for her or him. If the editor and 
image boxes are going to be fixed sizes perhaps I could add those 
parameters to a set of preferences as well.

>> - I couldn't see any way to change the font in the text window,
>> preferably to dpcustommono, which is ugly, but is the best font I've
>> yet used for proofing.

Well, you wouldn't want to set the font face or size for the file being 
saved, as that is a highly subjective matter. Unfortunately I have yet 
to see a browser that allows a user to override a page's stylesheet 
decision (although Opera is getting close). I've no experience yet with 
Chrome, does it do so? What I envision is allowing a user to select 
among a set of standard CSS style sheets as a sticky preference, or to 
actually upload his or her own for personal use.

>> - It would be nice to have some instructions. What exactly are you
>> expecting me to do to the page? Do you want headers/footers/page
>> numbers to be kept? Do you want end of line hyphens kept? Do you want
>> paragraphs joined?

Guilty as charged. I'm thinking of adding a "Proofing Guidelines" button 
to each page, which would popup a separate window with those 
instructions. Of course, at this stage of development I have virtually 
no idea as to what those instructions would be, but it might be a good 
idea to add it now anyway, even if the instructions are as simple as "I 
know I have to add this in the future."

>> - It would be nice to have (the option of) a horizontal rather than
>> vertical layout. I used to really like the vertical layout, but found
>> I was more accurate at proofing with a horizontal one.

This could be handled by (yet another) user preferred stylesheet.

>> - I really prefer block paragraphs rather than indented ones for
>> computer-based text.

A user preferred stylesheet could handle this issue as well, although if 
you were to do it you would need to figure out how to insert a visual 
signal when a paragraph is a "continuation" paragraph as opposed to a 
"real" paragraph.

>> A very good implementation so far -- I'll await developments.

Thank you.

Just as a reminder, however, I suspect the user interface portion will 
not receive much attention until the latter stages of development; for 
now, I only need it to work well enough for me to test other parts of 
the workflow.


From jimad at msn.com  Wed Mar 17 12:09:59 2010
From: jimad at msn.com (Jim Adcock)
Date: Wed, 17 Mar 2010 12:09:59 -0700
Subject: [gutvol-d] Re: New Tool "pgdiff"
In-Reply-To: <103fd.1a43a953.38cec415@aol.com>
References: <103fd.1a43a953.38cec415@aol.com>
Message-ID: <SNT120-DS11756FD5F99B11AB99268EAE2C0@phx.gbl>

>except that some of that "verbiage" was people asking just how exactly your program differs from one that they've been using all along.  don't you wanna tell 'em?

You are attacking my reply re attaching licensing terms by attaching it to unrelated discussions.

>>   On Vim I type:
>>   :/[{|}]/
>>   Which highlights the edits and takes me to the next set of edits

>but that selects both the options, and the surrounding characters.

>that's not really what you want -- what _most_people_ would want.

>and it involves typing.  either typing or a lot of delicate deleting.
>both of which increase the probability that errors are introduced.

Again, you are assuming the problem presupposes a solution which is one of "Choose A" or "Choose B".  If you use the tool on other than trivial problems you will find out that life is not that simple, and that frequently both A and B have some degree of errors that need to be corrected and/or merged to get you where you want to go.  If one wanted to make a graphical tool to do this you would not only need the "Choose A" and "Choose B" options but "Edit in Context while displaying a copy of the original scanned page" and if one wants to make that kind of tool one would be better off to put the time and effort into figuring out a tool to display a scanned page a bit-mapped line at a time comparing to the OCR text as opposed to the DP current approach of displaying a bit-mapped page at a time compared to a OCR page at a time.  And then one would also have to tackle the problem of how one wants to deal with the portability issues of the differing graphics systems on different people's computers.  And one would have to build in an editing capability on par with the non-integrated editors that people currently choose to use and/or offer emulation of those editors in your editor offering.  These WOULD be good issues to tackle, I just don't feel like I am the right person to tackle these problems.  In practice, using pgdiff with Vim I find personally to be MUCH easier, less painful, and more productive than the DP approach, which is why I offer it for people to choose from.  You still need to compare to the page scans.

>in the vast majority of cases (96%) where there is a difference between the two versions, _one_ of the versions is _correct_...

This is not my experience, but in any case it should be obvious that the results are HIGHLY dependent on what kind of texts and OCRs you are working on.

>but, you know, if some users like _your_ display better, _fine!_   :+)

More importantly, since I post my code and it is reasonably portable without a lot of rigmarole and without stack hacks like wdiff people can edit it and put it into their choice of display or other code.

>you'll need to provide a little more information to be understood.

Read the wdiff documentation and you will see the author admits he would have written a stand-alone tool that doesn't depend on diff if he could figure out the algorithm.

>   http://z-m-l.com/misc/jim-tool-addon-screenshot.png

...

>so, you see jim, i'm really trying to _help_ you in your quest here.

Thank you. Post a portable version or one compiled for windows and I will tell you how it works for me in practice.  

PS: doesn't really help me with *MY* quest since I have the tools *I* need to do my job the way I want to do it, but granted perhaps other people would be happier with the GUI approach you are suggesting. Since I post the source code they can apply my work however they want to.


From richfield at telkomsa.net  Wed Mar 17 12:54:00 2010
From: richfield at telkomsa.net (Jon Richfield)
Date: Wed, 17 Mar 2010 21:54:00 +0200
Subject: [gutvol-d]  Re:To whom it may concern:
In-Reply-To: <003301cac45c$af100b20$0d302160$@com>
References: <003301cac45c$af100b20$0d302160$@com>
Message-ID: <4BA13358.9020207@telkomsa.net>

Dan, yes it concerns me, but I cannot find those scans on that page. 
Should I be looking more intelligently?

Cheers,

Jon

On 2010/03/15 18:29 PM, Dan Weber wrote:
>
> To whom it may concern:
>
> www.popsci.com <http://www.popsci.com>
>
> This site has 137 years of Popular Science magazine page scans online 
> for free.
>
> danweber at mindspring.com
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>    
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100317/6d89e155/attachment.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Attached Message Part
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100317/6d89e155/attachment.ksh>

From Bowerbird at aol.com  Wed Mar 17 13:12:20 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 17 Mar 2010 16:12:20 EDT
Subject: [gutvol-d] Re: New Tool "pgdiff"
Message-ID: <8329d.6343aa8a.38d291a4@aol.com>

jim said:
>    You are attacking my reply re attaching licensing terms 
>    by attaching it to unrelated discussions.

jim, i'm not "attacking" anything.   let's steer clear of antagonistic 
language.

and if you had quoted what you were replying to, i would have _known_
the scope of your reply.   don't blame me because i'm not a mind-reader.


>   Again, you are assuming the problem presupposes a solution 
>    which is one of "Choose A" or "Choose B".

well, since my tool lets a person _edit_ either choice a _or_ choice b,
i think i've given the user plenty of leeway.


>    If you use the tool on other than trivial problems 
>    you will find out that life is not that simple

oh please, jim, wake up.   i've actually _done_ the kind of comparisons
and diff-resolutions that you're just _talking_about_.   moreover, i have
done them _many_times_.   so i think i have much better _experience_
with the actual situation than you do.   so don't try to "school me", ok?


>    frequently both A and B have some degree of errors that need
>    to be corrected and/or merged to get you where you want to go.

now you're just repeating your "frequently" term, which i have already
demonstrated to be false, using the most recent example i have shared.

in all the resolutions i've done, the vast majority of diffs involved cases
where _one_ of the versions was correct.   very rarely were both wrong...


>    If one wanted to make a graphical tool to do this 

i not only _wanted_ to make such a tool, i actually _programmed_ it...


>    If one wanted to make a graphical tool to do this 
>    you would not only need the "Choose A" and "Choose B" options 
>    but "Edit in Context while displaying a copy of the original scanned 
page" 

you need to read what i wrote, jim.   my tool gives people the option to
edit either choice a or choice b, before selecting that particular choice.

as for displaying the scan, does your pgdiff tool show that information?
because if your tool can show that info, my tool can display the scan...
but as far as i can see, from your limited example, you don't show that.

(and when i say "my tool" in this post, i mean the tool that i wrote that
_supports_ your tool.   in the tools that i've programmed for _my_use_,
i _always_ retain the page-scan information, so i can _display_ the scan.)


>    and if one wants to make that kind of tool one would be 
>    better off to put the time and effort into figuring out a tool

and here we wander off into the land of unnecessary complexity...


>   This is not my experience

i've shared many of my experiences in doing the comparison method.
if your experience doesn't match mine, you should share yours as well.


>   Post a portable version or one compiled for windows 
>    and I will tell you how it works for me in practice.?

i need to have some questions resolved before i put it out in public.
that's why i furnished you some sample texts to get pgdiff output on.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100317/87888af6/attachment.html>

From Bowerbird at aol.com  Wed Mar 17 14:17:53 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 17 Mar 2010 17:17:53 EDT
Subject: [gutvol-d] a rose smells just as sweet
Message-ID: <6c883.e6e09c0.38d2a101@aol.com>

rfrank has informed people that
he doesn't consider his site to be
"competition" to the d.p. site...

so from now on i'll characterize it
instead as "an alternative" to d.p.

it's all good.

and a rose smells just as sweet...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100317/5457d357/attachment.html>

From jimad at msn.com  Wed Mar 17 15:57:09 2010
From: jimad at msn.com (James Adcock)
Date: Wed, 17 Mar 2010 15:57:09 -0700
Subject: [gutvol-d] Re: [SPAM] re: New Tool "pgdiff"
In-Reply-To: <6311e.21050e3b.38cffc26@aol.com>
References: <6311e.21050e3b.38cffc26@aol.com>
Message-ID: <SNT120-DS16CBEF97C3746EDA9459EFAE2C0@phx.gbl>

>>here's the original text uploaded by rfrank for his proofers:
>>   http://z-m-l.com/go/jimad/sitka0-ocr.txt
>
>and here's the text after the proofers were done with it:
>>   http://z-m-l.com/go/jimad/sitka1-pp.txt
>
>if you can run that through your tool and share its output,
>that would be great.


OK, I put the output at:

 
http://www.freekindlebooks.org/Dev/StringMatch/BBoutput.txt

 
where I have changed the page separators on the two files to be identically
named, because I am assuming you wouldn't want to find all the file name
changes.  Please note that the problem domain you are applying the tool to
is not the same problem domain intended for the tool - so one shouldn't be
surprised then if you consider the results in some sense "suboptimal."  Even
these "simple" outputs however, show how often it really isn't simply a
problem of "Choose Word A" or "Choose Word B" but rather there a often lots
of other issues involved at the same time, such as whitespace issues, punc
issues, line break issues, etc, which complicate the design of the editor
interface - assuming one *wants* to design a custom editor.

 
Again, the problem this tool was designed to address was when you have two
"independent" OCR outputs and you want to compare them to find those words
or sections where a human being needs to perform an edit.  Or for
versioning.  The results after human editing then would be expected to be
about the quality of the output of a "P1" pass which then would have to be
further carefully checked by more passes.  And it is envisioned that even
during the "P1" pass the editor is comparing to the page images.  When
applied to the problem domain envisioned you have at least 2X as many errors
to deal with, and the resulting errors are more difficult than the ones in
your example input files.

 
Please see at:

 
http://www.freekindlebooks.org/Dev/StringMatch/hkdiff.txt

 
what I think is a more reasonable example of the kinds problems this tool is
designed to address - here being used for versioning - an OCR from one
edition of a text is being compared to an existing but old copy of a
human-corrected PG text.  On this example ideally a smart de-hyphenator
ought to be run before making the comparison, but, its still interesting to
see what happens when this isn't done.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100317/8415c364/attachment-0001.html>

From jimad at msn.com  Wed Mar 17 16:59:07 2010
From: jimad at msn.com (James Adcock)
Date: Wed, 17 Mar 2010 16:59:07 -0700
Subject: [gutvol-d] [SPAM] RE:  Re: [SPAM] re: New Tool "pgdiff"
In-Reply-To: <SNT120-DS16CBEF97C3746EDA9459EFAE2C0@phx.gbl>
References: <6311e.21050e3b.38cffc26@aol.com>
	<SNT120-DS16CBEF97C3746EDA9459EFAE2C0@phx.gbl>
Message-ID: <SNT120-DS24F2794E3959FCC470DECEAE2C0@phx.gbl>

PS RE:

 
http://www.freekindlebooks.org/Dev/StringMatch/hkdiff.txt

 
Attempting to hand-score the kinds of edits one would need to do on
hkdiff.txt, it seems to me like an intelligent editor could present "Choose
A" vs. "Choose B" alternatives about 85% of the time, whereas the other 15%
of the time a more complicated interface would have to be presented - or
else the editor just punts and points to the text and says "You Fix It!
(which is basically the approach my current choice of editor takes 100% of
the time ;-)

 
However, if the editor gives a "Choose A" vs. "Choose B" interface sometimes
the editor (and/or the user) is going to be deceived because what looks like
an A/B choice really ISN'T.  For example a hypothetical example:

 
.. one { must | MUST } be careful!

 
And the correct answer is neither A nor B but rather C == _must_

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100317/fbc2b14c/attachment.html>

From jimad at msn.com  Wed Mar 17 17:13:32 2010
From: jimad at msn.com (James Adcock)
Date: Wed, 17 Mar 2010 17:13:32 -0700
Subject: [gutvol-d] Re: New Tool "pgdiff"
In-Reply-To: <bcaa2.20468cb9.38d25ee9@aol.com>
References: <bcaa2.20468cb9.38d25ee9@aol.com>
Message-ID: <SNT120-DS20BB754355AA38334FE315AE2B0@phx.gbl>

>i do it with tools i've programmed that do all the things
that you talk about, and more.  that's how i do it, keith.


Post your tools, including source code, and then let's talk about it.

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100317/b55e3131/attachment.html>

From danweber at mindspring.com  Wed Mar 17 17:23:09 2010
From: danweber at mindspring.com (Dan Weber)
Date: Wed, 17 Mar 2010 20:23:09 -0400
Subject: [gutvol-d] Popular Science back issues
Message-ID: <003901cac631$30b64830$9222d890$@com>

Sorry. The address is
http://www.popsci.com/announcements/article/2010-03/new-browse-137-years-pop
sci-archive-free

 
I had to grab the jpg files from my temp internet files folder (Win Vista)
Hopefully someone else will know a better way to get them

 
They say they are partnered with Google

 
danweber at mindspring.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100317/d6206a17/attachment.html>

From Bowerbird at aol.com  Wed Mar 17 17:53:34 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 17 Mar 2010 20:53:34 EDT
Subject: [gutvol-d] Re: New Tool "pgdiff"
Message-ID: <7a50f.2606ca15.38d2d38e@aol.com>

jim said:
>   Post your tools, including source code, and then let?s talk about it.

maybe you're too new here to know this, but i don't post my source code.

even if i did, it wouldn't do you any good, unless you can use realbasic...

likewise, your code doesn't do me any good, because i don't deal with
whatever language you've posted it in, so you have done me no favors,
and thus i don't feel any need to "reciprocate" for your posted code...

but what _might_ do you some good is for me to talk in pseudo-code,
if you're interested in hearing that, which i am more than happy to do.

but none of this is difficult.   especially from a line-based perspective.

you read one file into one array, and the second file into a second array,
and then compare the two arrays, item-by-item.   it ain't rocket-science.

the main difficulty in any comparison routine is the re-sync process;
but if you work in a line-based way, your lines don't get out-of-sync.
(in the rare cases where they do, you can make a manual adjustment.)

so with this, the interface is more important than the underlying code.

but i'm even willing to post compiled versions of a comparison tool,
so that you can get a very good idea about the interface i am using,
provided you can get a handful of people -- i.e., 5 people -- to say
publicly, right on this listserve, that they would like to see my tool...

but i haven't gotten the impression that anyone here can code a g.u.i.
for an offline app.   i'd love to be wrong about that, so please please
do correct me, anyone listening out there, if you can indeed do that...

-bowerbird

p.s.   if you can't get 5 people to say "please", then you should read
the design description that keith wrote up last night, as it's decent.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100317/1a8bf348/attachment.html>

From jimad at msn.com  Wed Mar 17 17:56:38 2010
From: jimad at msn.com (James Adcock)
Date: Wed, 17 Mar 2010 17:56:38 -0700
Subject: [gutvol-d] Re: any arguments against "free-range" proofing?
In-Reply-To: <4B9FFD1A.8000409@verizon.net>
References: <36c3a.27b91b5.38c98aa2@aol.com> <4B9FFD1A.8000409@verizon.net>
Message-ID: <SNT120-DS185834508139F999ED1D34AE2B0@phx.gbl>

>With the current systems, a volunteer knows that even if they can't do the
entire book themselves, someone else will help out and it will get done. 

 
This statement is not true, but also to the extent it is true is also be a
statement of a problem:


DP has many examples of books that volunteer(s) start but which don't get
finished.  Hence the queuing system and the increasing wait times.
However, your thesis is also a statement of a problem: When volunteers start
something they assume that *someone else* needs to finish it!  In turn these
other volunteers may feel an obligation to finish something that someone
else has started when a better answer may be to NOT finish it!  Certainly in
the case of very difficult and time-consuming books that no one wants to
read, the right answer may be to NOT finish it.  One can easily show other
cases that are much more interesting: difficult books that people WOULD want
to read if they were finished and yet the right answer might STILL be that
it is better off NOT to finish it! [see for example: Bibliotheca Britannica]

 
When I volunteer at DP I often end up asking myself a simple question: Do
*I* think that if the person who started this project had to do it all
themselves would they do so?  If the answer is "NO" then I decide that my
efforts are being "freeloaded" upon and I go work on something else!

 
Conversely, one of my proposals for changes at DP is a simple one: if person
A starts a book and other volunteers do not want to finish it then at least
let person A finish it rather than leaving it stuck on queue "forever"!  One
simple measure of the "worthiness" of a project is that at least one person
in the world wants to finish it.  Unfortunately, DP fails even that test! -
the current system doesn't even allow a person who *wants* to finish a book
the right to do so!  At least put in a "time out" system or something where
if something gets stuck for a year or more then DP admits they are not going
to get it done in a timely manner and put it back up for grabs!


>I guess what I'm saying is that people who proof for the sake of proofing
like to see progress. 

 
To me personally "seeing progress" means seeing something I have worked on
posted to PG for others to read.  Agreed that means the book needs to get
"done." Each spot on a queue for a book to get stuck on is yet another
chance for a book to become not-done.

 
>Another issue with a free-range system has to do with abuse. If no one is
likely to look again at whatever page I've just done, there is nothing to
keep me from changing what it says. Think of it as a kind of graffiti. 

 
I have had problems with this on Wikipedia, where one posts science-based
answers to science-based questions and then people whose religion or
politics conflicts with the science hack the postings. Certainly when
someone is proofing something that they find offensive the temptation is
always to "edit."

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100317/886bd97d/attachment-0001.html>

From jimad at msn.com  Wed Mar 17 18:04:10 2010
From: jimad at msn.com (James Adcock)
Date: Wed, 17 Mar 2010 18:04:10 -0700
Subject: [gutvol-d] Re: New Tool "pgdiff"
In-Reply-To: <7a50f.2606ca15.38d2d38e@aol.com>
References: <7a50f.2606ca15.38d2d38e@aol.com>
Message-ID: <SNT120-DS25D84850C71C089388FB8AAE2B0@phx.gbl>

>but i haven't gotten the impression that anyone here can code a g.u.i.
for an offline app.  i'd love to be wrong about that, so please please
do correct me, anyone listening out there, if you can indeed do that...


I will readily admit to NOT being the world?s greatest GUI writer, but that is not the problem.  The problem is not having a portable GUI system that I?d be happy to write tools on that work on the variety of machines PG/DP people work on.  If you know of a portable GUI library system that you think is really really good let me know ? everything I dig into ends up disappointing me.

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100317/ac009ab6/attachment.html>

From dakretz at gmail.com  Wed Mar 17 18:10:31 2010
From: dakretz at gmail.com (don kretz)
Date: Wed, 17 Mar 2010 18:10:31 -0700
Subject: [gutvol-d] Re: New Tool "pgdiff"
In-Reply-To: <SNT120-DS25D84850C71C089388FB8AAE2B0@phx.gbl>
References: <7a50f.2606ca15.38d2d38e@aol.com>
	<SNT120-DS25D84850C71C089388FB8AAE2B0@phx.gbl>
Message-ID: <627d59b81003171810g549fa6f6w1e3611bc30101465@mail.gmail.com>

I think realistically the only three durable options are 1.) Adobe Flex with
the AIR library - fairly portable to
W/M/L and their new release has one of the best text-layout libraries I've
seen; But also see my earlier
comments. 2.) Silverlight - with the obvious microsoft attributes, and 3.)
HTML 5 which looks like it may
be an option sooner rather than later.

On Wed, Mar 17, 2010 at 6:04 PM, James Adcock <jimad at msn.com> wrote:

>  *>*but i haven't gotten the impression that anyone here can code a g.u.i.
> for an offline app.  i'd love to be wrong about that, so please please
> do correct me, anyone listening out there, if you can indeed do that...
>
>  I will readily admit to NOT being the world?s greatest GUI writer, but
> that is not the problem.  The problem is not having a portable GUI system
> that I?d be happy to write tools on that work on the variety of machines
> PG/DP people work on.  If you know of a portable GUI library system that you
> think is really really good let me know ? everything I dig into ends up
> disappointing me.
>
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100317/d8e1f329/attachment.html>

From prosfilaes at gmail.com  Wed Mar 17 18:53:53 2010
From: prosfilaes at gmail.com (David Starner)
Date: Wed, 17 Mar 2010 21:53:53 -0400
Subject: [gutvol-d] Re: any arguments against "free-range" proofing?
In-Reply-To: <SNT120-DS185834508139F999ED1D34AE2B0@phx.gbl>
References: <36c3a.27b91b5.38c98aa2@aol.com> <4B9FFD1A.8000409@verizon.net>
	<SNT120-DS185834508139F999ED1D34AE2B0@phx.gbl>
Message-ID: <6d99d1fd1003171853m3293107ck90c3a4172ee76979@mail.gmail.com>

On Wed, Mar 17, 2010 at 8:56 PM, James Adcock <jimad at msn.com> wrote:
> Certainly in
> the case of very difficult and time-consuming books that no one wants to
> read,

Unless I've missed something, you've never provided an example of
such. You've certainly never shown that they exist in significant
numbers at DP.

-- 
Kie ekzistas vivo, ekzistas espero.

From vlsimpson at gmail.com  Wed Mar 17 20:29:46 2010
From: vlsimpson at gmail.com (V. L. Simpson)
Date: Wed, 17 Mar 2010 22:29:46 -0500
Subject: [gutvol-d] Re: To whom it may concern:
In-Reply-To: <4BA13358.9020207@telkomsa.net>
References: <003301cac45c$af100b20$0d302160$@com>
	<4BA13358.9020207@telkomsa.net>
Message-ID: <bd09bf341003172029j69351815xab9ae5c8df892bc7@mail.gmail.com>

On Wed, Mar 17, 2010 at 2:54 PM, Jon Richfield <richfield at telkomsa.net> wrote:
> Dan, yes it concerns me, but I cannot find those scans on that page. Should
> I be looking more intelligently?
>
> Cheers,
>
> Jon
>
> On 2010/03/15 18:29 PM, Dan Weber wrote:
>
> To whom it may concern:
> www.popsci.com

> This site has 137 years of Popular Science magazine page scans online for
> free.

I typed archive in the search box on the site and got this:
http://www.popsci.com/archives

Then Google books advance search, title: popular science; check full
view and magazine buttons.

From Bowerbird at aol.com  Wed Mar 17 21:31:51 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 18 Mar 2010 00:31:51 EDT
Subject: [gutvol-d] [SPAM] re:  Re: New Tool "pgdiff"
Message-ID: <7c687.59e88b2e.38d306b7@aol.com>

jim said:
>   If you know of a portable GUI library system 
>    that you think is really really good let me know

i use realbasic.   if i didn't think it was really good, i wouldn't use it.

it compiles to windows, mac (even back to classic), and linux too...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100318/a13789ba/attachment.html>

From Bowerbird at aol.com  Wed Mar 17 22:05:57 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 18 Mar 2010 01:05:57 EDT
Subject: [gutvol-d] [SPAM] re: New Tool "pgdiff"
Message-ID: <520ca.55901046.38d30eb5@aol.com>

jim said:
>    And the correct answer is neither A nor B but rather C == _must_

well, one could easily program the tool to offer an italicized version
of an all-upper choice if you know you'll be processing p.g. e-texts.

indeed, a button that will italicize both choices is easy enough to code.

likewise with _any_ particular editing function that might be required...

for instance, i've programmed a routine that checks for a spacey-quote;
if it finds a spacey-quote in one of the choices, and the two choices are
otherwise identical, it auto-selects the option without the spacey-quote.

***

in reviewing the pgdiff output from the sitka files, i wanted to see if you
would do much preprocessing on the files.   it appears to me you did not.

in general, i'd _highly_ recommend preprocessing before a comparison.

the number of diffs can be significantly lowered by good preprocessing,
and preprocessing is typically a far more efficient way to make changes.

it also helps to know about the nature of the files that you're comparing.

for instance, one of the sitka files was a post-proofing file, meaning that
it was littered with artifacts of the d.p. workflow...   these include 
"notes"
the proofers leave for the post-processor.   it's far better to handle 
these
"notes" in an editor during preprocessing before you start the comparison.

another artifact of the d.p. workflow is asterisks on end-line (and 
end-page)
hyphenates.   i typically just delete these asterisks, as i have no use for 
them.
some of these were present in the o.c.r. file too, so i removed them as 
well...

after having deleted all the asterisks associated with "notes" and 
hyphenates,
the only asterisks left in the file were those that indicated _footnotes_ 
in the
o.c.r. file, so i did a monitored global change of them to footnote 
indicators.

that way, these footnote indicators wouldn't present a "spurious" 
difference...

(i could've done a global change to the characters that indicated the 
second
and third footnotes on a page, but i didn't bother, as there weren't too 
many.)

it also helps to know that rfrank marks "questionable" situations with an 
"@",
so you can search for those and deal with those before doing a comparison.

oh, and one other _big_ thing.   the o.c.r. file had the _pagenumbers_ in 
it.
they were enclosed in brackets, at the bottom of most pages, which is why
rfrank's preprocessing program probably didn't find them to delete them...

now, those pagenumbers were deleted by the proofers -- except in the 2
cases where the proofers failed to make the deletion -- so they were _not_
present in the second file.   so, to avoid the spurious diffs, you could 
have
eliminated them from the o.c.r. file easily, with a series of reg-ex 
changes.

on the other hand, since i _like_ pagenumbers, and want to _keep_ them,
i had my tool _inject_ them from the o.c.r. file back into the proofed 
file...

either way, it's best to eliminate as many of these "spurious" diffs as you 
can.

and i note here, jim, that you did eliminate one case of such "spurious" 
diffs
when you reformatted the page-scan references so they would be identical.
so i encourage you to take that general idea and run with it...

i _will_ talk further about the diffs that were generated anyway.

but i wanted to stress the importance of doing preprocessing...

***

jim, i looked at hkdiff.txt briefly.

i don't know what kind of sense to make of this diff at the end:
>    {|or|don't|you?-that's|the|idea.|Don't|you|reckon|12} 

i removed the whitespace so it'd fit on one line.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100318/ee695495/attachment-0001.html>

From Bowerbird at aol.com  Wed Mar 17 22:08:31 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 18 Mar 2010 01:08:31 EDT
Subject: [gutvol-d] @!@!@!@!@!@!@! Re:  [SPAM] re: New Tool "pgdiff"
Message-ID: <521b3.74a3f49a.38d30f4f@aol.com>


why are these posts coming back with "spam" in the header?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100318/c74c507a/attachment.html>

From richfield at telkomsa.net  Wed Mar 17 23:52:51 2010
From: richfield at telkomsa.net (Jon Richfield)
Date: Thu, 18 Mar 2010 08:52:51 +0200
Subject: [gutvol-d] Re: To whom it may concern:
In-Reply-To: <bd09bf341003172029j69351815xab9ae5c8df892bc7@mail.gmail.com>
References: <003301cac45c$af100b20$0d302160$@com>	
	<4BA13358.9020207@telkomsa.net>
	<bd09bf341003172029j69351815xab9ae5c8df892bc7@mail.gmail.com>
Message-ID: <4BA1CDC3.6030408@telkomsa.net>


Thanks to Dan and  V. L. Simpson.

This worked. It is of course just a weeny bit data-greedy for routine 
work, but it is nice to be able to go there. My respects to Pop-Sci for 
an intelligent and public-spirited use of a valuable resource, and a 
vigorous site. A lot of other magazines could do worse than inspect 
their effort with respect.

Cheers,

Jon
> On Wed, Mar 17, 2010 at 2:54 PM, Jon Richfield<richfield at telkomsa.net>  wrote:
>    
>> Dan, yes it concerns me, but I cannot find those scans on that page. Should
>> I be looking more intelligently?
>>
>> Cheers,
>>
>> Jon
>>
>> On 2010/03/15 18:29 PM, Dan Weber wrote:
>>
>> To whom it may concern:
>> www.popsci.com
>>      
>    
>> This site has 137 years of Popular Science magazine page scans online for
>> free.
>>      
> I typed archive in the search box on the site and got this:
> http://www.popsci.com/archives
>
> Then Google books advance search, title: popular science; check full
> view and magazine buttons.
>
>    

From schultzk at uni-trier.de  Thu Mar 18 00:01:33 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Thu, 18 Mar 2010 08:01:33 +0100
Subject: [gutvol-d] Re: [SPAM] RE:  Re: [SPAM] re: New Tool "pgdiff"
In-Reply-To: <SNT120-DS24F2794E3959FCC470DECEAE2C0@phx.gbl>
References: <6311e.21050e3b.38cffc26@aol.com>
	<SNT120-DS16CBEF97C3746EDA9459EFAE2C0@phx.gbl>
	<SNT120-DS24F2794E3959FCC470DECEAE2C0@phx.gbl>
Message-ID: <A79A33CE-4835-4D12-8E81-ED63577533F4@uni-trier.de>

Hi All,

	It is interesting how most here are in love with their tools.
	I have notice how statistics and proofs are stated to show what makes
	THIER tools the best. 

	No, James this is not directly aimed at you, but all the others.

	I could show you all easily that diff is the wrong tool. It is inefficient
	been proven since the 60s or was that the 70s.

	But, who cares.

	Your example can only be handled by an ABLE PROOFER. Neither
	your tool nor anybodies elses is better.

	Come on, peolple! get productive.

	regards
		Keith.

Am 18.03.2010 um 00:59 schrieb James Adcock:
>  
> Attempting to hand-score the kinds of edits one would need to do on hkdiff.txt, it seems to me like an intelligent editor could present ?Choose A? vs. ?Choose B? alternatives about 85% of the time, whereas the other 15% of the time a more complicated interface would have to be presented ? or else the editor just punts and points to the text and says ?You Fix It! (which is basically the approach my current choice of editor takes 100% of the time ;-)
>  
> However, if the editor gives a ?Choose A? vs. ?Choose B? interface sometimes the editor (and/or the user) is going to be deceived because what looks like an A/B choice really ISN?T.  For example a hypothetical example:
>  
> ?. one { must | MUST } be careful!
>  
> And the correct answer is neither A nor B but rather C == _must_
>  
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100318/67db8cf8/attachment.html>

From gbnewby at pglaf.org  Thu Mar 18 07:24:15 2010
From: gbnewby at pglaf.org (Greg Newby)
Date: Thu, 18 Mar 2010 07:24:15 -0700
Subject: [gutvol-d] Re: @!@!@!@!@!@!@! Re:  [SPAM] re: New Tool "pgdiff"
In-Reply-To: <521b3.74a3f49a.38d30f4f@aol.com>
References: <521b3.74a3f49a.38d30f4f@aol.com>
Message-ID: <20100318142415.GA2676@pglaf.org>

On Thu, Mar 18, 2010 at 01:08:31AM -0400, Bowerbird at aol.com wrote:
> 
> why are these posts coming back with "spam" in the header?
> 
> -bowerbird

The pglaf mailer includes a few spam filters, one
of which adds the [SPAM] string to the subject header.

Responders should edit them out, not propagate them.
  -- Greg

From Bowerbird at aol.com  Thu Mar 18 08:29:56 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 18 Mar 2010 11:29:56 EDT
Subject: [gutvol-d] i love my tools!  because the proof is in the pudding!
Message-ID: <7e817.4149442d.38d3a0f4@aol.com>

guilty as charged, keith!   i _do_ love my tools!

all of them!   most especially the ones that i coded myself,
since i put a lot of blood, sweat, and tears into each one!

but even the ones that other people made, i love those too!
because they make my life easier for me!   and i love "easier"!

and it's not that stupid kind of blind love, either.   no sir.

because -- don't forget it! -- the proof is in the pudding!

i can tell you exactly why i love each of my tools, and those
reasons are good, solid reasons with sound, logical backing.

i analyze my needs, and my tools, extremely closely so that
i _know_ -- with _certainty_ -- just exactly what i need and
why i need it, and then i make sure that my tools deliver it,
whether they were programmed by someone else (which i
strongly prefer, because i _did_ mention that i love "easier")
or programmed by me (due to the blood, sweat, and tears).

i'm happy to share my analyses of my needs, and my tools,
too, because that makes all of us smarter about all of that,
and i'm happy when other people share _their_ analyses too.

but yes yes yes yes yes yes yes, i _do_ love my tools!   i do!

and it's easy as pie to know why!   the proof is in the pudding!

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100318/782ff9ed/attachment.html>

From Bowerbird at aol.com  Thu Mar 18 10:14:39 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 18 Mar 2010 13:14:39 EDT
Subject: [gutvol-d] jim, i have some questions about pgdiff output
Message-ID: <4e132.3aad1a5d.38d3b97f@aol.com>

jim, here are 55 cases where your tool seems to give us
more than just the 2 choices that i would expect to see...

in some cases, such as the second one listed (hot springs),
it's because one of the proofer's notes contained a "|" in it.
you'll want to screen the input for your significant characters,
i.e., any "{" or "}" or "|", and eliminate them to avoid confusion.

i have some other questions as well, but let's see if the fix for
this issue provides answers for those questions as well...

-bowerbird


>    { <i>Seattle, | 
>    Seattle, Washington</i> | 
>    Washington##} [Illustration]#


>    Hot { Springs;[**probable printer's error. Should be ,|next: no, ; is 
fine] | 
>    Springs; } each with its individual charm;#


>    connect the Sitka of the past, the { <i>Novo | 
>    Novo Ark-*angelsk</i> | 
>    Arkangelsk#} of the great Russian American Company#


>    and their fate will never be known to a { certainty.##[Footnote | 
>    cer-taintv.##A: | 
>    * } January { 20th.[**,?] | 
>    20th. 1820, | 
>    1820. } a letter written by the Directory at St.#


>    { ====sitka0-015.txt===###[Illustration: Copyright by E. W. Merrill, | 

>    [10]####Sitka. | 
>    ====sitka0-015.txt===###Mount | 
>    O#Edgecumbe.] | 
>    H####} ====sitka0-016.txt===#


>    He named the mountain { <i>San | 
>    San Jacinthus</i>, | 
>    Jacinthus,}#


>    toward the sea, Cape { <i>del | 
>    del Engano</i>. | 
>    Engano. } No one who#


>    2,000 skins of the { <i>Morski | 
>    Morski bobrov</i>, | 
>    'bo'brov, } as they#


>    or { <i>Kolosh | 
>    Kolosh Ryeku</i>. | 
>    Ryeku.##} On the morning of September 28th the#


>    { ====sitka0-030.txt===##[Illustration: Sitka in | 
>    [24]####1805--From | 
>    ====sitka0-030.txt===##Lisianski's | 
>    [Blank Voyage.] | 
>    Page]####} ====sitka0-031.txt===#


>    rose the town of New Archangel { (<i>Novo | 
>    (Novo#Arkangelsk</i>,) | 
>    Arkangelsk,) } and on the kekoor was built a#


>    valued at 450,000 { rubles.[B]##[Footnote | 
>    rubles.f##A: | 
>    * } The livestock taken to Sitka in 1804 consisted of "Four#


>    p. { 218.)]##[Footnote | 
>    218.)##B: | 
>    t } Lisianski made the surveys and named the islands of the#


>    { <i>yourts</i>, | 
>    yourts, } in which live the { <i>kayours</i> | 
>    kayours } and the#


>    { [Footnote A: | 
>    * } The Russian sazhen is { 7 | 
>    I feet.] | 
>    feet,#


>    { [Footnote A: | 
>    * } These books and letters were brought by Resanof { in | 
>    In } the#


>    theft in the years when there was no custodian of such { 
property.]###====sitka0-043.txt===###[Illustration: The Bakery and Shops of the 
Russians--Later the | 
>    property.##Sitka | 
>    [36]####Trading | 
>    ====sitka0-043.txt===##Co.'s | 
>    [Blank Building.] | 
>    Page]####} ====sitka0-044.txt===#


>    the { <i>dushnoi dereva</i> | 
>    dushnoi or | 
>    dereva scented | 
>    or., at scented } wood of the#


>    Place of Islands { <i>(Chasti | 
>    (Chasti Ostrova)</i> | 
>    Ostrova) } is reputed#


>    { [Footnote | 
>    103.##A: | 
>    * } Golofnin, Voyage of the Sloop "Kamchatka," in Mat. { Pt. | 
>    Ft. } 4, p.#


>    { Wrangel's | 
>    Wraiigel's } daughter--Mary." There { is | 
>    Is } also { to | 
>    t-> } be found: "Died,#


>    the church by a partition called the { <i>Ikonastas</i>, | 
>    Ikonastas,#} which is ornamented with twelve { <i>ikons</i>, | 
>    ikons, } or#


>    { <i>repousse</i> | 
>    repousse } work in the true { Russian | 
>    Eussian } style of#


>    { ====sitka0-066.txt=== | 
>    [56]####[Illustration: | 
>    ====sitka0-066.txt===###} The { Madonna.] | 
>    Madonna.#


>    { ====sitka0-071.txt===###[Illustration: | 
>    [60]####/* | 
>    ====sitka0-071.txt===###} The Baranof Castle.#


>    The { U. | 
>    IT. S[**.] | 
>    S } Agricultural Department { building occupies | 
>    building-occupies } the site at the#


>    { [Footnote A: | 
>    * Narative[**Narrative?] | 
>    Narative } of a Voyage Round the { World, | 
>    World. } 1836-1842, by Captain#


>    Sir Edward Belcher, Vol. { 1,[**I?] | 
>    I, } pages { 95 | 
>    05 } et { sen.#


>    { ober off | 
>    <i>ober offitzer</i> | 
>    User } who sought her hand in marriage.#


>    dead in one of the small drawing { rooms."[A]##[Footnote | 
>    rooms."*##A: | 
>    * } Frederick { Schwatka, | 
>    Sohwatka, } the explorer, seems to have been one of#


>    24th, 1896, and the time is fixed as being in the administration { 
of]* | 
>    of##====sitka0-074.txt=== | 
>    [62]####[Illustration: | 
>    ====sitka0-074.txt===###} The Grave of the Princess { Maksoutoff.] | 
>    Maisoutoff.#


>    martin from the Yukon, others { <i>en | 
>    en route</i> | 
>    route } to#


>    reason for their living on this distant { shore.[A]##[Footnote | 
>    shore.*##A: | 
>    * } Between 1821 and 1862 there were shipped by the Russian#


>    (Washington, Government Printing { 
Office).]###====sitka0-079.txt===##[Illustration: Sitka in 1860, Near the Close | 
>    Office).##of | 
>    [66]####the | 
>    ====sitka0-079.txt===###Russian | 
>    CD#Administration.] | 
>    CO####} ====sitka0-080.txt===#


>    for calico and beads, blankets and { ammunition.[A] | 
>    ammunition.*#} This market was closed by a { portcullised | 
>    portcul-lised#


>    quids; fish priced according to { size[** | 
>    size ;?] | 
>    ? } all according to price list established#


>    B: | 
>    t } Golobokoe Lake was sounded to a depth { of | 
>    cf } 190 fathoms#


>    Ivan { Vasilivich | 
>    Vasiiivich Furuhelm, | 
>    Funihelm, } June 22, 1859, to Dec. 2, 1863.#


>    { ====sitka0-090.txt===###[Illustration: Sitka in 1869--During the 
Time of the Military | 
>    [76]####Occupation.] | 
>    ====sitka0-090.txt===#####} ====sitka0-091.txt===#


>    in the land that had so long been their { home.[C] | 
>    home.t#} Among those who remained are the { Kashavaroffs, | 
>    Kashavar-offs,#


>    { [Footnote A: | 
>    * } The Russian soldiery were dressed { in | 
>    In } a dark uniform, trimmed#


>    it down on the bayonets of the Russian { soldiery.]##[Footnote | 
>    soldiery.##C: | 
>    t } On December 14, { 1807, | 
>    1807. } the Russian ship "Czaritza," sailed for#


>    Russia, via London, with { 168 | 
>    368 } passengers. January { 1, | 
>    I, } 1868, the#


>    Ex. Doc. H. R. 41st Cong. 2nd { Ses., | 
>    Ses.. } p. 1030; Seattle { Intelligencer, | 
>    intelligencer,#


>    123; citizens by treaty, 229. { Total, | 
>    Total,.444. 444. | 
>    ? } Beardslee's Report, 47th#


>    Erussard, Ed. Doyle, George E. Pilz, Nicholas Haley, John { McKenna, | 

>    MCKenna,#Reub | 
>    Keub } Albertson, John Olds and { others.#


>    One of the traders of the town, { Caplin, said: | 
>    Caplin,-said:#} "De Captain may go { to ---- | 
>    to---wid wid | 
>    his his | 
>    tarn#


>    { ====sitka0-107.txt===##[**The CP failed to rotate this page 
correctly.][**Seems to be fixed now :-)]#[Illustration: Sitka--East on Lincoln 
Street--the Governor's Walk | 
>    [92]####of | 
>    ====sitka0-107.txt===##the | 
>    [Blank Russians.] | 
>    Page]####} ====sitka0-108.txt===#


>    { ====sitka0-110.txt===##[Illustration: Interior of Cathedral | 
>    [94]####of | 
>    ====sitka0-110.txt===##St. | 
>    [Blank Michael] | 
>    Page]####} ====sitka0-111.txt===#


>    { [Footnote A: | 
>    ? } The first church in Alaska was built at { Kodiak | 
>    Kodlak } (Paulovski) in#


>    towers above the bay to the height of { 3,216 | 
>    3,21.6#} feet. Along the river, known as the { <i>Kolosh | 
>    Kolosh#


>    is prominent the Devil's Club { (<i>panax | 
>    (panax horridus</i>), | 
>    horrid-us},#


>    { ====sitka0-117.txt=== | 
>    [100]####[Illustration: | 
>    ====sitka0-117.txt===###} Russian { Blockhouse.] | 
>    Blockhouse.#


>    drew their stores of { <i>krasnia | 
>    Jcrasnia ruiba</i> | 
>    ruiba } (the red#


>    the trough of the watering place of { the | 
>    the" "Jamestown," | 
>    Jamestown,"#} came to the beach. This place may be#
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100318/1b2b6c25/attachment-0001.html>

From jimad at msn.com  Thu Mar 18 12:30:32 2010
From: jimad at msn.com (Jim Adcock)
Date: Thu, 18 Mar 2010 12:30:32 -0700
Subject: [gutvol-d] Re: any arguments against "free-range" proofing?
In-Reply-To: <6d99d1fd1003171853m3293107ck90c3a4172ee76979@mail.gmail.com>
References: <36c3a.27b91b5.38c98aa2@aol.com>
	<4B9FFD1A.8000409@verizon.net>	<SNT120-DS185834508139F999ED1D34AE2B0@phx.gbl>
	<6d99d1fd1003171853m3293107ck90c3a4172ee76979@mail.gmail.com>
Message-ID: <SNT120-DS161456A8C3597325A9F007AE2B0@phx.gbl>

>Unless I've missed something, you've never provided an example of
such. You've certainly never shown that they exist in significant
numbers at DP.

Unless I've missed something, PG doesn't publish download numbers on
anything other than the most popular books.  

However, TIA does publish download numbers which one can use as proxy:

2,583,382 Downloads of the Most Popular PG Book
        8 Downloads of the Least Popular PG Book

Bang-for-the-Effort Ratio of Over 300,000 to 1.

You can query this yourself using the TIA "Advanced Search" option on
"collection:gutenberg" 

fields to return = downloads + title

HTML table

Sort Results by: either downloads desc or downloads acs

But one should be forewarned that it does not appear to me that patterns of
downloads from TIA is identical to pattern of downloads directly from PG --
TIA users are more sophisticated users aka nerdy than PG direct users. 

Personally I would rather work on a book that is towards the 2,500,000
download end of the spectrum than on the 10 downloads end of the spectrum!
Again, there are literally about 1,000 more books out there that can be
saved than we have the time and effort to save.  The question then becomes,
which books do we save?  If one is doing the entire job oneself then the
answer is easy: That book which you are willing to work on.  If one is
picking a book and imposing the work on other volunteers then the question
becomes who should have the right to make that decision and how?  "First
come first serve" I suggest is a horrible way to make this choice because it
encourages the most greedy and inconsiderate submitters to get there first
rather than to take a thoughtful approach to picking which books to save and
then doing a really really good job of digitizing and OCR'ing them.


From Bowerbird at aol.com  Thu Mar 18 12:44:55 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 18 Mar 2010 15:44:55 EDT
Subject: [gutvol-d] Re:
 =?iso-8859-1?q?=40!=40!=40!=40!=40!=40!=40!_Re=3A=A0_=5BSPAM?=
 =?iso-8859-1?q?=5D_re=3A_New_Tool_=22pgdiff=22?=
Message-ID: <9eacc.665e7552.38d3dcb7@aol.com>

greg said:
>   The pglaf mailer includes a few spam filters, one
>    of which adds the [SPAM] string to the subject header.

why?

and for what purpose?

which particular "filter" is it that is doing this?

can it be deactivated?

if not, can you tell us how to avoid setting it off?

because this "filter" is doing nothing but emitting
false alarms.   and it's not stopping any spam, is it?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100318/074aabff/attachment.html>

From jimad at msn.com  Thu Mar 18 15:16:42 2010
From: jimad at msn.com (James Adcock)
Date: Thu, 18 Mar 2010 15:16:42 -0700
Subject: [gutvol-d] Re: jim, i have some questions about pgdiff output
Message-ID: <SNT120-DS15877230ABCA04706CB5DDAE2B0@phx.gbl>

>jim, here are 55 cases where your tool seems to give us more than just the
2 choices that i would expect to see...

See my discussion below of what the Levenshtein Distance is and how pgdiff
implements it.

>in some cases, such as the second one listed (hot springs), it's because
one of the proofer's notes contained a "|" in it.
you'll want to screen the input for your significant characters, i.e., any
"{" or "}" or "|", and eliminate them to avoid confusion.

Agreed that this would be a problem if my tool is used as input to another
"smart editor" tool that wants to present "Choose A" vs. "Choose B" type
choices.

Since instead the tool was targeting a regex editor being driven by a real
human being who can recognize from context whether the "{|}" chars are being
used to highlight differences vs. being used as part of the input text it
hasn't been a problem for me re the intended problem domain.

======

Levenshtein Distance is the measure of the number of changes needed to
transform one string of tokens into a different string of tokens, where the
allowable edits are "insert", "delete" or "substitute."  Different
implementations of the algorithm would have different interpretation of what
constitutes a "token" and what constitutes a "string".  One obvious
interpretation would be that a "token" is an ascii char and a string is a
line of text (dictionary lookups of miss-spelled words) Another obvious
interpretation is a "token" is a line of text and the string is the list of
lines of text within a file (diff)  pgdiff implements neither of these but
rather a "token" to be a "word" where a "word" is a non-white sequence of
chars followed by a white sequence of chars, where the white sequence of
chars is considered not-significant for the purposes of the Levenshtein
Distance, but IS significant for the display of output.  pgdiff considers
the "string" to be the entire list of words in the input file. The typical
importance of the white part is whether words are separated by a space or by
a linebreak.  Pgdiff doesn't care about the white part in terms of the
Levenshtein Distance, so that the two input files can have different line
lengths and different linebreak locations, and still be comparable.  This
also means that typically including page break information in the input
files such as the "====== filename.101 ====" type stuff would NOT be a good
idea, since typically the input files may have their page breaks in
different locations re their word content -- unless the two input files are
from the same identical edition. 

So here's some answers to some implied questions or assumptions:

Does pgdiff look for word differences within a line of text?  No.

Does pgdiff look for single word changes? No.

OK, what does pgdiff do?

What pgdiff does is to calculate a best match of words across two entire
files.  Assuming you set the input options large enough, for example, one
input file could contain an entire chapter that the other input file doesn't
contain and the algorithm would sync up just fine.  Or in the case of a book
I've worked on previously the US version had paragraphs removed by a censor,
whereas the European version of the text had them intact.  When the words do
not match exactly, the mismatches are categorized three ways 1) Insert this
missing word. 2) Delete this extraneous word. Or 3) Substitute this one word
for a different word.  Now by reversing the input order options 1) and 2)
obviously become symmetrical -- an insertion in one case becomes a deletion
in the other case.  So in either case an isolated word difference is
displayed like { this } or if a bunch of words in a row are delete or insert
like { this is in one text but not the other } In case 3) if only one word
is different in a row it displays the output choice like { this | that } But
in case three if a bunch of words are different in a row how to display
them?  If the differences are due to scannos it is probably best to display
the words next to each other { this | th*s is | iz a | u test | tost }
whereas if the differences are due to human editing it would probably be
best to display them as "sentences" { THIS IS A TEST | _this is a test_ }
If you are implementing a "smart editor" then clearly you can choose to
display them which way you want.  In practice what one normally sees is some
weird mixture of the two possible situations, and it isn't clear to me which
display technique is best, so so far I have chosen the easiest approach to
implement -- which is the first pattern of display { this | th*s is | iz a |
u test | tost }  

>From the BBoutput.txt file, for example, consider:

{ <i>Seattle, | Seattle, Washington</i> | Washington

}

Which is of the first pattern.  The ending } is on a newline since the two
tokens differing in whitespace, space vs. linebreak.  Taking that diff back
out one gets:

{ <i>Seattle, | Seattle, Washington</i> | Washington }

Which one can read as:

Choose one of: <i>Seattle, OR Seattle,

Followed by:

Choose one of: Washington</i> OR Washington

In this case if one KNEW the differences are due to humans rather than
scannos , then it is "obvious" that the better display pattern would be the
second one:

{ <i>Seattle, Washington</i> | Seattle, Washington }

IE Choose one of: "<i>Seattle, Washington</i>" OR "Seattle, Washington"

But in general the tool doesn't know if differences are due to human edits
or scannos, and in general what one sees is a mixture of both problems
happening at the same time.

PS: OK pgdiff doesn't REALLY match across ENTIRE files since if the files
are huge Levenshtein is an n^2 algorithm in space and time.  What it does do
is break a file into large overlapping chunks of text and calculate the
measure across the chunks, where the size of the chunks can be specified as
in input parm if you prefer, and where the chunks get sewn back together
using an invariant of choosing places in the match where words DO match, and
checking the sanity of that match to make sure we haven't lost sync.  What
this means in practice is that if you specify a parm of -10000 as an input
setting then the algorithm can "ONLY" handle about 10000 word mismatches
adjacent to each other in a row without erroring out. This parm in practice
is important for versioning where two editions of a book have large chunks
of text which don't match each other.  IE a chapter is edited out or
edited in or a censor has taken their knife to the text. Common problems are
that two texts from different editions have entire book prefixes
(introductions) or entire book suffixes (postscripts or indexes) which don't
match -- which one is better to explicitly remove and deal with separately,
but which the algorithm will try to handle if you set the input parm large
enough.


From Bowerbird at aol.com  Thu Mar 18 15:22:31 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 18 Mar 2010 18:22:31 EDT
Subject: [gutvol-d] i'm a busy little bee
Message-ID: <8a6e1.155e416c.38d401a7@aol.com>

i'm a busy little bee these days...

let's see...

***

i'm waiting on a response from jim so i can continue to
analyze his tool, and work on my support-application...

***

i grabbed the "frankenstein" content from lee passey's site,
and have mounted my own version of the book over here:
>    http://z-m-l.com/go/fpass/dofpass.pl

the perl script lets you step through the pages of the book.
the script shows lee's .html files as they were when grabbed.

i don't wanna mess with the complications of an .html editor,
so i won't be bothering to offer an edit capability on the text,
at least not quite yet...   i mention this because editing is one
of the thorny aspects of using .html as your saved-text format.
lee is using the "kupu" html-editor.   too much trouble for me,
and likely far too many cross-browser inconsistencies as well.
but those are lee's problems to solve, not mine.   good luck, lee.

***

here's a little script that tells you if a word you enter is present
in the dictionary that i use.   i'm not sure why you'd want to know
such information, but the script was developed in support of my
spiffy spellchecking feature, so it's there if you _do_ have a need:
>    http://z-m-l.com/go/dict17577.pl

***

i'm finishing up a long post on intelligent filenaming.   yes, again.
but there's a good angle, important enough to warrant exposition.

the angle, to steal my own thunder, is that you can easily use
the pagenumber information that's contained in the o.c.r. files
to rename your .txt and .png files in a more-intelligent manner.

if the pagenumber in file "011.txt" says that it's pagenumber 7,
you'll rename "011.txt" and "011.png" to "007.txt" and "007.png".

easy as pie!

***

and finally, i'm still coding my online proofing system.   it's grand.
i love my tools...

anyway, i'll probably unveil the thing next week.

yeah, yeah, i know, you can hardly wait, you're so excited...      ;+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100318/7aef9107/attachment.html>

From Bowerbird at aol.com  Thu Mar 18 15:50:25 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 18 Mar 2010 18:50:25 EDT
Subject: [gutvol-d]  Re: jim, i have some questions about pgdiff output
Message-ID: <8c234.435b34c3.38d40831@aol.com>

jim said:
>   In practice what one normally sees is some weird mixture of 
>    the two possible situations, and it isn't clear to me 
>    which display technique is best, so so far I have 
>    chosen the easiest approach to implement

in view of the frank admission, let me make some suggestions.

i believe these would make your tool's output more workable
for the end-user who has to resolve the diffs, no matter _what_
method they use, including the reg-ex editor you use yourself.

***

let me pull out the last 2 of the 55 anomalies i posted for you...

***

here's the first:

>    drew their stores of { <i>krasnia | 
>    Jcrasnia ruiba</i> | 
>    ruiba } (the red

some people might prefer the version as it was in your file:

>    stores of { <i>krasnia |   Jcrasnia ruiba</i> |   ruiba } (the red

rather than showing this as a single diff, i'd present it as two...

the first would be:
>    { <i>krasnia |   Jcrasnia }.

the second would be
>    { ruiba</i> |   ruiba }

***

here's the second example:

>    the trough of the watering place of { the | 
>    the" "Jamestown," | 
>    Jamestown,"} came to the beach. This place may be

or, more in keeping with how it's displayed in your output:

>    place of { the | the" "Jamestown," | Jamestown,"} came to 

again, i would present this as two diffs...

the first would be:
>    { the | the" }

the second would be:
>    { "Jamestown," | Jamestown,"}

***

in both of these examples, i think combining the 2 diffs into
one bracket-bound item confuses the item unnecessarily, and
confuses the end-user in the process, making the resolution
much more difficult than it needs to be...

in many of these "multiple diff" brackets, i could have my tool
pull apart the various diffs, and display them appropriately...

so, you know, if you think the output you are showing now is
done the way you _want_ to have it done, that's your decision.

but i think it will be more clear if you did it slightly differently.

***

another confusion i had with your output was that there were
several bracketed items that contained some separator-lines...

since all of those separator-lines were standaridized by you
before you ran your pgdiff, it seems to me that none of them
should've been included in any of the brackets.   should they?

if you start bringing non-diff material into the edit process,
you're asking for problems, it would seem to me, so i would
rework that code to try to avoid such problems if i were you.

anyway, just a few suggestions, hopefully helpful ones...      :+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100318/e9271505/attachment-0001.html>

From vze3rknp at verizon.net  Thu Mar 18 17:04:01 2010
From: vze3rknp at verizon.net (Juliet Sutherland)
Date: Thu, 18 Mar 2010 20:04:01 -0400
Subject: [gutvol-d] Re: i'm a busy little bee
In-Reply-To: <8a6e1.155e416c.38d401a7@aol.com>
References: <8a6e1.155e416c.38d401a7@aol.com>
Message-ID: <4BA2BF71.7060101@verizon.net>


On 3/18/2010 6:22 PM, Bowerbird at aol.com wrote:
> the angle, to steal my own thunder, is that you can easily use
> the pagenumber information that's contained in the o.c.r. files
> to rename your .txt and .png files in a more-intelligent manner.
>
> if the pagenumber in file "011.txt" says that it's pagenumber 7,
> you'll rename "011.txt" and "011.png" to "007.txt" and "007.png".
>
> easy as pie!
Only as easy as pie if the OCR got the page number correct. In my 
experience, page numbers are very frequently misread.

JulietS
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100318/a41018ae/attachment.html>

From Bowerbird at aol.com  Thu Mar 18 18:00:29 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 18 Mar 2010 21:00:29 EDT
Subject: [gutvol-d] Re: i'm a busy little bee
Message-ID: <7a7aa.5a7091d9.38d426ad@aol.com>

juliet said:
>   Only as easy as pie if the OCR got the page number correct. 
>    In my experience, page numbers are very frequently misread.

i'm getting ahead of myself, but yes, you must check them first.

in the sample book which i'll be talking about, 4 pagenumbers
were misrecognized in 108 pages, and 1 was entirely missing.
and actually, the misrecognitions were on the left-bracket that
preceded the pagenumber, rather than the pagenumber per se.
but yes, it is true this book had atypically accurate recognition,
and that pagenumbers are not infrequently misrecognized...

however, it's also true there is a _huge_ amount of redundancy
in pagenumbers -- they march on in a predictable sequence --
so routines can be written (and i have written a few of them) to
"fill in" the missing numbers in an astonishingly accurate way...

even the unnumbered pages stick out in a fairly distinctive way,
since the numbering-sequence politely "steps around" them...

but why don't we hold off further dialog until i do my full post?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100318/4dd700c5/attachment.html>

From Bowerbird at aol.com  Thu Mar 18 18:46:31 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 18 Mar 2010 21:46:31 EDT
Subject: [gutvol-d] Re: jim, i have some questions about pgdiff output
Message-ID: <7d330.675f41a8.38d43177@aol.com>

for the sake of comparison, here's how i display the sitka diffs:

>    http://z-m-l.com/go/jimad/sitka-175diffs.html

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100318/e0ed4c6a/attachment.html>

From Bowerbird at aol.com  Thu Mar 18 20:08:13 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 18 Mar 2010 23:08:13 EDT
Subject: [gutvol-d] Re: jim, i have some questions about pgdiff output
Message-ID: <818ff.2c290d3d.38d4449d@aol.com>

and jim, it doesn't look like 5 people want to see my comparison tool,
so you'll have to settle for a screenshot (with some fancy stuff deleted):

>    http://z-m-l.com/go/jimad/comparison-screenshot.png

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100318/f358e952/attachment.html>

From prosfilaes at gmail.com  Thu Mar 18 21:01:50 2010
From: prosfilaes at gmail.com (David Starner)
Date: Fri, 19 Mar 2010 00:01:50 -0400
Subject: [gutvol-d] Re: any arguments against "free-range" proofing?
In-Reply-To: <SNT120-DS161456A8C3597325A9F007AE2B0@phx.gbl>
References: <36c3a.27b91b5.38c98aa2@aol.com> <4B9FFD1A.8000409@verizon.net>
	<SNT120-DS185834508139F999ED1D34AE2B0@phx.gbl>
	<6d99d1fd1003171853m3293107ck90c3a4172ee76979@mail.gmail.com>
	<SNT120-DS161456A8C3597325A9F007AE2B0@phx.gbl>
Message-ID: <6d99d1fd1003182101l5883e092wf3b8e5584f818862@mail.gmail.com>

On Thu, Mar 18, 2010 at 3:30 PM, Jim Adcock <jimad at msn.com> wrote:
> Personally I would rather work on a book that is towards the 2,500,000
> download end of the spectrum than on the 10 downloads end of the spectrum!

Not something I really see from what you've uploaded to PG, but okay.
I'm not sure I agree though; getting something unique online or
something higher-quality then can be found elsewhere, is more
important to me then something there's a dozen copies of on the web.

>?"First
> come first serve" I suggest is a horrible way to make this choice because it
> encourages the most greedy and inconsiderate submitters to get there first
> rather than to take a thoughtful approach to picking which books to save and
> then doing a really really good job of digitizing and OCR'ing them.

I'm sure we could have told all the Slashdotters to hold on while we
were preparing material for them. We might have actually done 40 or 50
books by now that way. I'm sure it also would have helped to criticize
our submitters as "greedy and inconsiderate". I'm sure most people who
scanned books for DP never thought about the value of the book they
were scanning.

-- 
Kie ekzistas vivo, ekzistas espero.

From schultzk at uni-trier.de  Fri Mar 19 01:54:46 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Fri, 19 Mar 2010 09:54:46 +0100
Subject: [gutvol-d] Re: jim, i have some questions about pgdiff output
In-Reply-To: <SNT120-DS15877230ABCA04706CB5DDAE2B0@phx.gbl>
References: <SNT120-DS15877230ABCA04706CB5DDAE2B0@phx.gbl>
Message-ID: <72F0393F-49CB-428D-9E64-6E752997D720@uni-trier.de>

Hi,

	The more i think about the tools discussed here and the use of 
	"diff"s I get the feeling that the use of diff is actually overkill.

	diff is basically n^2. It was developed when text/string processing
	was not efficient. Designed for revisioning and compression.
	It works best for frequent and large differences. Furthermore,
	it aides in analysis.

	Proofing is per se linear,  has relatively few differences, and is aided by
	humans, and a new version is to be created and not a merge.
	The process is simple compare text A and B as long as they are equal
	and then gather the information as long as the differ, present the difference,
	offer possible changes, continue.
	Without much analysis one can see that this process is linear.
	
	So maybe a more direct approach could be viable. Of course, other problems
	of the collaboration have to dealt with elsewhere.

	O.K. this approach may seem simplistic and primitive, yet it solves a few problems.
		1) equality and proofing are done in one pass
		2) works with files of any size
		3) works with text divided among several files
		4) can be easily integrated into different editor modals
		5) presentation of the two versions is part of the tool and not
	            dependent on other EXTERNAL representations
		6) the processing of metadata and formatting is controlled
	             by the proofing/editor tool. No more worry about
	             pollution for the external diff-tool 
	
	Cavets:
		a) you would need a logging system for changes
		b) higher storage requirements for the entire system
		c) would have to be programmed from start
		d) highly adjustable.

	regards
		Keith.
	 

Am 18.03.2010 um 23:16 schrieb James Adcock:

>> jim, here are 55 cases where your tool seems to give us more than just the
> 2 choices that i would expect to see...
> 
> See my discussion below of what the Levenshtein Distance is and how pgdiff
> implements it.
> 
>> in some cases, such as the second one listed (hot springs), it's because
> one of the proofer's notes contained a "|" in it.
> you'll want to screen the input for your significant characters, i.e., any
> "{" or "}" or "|", and eliminate them to avoid confusion.
> 
> Agreed that this would be a problem if my tool is used as input to another
> "smart editor" tool that wants to present "Choose A" vs. "Choose B" type
> choices.
> 
> Since instead the tool was targeting a regex editor being driven by a real
> human being who can recognize from context whether the "{|}" chars are being
> used to highlight differences vs. being used as part of the input text it
> hasn't been a problem for me re the intended problem domain.
> 
> ======
> 
> Levenshtein Distance is the measure of the number of changes needed to
> transform one string of tokens into a different string of tokens, where the
> allowable edits are "insert", "delete" or "substitute."  Different
> implementations of the algorithm would have different interpretation of what
> constitutes a "token" and what constitutes a "string".  One obvious
> interpretation would be that a "token" is an ascii char and a string is a
> line of text (dictionary lookups of miss-spelled words) Another obvious
> interpretation is a "token" is a line of text and the string is the list of
> lines of text within a file (diff)  pgdiff implements neither of these but
> rather a "token" to be a "word" where a "word" is a non-white sequence of
> chars followed by a white sequence of chars, where the white sequence of
> chars is considered not-significant for the purposes of the Levenshtein
> Distance, but IS significant for the display of output.  pgdiff considers
> the "string" to be the entire list of words in the input file. The typical
> importance of the white part is whether words are separated by a space or by
> a linebreak.  Pgdiff doesn't care about the white part in terms of the
> Levenshtein Distance, so that the two input files can have different line
> lengths and different linebreak locations, and still be comparable.  This
> also means that typically including page break information in the input
> files such as the "====== filename.101 ====" type stuff would NOT be a good
> idea, since typically the input files may have their page breaks in
> different locations re their word content -- unless the two input files are
> from the same identical edition. 
> 
> So here's some answers to some implied questions or assumptions:
> 
> Does pgdiff look for word differences within a line of text?  No.
> 
> Does pgdiff look for single word changes? No.
> 
> OK, what does pgdiff do?
> 
> What pgdiff does is to calculate a best match of words across two entire
> files.  Assuming you set the input options large enough, for example, one
> input file could contain an entire chapter that the other input file doesn't
> contain and the algorithm would sync up just fine.  Or in the case of a book
> I've worked on previously the US version had paragraphs removed by a censor,
> whereas the European version of the text had them intact.  When the words do
> not match exactly, the mismatches are categorized three ways 1) Insert this
> missing word. 2) Delete this extraneous word. Or 3) Substitute this one word
> for a different word.  Now by reversing the input order options 1) and 2)
> obviously become symmetrical -- an insertion in one case becomes a deletion
> in the other case.  So in either case an isolated word difference is
> displayed like { this } or if a bunch of words in a row are delete or insert
> like { this is in one text but not the other } In case 3) if only one word
> is different in a row it displays the output choice like { this | that } But
> in case three if a bunch of words are different in a row how to display
> them?  If the differences are due to scannos it is probably best to display
> the words next to each other { this | th*s is | iz a | u test | tost }
> whereas if the differences are due to human editing it would probably be
> best to display them as "sentences" { THIS IS A TEST | _this is a test_ }
> If you are implementing a "smart editor" then clearly you can choose to
> display them which way you want.  In practice what one normally sees is some
> weird mixture of the two possible situations, and it isn't clear to me which
> display technique is best, so so far I have chosen the easiest approach to
> implement -- which is the first pattern of display { this | th*s is | iz a |
> u test | tost }  
> 
>> From the BBoutput.txt file, for example, consider:
> 
> { <i>Seattle, | Seattle, Washington</i> | Washington
> 
> }
> 
> Which is of the first pattern.  The ending } is on a newline since the two
> tokens differing in whitespace, space vs. linebreak.  Taking that diff back
> out one gets:
> 
> { <i>Seattle, | Seattle, Washington</i> | Washington }
> 
> Which one can read as:
> 
> Choose one of: <i>Seattle, OR Seattle,
> 
> Followed by:
> 
> Choose one of: Washington</i> OR Washington
> 
> In this case if one KNEW the differences are due to humans rather than
> scannos , then it is "obvious" that the better display pattern would be the
> second one:
> 
> { <i>Seattle, Washington</i> | Seattle, Washington }
> 
> IE Choose one of: "<i>Seattle, Washington</i>" OR "Seattle, Washington"
> 
> But in general the tool doesn't know if differences are due to human edits
> or scannos, and in general what one sees is a mixture of both problems
> happening at the same time.
> 
> PS: OK pgdiff doesn't REALLY match across ENTIRE files since if the files
> are huge Levenshtein is an n^2 algorithm in space and time.  What it does do
> is break a file into large overlapping chunks of text and calculate the
> measure across the chunks, where the size of the chunks can be specified as
> in input parm if you prefer, and where the chunks get sewn back together
> using an invariant of choosing places in the match where words DO match, and
> checking the sanity of that match to make sure we haven't lost sync.  What
> this means in practice is that if you specify a parm of -10000 as an input
> setting then the algorithm can "ONLY" handle about 10000 word mismatches
> adjacent to each other in a row without erroring out. This parm in practice
> is important for versioning where two editions of a book have large chunks
> of text which don't match each other.  IE a chapter is edited out or
> edited in or a censor has taken their knife to the text. Common problems are
> that two texts from different editions have entire book prefixes
> (introductions) or entire book suffixes (postscripts or indexes) which don't
> match -- which one is better to explicitly remove and deal with separately,
> but which the algorithm will try to handle if you set the input parm large
> enough.
> 
> 
> 
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d


From Bowerbird at aol.com  Fri Mar 19 05:57:32 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 19 Mar 2010 08:57:32 EDT
Subject: [gutvol-d] save those pagenumber references
Message-ID: <6840.ffe3dd8.38d4cebc@aol.com>

ok, on the "good news" front, it appears that rfrank has
finally decided to start naming his files more wisely, so
big respect to the people who steered in that direction.

there seemed to be some uncertainty from roger about
how to go about coding apps with those new filenames,
so i'll talk a little bit about that and hope it filters back...

but the initial info can be used by other people as well!

sure, if you're scanning your own books, you can name
the files intelligently from the get-go, and never worry.

(but, um, if you _are_ scanning your own books, please
ask me for advice on filenaming, and don't just do what
d.p. did when they tried to implement smart filenames,
because they got some of the "details" badly mangled.)

but sometimes, from other people, you might get files
which were named badly, and you'll have to rename 'em.

even some of the big scanning projects -- umichigan and
the internet archive and google (well, not so much google,
not any more, they wised up pretty quickly) -- have been
known to adopt some fairly stupid filenaming conventions,
so if you use their stuff, you'll have to clean up their mess.

so it behooves you to know how.

first things first: get yourself "twisted", the dkretz program.
>    http://code.google.com/p/dp50/downloads/list

the initial impetus for this program was precisely this task
of renaming files intelligently, and it works very well for it.

so that's really all you need.

but i'll tell you a bit more...

let's say you're doing preprocessing.   one of the things that
d.p. does is it strips the pagenumbers out of the .txt files...
that is just asinine!   do not do that, folks.   that is the info
that you _need_, so -- obviously -- do _not_ throw it away!

rfrank discards the pagenumber info from his .txt files too.

sometimes, though, for some books, the pagenumber info
sidesteps deletion.   one such book was the "sitka" one that
jim and i have been working on.   you can find the file here:
>    http://z-m-l.com/go/jimad/sitka0-ocr.txt

you can see, at the bottom of each page, the pagenumber,
enclosed in brackets.   and oh what a lovely sight they are!
because they tell exactly what the file _should_ be named!

for instance, go down to the start of chapter 1.

you will see that it occurs in the file rfrank named "011.txt".

but, as shown by the pagenumber at the bottom, it's page 7,
and _should_ be named "007.txt" or (better) "sitkap007.txt".

(in case you're wondering why chapter 1 starts on page 7,
it's because the _foreword_ starts on page 1, and runs to
page 5.   page 6 is a blank verso that is opposite chapter 1.)

so we know the file "011.txt" should be "sitkap007.txt".   great!

but remember the another wrinkle too -- the pagescan filename.

so if we know that "011.txt" should be named "sitkap007.txt",
we also know that "011.png" should be named "sitkap011.png".

now we're cooking...

***

so, to find out the pagenumbers in each of the text-files,
you can run a little perl program i've put up on the site:
>    http://z-m-l.com/go/jimad/doglobal.pl

that program is a simple "find" program that pulls out any line
with the string ".txt" in it, or a right-bracket (i.e., "]"), as shown:

sitka0-ocr-001.txt -- [Illustration][**fine print verified by CP]
sitka0-ocr-002.txt -- 
sitka0-ocr-003.txt -- 
sitka0-ocr-004.txt -- [Illustration: Lovers' Lane, Sitka.]
sitka0-ocr-005.txt -- 
sitka0-ocr-006.txt -- 
sitka0-ocr-007.txt -- [3]
sitka0-ocr-008.txt -- [4]
sitka0-ocr-009.txt -- [5]
sitka0-ocr-010.txt -- [Blank Page]
sitka0-ocr-011.txt -- [7]
sitka0-ocr-012.txt -- [8]
sitka0-ocr-013.txt -- [9]
sitka0-ocr-014.txt -- [10]
sitka0-ocr-015.txt -- 
sitka0-ocr-016.txt -- [11]
sitka0-ocr-017.txt -- 112]
sitka0-ocr-018.txt -- [13]
sitka0-ocr-019.txt -- [14]
sitka0-ocr-020.txt -- [15]
sitka0-ocr-021.txt -- [16]
sitka0-ocr-022.txt -- [17]
sitka0-ocr-023.txt -- [18]
sitka0-ocr-024.txt -- [19]
sitka0-ocr-025.txt -- 120]
sitka0-ocr-026.txt -- [21]
sitka0-ocr-027.txt -- [22]
sitka0-ocr-028.txt -- [23]
sitka0-ocr-029.txt -- [24]
sitka0-ocr-030.txt -- [Blank Page]
sitka0-ocr-031.txt -- [25]
sitka0-ocr-032.txt -- [26]
sitka0-ocr-033.txt -- [27]
sitka0-ocr-034.txt -- [28]
sitka0-ocr-035.txt -- [29]
sitka0-ocr-036.txt -- [30]
sitka0-ocr-037.txt -- [31]
sitka0-ocr-038.txt -- [32]
sitka0-ocr-039.txt -- [33]
sitka0-ocr-040.txt -- [34]
sitka0-ocr-041.txt -- [35]
sitka0-ocr-042.txt -- [36]
sitka0-ocr-043.txt -- [Blank Page]
sitka0-ocr-044.txt -- [37]
sitka0-ocr-045.txt -- [38]
sitka0-ocr-046.txt -- [39]
sitka0-ocr-047.txt -- [40]
sitka0-ocr-048.txt -- [41]
sitka0-ocr-049.txt -- [42]
sitka0-ocr-050.txt -- [43]
sitka0-ocr-051.txt -- [44]
sitka0-ocr-052.txt -- [45]
sitka0-ocr-053.txt -- [46]
sitka0-ocr-054.txt -- [Blank Page]
sitka0-ocr-055.txt -- [47]
sitka0-ocr-056.txt -- [48]
sitka0-ocr-057.txt -- [49]
sitka0-ocr-058.txt -- [50]
sitka0-ocr-059.txt -- [51]
sitka0-ocr-060.txt -- [52]
sitka0-ocr-061.txt -- [53]
sitka0-ocr-062.txt -- [54]
sitka0-ocr-063.txt -- [Blank Page]
sitka0-ocr-064.txt -- [55]
sitka0-ocr-065.txt -- [56]
sitka0-ocr-066.txt -- 
sitka0-ocr-067.txt -- [57]
sitka0-ocr-068.txt -- 
sitka0-ocr-069.txt -- [59]
sitka0-ocr-070.txt -- [60]
sitka0-ocr-071.txt -- 
sitka0-ocr-072.txt -- [61]
sitka0-ocr-073.txt -- [62]
sitka0-ocr-074.txt -- 
sitka0-ocr-075.txt -- [63]
sitka0-ocr-076.txt -- [64]
sitka0-ocr-077.txt -- [65]
sitka0-ocr-078.txt -- [66]
sitka0-ocr-079.txt -- 
sitka0-ocr-080.txt -- [67]
sitka0-ocr-081.txt -- [68]
sitka0-ocr-082.txt -- [69]
sitka0-ocr-083.txt -- [70]
sitka0-ocr-084.txt -- [71]
sitka0-ocr-085.txt -- [72]
sitka0-ocr-086.txt -- [73]
sitka0-ocr-087.txt -- [74]
sitka0-ocr-088.txt -- [75]
sitka0-ocr-089.txt -- [76]
sitka0-ocr-090.txt -- 
sitka0-ocr-091.txt -- [77]
sitka0-ocr-092.txt -- [78]
sitka0-ocr-093.txt -- [79]
sitka0-ocr-094.txt -- [80]
sitka0-ocr-095.txt -- [81]
sitka0-ocr-096.txt -- [82]
sitka0-ocr-097.txt -- [83]
sitka0-ocr-098.txt -- 184]
sitka0-ocr-099.txt -- [85]
sitka0-ocr-100.txt -- [86]
sitka0-ocr-101.txt -- [87]
sitka0-ocr-102.txt -- [88]
sitka0-ocr-103.txt -- [89]
sitka0-ocr-104.txt -- [90]
sitka0-ocr-105.txt -- [91]
sitka0-ocr-106.txt -- [92]
sitka0-ocr-107.txt -- [Blank Page]
sitka0-ocr-108.txt -- [93]
sitka0-ocr-109.txt -- [94]
sitka0-ocr-110.txt -- [Blank Page]
sitka0-ocr-111.txt -- [95]
sitka0-ocr-112.txt -- [96]
sitka0-ocr-113.txt -- [97]
sitka0-ocr-114.txt -- [98]
sitka0-ocr-115.txt -- [99]
sitka0-ocr-116.txt -- [100]
sitka0-ocr-117.txt -- 
sitka0-ocr-118.txt -- [101]
sitka0-ocr-119.txt -- [102]
sitka0-ocr-120.txt -- [103]
sitka0-ocr-121.txt -- [104]
sitka0-ocr-122.txt -- [105]
sitka0-ocr-123.txt -- [106]
sitka0-ocr-124.txt -- [107]
sitka0-ocr-125.txt -- [108]
sitka0-ocr-126.txt -- 

***

i will do a detailed look at that list, and explain everything in it,
but you might wanna take a gander first, to see what _you_ see.
since it might be more fun for you to figure it out for yourself,
rather than plow through my pedantic bullshit...

***

now, we need to do a little repair on some pages, as follows:

the left-bracket was misrecognized on 3 files, so fix that:
sitka017.txt -- 112]
sitka025.txt -- 120]
sitka098.txt -- 184]

the first 4 pages are front-matter, so add some "f" pagenumbers:
sitka001.txt -- add [f001]
sitka002.txt -- add [f002]
sitka003.txt -- add [f003]
sitka004.txt -- add [f004]

the first 2 pagenumbers were deleted by early proofers, so add back:
sitka005.txt -- add [1]
sitka006.txt -- add [2]

page 6 really is a blank page, so let's add a pagenumber to it:
sitka010.txt -- add [6]

the pagenumber on 1 file wasn't picked up by scanner, so we'll add it:
sitka068.txt -- add [58]

the pagenumber on the last page, a map, wasn't there, so we'll add it:
sitka126.txt -- add [109]

the rest are illustration pages (even though some claim to be "blank"),
which we can tell because they exist outside of the page-sequencing,
so we'll add the "a" filenaming convention to slide them into place...

append "a" to these unnumbered pages, which had no pagenumber:
sitka015.txt -- add [10a}
sitka066.txt -- add [56a}
sitka074.txt -- add [62a}
sitka079.txt -- add [66a}
sitka090.txt -- add [76a}
sitka117.txt -- add [100a}

sitka030.txt -- change [blank page] to [24a]
sitka043.txt -- change [blank page] to [36a]
sitka054.txt -- change [blank page] to [46a]
sitka063.txt -- change [blank page] to [54a]
sitka107.txt -- change [blank page] to [92a]
sitka110.txt -- change [blank page] to [94a]

as i said in a short response to juliet yesterday, many of these
missing and misrecognized pagenumbers _could_ have been
"filled in" automatically, because of pagenumber redundancy.
but editing them wasn't too difficult for this particular book...

(i did the editing using my new editor interface, which i will be
revealing to all you excited fans out there next week.   oh boy!)

***

once all of the pagenumbers in the files have been corrected,
output from the above doglobal.pl script would look like this:

sitka0-ocr-001.txt -- [f001]
sitka0-ocr-002.txt -- [f002]
sitka0-ocr-003.txt -- [f003]
sitka0-ocr-004.txt -- [f004]
sitka0-ocr-005.txt -- [1]
sitka0-ocr-006.txt -- [2]
sitka0-ocr-007.txt -- [3]
sitka0-ocr-008.txt -- [4]
sitka0-ocr-009.txt -- [5]
sitka0-ocr-010.txt -- [6]
sitka0-ocr-011.txt -- [7]
sitka0-ocr-012.txt -- [8]
...
sitka0-ocr-024.txt -- [19]
sitka0-ocr-025.txt -- [20]
sitka0-ocr-026.txt -- [21]
sitka0-ocr-027.txt -- [22]
sitka0-ocr-028.txt -- [23]
sitka0-ocr-029.txt -- [24]
sitka0-ocr-030.txt -- [24a]
sitka0-ocr-031.txt -- [25]
...
sitka0-ocr-126.txt -- [109]

***

then we can do a variant of that output, to do the renaming for us:

rename sitka0-ocr-001.txt as sitkaf001.txt 
rename sitka0-ocr-002.txt as sitkaf002.txt 
rename sitka0-ocr-003.txt as sitkaf003.txt 
rename sitka0-ocr-004.txt as sitkaf004.txt 
rename sitka0-ocr-005.txt as sitkap001.txt 
rename sitka0-ocr-006.txt as sitkap002.txt
rename sitka0-ocr-007.txt as sitkap003.txt
rename sitka0-ocr-008.txt as sitkap004.txt
rename sitka0-ocr-009.txt as sitkap005.txt
rename sitka0-ocr-010.txt as sitkap006.txt
rename sitka0-ocr-011.txt as sitkap007.txt
rename sitka0-ocr-012.txt as sitkap008.txt
...
rename sitka0-ocr-024.txt as sitkap019.txt
rename sitka0-ocr-025.txt as sitkap020.txt
rename sitka0-ocr-026.txt as sitkap021.txt
rename sitka0-ocr-027.txt as sitkap022.txt
rename sitka0-ocr-028.txt as sitkap023.txt
rename sitka0-ocr-029.txt as sitkap024.txt
rename sitka0-ocr-030.txt as sitkap024a.txt
rename sitka0-ocr-031.txt as sitkap025.txt
...
rename sitka0-ocr-126.txt as sitkap109.txt

***

remember that we have to do the scan files as well.
(we'll just do a global change from ".txt" to ".png".)

rename sitka0-ocr-001.png as sitkaf001.png 
rename sitka0-ocr-002.png as sitkaf002.png 
rename sitka0-ocr-003.png as sitkaf003.png 
rename sitka0-ocr-004.png as sitkaf004.png 
rename sitka0-ocr-005.png as sitkap001.png 
rename sitka0-ocr-006.png as sitkap002.png
rename sitka0-ocr-007.png as sitkap003.png
rename sitka0-ocr-008.png as sitkap004.png
rename sitka0-ocr-009.png as sitkap005.png
rename sitka0-ocr-010.png as sitkap006.png
rename sitka0-ocr-011.png as sitkap007.png
rename sitka0-ocr-012.png as sitkap008.png
...
rename sitka0-ocr-024.png as sitkap019.png
rename sitka0-ocr-025.png as sitkap020.png
rename sitka0-ocr-026.png as sitkap021.png
rename sitka0-ocr-027.png as sitkap022.png
rename sitka0-ocr-028.png as sitkap023.png
rename sitka0-ocr-029.png as sitkap024.png
rename sitka0-ocr-030.png as sitkap024a.png
rename sitka0-ocr-031.png as sitkap025.png
...
rename sitka0-ocr-126.png as sitkap109.png

***

this example makes it pretty clear that -- if you only
leave the pagenumbers in the o.c.r., just leave 'em! --
it's pretty easy to use them to name your files wisely...

pagenumbers in the runhead are easy to grab as well.
they're either at the right side of the runhead (if odd)
or at the left side of the runhead (on the even pages).

(the runhead is usually the first line in the file, right?,
but sometimes the pagenumber drops to the second.
still it's usually the first _number_ you find in the file,
so it's easy enough to code your script to look for that.)

again, you have to check them!, to make sure they were
recognized correctly, so you can fix 'em if they weren't.

but once you've got them all in place, you are golden...

and the beauty is that now your files are named wisely!
you'll always know page 23 is in the file named "p023",
and page 46 is in "p046", and page 123 is in "p123"...

moreover, when you want to go to page 46, you will
actually _end_up_ on page 46, not some other page
that is kinda close, depending on what the "offset" is!

***

and here's another nice thing.   you'll notice that we
had some unnumbered pages that were named with
an appended "a"?   well, we need to keep the recto and
the verso straight, if we want to make good e-books,
so we can't just add an "a" without a backside "b" too.

but hey, that's no problem at all!   after each "a" page,
we just slide in a "b" name underneath it, and presto!,
our recto/verso is right again.   and we didn't have to
_readjust_ all filenames that followed each "insertion",
because those files were wisely-named to begin with.

***

there's one more thing to talk about: coding apps...

(if you don't do coding, you can leave now if you want;
but it probably won't hurt you to read the rest of this.
you made it _this_ far, so you must be a glutton for it.)

first let's get the necessary admission out of the way...

it's very easy to do your coding when you name your
files in a stupid 001.txt-999.txt way, because you can
simply code the number as a shortcut for the filename.

you use an integer for your pagenumber, and it's easy.

your _files_ go from 1 to 999, and so do your _names_.
it's easy to keep track of things; you just go up or down.

because of this ease, i can understand why you _might_
want to keep using those stupid filenames.   but don't...

still, at first, it may not be immediately obvious to you
how to depart from this method.   but it really is simple.

instead of thinking of each filename as a _number_ 
(i.e., an integer), think of it as a "name" (i.e., a string).

yes, the filename has a number _in_ it, and the number
is the _important_part_ (to your end-user), but do not
_think_ of it in this way, at least not for the time being.

think of the filename as a string, nothing but a string...

however, you will _load_ those strings into an _array_...

you'll have as many items in the array as you have files,
and the value of each item will be the _name_ of the file.

then you think of the _index_ for that array as an integer
-- because that's what it is! -- and you use _that_ in the
exact same way you used your pagenumber integer before.

so see, you didn't have to give up the easy convenience of
a number to keep on-track like you thought you'd have to.
your index array goes up and down, just like it did before.

in other words, you can still think of your _files_ as going
from 1 to 999, and increment your array index as before.

but whenever you want to know the _filename_ of a page,
you look-up the value of the array at that index-number.

so let's look at how this would work for our "sitka" book.

the string value of item array #1 would be "sitkaf001".

the string value of item array #2 would be "sitkaf002".

the string value of item array #5 would be "sitkap001",
because it's page 1, and that's where the foreword starts.

the string value of item array #11 would be "sitkap007",
because that's page 7, and that's where chapter 1 starts...

and the string value of item array #126 would be "sitkap109",
and that's the map that's on the last (recto) page in the book.

(of course, there will be a blank verso that'll be "sitkap110",
since a book cannot have an odd number of pages, can it?)

so the last question is "how do i populate the filename array?"

there are various ways you can do it, but two good ones are:
1.   read the book's subdirectory to glean the graphic filenames.
2.   create a "map" file intended to provide the graphic filenames.

you can also combine these 2 methods as "belt and suspenders";
you create a map file, but your viewer-app confirms the map by
reading the subdirectory to ensure all the graphic files are there.

it's not nearly as difficult to create a "map" file as you might think.

for instance, look closely at the sitka file we're working on:
>    http://z-m-l.com/go/jimad/sitka0-ocr.txt

just pull out the separator lines, and you've got your map file.

of course, the current version of that file is using the current
stupid filenames, but you can generate a new concatenated file
after you've renamed your .txt files, and your map will be fine.

you can also just view your subdirectory structure in a browser,
and copy out the filenames, and save them in a file, and bingo!,
there's your map file.

myself, with z.m.l., i use the separator-line method, as you can
see if you look at any paginated z.m.l. file.   the lines which have
double-braces enclosing a graphic filename constitute the map.

***

all in all, if you start naming your files intelligently, you'll find
that the benefits far outweigh the costs of doing any rename...

still, i've tried here, in this post, to show you how to do a rename
in the easiest possible way.   just remember not to do like d.p. --
_and_keep_the_darn_filename_information_in_your_o.c.r._files_...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100319/84fd5293/attachment-0001.html>

From Bowerbird at aol.com  Fri Mar 19 06:03:54 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 19 Mar 2010 09:03:54 EDT
Subject: [gutvol-d] Re: jim, i have some questions about pgdiff output
Message-ID: <6dee.30b7d3b4.38d4d03a@aol.com>

keith, you have no pudding.

you have a lot of cards, which purport to have recipes on them,
but i cannot make heads or tails of them, and they certainly have
no taste, nor can they be eaten, so -- not to be mean or anything,
but -- what good are they?

or maybe it's just me.   if someone else can explain to us
just exactly what it is that keith is talking about, please do.

thank you.   have a nice day.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100319/7ab2c2f7/attachment.html>

From jimad at msn.com  Fri Mar 19 10:59:21 2010
From: jimad at msn.com (James Adcock)
Date: Fri, 19 Mar 2010 10:59:21 -0700
Subject: [gutvol-d] [SPAM] RE:   Re: jim,
	i have some questions about pgdiff output
In-Reply-To: <8c234.435b34c3.38d40831@aol.com>
References: <8c234.435b34c3.38d40831@aol.com>
Message-ID: <SNT120-DS355C3F7193E8A720D0192AE2A0@phx.gbl>

I've put up a new copy of the tool pgdiff that contains an option "-smarted"
which outputs the text in a form similar to what I think you want BB for
your "smart editor" tool.  It is similar to what pgdiff originally output
but that output I had found too tedious and verbose for my taste when I am
editing the output using a regex editor.  Your suggestions work in simple
cases but  I think you will find that they fail relatively spectacularly on
difficult cases, such as when performing versioning across different
editions.

 
I also updated the example output file "BBoutput.txt" to show the new
output. 

 
"Non-diff" material will show up in the output if the "Non-diff" material is
in a mixed order.  For example if the two files have:

 
The quick dog jumps...

 
And the other file has:

 
The dog quick jumps.

 
Then dog and/or quick will show up in the edits because there is no way you
can do a Levenshtein edit that doesn't include both "dog" and "quick"
because the Levenshtein measure doesn't include a notion of "reverse the
order of these two tokens."  Also you may THINK two tokens are identical but
they aren't identical unless they ARE identical - the measure also doesn't
have a notion of "these two tokens look really similar so I want them to
match up."  Either tokens match or they don't.  So in the case of:

 
The quick dogs jumps..

 
Vs.

 
The dog quick jumps..

 
The algorithm isn't going to try to match up "dog" and "dogs" because it has
no notion of token "similarity" - "dog" and "dogs" are simply two different
tokens and they don't match.  Further, even if they do match they still may
not compare to each other if there are nearby edits that also don't match,
such that the total number of "insert" "delete" and "substitute" edits is
minimized by NOT making the two identical tokens match up.  If you look
carefully at the output of diff you will see it has the same problem (where
a "token" is a line of text not a word) - diff DOES NOT always
"successfully" match up two lines of identical text - because like pgdiff
diff isn't trying to maximize the number of token matches, rather it is
trying to minimize the number of Levenshtein edits.

 
Again, the problem is basically the domain you are interested in working on
and the domain I am interested in working on is very different.  You want a
tool that catches small changes within a line of text, and I want a tool
that catches large changes within a file.  It is easy to hypothesize what
the "answer" is if you are not the one doing the work.  But if you are the
one doing the work you rapidly find "oops that idea doesn't work after all!"
The real goal of the tool is to find places in the text where a human bean
needs to step in to fix the problem, and that it does extremely well when
the human bean is driving a regex editor and looking at a copy of the
original bitmap page. If one wants to try to do a "smart editor" sometimes
its going to work and other times its going to fail spectacularly - other
than identifying there IS a problem - and then again the human bean is going
to have to sort out and fix the problem.  In the worse case this involves
deleting the text being questioned and typing in the text seen on the bitmap
page - which again is not typically a terrible situation - if you have a
tool that will point you to the problem in the first place which certainly
pgdiff does.

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100319/c6ef3a16/attachment.html>

From jimad at msn.com  Fri Mar 19 11:29:21 2010
From: jimad at msn.com (James Adcock)
Date: Fri, 19 Mar 2010 11:29:21 -0700
Subject: [gutvol-d] Re: jim, i have some questions about pgdiff output
In-Reply-To: <72F0393F-49CB-428D-9E64-6E752997D720@uni-trier.de>
References: <SNT120-DS15877230ABCA04706CB5DDAE2B0@phx.gbl>
	<72F0393F-49CB-428D-9E64-6E752997D720@uni-trier.de>
Message-ID: <SNT120-DS9C3ABA1C4C141C7559697AE2A0@phx.gbl>

>	Proofing is per se linear,  has relatively few differences, and is
aided by
	humans, and a new version is to be created and not a merge.
	The process is simple compare text A and B as long as they are equal
	and then gather the information as long as the differ, present the
difference,
	offer possible changes, continue.
	Without much analysis one can see that this process is linear.

Agreed -- although again you run into problems when your assumptions break
down.  Pgdiff wasn't intended for these simply "change a couple letters
within a line of text" problems.  It was intended for problems of the nature
of "I have two different editions of the text from two different continents
one using English spellings and one using American spellings and having
different linebreaks and different pagebreak and different intros and
censorship and different indexes and I want to use one to help find scannos
in the other."  Yes it can be used for simpler tasks but if you have a
simpler task you might be better off to figure out exactly what that task is
and write a tool to match that task.  Human edits within line tend to be
char-by-char and you might be better off using a Levenshtein measure with
the "token" set to be a char and the "string" set to be a line of text -- to
give an obvious example -- since its not obvious to me how someone uses a
mouse and a keyboard to make changes other than "insert a char" "delete a
char" or "substitute a char" -- unless one uses cut and paste, in which case
all assumptions are off again....


From jimad at msn.com  Fri Mar 19 11:35:21 2010
From: jimad at msn.com (James Adcock)
Date: Fri, 19 Mar 2010 11:35:21 -0700
Subject: [gutvol-d] Re: save those pagenumber references
In-Reply-To: <6840.ffe3dd8.38d4cebc@aol.com>
References: <6840.ffe3dd8.38d4cebc@aol.com>
Message-ID: <SNT120-DS19D4FAC9C7BD9E9F22A0AAAE2A0@phx.gbl>

>ok, on the "good news" front, it appears that rfrank has
finally decided to start naming his files more wisely, so
big respect to the people who steered in that direction.

How do you propose to deal with texts that have a large number of "prefix"
pages numbered something like "iii" for example?

 
How do you propose to deal with texts that have a large number of "prefix"
pages which are not numbered at all?

 
How do you propose to deal with texts where the numbering scheme was screwed
up in the original text?

 
How do you propose to deal with texts which do not count illustration pages
in their numbering scheme? 

 
Etc.

 
Again, it's great to have a simple system that works except when it doesn't
work in which case it's not so simple anymore.

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100319/834556df/attachment.html>

From Bowerbird at aol.com  Fri Mar 19 13:11:48 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 19 Mar 2010 16:11:48 EDT
Subject: [gutvol-d] Re: "that doesn't work"
Message-ID: <23e82.33ce9749.38d53484@aol.com>

jim said:
>    I?ve put up a new copy of the tool pgdiff
>    that contains an option ?-smarted? which 
>    outputs the text in a form similar to what 
>    I think you want BB for your ?smart editor? tool.? 

i'm sure some of your users will enjoy that option, jim...
as you might expect, i'll likely stick with my own tools...
but yes, this new option might allow me to perfect the
tool that i've built in support of your pgdiff tool, i hope.


>    Your suggestions work in simple cases but?I think 
>    you will find that they fail relatively spectacularly 
>    on difficult cases, such as when performing versioning
>    across different editions.

well, if i'm gonna fail, please let me fail "spectacularly".

i compare different editions using a different technique;
essentially i do a _paragraph-level_ comparison for that.
it's easy enough to unwrap texts to the paragraph level.

indeed, i do paragraph-level analyses in my comparisons
all the time.   that's how i catch the paragraphing glitches.

(it's also necessary to work at the paragraph level when
you're fixing spacey-quotes, as i have mentioned before.)


>    I also updated the example output file ?BBoutput.txt? 
>    to show the new output.

great.   i'll go get it this afternoon...


>    Again, the problem is basically the domain 
>    you are interested in working on and the domain 
>    I am interested in working on is very different.

actually, they're not.   but that's another question for
another day.   here today's issue is finding and fixing
errors by comparing two versions which are similar...


>    You want a tool that catches small changes 
>    within a line of text, and I want a tool that catches 
>    large changes within a file.

two rejoinders.

first, my tools are capable of finding "large differences"
if they are what exist.   but, like i just said, that arena is
not of much particular interest here on the p.g. listserve.

second, i have -- without knowing it at first -- worked
on doing comparisons between what turned out to be
different editions of a book.   and most of the changes
were not "large" ones, but rather "small" ones, notably
punctuation variations reflecting different house "styles".

i discussed this particular comparison at _great_ length
over on the d.p. forums, under a thread with a title like
"a revolutionary method of proofing", if you're interested.


>    It is easy to hypothesize what the ?answer? is 
>    if you are not the one doing the work.

i agree.   that's why i suggested we work on actual data.

i find it best if i don't bias my research by selecting the
data that i work on, so i work on other people's stuff,
which is why i choose that book from rfrank.   however,
if you want to share some data on a book of your own,
one you're working on, i would be happy to look at it...


>    But if you are the one doing the work you rapidly find 
>    ?oops that idea doesn?t work after all!?? 

you know, i hear a lot of people saying "that doesn't work".

but usually, they're being bamboozled by some _small_
issue that can be overcome quite easily if they just try...

a good example of that was yesterday, when juliet said
"your renaming solution won't work because pagenumbers
are often misrecognized."   well, yeah, that happens, but
that particular "obstacle" can be hurdled with little effort.

so i invite you to bring any "doesn't work" problems to me...
i like the challenge of seeing if i can make it work regardless.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100319/5ab0e199/attachment-0001.html>

From Bowerbird at aol.com  Fri Mar 19 13:33:30 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 19 Mar 2010 16:33:30 EDT
Subject: [gutvol-d] Re: save those pagenumber references
Message-ID: <254bd.5074f290.38d5399a@aol.com>

jim said:
>    How do you propose to deal with texts that have a large number 
>    of ?prefix? pages numbered something like ?iii? for example?
>
>    How do you propose to deal with texts that have a large number
>    of ?prefix? pages which are not numbered at all?
>
>    How do you propose to deal with texts where the numbering 
>    scheme was screwed up in the original text?
>
>    How do you propose to deal with texts which do not count 
>    illustration pages in their numbering scheme?
>
>    Etc.
>
>    Again, it?s great to have a simple system that works except 
>    when it doesn?t work in which case it?s not so simple anymore.

gee, jim.

i just talked about how people get bamboozled by small issues,
which can be hurdled quite easily if you just set your mind to it...

and here you make a reply with a whole handful of small issues.
not even "small", really...   more like _tiny_...   even _teeny-tiny_...

indeed, if you really look at the example i discussed, you'll see
that several of your questions were answered there _already_...

so i'm not even going to go through the exercise of answering.

if you really want answers, you can generate them yourself, or
go back and look where i have been discussing this issue for
_many_years_, and review any one of those exhausting threads.

there _is_ such a thing as a stupid question.   i've asked them
myself, as have all of us.   and jim, you just asked a _handful._

but you know, jim, the thing i'm wondering is this...

i've held this position on intelligent filenaming conventions
for _years_ now.   and that's just counting on _this_listserve._
i've been practicing what i preach for about two decades now.

if there was really some problem with my system, don't you
think i would have discovered it by now?   do you really think
that you can come up with a reaction in your first 5 minutes
that i haven't experienced in the years and years and years
i've been doing intensely close analysis of book digitization?

i mean, _seriously_...   did you really think i just happened to
"overlook" that books generally have forward-matter pages,
and that those pages have a different pagenumber sequence?

and do you really think i just hadn't ever noticed that some
of the illustration-pages in books are unnumbered pages?

really?

so let me say this _again_, jim...   if you want to have dialog
with me, you _cannot_ say stupid things.   you simply cannot.

because i won't continue to talk with you if you do.   capiche?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100319/f8da5531/attachment.html>

From ajhaines at shaw.ca  Fri Mar 19 17:09:19 2010
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Fri, 19 Mar 2010 17:09:19 -0700
Subject: [gutvol-d] Re: save those pagenumber references
References: <6840.ffe3dd8.38d4cebc@aol.com>
	<SNT120-DS19D4FAC9C7BD9E9F22A0AAAE2A0@phx.gbl>
Message-ID: <B8A89652BE3F41F28761735AFF3551E7@alp2400>

Jim, the material below describes the scanset naming standard used by PG. 
For an example, go to http://www.gutenberg.org/etext/25896, click on the 
Base Directory link at the bottom of the book's catalog info, above the 
actual files.  You should see a 25896-page-images folder.  Click on it to 
see the actual files.

Al


Basic format:

The prefix for the cover pages is: "c".
The prefix for the roman pages is: "f".
The prefix for the arabic pages is: "p".

***

For blank pages there should be no file and the page number should be
skipped. Optionally an image saying: "This page is blank in the
original." may be inserted.

***

Example of file naming:

  front cover      c0001.png
  back cover       c0002.png
  spine            c0003.png

  i title page     f0001.png
  ii title verso   f0002.png
  iii dedication   f0003.png
  iv is blank
  v contents       f0005.png

  page 1           p0001.png
  page 2           p0002.png
  image on page 2  p0002-image1.png
  image on page 2  p0002-image2.png
  page 3           p0003.png
  page 4 is blank
  page 5           p0005.png
  ...              ...
  page 9999        p9999.png


----- Original Message ----- 
From: James Adcock
To: 'Project Gutenberg Volunteer Discussion'
Sent: Friday, March 19, 2010 11:35 AM
Subject: [gutvol-d] Re: save those pagenumber references


>ok, on the "good news" front, it appears that rfrank has
finally decided to start naming his files more wisely, so
big respect to the people who steered in that direction.

How do you propose to deal with texts that have a large number of "prefix" 
pages numbered something like "iii" for example?

How do you propose to deal with texts that have a large number of "prefix" 
pages which are not numbered at all?

How do you propose to deal with texts where the numbering scheme was screwed 
up in the original text?

How do you propose to deal with texts which do not count illustration pages 
in their numbering scheme?

Etc.

Again, it's great to have a simple system that works except when it doesn't 
work in which case it's not so simple anymore.


_______________________________________________
gutvol-d mailing list
gutvol-d at lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d 


From dakretz at gmail.com  Fri Mar 19 17:27:56 2010
From: dakretz at gmail.com (don kretz)
Date: Fri, 19 Mar 2010 17:27:56 -0700
Subject: [gutvol-d] Re: save those pagenumber references
In-Reply-To: <B8A89652BE3F41F28761735AFF3551E7@alp2400>
References: <6840.ffe3dd8.38d4cebc@aol.com>
	<SNT120-DS19D4FAC9C7BD9E9F22A0AAAE2A0@phx.gbl>
	<B8A89652BE3F41F28761735AFF3551E7@alp2400>
Message-ID: <627d59b81003191727i19c63e85t8f3897fb05089f2@mail.gmail.com>

So far this "spec" seems to be primarily a legend.

Is it documented anywhere?

On Fri, Mar 19, 2010 at 5:09 PM, Al Haines (shaw) <ajhaines at shaw.ca> wrote:

> Jim, the material below describes the scanset naming standard used by PG.
> For an example, go to http://www.gutenberg.org/etext/25896, click on the
> Base Directory link at the bottom of the book's catalog info, above the
> actual files.  You should see a 25896-page-images folder.  Click on it to
> see the actual files.
>
> Al
>
>
>
> Basic format:
>
> The prefix for the cover pages is: "c".
> The prefix for the roman pages is: "f".
> The prefix for the arabic pages is: "p".
>
> ***
>
> For blank pages there should be no file and the page number should be
> skipped. Optionally an image saying: "This page is blank in the
> original." may be inserted.
>
> ***
>
> Example of file naming:
>
>  front cover      c0001.png
>  back cover       c0002.png
>  spine            c0003.png
>
>  i title page     f0001.png
>  ii title verso   f0002.png
>  iii dedication   f0003.png
>  iv is blank
>  v contents       f0005.png
>
>  page 1           p0001.png
>  page 2           p0002.png
>  image on page 2  p0002-image1.png
>  image on page 2  p0002-image2.png
>  page 3           p0003.png
>  page 4 is blank
>  page 5           p0005.png
>  ...              ...
>  page 9999        p9999.png
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100319/c5e6c7bc/attachment.html>

From ajhaines at shaw.ca  Fri Mar 19 17:46:12 2010
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Fri, 19 Mar 2010 17:46:12 -0700
Subject: [gutvol-d] Re: save those pagenumber references
References: <6840.ffe3dd8.38d4cebc@aol.com>
	<SNT120-DS19D4FAC9C7BD9E9F22A0AAAE2A0@phx.gbl>
	<B8A89652BE3F41F28761735AFF3551E7@alp2400>
	<627d59b81003191727i19c63e85t8f3897fb05089f2@mail.gmail.com>
Message-ID: <14A40D8B90F144F7937A4DDF3839F373@alp2400>

No.  It was developed and used by Joshua Hutchinson, when he used to post DP scansets to PG.  I got the material from his emails to the WWers a few months ago.

So far as I know, only Joshua has done this with scans.  Some submitters (usually from DP) incorporate scansets into their HTML files, with the file's page numbers linked to the scans.  Offhand, I don't have any examples of this.


  ----- Original Message ----- 
  From: don kretz 
  To: Project Gutenberg Volunteer Discussion 
  Sent: Friday, March 19, 2010 5:27 PM
  Subject: [gutvol-d] Re: save those pagenumber references


  So far this "spec" seems to be primarily a legend.

  Is it documented anywhere?


  On Fri, Mar 19, 2010 at 5:09 PM, Al Haines (shaw) <ajhaines at shaw.ca> wrote:

    Jim, the material below describes the scanset naming standard used by PG. For an example, go to http://www.gutenberg.org/etext/25896, click on the Base Directory link at the bottom of the book's catalog info, above the actual files.  You should see a 25896-page-images folder.  Click on it to see the actual files.

    Al


    Basic format:

    The prefix for the cover pages is: "c".
    The prefix for the roman pages is: "f".
    The prefix for the arabic pages is: "p".

    ***

    For blank pages there should be no file and the page number should be
    skipped. Optionally an image saying: "This page is blank in the
    original." may be inserted.

    ***

    Example of file naming:

     front cover      c0001.png
     back cover       c0002.png
     spine            c0003.png

     i title page     f0001.png
     ii title verso   f0002.png
     iii dedication   f0003.png
     iv is blank
     v contents       f0005.png

     page 1           p0001.png
     page 2           p0002.png
     image on page 2  p0002-image1.png
     image on page 2  p0002-image2.png
     page 3           p0003.png
     page 4 is blank
     page 5           p0005.png
     ...              ...
     page 9999        p9999.png


------------------------------------------------------------------------------


  _______________________________________________
  gutvol-d mailing list
  gutvol-d at lists.pglaf.org
  http://lists.pglaf.org/mailman/listinfo/gutvol-d
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100319/af333a16/attachment.html>

From dakretz at gmail.com  Fri Mar 19 18:58:51 2010
From: dakretz at gmail.com (don kretz)
Date: Fri, 19 Mar 2010 18:58:51 -0700
Subject: [gutvol-d] Re: save those pagenumber references
In-Reply-To: <14A40D8B90F144F7937A4DDF3839F373@alp2400>
References: <6840.ffe3dd8.38d4cebc@aol.com>
	<SNT120-DS19D4FAC9C7BD9E9F22A0AAAE2A0@phx.gbl>
	<B8A89652BE3F41F28761735AFF3551E7@alp2400>
	<627d59b81003191727i19c63e85t8f3897fb05089f2@mail.gmail.com>
	<14A40D8B90F144F7937A4DDF3839F373@alp2400>
Message-ID: <627d59b81003191858q7c3bd468v2869a8185d875653@mail.gmail.com>

OK - I got those from Joshua and they formed the requirements for Version 1
of Twister.

Here's an extensive forum thread on DP
<http://www.pgdp.net/phpBB2/viewtopic.php?t=32038>where we hashed this all
out.


On Fri, Mar 19, 2010 at 5:46 PM, Al Haines (shaw) <ajhaines at shaw.ca> wrote:

>  No.  It was developed and used by Joshua Hutchinson, when he used to post
> DP scansets to PG.  I got the material from his emails to the WWers a few
> months ago.
>
> So far as I know, only Joshua has done this with scans.  Some submitters
> (usually from DP) incorporate scansets into their HTML files, with the
> file's page numbers linked to the scans.  Offhand, I don't have any examples
> of this.
>
>
>
> ----- Original Message -----
> *From:* don kretz <dakretz at gmail.com>
> *To:* Project Gutenberg Volunteer Discussion <gutvol-d at lists.pglaf.org>
> *Sent:* Friday, March 19, 2010 5:27 PM
> *Subject:* [gutvol-d] Re: save those pagenumber references
>
> So far this "spec" seems to be primarily a legend.
>
> Is it documented anywhere?
>
> On Fri, Mar 19, 2010 at 5:09 PM, Al Haines (shaw) <ajhaines at shaw.ca>wrote:
>
>> Jim, the material below describes the scanset naming standard used by PG.
>> For an example, go to http://www.gutenberg.org/etext/25896, click on the
>> Base Directory link at the bottom of the book's catalog info, above the
>> actual files.  You should see a 25896-page-images folder.  Click on it to
>> see the actual files.
>>
>> Al
>>
>>
>>
>> Basic format:
>>
>> The prefix for the cover pages is: "c".
>> The prefix for the roman pages is: "f".
>> The prefix for the arabic pages is: "p".
>>
>> ***
>>
>> For blank pages there should be no file and the page number should be
>> skipped. Optionally an image saying: "This page is blank in the
>> original." may be inserted.
>>
>> ***
>>
>> Example of file naming:
>>
>>  front cover      c0001.png
>>  back cover       c0002.png
>>  spine            c0003.png
>>
>>  i title page     f0001.png
>>  ii title verso   f0002.png
>>  iii dedication   f0003.png
>>  iv is blank
>>  v contents       f0005.png
>>
>>  page 1           p0001.png
>>  page 2           p0002.png
>>  image on page 2  p0002-image1.png
>>  image on page 2  p0002-image2.png
>>  page 3           p0003.png
>>  page 4 is blank
>>  page 5           p0005.png
>>  ...              ...
>>  page 9999        p9999.png
>>
>>
>>  ------------------------------
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100319/260b1f73/attachment-0001.html>

From Bowerbird at aol.com  Sat Mar 20 09:43:42 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 20 Mar 2010 12:43:42 EDT
Subject: [gutvol-d] Re: save those pagenumbers!
Message-ID: <4e8ac.4d6c005e.38d6553e@aol.com>

i have a reply in the works, but i won't post it until monday...
so keep your minds open until then, and have a nice weekend!

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100320/a29e5416/attachment.html>

From Bowerbird at aol.com  Sat Mar 20 18:01:13 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 20 Mar 2010 21:01:13 EDT
Subject: [gutvol-d] a sitka smoothreading glitch
Message-ID: <62f1f.ad509d4.38d6c9d9@aol.com>

rfrank's roundless site -- fadedpage.com -- now has
their "sitka" book available for _smooth-reading_, at:
>    http://www.fadedpage.com/s/sitka/sitka.htm

i'll have a lot of nice things to say about rfrank's work,
because his e-books really do look quite clean and nice,
but -- since this book is in-process and all -- for now,
i'll just report a little glitch, on (waitforit) pagenumbers.

seems a stray page-indicator found its way onto page 96,
so that page 96 is now incorrectly short, with some of its
text now being shown on what's called page 97, and with
every page after page 96 having its pagenumber off by 1,
so the last words ("may be made") are incorrectly reported
as page 109, when in actuality they occurred on page 108.

that kind of thing can happen when the pagenumbers are
something that you recompute at the end of the process,
rather than something integral to your entire workflow...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100320/764ba787/attachment.html>

From joshua at hutchinson.net  Sun Mar 21 07:22:53 2010
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Sun, 21 Mar 2010 14:22:53 +0000 (GMT)
Subject: [gutvol-d] Re: save those pagenumber references
References: <6840.ffe3dd8.38d4cebc@aol.com>
	<SNT120-DS19D4FAC9C7BD9E9F22A0AAAE2A0@phx.gbl>
	<B8A89652BE3F41F28761735AFF3551E7@alp2400>
	<627d59b81003191727i19c63e85t8f3897fb05089f2@mail.gmail.com>
	<14A40D8B90F144F7937A4DDF3839F373@alp2400>
	<627d59b81003191858q7c3bd468v2869a8185d875653@mail.gmail.com>
Message-ID: <102322535.23878.1269181373985.JavaMail.mail@webmail04>

An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100321/947c7817/attachment.html>

From gbuchana at teksavvy.com  Sun Mar 21 09:45:19 2010
From: gbuchana at teksavvy.com (Gardner Buchanan)
Date: Sun, 21 Mar 2010 12:45:19 -0400
Subject: [gutvol-d] Re: save those pagenumber references
In-Reply-To: <102322535.23878.1269181373985.JavaMail.mail@webmail04>
References: <6840.ffe3dd8.38d4cebc@aol.com>	<SNT120-DS19D4FAC9C7BD9E9F22A0AAAE2A0@phx.gbl>	<B8A89652BE3F41F28761735AFF3551E7@alp2400>	<627d59b81003191727i19c63e85t8f3897fb05089f2@mail.gmail.com>	<14A40D8B90F144F7937A4DDF3839F373@alp2400>	<627d59b81003191858q7c3bd468v2869a8185d875653@mail.gmail.com>
	<102322535.23878.1269181373985.JavaMail.mail@webmail04>
Message-ID: <4BA64D1F.5030008@teksavvy.com>

Is there any suggestion what a formatted text-only book that
retains page numbers should look like?  Is it reasonable to just
sprinkle them into the text, maybe something like this:

---------
Captain Headley, musingly pressing his hand to his brow, "and how
unfortunate. Had Winnebeg brought General Hull's despatch one day
sooner, all this would not have happened, for they never could have
obtained [35] permission to leave the fort, much less to visit so
dangerous a vicinity as Hardscrabble. Our march from this would
have changed the whole current of events."

"Even so," returned Mrs. Headley; "but here is a packet, left with
Serjeant Nixon, which he has just handed to me, and which may throw
some light on the subject. I will first glance over it myself."
-----------
"God bless you, Ronayne! Alas, you are not alone in, your trials--much
of moment awaits us all. Good night!"

And, assuming her disguise, she speedily regained her home.


[44]


CHAPTER X.

    "Ne'er may he live to see a sunshine day that cries--Retire,
    when Warwick bids him stay."
       --_Henry IV._

On the western bank of the south side of the Chicago River, and
-----------


============================================================
Gardner Buchanan                     <gbuchana at teksavvy.com>
Ottawa, ON             FreeBSD: Where you want to go. Today.

From ajhaines at shaw.ca  Sun Mar 21 10:34:05 2010
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Sun, 21 Mar 2010 10:34:05 -0700
Subject: [gutvol-d] Re: save those pagenumber references
References: <6840.ffe3dd8.38d4cebc@aol.com>
	<SNT120-DS19D4FAC9C7BD9E9F22A0AAAE2A0@phx.gbl>
	<B8A89652BE3F41F28761735AFF3551E7@alp2400>
	<627d59b81003191727i19c63e85t8f3897fb05089f2@mail.gmail.com>
	<14A40D8B90F144F7937A4DDF3839F373@alp2400>
	<627d59b81003191858q7c3bd468v2869a8185d875653@mail.gmail.com>
	<102322535.23878.1269181373985.JavaMail.mail@webmail04>
	<4BA64D1F.5030008@teksavvy.com>
Message-ID: <5248C1F0A99C4852BE91C37F933C482D@alp2400>

There are two articles in the PG Volunteers' FAQ about page numbers:

http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.98._Should_I_keep_page_numbers_in_the_e-text.3F

and

http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.99._In_the_exceptional_cases_where_I_keep_page_numbers.2C_how_should_I_format_them.3F


My personal practice is to use curly braces for page numbers and square 
brackets for footnote numbers.  I include page numbers in an etext only if 
the book has internal references of some kind, e.g. footnotes that refer to 
specific pages, an index, or a table of contents that's sufficiently 
detailed as to function as an index.  I number only the first page of an 
index, since I've never seen one with references to elsewhere in itself. 
Two-column indexes are rendered as single-column.

For examples, see http://www.gutenberg.org/etext/19765 or 
http://www.gutenberg.org/etext/30610.

Al


----- Original Message ----- 
From: "Gardner Buchanan" <gbuchana at teksavvy.com>
To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
Sent: Sunday, March 21, 2010 9:45 AM
Subject: [gutvol-d] Re: save those pagenumber references


> Is there any suggestion what a formatted text-only book that
> retains page numbers should look like?  Is it reasonable to just
> sprinkle them into the text, maybe something like this:
>
> ---------
> Captain Headley, musingly pressing his hand to his brow, "and how
> unfortunate. Had Winnebeg brought General Hull's despatch one day
> sooner, all this would not have happened, for they never could have
> obtained [35] permission to leave the fort, much less to visit so
> dangerous a vicinity as Hardscrabble. Our march from this would
> have changed the whole current of events."
>
> "Even so," returned Mrs. Headley; "but here is a packet, left with
> Serjeant Nixon, which he has just handed to me, and which may throw
> some light on the subject. I will first glance over it myself."
> -----------
> "God bless you, Ronayne! Alas, you are not alone in, your trials--much
> of moment awaits us all. Good night!"
>
> And, assuming her disguise, she speedily regained her home.
>
>
> [44]
>
>
> CHAPTER X.
>
>    "Ne'er may he live to see a sunshine day that cries--Retire,
>    when Warwick bids him stay."
>       --_Henry IV._
>
> On the western bank of the south side of the Chicago River, and
> -----------
>
>
> ============================================================
> Gardner Buchanan                     <gbuchana at teksavvy.com>
> Ottawa, ON             FreeBSD: Where you want to go. Today.
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
> 


From Bowerbird at aol.com  Sun Mar 21 11:18:52 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 21 Mar 2010 14:18:52 EDT
Subject: [gutvol-d] Re: save those pagenumber references
Message-ID: <7d0a7.5b337a92.38d7bd0c@aol.com>

al said:
>   My personal practice is

and therein lies the rub.

the p.g. e-texts are rife with "personal practice".
and the d.p. e-texts are soaking in it right now...

one thing you have to know about pagenumbers is
some people need 'em and other people hate 'em...

which means you have to have 'em, and you have to
give people a way to shut them off...   _totally_ off...

the only way to do that is to establish a convention,
so viewer-app developers can make everyone happy.

most "personal practice" implementations try to walk
the tightrope between the two sides, and fail _both_,
in the sense they don't do the _full_ job that the "pro"
people want pagenumbers to do, but yet aren't nearly
as non-invasive as the "anti" people reasonably want.

if a hundred different digitizers do it a hundred ways
-- or a thousand digitizers do it a thousand ways --
nobody is gonna end up happy; we'll all be miserable.

and face it, if both sides are going to end up unhappy,
you might as well flip a coin and make one side happy.

the only way to make it work is to do it _one_way_...
so developers can target the convention successfully.

michael isn't going to prescribe this remedy for p.g.
even if he tried, he probably would not succeed, and
he has made it clear that he doesn't even want to try
and do things like that, as per his basic philosophy...
nobody else has a remote chance of success with p.g.

so alas, it is not to be.

but perhaps it doesn't matter.

because it's becoming increasingly clear that the only
cyberlibrary that's going to matter is the google one,
and -- after a few missteps at the very beginning --
google has gotten pretty smart about pagenumbers...

so whatever conventions they establish will stick.

***

but, to answer the question...

gardner said: 
>    Is there any suggestion what 
>    a formatted text-only book that
>    retains page numbers should look like?
>    Is it reasonable to just sprinkle them 
>    into the text, maybe something like this:

there's nothing difficult about the issue, technically.

you wouldn't want to "sprinkle them" thoughtlessly,
but any number of _well-specified_conventions_ can
handle the tiny number of wrinkles that do crop up...

(tell me if you want me to dredge my memory-bank
to catalog them, but there seriously aren't too many.)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100321/581b310b/attachment-0001.html>

From ajhaines at shaw.ca  Sun Mar 21 11:33:44 2010
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Sun, 21 Mar 2010 11:33:44 -0700
Subject: [gutvol-d] Re: save those pagenumber references
References: <7d0a7.5b337a92.38d7bd0c@aol.com>
Message-ID: <C123B3E396BA4B649A0474AB2B9AE458@alp2400>

What bowerbird failed (or didn't bother) to mention was that using curly braces for page numbers and square brackets for footnotes are practices that are documented in PG's Volunteers' FAQ (V.98, V.99, V.103).  As such, my "personal practice" is not an invention of my own, but are PG-standard, documented, practices that I've adopted for my projects.

Al

  ----- Original Message ----- 
  From: Bowerbird at aol.com 
  To: gutvol-d at lists.pglaf.org ; bowerbird at aol.com 
  Sent: Sunday, March 21, 2010 11:18 AM
  Subject: [gutvol-d] Re: save those pagenumber references


  al said:
  >   My personal practice is

  and therein lies the rub.

  the p.g. e-texts are rife with "personal practice".
  and the d.p. e-texts are soaking in it right now...

  one thing you have to know about pagenumbers is
  some people need 'em and other people hate 'em...

  which means you have to have 'em, and you have to
  give people a way to shut them off...  _totally_ off...

  the only way to do that is to establish a convention,
  so viewer-app developers can make everyone happy.

  most "personal practice" implementations try to walk
  the tightrope between the two sides, and fail _both_,
  in the sense they don't do the _full_ job that the "pro"
  people want pagenumbers to do, but yet aren't nearly
  as non-invasive as the "anti" people reasonably want.

  if a hundred different digitizers do it a hundred ways
  -- or a thousand digitizers do it a thousand ways --
  nobody is gonna end up happy; we'll all be miserable.

  and face it, if both sides are going to end up unhappy,
  you might as well flip a coin and make one side happy.

  the only way to make it work is to do it _one_way_...
  so developers can target the convention successfully.

  michael isn't going to prescribe this remedy for p.g.
  even if he tried, he probably would not succeed, and
  he has made it clear that he doesn't even want to try
  and do things like that, as per his basic philosophy...
  nobody else has a remote chance of success with p.g.

  so alas, it is not to be.

  but perhaps it doesn't matter.

  because it's becoming increasingly clear that the only
  cyberlibrary that's going to matter is the google one,
  and -- after a few missteps at the very beginning --
  google has gotten pretty smart about pagenumbers...

  so whatever conventions they establish will stick.

  ***

  but, to answer the question...

  gardner said: 
  >   Is there any suggestion what 
  >   a formatted text-only book that
  >   retains page numbers should look like?
  >   Is it reasonable to just sprinkle them 
  >   into the text, maybe something like this:

  there's nothing difficult about the issue, technically.

  you wouldn't want to "sprinkle them" thoughtlessly,
  but any number of _well-specified_conventions_ can
  handle the tiny number of wrinkles that do crop up...

  (tell me if you want me to dredge my memory-bank
  to catalog them, but there seriously aren't too many.)

  -bowerbird


------------------------------------------------------------------------------


  _______________________________________________
  gutvol-d mailing list
  gutvol-d at lists.pglaf.org
  http://lists.pglaf.org/mailman/listinfo/gutvol-d
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100321/4a580747/attachment.html>

From dakretz at gmail.com  Sun Mar 21 12:16:31 2010
From: dakretz at gmail.com (don kretz)
Date: Sun, 21 Mar 2010 12:16:31 -0700
Subject: [gutvol-d] Re: save those pagenumber references
In-Reply-To: <C123B3E396BA4B649A0474AB2B9AE458@alp2400>
References: <7d0a7.5b337a92.38d7bd0c@aol.com>
	<C123B3E396BA4B649A0474AB2B9AE458@alp2400>
Message-ID: <627d59b81003211216l1e020e97p99fe300aa2bf4d38@mail.gmail.com>

PG needs for age numbers need to be there somewhere because without them
there's no future hope for controlled/moderated text refinement. We need
them
to match up the canonical text with the canonical image and quickly verify
that
a proposed correction is legitimate.

Whether the page number needs to be included in the downloaded "plain text"
version, or whether the "plain text" version should be the canonical version
are
separate matters.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100321/6cd568a1/attachment.html>

From schultzk at uni-trier.de  Mon Mar 22 02:02:25 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Mon, 22 Mar 2010 10:02:25 +0100
Subject: [gutvol-d] Re: jim, i have some questions about pgdiff output
In-Reply-To: <6dee.30b7d3b4.38d4d03a@aol.com>
References: <6dee.30b7d3b4.38d4d03a@aol.com>
Message-ID: <3775A2FB-3BD0-499C-BF63-CB0DC894DEE2@uni-trier.de>

BB, 

	What should I say to you.

	To take one of my favorite quotes from Lotfih Zadeh
	"We are still confused, but on a higher level" !

	Though I tell you this much I am at least thinking about
	a proof of concept. I also, am looking for the pieces I need.
	The biggest one will be the OCR engine that has the features I 
	will need. What I can not promise is that I will keep up interest in 
	it. 

	regards
		Keith
		
Am 19.03.2010 um 14:03 schrieb Bowerbird at aol.com:

> keith, you have no pudding.
> 
> you have a lot of cards, which purport to have recipes on them,
> but i cannot make heads or tails of them, and they certainly have
> no taste, nor can they be eaten, so -- not to be mean or anything,
> but -- what good are they?
> 
> or maybe it's just me.  if someone else can explain to us
> just exactly what it is that keith is talking about, please do.
> 
> thank you.  have a nice day.
> 
> -bowerbird
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100322/5e50de73/attachment.html>

From schultzk at uni-trier.de  Mon Mar 22 02:19:54 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Mon, 22 Mar 2010 10:19:54 +0100
Subject: [gutvol-d] Re: jim, i have some questions about pgdiff output
In-Reply-To: <SNT120-DS9C3ABA1C4C141C7559697AE2A0@phx.gbl>
References: <SNT120-DS15877230ABCA04706CB5DDAE2B0@phx.gbl>
	<72F0393F-49CB-428D-9E64-6E752997D720@uni-trier.de>
	<SNT120-DS9C3ABA1C4C141C7559697AE2A0@phx.gbl>
Message-ID: <3DEDF2F6-D95C-4820-83D4-2B4A4C1F3EAE@uni-trier.de>

Hi James,

	I do understand the the levenstein measure and actually
	do not think we need to discuss it caveats far as precision
	and sucessfulness.

	An interesting approach by using English and American
	versions. Yet, that makes pgdiff specific to one set of languages.
	
	On the other side. If you out the problem of the forwards, tocs, and
	indices et. al. you could simply try adding in a component that rewrites
	with the others spelling conventions. That I know is no trivial
	task.

	As far as my considering not using diff and just a simple comparison
	method which is linear, the problem of alignment does remain.
	I admit I have done the math or have an exact algorithm but it does
	seem to me that it would be polynominal and still far better than
	n^2. 

	regards
		Keith.

Am 19.03.2010 um 19:29 schrieb James Adcock:

>> 	Proofing is per se linear,  has relatively few differences, and is
> aided by
> 	humans, and a new version is to be created and not a merge.
> 	The process is simple compare text A and B as long as they are equal
> 	and then gather the information as long as the differ, present the
> difference,
> 	offer possible changes, continue.
> 	Without much analysis one can see that this process is linear.
> 
> Agreed -- although again you run into problems when your assumptions break
> down.  Pgdiff wasn't intended for these simply "change a couple letters
> within a line of text" problems.  It was intended for problems of the nature
> of "I have two different editions of the text from two different continents
> one using English spellings and one using American spellings and having
> different linebreaks and different pagebreak and different intros and
> censorship and different indexes and I want to use one to help find scannos
> in the other."  Yes it can be used for simpler tasks but if you have a
> simpler task you might be better off to figure out exactly what that task is
> and write a tool to match that task.  Human edits within line tend to be
> char-by-char and you might be better off using a Levenshtein measure with
> the "token" set to be a char and the "string" set to be a line of text -- to
> give an obvious example -- since its not obvious to me how someone uses a
> mouse and a keyboard to make changes other than "insert a char" "delete a
> char" or "substitute a char" -- unless one uses cut and paste, in which case
> all assumptions are off again....
> 
> 
> 
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d


From jimad at msn.com  Mon Mar 22 07:32:44 2010
From: jimad at msn.com (Jim Adcock)
Date: Mon, 22 Mar 2010 07:32:44 -0700
Subject: [gutvol-d] Re: save those pagenumber references
In-Reply-To: <254bd.5074f290.38d5399a@aol.com>
References: <254bd.5074f290.38d5399a@aol.com>
Message-ID: <SNT120-DS84641AE06B208C1ACC63CAE270@phx.gbl>

>there _is_ such a thing as a stupid question.  i've asked them myself, as have all of us.  and jim, you just asked a _handful._

Perhaps there is such a thing as a "stupid answer" since the answer you gave recently addresses none of the issues I raised.  


From Bowerbird at aol.com  Mon Mar 22 10:51:55 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 22 Mar 2010 13:51:55 EDT
Subject: [gutvol-d] Re: save those pagenumber references
Message-ID: <bf9cc.abff9ee.38d9083b@aol.com>

jim said:
>    Perhaps there is such a thing as a "stupid answer" since 
>    the answer you gave recently addresses none of the issues I raised.

i was quite clear that i was deliberately not answering your questions,
precisely because they were stupid, and -- even further -- because
their answers were contained in the example i'd given just previously.

i'll work very hard, and go far out of my way, to have a good dialog.
because i value that.   which is the same reason i won't countenance 
someone polluting that dialog with posts that regress the progress.

i've spent years refining my filenaming conventions.   you spent
5 minutes and came up with some kindergarten-level questions.
use another 5 minutes, and you can answer your own questions.

maybe then you'll also know why i would rather spend 10 minutes
of my own time writing _this_ post instead of 5 minutes writing
a post that answered your stupid questions.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100322/54224a61/attachment-0001.html>

From Bowerbird at aol.com  Mon Mar 22 13:02:52 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 22 Mar 2010 16:02:52 EDT
Subject: [gutvol-d] Re: save those pagenumber references
Message-ID: <1aa54.6e0339c6.38d926ec@aol.com>

al said:
>    What bowerbird failed (or didn't bother)?
>    to mention was that 
>    using curly braces for page numbers 
>    and square brackets for footnotes \
>    are practices that are documented in 
>    PG's Volunteers' FAQ (V.98, V.99, V.103).? 
>    As such,?my "personal practice"?is not 
>    an invention of my own, but are?PG-standard, 
>    documented, practices that I've adopted for my projects.

discussions here are often so pointless it's not worth bothering.

and yet i persist.   sometimes i think _i_ must be the stupid one.

but then, no, i realize, no, it's not me that's the stupid one at all.

***

so...

what al failed (or didn't bother) to mention is that
whether or not any contributor _follows_ the f.a.q.
is entirely a personal matter up to them...

and that's why the f.a.q. don't really matter much...

not unless the whitewashers enforce a particular aspect.

and this one has not been enforced.

so the vast majority of the .txt files have no pagenumbers.

(well, actually, it _is_ enforced.   because v.98 actually instructs
producers that they should _not_ keep pagenumbers, except in
"exceptional" cases.   al tried to slip a fast one by us there, eh?)

so p.g. has failed to create a convention about how it is done,
even inside of its own cyberlibrary, let alone _outside_ of itself.

and let me tell you that i respect michael hart's _principled_
decision not to enforce a standard much more than i respect
a naive belief that -- just because it's in the f.a.q. -- you have
established a convention.   i don't respect that naivety at all...

on the other hand, michael's unwillingness to take a stand
_has_ meant that the producers have overruled the f.a.q.

d.p. postprocessors have taken to including pagenumber info
in their .html versions over the course of the last few years...
many now include the pagenumbers as a matter of _routine._

that's the good news.

the bad news is that the laissez faire attitude is paramount
in d.p. postprocessors.   they do things however they want.
and they change how they do things whenever they want to.

so, over the course of those last few years, they've treated
pagenumber info in countless ways, with zero consistency.

so it will be difficult or impossible to construct a "standard"
from the d.p. practices, especially since the information is
buried in the source .html, and not evident on the surface.

it's also the case that there continue to be major problems
with _all_ of their implementations, for reasons that might
well be unavoidable, such as browsers that do not support
the kind of functionalities that might be necessary to walk
that tightrope i talked about between "pro" and "anti" forces.

but, for people who like to view the glass as being 2/10 full
instead of 8/10 empty, please enjoy the fact that the people
who finish off the e-texts at d.p. now value pagenumbers...

yet al still remains clueless...

and his cluelessness moves up to a higher level as well.

because remember that the _reason_ we want a convention
is so that the developers of viewer-programs will support it,
by programming the necessary capabilities into their apps...

does anyone know any app developers who have done that?
i mean, besides _me_ with _my_ apps?   yeah, i thought not...
the convention, even if obtained, is just a means to an end.

and the pointlessness continues...

one of the useful aspects of pagenumbers, as don points out,
is they allow us to refer back to the page-scans of the book...

but the f.a.q. betrays no knowledge of this beneficial purpose,
and thus fails to enlighten the e-text producers of this linkage.

if it _was_ based on this broad goal, the f.a.q. would also show
awareness that pagenumbers per se are but a small part of the
overall needs, along with things like _the_original_linebreaks_
and _the_original_end-line-hyphenates_.   without those other
vital aspects, it's similar to "baking a cake" with sugar as your
only ingredient; the thing you get out at the end won't be cake.

i talk more about this in the reply i drafted over the weekend,
which i still intend to send today, so i won't belabor it now...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100322/1d94aa2f/attachment.html>

From Bowerbird at aol.com  Mon Mar 22 14:55:11 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 22 Mar 2010 17:55:11 EDT
Subject: [gutvol-d] Re: save those pagenumber references
Message-ID: <23a5c.75e8cc64.38d9413f@aol.com>

al said:
>    Basic format:
>    
>    The prefix for the cover pages is: "c".
>    The prefix for the roman pages is: "f".
>    The prefix for the arabic pages is: "p".
>    
>    ***
>    
>    For blank pages there should be no file and 
>    the page number should be skipped. 
>    Optionally an image saying: 
>    "This page is blank in the original."
>    may be inserted.
>    
>    ***
>    
>    Example of file naming:
>    
>     front cover       c0001.png
>     back cover        c0002.png
>     spine             c0003.png
>    
>     i title page      f0001.png
>     ii title verso    f0002.png
>     iii dedication    f0003.png
>     iv is blank
>     v contents        f0005.png
>    
>     page 1            p0001.png
>     page 2            p0002.png
>     image on page 2   p0002-image1.png
>     image on page 2   p0002-image2.png
>     page 3            p0003.png
>     page 4 is blank
>     page 5            p0005.png
>     ...               ...
>     page 9999         p9999.png


dkretz said:
>    So far this "spec" seems to be primarily a legend.
>    Is it documented anywhere?


al said:
>    No.   It was developed and used by Joshua Hutchinson


dkretz said:
>    Here's an extensive forum thread on DP 
>    where we hashed this all out.

oh lord.

***

where do i begin?

seriously, this is such a mess.

where do i begin?

***

well, to start with the last comment first,
this wasn't "hashed out" at all.   it was just
messed up, because josh and marcello are
too stubborn to take good advice from me.

and on a more general level, this all shows
that d.p. and p.g. can mess things up even
when they actually try to do the right thing.

***

so let's go back and examine the problems...

***

we'll need to start with a short history lesson.

years back, there was a push to get the scans
hosted at p.g. with the text, and p.g. said ok.

but when people started posting their scans,
i noticed they had been named very stupidly.

most stupid was that the filenames contained
_numbers_ that were _not_ the _pagenumbers_.
thus the file for page 123 might be "0128.png".

this didn't surprise me, because d.p. has been
naming their scans stupidly for many years...
i'd tried to wise them up, but they didn't listen.

but it's one thing to name _your_ files stupidly,
since you're the only one who works with 'em,
so you're the only one who pays the penalties
of the big costs that stupid filenames impose.

it is quite _another_ thing to name files that
you post in public using a stupid convention,
because the _public_ works with those files...

luckily, the most insane position did not prevail.

p.g. required that all scans must be named
using the same number as the pagenumber.

for a while, anyway, some d.p. people would
rename all the scan-files so they could then
be posted to p.g.   yes, it's stupid to work with
stupidly-named files, because you pay all the
penalties of working with stupidly-named files,
only to rename them to smarter names _after_
you're done working with them, but that's what
d.p. was doing.   for a little while.   until it fizzed.

the good news is that most scans at p.g. are
named with a number that's the pagenumber.

the bad news is that the renaming requirement
essentially means not many scans get posted...

the ugly news is that the names are _still_not_
really intelligent.   they're not _moronic_, but
they're not very intelligent either, not at all...
on an i.q. scale, they'd weigh in at about 87.

thus ends our history lesson to set context...

***

ok, what comes next?

first of all, let's remember the philosophy
that should be a fundamental cornerstone
of _any_ intelligent filenaming convention...

one important principle (the first?) which should
be at work here is that every filename is _unique._

that is, _each_and_every_ file should have a name
that identifies _that_file_ separate from all others.

now, there might be some cases where the same
file might have different names in different places.
(some would argue that; let's put that off for now.)

but an _iron-clad_rule,_ with _no_ exception, is
different files must always have different names.

to say it another way, different files must _never_
have the same name.   _never_, _never_, _never_.

so right at the _very_outset_, the dp/pg model
has failed us...   all of their files are named with
the same p0001.png-p9999.png convention and
thus fail to meet the imperative to be _unique._

how can we tell one file named p0001.png from
_every_other_ file named p0001.png?   we cannot.
and since every book has a p0001.png file, _bad_.

this isn't rocket-science.   it's common sense.
_different_files_should_have_different names!_

we're back in the same old boat where we need
to pay heed to the subdirectory name to know
with certainty which book each file represents.

if the filenames were unique, we could place
every one of our files in a single subdirectory,
and we would have no filename crashes and
we could identify each file as a unique entity,
just from its name, without looking inside it.

i mean, it's great that we know that p0001.png
is a scan of a page that was numbered as page 1
in the book in which it appeared, but the filename
doesn't tell us _which_ book that was, so we are
left out in the cold on the very first step we take.

how sad...   how utterly and thoroughly pathetic...

***

to make my filenames _unique_ to a particular book,
i give each scan in a book a 5-letter unique prefix...

so, for the "sitka" book we've been analyzing lately,
the 5-letter prefix for all the filenames is "sitka"...

in case you're wondering, a 5-letter prefix gives us
26**5 possibilities for unique ones, which computes
to 11 million possibilities.   11.8 million, to be exact,
but some of those might be voided as unusable...

if you feel a need to be able to label more books,
a 6-letter prefix gives 308,915,776. (308+ million.)

a 7-letter prefix gives 8 billion.   8-letter, 208 billion.
let me know when you've got 208 billion documents.
til then, an 8-letter prefix will work just fine, thanks.
indeed, i'm happy with a 5-letter prefix at the moment.

***

ok, so let's go on...

jim said:
>    The prefix for the cover pages is: "c".
>    The prefix for the roman pages is: "f".
>    The prefix for the arabic pages is: "p".

the "c", "f", and "p" convention is one i created...

thankfully, this model was adopted by dp/pg.

but there was a _reason_ i picked those letters,
a good reason, and -- when it came to details --
dp/pg again screwed up with its implementation.

the "p" stands for "page", and that's obvious.

and "c" for "cover" is the obvious choice too.

but some people suggested the front-matter
should have an "r" prefix, for "roman numbers".

know why i rejected "r" in favor of "f", do you?

think about it for a minute, and see if you know.

if you said i chose "f" to stand for "front-matter"
or "forward-matter", you got an "f" on this quiz.

it's a nice mnemonic, sure, but the real reason
why i chose "f" is a much more pragmatic one...
(know any other words that start with "mne"
besides "mnemonic"?   so what is its origin?)

so, did you think of the answer why i used "f"?

to explain why, think back to when i said that
-- in coding your app and getting a "map" of
the files within any specific book by reading
the directory to see what files were there --
a vital component of that strategy will be that
the filenames _sort_in_the_order_they_appear._

that is, we need to know not just the files that
comprise the book, but their appearance order.

so i choose "f" for front-matter pages because
those pages appear between "c" and "p" pages
-- the cover and the arabic-numbered pages --
so the prefix needed to fall between "c" and "p".

and "f" worked just fine.

you should also keep in mind that the letters
"d" and "e" can be used between "c" and "f",
if the idiosyncrasies of a certain book need it.

likewise, there are lots of letters that can be
used between "f" and "p", if a book needs 'em.

and similarly, there are lots of letters _after_
"p" that can be used, for material that might
come _after_ regular arabic-sequence "pages".

but yeah, that's why i chose "f" instead of "r"...

it was so the filenames would _sort_ correctly.

***

and speaking along these lines, it's just plain silly
that dp/pg pads their pagenumbers to 4 places...

the vast majority of books are under 1000 pages,
so padding the pagenumber to 3 places works well.
that fourth padding place just causes more work.

in those rare cases where you have pagenumbers
that run in 4 digits, one can summon the "r" prefix
to signify those pages, so "r000.png" is page 1000,
"r001.png" would be 1001, "r002.png" 1002, etc.

(yes, you could use "q" too.   but as a general rule,
you will leave yourself more flexibility if you do not
choose to use prefixes that are directly adjoining.)

***

the insanity continues...

al says this:
>    For blank pages there should be no file 
>    and the page number should be skipped. 

that's just crazy talk.   include a blank image-file
and name it appropriately, so the world doesn't
suspect that you screwed up and dropped a file.

because that's _exactly_ what they will suspect...

(and with good reason.   skipped pages happen,
a lot, as the world learned from google's work.)

***

...and it goes on and on...

al said:
>     front cover       c0001.png
>     back cover        c0002.png
>     spine             c0003.png

um, no.   bad idea.   very bad idea.   you know how
i said that the sort-order of the filenames should
be identical to their order of appearance, right?

so hopefully you understand that the back-cover
-- i.e., the last thing in the book -- should have
a filename that sorts to the end.   not position #2.
that's assuming that you even need a back-cover.

and the spine?   i suppose if you _must_ have it,
you will be determined to include it, but please
give it a name that sorts it to the end, too, since
for most people it will just be a cute little gesture.
consider it as the mint as you leave the restaurant.

you might also remember that i insisted the files
must reflect the recto/verso aspects of the book.
for every recto file and filename, there _must_ be
a verso file and filename.   once again, if you fail to
maintain this nicety, the world will suspect that you
have lost a file, or that you just do not understand
one of the basic structural aspects of the p-book,
specifically that every piece of paper has two sides.

that's why you always include a blank-page file...

...and why, if you have a file named "c0003.png",
you must also have "c0004.png".   don't forget it.

***

...and on and on...

al said:
>     page 1            p0001.png
>     page 2            p0002.png
>     image on page 2   p0002-image1.png
>     image on page 2   p0002-image2.png
>     page 3            p0003.png
>     page 4 is blank
>     page 5            p0005.png
>     ...               ...
>     page 9999         p9999.png

first off, you can tell this originated from me,
because of the all-lower-case look of it, _but_
i've always padded my numbers to just 3 digits.
i believe it was marcello who added that 4th one.
(and, as i just explained above, it's unnecessary.)

and gee.   you know, like jim said, what i propose is
really -- at the very heart of it -- a simple system...

so it's honestly quite _amazing_ that dp/pg could
screw it up in so many different ways.   _amazing_.

look at the lines there pointing to "image on page 2".

either marcello or josh must have added those too.

this is something of a nightmare happening here.
up to now, the files we've been talking about are
_page-scans_.   that is, they represent a full page.
we all know why that's the case; it's because we
are doing _proofing_, so we need the page-scan.

now all of a sudden something different pops in,
namely "images" contained on the same page as
the page-scan (which, of course, is also an image).

ok, i won't pretend i don't know what these are.
they're higher-resolution versions of _pictures_
that were contained on that page in the p-book.

which is all well and good, but let's not mix them
in with the page-scans, which is what happens if
you name the hi-res files using the same model.

give those files names that are _quite_different_,
and which sort them completely out of our range.

it'd be good if you even stored them in a separate
directory.   (luckily, this is exactly what p.g. does,
storing them in a subdirectory of the .html file,
as these "subimages" are used by .html versions;
but we certainly don't need 'em to do proofing.)

better yet, examine if you need those files at all.

if a particular page had a picture on it that needs
to be scanned at a higher-resolution, then make
the actual page-scan at that higher-resolution...
there's no sense having a low-res version of it,
especially if it's just going to cause us problems.

then, in your e-book file, give instructions for the
viewer-program about the coordinates of the scan
that represent the picture that you want it to "clip".
the viewer-app will then load in the high-res scan,
clip out the picture, and then display it accordingly.
(ok, this is a little futuristic, since no viewer-apps
will do this currently, not even mine.   but soon...)

***

al said:
>     page 2            p0002.png
>     image on page 2   p0002-image1.png
>     image on page 2   p0002-image2.png

one more thing about this.   even though, as i
mentioned above, these "subimage" filenames
have no ill effects, as they're stored elsewhere,
there is yet another problem presented here,
one which _does_ manifest in the posted scans.

you might get the idea, from that list there,
that dashes are an ok thing in your filenames.

the problem comes with unnumbered pages.

let's say we have an unnumbered illustration
facing page 36 in our "sitka" book, as we do.

so our names would run like this:

>    sitkap035.png
>    sitkap036.png
>    sitkap036a.png
>    sitkap036b.png
>    sitkap037.png

at least that's how _i_ do it...


but if you looked at the policy as al wrote it,
you might well conclude the names should be:

>    sitkap035.png
>    sitkap036.png
>    sitkap036-a.png
>    sitkap036-b.png
>    sitkap037.png


or maybe you'd even think they could be:

>    sitkap035.png
>    sitkap036.png
>    sitkap036-1.png
>    sitkap036-2.png
>    sitkap037.png


either way, the problem becomes clear if you
once again recall that we want the filenames to
_sort_ correctly...   al's names will sort this way:

>    sitkap035.png
>    sitkap036-a.png
>    sitkap036-b.png
>    sitkap036.png
>    sitkap037.png

this would cause the viewer-program to believe
that it should place that unnumbered illustration
between pages 35 and 36 -- a recto and a verso!

this illustration either goes between 34 and 35,
or it goes between 36 and 37, but that is unclear,
and computer programs need things to be clear.

***

if you are now asking "why do we need to be
concerned with how computer programs will
interpret these files?", then you're making the
same mistake that the dp/pg people have made.

you are failing to grasp the _larger_context_
in which these files will be used.   and it is this
larger context that is necessary to help us hone
the conventions that we adopt in making e-texts.

the pagenumber f.a.q. failed to consider the
necessary linkage with the names of the scans,
and the scanfile-naming rules failed to consider
how those scans would be used by developers.

this inability -- and unwillingness sometimes --
to see the big picture is why dp/pg isn't creating
coherent policies on such matters, even when it
actually _tries_ to do so (which is relatively rare).

so there implementations will be short-sighted.

when you add in the stubborn way that people
like al and juliet and marcello and josh _refuse_
to take any advice from me, no matter how good,
the situation can look bleak.   however,   i remain
focused on the long-term, where i am confident
-- supremely confident -- that my ideas will win.

and in the short-term, i just remind myself, on
the infrequent occasions when the question will
present itself to me, that i am not the stupid one.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100322/c7e50c11/attachment-0001.html>

From Bowerbird at aol.com  Mon Mar 22 16:12:58 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 22 Mar 2010 19:12:58 EDT
Subject: [gutvol-d] what do distributed book digitizers want?
Message-ID: <243b.53d5c8bd.38d9537a@aol.com>

over on his fadedpage forums, rfrank asks "what motivates users?"

it's an excellent question, one that deserves a thoughtful answer...

so here's my take on it, which i give partly because i have some
disagreements with _some_ conclusions that roger has come to.

i phrased these all in an affirmative way, so this could be used as
a mission statement (e.g., for my own site, or by others), although
the reverse-phrasing will often have made more sense to people...
(for instance, "i do not want to be asked to do _unnecessary_ work.")

i'd guess that most people would approve of most of my items, so
i'd be interested to hear if anybody would challenge any of them...

***

what i want as a distributed book digitizer...

i want to proof, yes.
i want to format too.
i want to finish pages.
i want to finish books.
i want to smooth-read.
i want to do a great job.
i want to do necessary work.
i want to select what i work on.
i want to have unambiguous rules.
i want to know if i am doing a great job.
i want to know when i am making mistakes.
i want to see solid proof if i've made a mistake.
i want to receive fair credit for work i have done.
i want to work in a system that's very transparent.
i want to work with others who're doing great work.
i want to know my energy is being used productively.
i want to know how to improve the quality of my work.
i want to know exactly what data the system has on me.
i want to be able to challenge the system if acts unfairly.
i want to let the world know when i have done a great job.
i want to let the world know when i have done a lot of work.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100322/f461828c/attachment.html>

From Bowerbird at aol.com  Mon Mar 22 18:33:25 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 22 Mar 2010 21:33:25 EDT
Subject: [gutvol-d] more sitka smoothreading glitches
Message-ID: <4bcd0.498fe527.38d97465@aol.com>

i see rfrank is making improvements to his smoothreading version
of the "sitka" book.   i wasn't sure if he would do that as they came in,
or if he would simply wait and do all the fixes one time at the end...

>    http://www.fadedpage.com/s/sitka/sitka.htm

an incremental approach is just fine, but it means that no one has
yet reported this next glitch, which is a rather amazing one, since
it has survived the system through preprocessing and proofing and
postprocessing, although it doesn't even pass a simple spellcheck:

>    Some looked with extreme disfavor upon the establishment, 
>    while others wrere friendly.

it's also unclear whether anyone has reported the inconsistencies
in the spelling of the baron's name -- is it wrangel or wrangell? --
but perhaps rfrank decided to leave 'em as they are in the p-book.

of course, if _that_ were the case, he wouldn't have changed the
two cases of the baron's name on page 43, since they are clearly
printed as "wrangel".   but also there, two alaskan places which
-- as the book directly states there -- "today perpetuate his name"
are clearly printed as "wrangell", which is the cause for confusion,
compounded by the fact that the name is spelled as "wrangell" on
pages 54, 61, 63 (twice), and 102, but as "wrangel" on page 75...

aside from the inconsistent-with-the-printed-page instances on
page 43, rfrank was also inconsistent with the ink-on-paper on
page 63 (the second instance), where he was not just inconsistent
with the printed book, but with his own version on the same page.
(in other words, the page was consistent itself, but rfrank was not.)

***

all of this is not to criticize rfrank.   indeed, i will tell you that
he is an excellent postprocessor.   he has a ton of experience;
he's probably submitted over 500 books to p.g. by this time...

what this _does_ show is that even an excellent postprocessor,
with a ton of experience, can have errors that persist through
preprocessing, proofing, and postprocessing, and maybe even
through smoothreading.   (at least this far, these glitches have.)

so i think this is good evidence that "once and done" is _not_
a good strategy for a roundless system.   that philosophy has
_never_ been a part of the roundless system that _i_ preach...

indeed, i believe any change should be reviewed and approved
by two separate people before it is considered to be "golden"...

it's also important to remind ourselves that we are not "short"
of proofers.   to the contrary, we have a huge _glut_ of proofers.

distributed proofreaders has so many proofers that they are
now actively considering ways to _throttle_ their p1 proofers!

with an _abundance_ of proofers, there is no need to scrimp...

we can have multiple proofers look at every page in every book.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100322/1016c022/attachment.html>

From schultzk at uni-trier.de  Mon Mar 22 18:41:09 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Tue, 23 Mar 2010 02:41:09 +0100
Subject: [gutvol-d] Re: save those pagenumber references
In-Reply-To: <23a5c.75e8cc64.38d9413f@aol.com>
References: <23a5c.75e8cc64.38d9413f@aol.com>
Message-ID: <B8C106DB-5F60-4AB8-9A75-8EBA0EF3A31A@uni-trier.de>

BB I can not believe you are serious.

1) Your critic fails all logic. Why in Gods name would anybody intermix
     scans from more than one book in the same directory. Their are more than
     enough files just from one book !

2) How is a sequence of five arbitary characters anymore informative.
     Or can you remeber 26^5 titles.

Come On Man! Wake up.

regards
    Keith. 


Am 22.03.2010 um 22:55 schrieb Bowerbird at aol.com:

[snip, snip   hot air deleted]

From Bowerbird at aol.com  Tue Mar 23 00:26:05 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 23 Mar 2010 03:26:05 EDT
Subject: [gutvol-d] Re: save those pagenumber references
Message-ID: <705ad.3f20837b.38d9c70d@aol.com>

keith said:
>    BB I can not believe you are serious.

is that so?   because i find your disbelief to be quite humorous!   :+)


>    1) Your critic fails all logic. 

it fails _all_ logic?   i have a hard time believing that, keith...         
:+)


>    Why in Gods name would anybody intermix scans 
>    from more than one book in the same directory. 
>    Their are more than enough files just from one book !

i wrote that huge post, and _that's_ what you took from it?

talk about missing the point.   you missed it by a mile, keith.
(a mile is about 1.6 kilometers, in case you are wondering.)

for the record, not that i think anyone else missed the point,
it might not be that you'd _want_ to put more than one book
in a directory, it's that you _could_ if you ever _did_ want to,
whereas, when all books are named p001-p999, you cannot.

the more important point is that, given the files for a book,
and for another book, you wanna be able to tell them apart.
all files for a book should be named with a common element.
and the name of every file should be unique from all others,
across your entire system.   this is nothing but common sense.


>    2) How is a sequence of five arbitary characters anymore
>    informative.   Or can you remeber 26^5 titles.

the characters are not informative in and of themselves, but
they become meaningful when all files from a book receive
the same prefix, because then you see, just from the names,
they go together.   and no, there's no need to remember them,
since the catalog will keep all of the information straight and
make the appropriate information available to the end-users.

but i suspect you knew all that.

***

but in order to see how someone might do it another way,
go look at the internet archive and their naming convention.

they went for longer names, hoping for _some_ meaning...
and, to a degree, they attained it, at a cost in convenience.

for instance, here's a subdirectory name:
>    http://www.archive.org/details/adventuresoftoms00twaiiala

that subdirectory maps onto another more-specific one:
>    http://ia331317.us.archive.org/1/items/adventuresoftoms00twaiiala/

so their "name" for this book is "adventuresoftoms00twaiiala".

therefore, you might guess -- correctly -- that this book is
"the adventures of tom sawyer".   but it doesn't inform you
_which_ edition of the book this is, or where it came from,
or if it is one of the several copies from project gutenberg,
or when it was published, or any number of details about it.
to get to that information, you'll have to visit their catalog,
and if you're gonna visit a catalog anyway, you might as well
visit the catalog to find out the 5-letter "prefix" of the book,
a prefix that's much easier than "adventuresoftoms00twaiiala".

and you better believe me, because it has happened to me,
once you get a lot of the archive.org files on your machine,
it starts to become very hard to discriminate names such as:
>    http://www.archive.org/details/adventuresoftoms00twaiiala
>    http://www.archive.org/details/theadventuresoft00074gut
>    http://www.archive.org/details/theadventuresoft07193gut
>    http://www.archive.org/details/theadventuresoft07194gut
>    http://www.archive.org/details/adventurestomsa02twaigoog
>    http://www.archive.org/details/adventurestomsa00twaigoog
>    http://www.archive.org/details/adventurestomsa00willgoog
>    http://www.archive.org/details/adventurestomsa01twaigoog
>    http://www.archive.org/details/adventurestomsa05twaigoog
>    http://www.archive.org/details/tomsawyer00twain
>    http://www.archive.org/details/adventuresoftoms20twai
>    http://www.archive.org/details/adventuresoftoms99twai
>    http://www.archive.org/details/adventuresoftoms00twai2
>    http://www.archive.org/details/tomsawyeradv00twairich
>    http://www.archive.org/details/advtomsawyer00twairich
>    
http://www.archive.org/details/booki-export-the-adventures-of-tom-sawyer

so, for me anyway, a 5-letter prefix seems to do the job just fine.

***

likewise, we can look at the system used by project gutenberg,
where the "prefix" for the book is essentially its 5-digit name.

digits are, in some ways, even more convenient that characters.
the problem is, 5-digit names only work up to 99,999 books...
that's enough for now, for project gutenberg, so that's fine, but
i wanted more breathing room, so i chose 5-character names...

***

or let's take a look at youtube names.   here's a sample u.r.l.:
>    http://www.youtube.com/watch?v=sA_0cvd1EUM
>    http://www.youtube.com/watch?v=qybUFnY7Y8w

first, i'm not sure why they need that "watch" in every u.r.l.
surely "watching" a video would be the default action, not?,
so it seems to me they could have abstracted that out, but...

we find they're using an 11=character name, one that uses
_both_ uppercase and lowercase letters (i only use lowercase),
_and_ numbers, _and_ at least some other characters as well.
that's going to give them _many_trillions_ of possible names,
which i guess is how high you think if you sell for $1.6billion.

***

speaking of google, let's see their book filename convention:
>    http://www.google.com/books?id=3n4hAAAAMAAJ
>    http://www.google.com/books?id=Y7sOAAAAIAAJ

they've got a 12-character name, uppercase and lowercase,
plus numbers.   which, again, will accommodate lots of files.

***

>    Come On Man! Wake up.

well, it's after midnight my time, so i'm about to go to sleep;
but i will wake up tomorrow morning, all ready to post again.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100323/d3cc6253/attachment-0001.html>

From schultzk at uni-trier.de  Tue Mar 23 03:10:46 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Tue, 23 Mar 2010 11:10:46 +0100
Subject: [gutvol-d] Re: save those pagenumber references
In-Reply-To: <705ad.3f20837b.38d9c70d@aol.com>
References: <705ad.3f20837b.38d9c70d@aol.com>
Message-ID: <8D75E262-82B9-447A-9C3A-6F124822F56F@uni-trier.de>

Hi BB,

	I did get your point and they due have there merit,
	yet no more than any other filenaming convention where
	you overly compress the names.

	I will not either go into how flawed they are.
	If you want telling filenames use them. We are not living
	in a DOS world where we are limited to 8 characters.

	Proog given that your naming convention is flawed and
	so now you can change it !!

	regards
		Keith.

Am 23.03.2010 um 08:26 schrieb Bowerbird at aol.com:

> keith said:
> >   BB I can not believe you are serious.
> 
> is that so?  because i find your disbelief to be quite humorous!  :+)
> 
> 
> >   1) Your critic fails all logic. 
> 
> it fails _all_ logic?  i have a hard time believing that, keith...        :+)
> 
> 
> >   Why in Gods name would anybody intermix scans 
> >   from more than one book in the same directory. 
> >   Their are more than enough files just from one book !
> 
> i wrote that huge post, and _that's_ what you took from it?
> 
> talk about missing the point.  you missed it by a mile, keith.
> (a mile is about 1.6 kilometers, in case you are wondering.)
> 
> for the record, not that i think anyone else missed the point,
> it might not be that you'd _want_ to put more than one book
> in a directory, it's that you _could_ if you ever _did_ want to,
> whereas, when all books are named p001-p999, you cannot.
> 
> the more important point is that, given the files for a book,
> and for another book, you wanna be able to tell them apart.
> all files for a book should be named with a common element.
> and the name of every file should be unique from all others,
> across your entire system.  this is nothing but common sense.
> 
> 
> >   2) How is a sequence of five arbitary characters anymore
> >   informative.  Or can you remeber 26^5 titles.
> 
> the characters are not informative in and of themselves, but
> they become meaningful when all files from a book receive
> the same prefix, because then you see, just from the names,
> they go together.  and no, there's no need to remember them,
> since the catalog will keep all of the information straight and
> make the appropriate information available to the end-users.
> 
> but i suspect you knew all that.
> 
> ***
> 
> but in order to see how someone might do it another way,
> go look at the internet archive and their naming convention.
> 
> they went for longer names, hoping for _some_ meaning...
> and, to a degree, they attained it, at a cost in convenience.
> 
> for instance, here's a subdirectory name:
> >   http://www.archive.org/details/adventuresoftoms00twaiiala
> 
> that subdirectory maps onto another more-specific one:
> >   http://ia331317.us.archive.org/1/items/adventuresoftoms00twaiiala/
> 
> so their "name" for this book is "adventuresoftoms00twaiiala".
> 
> therefore, you might guess -- correctly -- that this book is
> "the adventures of tom sawyer".  but it doesn't inform you
> _which_ edition of the book this is, or where it came from,
> or if it is one of the several copies from project gutenberg,
> or when it was published, or any number of details about it.
> to get to that information, you'll have to visit their catalog,
> and if you're gonna visit a catalog anyway, you might as well
> visit the catalog to find out the 5-letter "prefix" of the book,
> a prefix that's much easier than "adventuresoftoms00twaiiala".
> 
> and you better believe me, because it has happened to me,
> once you get a lot of the archive.org files on your machine,
> it starts to become very hard to discriminate names such as:
> >   http://www.archive.org/details/adventuresoftoms00twaiiala
> >   http://www.archive.org/details/theadventuresoft00074gut
> >   http://www.archive.org/details/theadventuresoft07193gut
> >   http://www.archive.org/details/theadventuresoft07194gut
> >   http://www.archive.org/details/adventurestomsa02twaigoog
> >   http://www.archive.org/details/adventurestomsa00twaigoog
> >   http://www.archive.org/details/adventurestomsa00willgoog
> >   http://www.archive.org/details/adventurestomsa01twaigoog
> >   http://www.archive.org/details/adventurestomsa05twaigoog
> >   http://www.archive.org/details/tomsawyer00twain
> >   http://www.archive.org/details/adventuresoftoms20twai
> >   http://www.archive.org/details/adventuresoftoms99twai
> >   http://www.archive.org/details/adventuresoftoms00twai2
> >   http://www.archive.org/details/tomsawyeradv00twairich
> >   http://www.archive.org/details/advtomsawyer00twairich
> >   http://www.archive.org/details/booki-export-the-adventures-of-tom-sawyer
> 
> so, for me anyway, a 5-letter prefix seems to do the job just fine.
> 
> ***
> 
> likewise, we can look at the system used by project gutenberg,
> where the "prefix" for the book is essentially its 5-digit name.
> 
> digits are, in some ways, even more convenient that characters.
> the problem is, 5-digit names only work up to 99,999 books...
> that's enough for now, for project gutenberg, so that's fine, but
> i wanted more breathing room, so i chose 5-character names...
> 
> ***
> 
> or let's take a look at youtube names.  here's a sample u.r.l.:
> >   http://www.youtube.com/watch?v=sA_0cvd1EUM
> >   http://www.youtube.com/watch?v=qybUFnY7Y8w
> 
> first, i'm not sure why they need that "watch" in every u.r.l.
> surely "watching" a video would be the default action, not?,
> so it seems to me they could have abstracted that out, but...
> 
> we find they're using an 11=character name, one that uses
> _both_ uppercase and lowercase letters (i only use lowercase),
> _and_ numbers, _and_ at least some other characters as well.
> that's going to give them _many_trillions_ of possible names,
> which i guess is how high you think if you sell for $1.6billion.
> 
> ***
> 
> speaking of google, let's see their book filename convention:
> >   http://www.google.com/books?id=3n4hAAAAMAAJ
> >   http://www.google.com/books?id=Y7sOAAAAIAAJ
> 
> they've got a 12-character name, uppercase and lowercase,
> plus numbers.  which, again, will accommodate lots of files.
> 
> ***
> 
> >   Come On Man! Wake up.
> 
> well, it's after midnight my time, so i'm about to go to sleep;
> but i will wake up tomorrow morning, all ready to post again.
> 
> -bowerbird
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100323/2f8bf39d/attachment.html>

From Bowerbird at aol.com  Tue Mar 23 12:09:47 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 23 Mar 2010 15:09:47 EDT
Subject: [gutvol-d] Re: save those pagenumber references
Message-ID: <5bee0.5e4a912e.38da6bfb@aol.com>

keith said:
>    I did get your point and they due have there merit,
>    yet no more than any other filenaming convention 
>    where you overly compress the names.

what do you mean by "overly compress the names"?
what would a "noncompressed" filename look like?


>    I will not either go into how flawed they are.

...because you have no arguments of substance...


>    If you want telling filenames use them.

again, what does this _mean_?


>    We are not living in a DOS world 
>    where we are limited to 8 characters.

the history of the u.r.l. in terms of its length,
is rather interesting.   everybody started with
an ethic that they should be short and punchy.
not just for convenience, but memorability too.

gradually the u.r.l. began accumulating length,
as websites got more extensive and files were
segmented into subdirectories for convenience.

then google started giving juice for content words
in the u.r.l., and the length zoomed ridiculously, as
everyone employed long names for s.e.o. purposes.

things got so ludicrous that we had the emergence
of u.r.l. "shorteners", web services that promised to
end the scourge of a long u.r.l. by providing a much
shorter one they maintained which rerouted people
to the longer original, _plus_ furnished some stats,
so you knew where the clicks were coming from, etc.

what happened then was that twitter hit, and hit big.

all of a sudden, people faced a 140-character limit.
they didn't want to "waste" a substantial percentage
of that limit every time they wanted to send a u.r.l.,
so the demand for shortener services skyrocketed...

so before we could turn around, there were dozens
such services, and not just 2 or 3 (bit.ly and tinyurl),
and things got messy.   first, the shortened u.r.l. is
a pain in the ass for many people, because tweeters
will often provide different shortened versions for
the same long u.r.l., but your browser doesn't show
them as already-visited links (since technically they
_are_ different links, and your browser doesn't know
that they all point to the same eventual destination).

second, shortener services make the u.r.l. "brittle"...
if the shortener service breaks down, so does their
"rerouting" ability which points to the ultimate site,
causing all those links to break for no good reason.

as startups, with very little chance of "making it",
the original shorteners had frequent down-time,
so the problem was readily apparent, even then...

but as more and more of these services started up
-- hoping to hit the lottery by being "blessed" by
twitter or google or anyone who would buy them
for a boatload of money -- it was more and more
clear that most of these services _would_fail_, and
take all their short links with them when they did.

and sure enough, then they did start closing down.

and they continue to have cutbacks, to this very day.
one of them -- http://tr.im/ -- just announced that
it is no longer accepting u.r.l. shortening requests...
luckily, they're still honoring their current redirects;
but what happens when they go completely under?

well, we're lucky once again, because google has
come to the rescue.   they have ensured that they
will support a service designed to honor redirects
for any shortener service that goes out of business.

it makes sense, since they have a large degree of
responsibility for this problem in the first place,
since they give extra google juice to a long u.r.l.

thankfully, though, the shortener services made us
admit to ourselves that the long u.r.l. is a problem,
bringing us to the current stage of u.r.l. history,
where we are once again embracing the short u.r.l.

many people are now voluntarily cutting back on the
use of the long u.r.l.; google could help this effort
by reversing its policy to give juice to the long u.r.l.

because a short and clear u.r.l. is a better u.r.l.

because people _do_ have to occasionally type in
a u.r.l., and can't just do a simple copy-and-paste.

because people often include u.r.l. in listserve posts,
where there is an imposed length on the lines, and
u.r.l. get printed in p-books, with limited line-length.

because people tweet u.r.l.

because people dislike the brittle shortened u.r.l.

so that's why i think my 5-letter prefix works just fine.


>    Proog given that your naming convention is flawed 
>    and so now you can change it !!

huh?   what?   i guess you better run that by me again.

no, on second thought, never mind.   this is a great
example about here discussion here is one big waste.

i don't think you're _trying_ to sidetrack the dialog,
keith, so i'm not going to scold you, but just tell you
that you need to keep things moving _forward_, ok?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100323/0a3f1d85/attachment-0001.html>

From schultzk at uni-trier.de  Tue Mar 23 14:56:17 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Tue, 23 Mar 2010 22:56:17 +0100
Subject: [gutvol-d] Re: save those pagenumber references
In-Reply-To: <5bee0.5e4a912e.38da6bfb@aol.com>
References: <5bee0.5e4a912e.38da6bfb@aol.com>
Message-ID: <D5040D7F-CF20-482D-9027-B0E1DACEEC65@uni-trier.de>

Hi BB,

	I can not help you if you do not understand plain english or
	american for that matter.

	Also, since you are unable to stay with any point I say good-bye.

	regards
		Keith.


Am 23.03.2010 um 20:09 schrieb Bowerbird at aol.com:

> keith said:
> >   I did get your point and they due have there merit,
> >   yet no more than any other filenaming convention 
> >   where you overly compress the names.
> 
> what do you mean by "overly compress the names"?
> what would a "noncompressed" filename look like?
> 
> 
> >   I will not either go into how flawed they are.
> 
> ...because you have no arguments of substance...
> 
> 
> >   If you want telling filenames use them.
> 
> again, what does this _mean_?
> 
> 
> >   We are not living in a DOS world 
> >   where we are limited to 8 characters.
> 
> the history of the u.r.l. in terms of its length,
> is rather interesting.  everybody started with
> an ethic that they should be short and punchy.
> not just for convenience, but memorability too.
> 
> gradually the u.r.l. began accumulating length,
> as websites got more extensive and files were
> segmented into subdirectories for convenience.
> 
> then google started giving juice for content words
> in the u.r.l., and the length zoomed ridiculously, as
> everyone employed long names for s.e.o. purposes.
> 
> things got so ludicrous that we had the emergence
> of u.r.l. "shorteners", web services that promised to
> end the scourge of a long u.r.l. by providing a much
> shorter one they maintained which rerouted people
> to the longer original, _plus_ furnished some stats,
> so you knew where the clicks were coming from, etc.
> 
> what happened then was that twitter hit, and hit big.
> 
> all of a sudden, people faced a 140-character limit.
> they didn't want to "waste" a substantial percentage
> of that limit every time they wanted to send a u.r.l.,
> so the demand for shortener services skyrocketed...
> 
> so before we could turn around, there were dozens
> such services, and not just 2 or 3 (bit.ly and tinyurl),
> and things got messy.  first, the shortened u.r.l. is
> a pain in the ass for many people, because tweeters
> will often provide different shortened versions for
> the same long u.r.l., but your browser doesn't show
> them as already-visited links (since technically they
> _are_ different links, and your browser doesn't know
> that they all point to the same eventual destination).
> 
> second, shortener services make the u.r.l. "brittle"...
> if the shortener service breaks down, so does their
> "rerouting" ability which points to the ultimate site,
> causing all those links to break for no good reason.
> 
> as startups, with very little chance of "making it",
> the original shorteners had frequent down-time,
> so the problem was readily apparent, even then...
> 
> but as more and more of these services started up
> -- hoping to hit the lottery by being "blessed" by
> twitter or google or anyone who would buy them
> for a boatload of money -- it was more and more
> clear that most of these services _would_fail_, and
> take all their short links with them when they did.
> 
> and sure enough, then they did start closing down.
> 
> and they continue to have cutbacks, to this very day.
> one of them -- http://tr.im/ -- just announced that
> it is no longer accepting u.r.l. shortening requests...
> luckily, they're still honoring their current redirects;
> but what happens when they go completely under?
> 
> well, we're lucky once again, because google has
> come to the rescue.  they have ensured that they
> will support a service designed to honor redirects
> for any shortener service that goes out of business.
> 
> it makes sense, since they have a large degree of
> responsibility for this problem in the first place,
> since they give extra google juice to a long u.r.l.
> 
> thankfully, though, the shortener services made us
> admit to ourselves that the long u.r.l. is a problem,
> bringing us to the current stage of u.r.l. history,
> where we are once again embracing the short u.r.l.
> 
> many people are now voluntarily cutting back on the
> use of the long u.r.l.; google could help this effort
> by reversing its policy to give juice to the long u.r.l.
> 
> because a short and clear u.r.l. is a better u.r.l.
> 
> because people _do_ have to occasionally type in
> a u.r.l., and can't just do a simple copy-and-paste.
> 
> because people often include u.r.l. in listserve posts,
> where there is an imposed length on the lines, and
> u.r.l. get printed in p-books, with limited line-length.
> 
> because people tweet u.r.l.
> 
> because people dislike the brittle shortened u.r.l.
> 
> so that's why i think my 5-letter prefix works just fine.
> 
> 
> >   Proog given that your naming convention is flawed 
> >   and so now you can change it !!
> 
> huh?  what?  i guess you better run that by me again.
> 
> no, on second thought, never mind.  this is a great
> example about here discussion here is one big waste.
> 
> i don't think you're _trying_ to sidetrack the dialog,
> keith, so i'm not going to scold you, but just tell you
> that you need to keep things moving _forward_, ok?
> 
> -bowerbird
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100323/d50bc6f1/attachment.html>

From Bowerbird at aol.com  Tue Mar 23 15:02:27 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 23 Mar 2010 18:02:27 EDT
Subject: [gutvol-d] Re: save those pagenumber references
Message-ID: <63e12.3dbb24c5.38da9473@aol.com>

well, i guess i'll never know what an "uncompressed" filename 
would look like, or what a "telling filename" could possibly be...

my loss, apparently...        :+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100323/f0fd3503/attachment.html>

From Bowerbird at aol.com  Tue Mar 23 17:14:57 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 23 Mar 2010 20:14:57 EDT
Subject: [gutvol-d] Re: his style grates on many and his egotism seems
	boundless
Message-ID: <6b857.3980f0bc.38dab381@aol.com>

over on his fadedpage forums, rfrank said:
>    For those of you that read gutvol-d, 
>    you should know that I do, also.

ok.   that clears that up, for anyone who was wondering.


>    Bowerbird seems very interested in the work 
>    we are doing here 

you're darn tooting i am.        :+)

rfrank is currently doing the most interesting work in
the whole arena of book distribution, and that's an arena
where i have exhibited keen interest for a very long time,
certainly long before rfrank ever became involved with it.


>    Bowerbird seems very interested in 
>    the work we are doing here and posts his 
>    observations and suggestions of the gutvol-d list.

well, yeah, i've been posting on this listserve since 2003,
except for that short time when the attack-pack got me
"moderated", and i went on strike until it was rescinded.

so again, nothing new there...


>    Though his style grates on many 
>    and his egotism seems boundless,

ok, let's break this down, shall we?


>    his style grates on many 

yes it does.   and this is particularly so for those people
who subscribe to the dale-carnegie school-of-thought
on "how to win friends and influence people", of which
i do firmly believe that rfrank is a very big follower...

(and al haines too, as well as juliet sutherland.)

so let me be perfectly clear on this matter once again.

i _hate_ the dale-carnegie philosophy.   with a passion.

i consider it to be tremendously duplicitous, in that it
zeros in on one of the most pathetic of human traits
-- our insecurity about our worth -- and trades on it.

it encourages one to feed people positive reinforcement,
so as to make them feel good and overcome insecurity,
so they will come to like you and be influenced by you...

i'm not denying that it _works_.   it works all too well!

but it's cynical.   and it's manipulative.   and it's ugly.

it tip-toes around the issue of flagrant dishonesty
by informing its adherents to strive to be honest,
and not to lie outright, but that's largely a cover-up
which denies the fact one can't _always_ be positive,
not if one feels any solid commitment to the truth,
the whole truth, and nothing but the truth, as i do.

so i will often go out of my way to do reverse-carnegie.

dale says "never tell someone they are wrong".

so when i am thoroughly convinced someone is wrong,
when it's true, i say it, and with gusto: "you are wrong."

and then i give all of the reasons _why_ they are wrong,
which is also a reverse-carnegie, because dale says that
you should always give people an out, a way to save face.

those are just two quick examples.   but that's enough,
because we're not really here to talk about dale carnegie.

the thing to remember is, i don't give a crap if anyone
here becomes my "friend" or not.   i have enough friends,
and i don't even _want_ friends who i can't be honest with.

and i'm not here attempting to "influence" anyone either.

a lot of people get confused about this, because i'm often
saying that things _should_be_done_ in a certain way, so
people think i have some kind of personal interest about
actually _having_ them done that way.   i don't really care.
you can do it however you want.   because when i talk about
"how something should be done", i'm talking about _logic_.
i'm talking about the _arguments_ that dictate that decision.
as shown here, frequently, a lot of people here don't seem to
care about "logic" and "reasoned decisions" and stuff like that.
which is fine by me.   please make decisions however you want.

the thing is, it really pisses off the carnegie adherents when
you don't care whether you influence them or not, probably
because they are willing to sell their soul to have influence,
so your apathy (or hostility) about it contradicts their values.

so when you fail to butter them up before you lobby them,
like dale advises, they get all offended, and even _mean_...

(that's right, they forget dale's advice to always be nice,
which just goes to show they didn't absorb it very deeply;
they only use it because it often works on a surface level.)

so, yeah, my style "grates" on some people.   so what?

because a whole lot of other people -- who i actually
like a lot better -- actually _appreciate_ and _respect_
someone who is willing to speak their mind honestly...


>    and his egotism seems boundless

that's just a silly projection.

i'm a humble person.   i am honestly and truly humble.
i'm unimposing, and i'm tremendously kind and gentle.

and it's not just a phony act i put on to "win friends"...

but there is something about truth.   when you have truth
on your side, you become strong.   you become invincible.

i work -- hard! -- to make sure i get to the bottom
of a situation, and consider every angle, because it is
_vitally_ important to me that i have truth on my side.

if i'm on one side of an issue, and the strength of the
argumentation suddenly flips truth to the _other_ side,
i flip right along with it.   because truth is important...

yes, one of my biggest flaws is saying "i told you so."

but one of my biggest assets is that i have absolutely
no reluctance, at all, to say "i was wrong" when i was.

a lot of people think i'm "egotistical" when i'm _really_
just extremely confident that i have truth on my side...

so it actually has nothing to do with _me_, or my _ego_.
instead, it has _everything_ to do with _truth_...


>    Though his style grates on many 
>    and his egotism seems boundless,
>    at times there is something of worth in what he posts.

of course there is.   that's because i have truth on my side.

it's also because i'm enough of a scientist that i'm willing
-- nay, _eager_ -- to listen when someone says they think
that i'm wrong.   because if they're correct that i am wrong,
i _want_ them to show me the light, so i can switch sides...

but again, i don't really care if i "convince" anyone or not.
it's an intellectual exercise for me, not a power struggle...


>    at times there is something of worth in what he posts.

oh yeah, and the _other_ thing is that you can never trust
a carnegie follower when they say anything nice about you,
because they're probably just attempting to butter you up.
so maybe roger doesn't even _believe_ what he said there.


>    He mistakenly reported that the SR version of the book 
>    is being incrementally updated.

well, the file that is now posted on your site, to which i
gave the u.r.l., is _not_ the file that was posted yesterday.
the pagination error i pointed out yesterday was corrected.
so i'm not sure how you can use the term "mistakenly"...


>    He also shows he hasn't come to a complete understanding 
>    of the unusual situation in the text regarding the 
>    inconsistent usage of Wrangel and Wrangell spellings 
>    as it applies to Barons, islands and native population. 
>    It still isn't right and will be bimodally normalized 
>    after smoothreading completes.

i didn't really try to "come to a complete understanding".

the p-book appears to me to be inconsistent in its usage,
and you appear to be inconsistent too, and your usage
does not achieve consistency with the p-book's usage...

i pointed out the inconsistencies to show i'd found them.
but there's no payoff for me to do any more work on that.


>    He did, however, correctly spot the effect of a superfluous 
>    page transition marker after the last illustration 
>    on a numbered page in the book. Since these books are 
>    all generated from one source file, it was a simple fix 
>    and it was regenerated in a heartbeat.

ok.   so the file that's up online was _not_ "updated", but it
_was_ "regenerated".   i'll try to remember this terminology.


>    He also believes that I may have post-processed 
>    over 500 books, and I have not. 

well, i'd rather give rfrank _more_ credit than _less_...
i know he's done _hundreds_and_hundreds_ of books.

he's also programmed a lot of tools, and is now running the
roundless experimental site, plus he's on the board at d.p.,
so it's clear that he's doing a lot, and i give him credit for it.


>    Though I could have a lot to say about his posts, I choose 
>    not to engage him for historical and practical reasons.

the "historical" reason might be that when he did engage me,
he tried to deny reality, so i rubbed his nose in it, just like you
rub a dog's nose in his pee when he urinates in your house...

and the "practical" reason might be that he knows i will do that
again if he tries to deny reality again, carnegie notwithstanding.

but hey, i don't need for us to "engage".   i'm self-motivated.
i will say what i have to say, whether anyone listens or not...

so he can say what he wants on his board, and i'll post here.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100323/e183bc89/attachment-0001.html>