From schultzk at uni-trier.de  Mon Feb  1 02:02:35 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Mon, 1 Feb 2010 11:02:35 +0100
Subject: [gutvol-d] Re: More iPad details
In-Reply-To: <SNT120-DS77A1950EB81C6FAF95A3EAE5B0@phx.gbl>
References: <alpine.DEB.2.00.1001271748350.26348@mail.pglaf.org>	<SNT120-DS123D68EB3B258BBEC38A7EAE5C0@phx.gbl>
	<F50B6772-7C0B-4126-AEFF-8CA1D8BCBEC9@uni-trier.de>
	<SNT120-DS77A1950EB81C6FAF95A3EAE5B0@phx.gbl>
Message-ID: <F89A84E8-D781-487E-AE03-503E413DFC0B@uni-trier.de>

Hi Jim,

	You have made my point. 

	The point remains, that text -to-speech
	is a important component, but it does not
	constitute designed for the blind or ...
	
	As you mentioned the blind will mostly,
	get more hardware and software better
	suited to thier needs.

	BTW. Macs have had text-to-speech for
	decades, too. 

	regards
		Keith.

Am 29.01.2010 um 21:24 schrieb Jim Adcock:

>> 	I find your argument mute. As most computers
> 	are not design for the blind or sight impair.
> 	Sure they can be modified for use with the 
> 	blind.
> 
> I don't understand your comments.  Modern computers have many
> "accessibility" features built-in.  HTML has "accessibility" features
> built-in.  Granted a blind user will probably want to buy a 3rd party screen
> reader app to best make use of the accessibility features built into
> computers -- but then again the sighted iPad user will have to download a
> separate Apple app just to be able to read books! Windows 7 comes with a
> basic screen reader.  For an overview of these issues see for example:
> 
> http://www.microsoft.com/enable/
> 
> Blind users have been using text-to-speech with computers since DECtalk
> 1984. A notable user you have probably seen and heard on TV is Stephen
> Hawkings.


From schultzk at uni-trier.de  Mon Feb  1 02:06:15 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Mon, 1 Feb 2010 11:06:15 +0100
Subject: [gutvol-d] Re: More iPad details
In-Reply-To: <SNT120-DS2254AF65D02723E0A1C83CAE5B0@phx.gbl>
References: <b832.36bef6e.38947643@aol.com>
	<627d59b81001290940q58464655n51e14df6fbd06939@mail.gmail.com>
	<SNT120-DS2254AF65D02723E0A1C83CAE5B0@phx.gbl>
Message-ID: <2C8597D5-3FC6-4EC0-97C5-AB337BD96815@uni-trier.de>


Am 29.01.2010 um 21:59 schrieb Jim Adcock:

>> I have a sight-impaired friend who would appreciate having one of those
> Kindles drop-kicked in his direction. He figures he can deal with the
> buttons somehow.
> 
> Here is a reference to the National Federation of the Blind lawsuit over
> Kindle use on College Campuses, which was concluded by ending the Kindle
> campus program in progress, and Kindle agreeing to improve accessibility.
> The lawsuit alleged that the Kindles were inaccessible to blind students and
> thus violate federal law.
> 
> http://www.nfb.org/nfb/NewsBot.asp?MODE=VIEW&ID=527
> 
> So hopefully Kindles will someday soon be able to speak the buttons and the
> list of book titles and authors.  Can't find any place that Amazon talks
> about this issue -- not surprisingly! Hopefully Apple and iPad have enough
> experience that they will not step into the same puddle!
	Apple has the technology for text-to-speech. They should be able to port
	it. They managed iWorks, i am sure they can manage text-to speech.
	The question remains if the iPad will still preform well. 
	We have to wait and see.

	regards
		Keith.


From schultzk at uni-trier.de  Mon Feb  1 02:13:07 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Mon, 1 Feb 2010 11:13:07 +0100
Subject: [gutvol-d] Re: Transferring PG files from PC to iPod
In-Reply-To: <b715.2d6b69ce.389475d7@aol.com>
References: <b715.2d6b69ce.389475d7@aol.com>
Message-ID: <66B1B3B6-ADE9-46E5-AFFF-43DB5AE8C328@uni-trier.de>

Hi BB,

	I feelt you were unnessarily hard with some
	of your typical comments. My comments were
	meant to hit the nerve they hit.

	I did not mean to get on your back. Just a demonstration 
	of being on the other side of certain comments.

	Far as the problem of formating we do live on the
	same planet. Though often enough in different
	cultures.

	Take care.

	regards
		Keith.

Am 29.01.2010 um 18:33 schrieb Bowerbird at aol.com:

> keith said:
> >   Sorry, BB I think You did not due Walter and Andrew justice. 
> >   They did not attack anyone and just stated their views. 
> >   You could have just have just mention the advantages of 
> >   eucalyptus. But, why be so sarcastic here. 
> 
> hey, back off, keith, now.
> 
> i didn't "attack" anyone, not by any stretch of the imagination.
> 
> i just disagreed with something walter said.  or, more specifically,
> i asked for clarification, and registered a few counter-thoughts...
> 
> and i don't appreciate it when people mistake my motives and
> then mischaracterize them as if they had some handle on them.
> 
> you've made a mistake here, keith, a bad mistake, and if i were
> the whining kind, i'd probably demand some kind of apology,
> but as it is, i'm just warning you to stop making that mistake...
> 
> 
> >   You could have just have just mention the advantages of 
> >   eucalyptus. But, why be so sarcastic here. 
> 
> i _did_ mention the "advantage" of eucalyptus, nice formatting.
> 
> but that just introduces the same question i asked about stanza,
> namely, "what is it that _constitutes_ nice and proper formatting?"
> 
> this is a good question, one that really _needs_ to be asked, so
> that we can then go on and ask more sophisticated questions,
> such as "how do we apply that formatting?", and "what kind of
> rule-set is eucalyptus following in order to apply its formatting?",
> and so on.  as it is, though, as evidenced by the mess of formats
> coming out of d.p., there is a wide range of "formatting" that
> _could_ be considered "proper", so it's rather meaningless when
> someone refers to "proper formatting", and it's good to know that.
> it doesn't mean they are "wrong", but it _does_ mean that we are
> justified in asking them precisely what _they_ mean by the term...
> 
> and further, there is no "sarcasm" here.  i'm plenty capable of being
> sarcastic it, it's something i do often, and fairly well, although there
> probably isn't much "honor" in that performance in most eyes, but
> there's no reason to think that everything that i do is "sarcastic"...
> if you pay any attention at all, it should be quite easy to see when
> i am being sarcastic and when i'm not.  so keith, _pay_attention_,
> at least if you're going to make commentary.
> 
> also, i'm not sure if eucalyptus uses the utf8 version of files or not.
> plain-text doesn't rule out an encoding -- or even utf8 -- you know.
> 
> -bowerbird
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100201/5dac88e0/attachment.html>

From jimad at msn.com  Mon Feb  1 11:02:26 2010
From: jimad at msn.com (James Adcock)
Date: Mon, 1 Feb 2010 11:02:26 -0800
Subject: [gutvol-d] Re: Formats and gripes
In-Reply-To: <4B65299A.7060304@perathoner.de>
References: <a0bf3e960912050813n481ef547uea27caec910d186b@mail.gmail.com>	<a0bf3e960912050820o105d3daen19f344ba32e66157@mail.gmail.com>	<a0bf3e960912050826kfff7dfatd8af522a6d79d90c@mail.gmail.com>	<a0bf3e960912050846t48b2e462x7dba2bf5ec546174@mail.gmail.com>	<a0bf3e960912050901k661817f6ie8b415826e6d151d@mail.gmail.com>	<20100124200541.GH27785@pglaf.org>	<SNT120-DS20D464376EA57A5412A1A3AE5D0@phx.gbl>	<4B60A4D4.8050701@perathoner.de>	<SNT120-DS1804AAB8140071074F5058AE5C0@phx.gbl>	<4B613AB1.4020302@perathoner.de>	<SNT120-DS21BABAB523D5DA00B3F462AE5C0@phx.gbl>	<4B61D55B.8010405@perathoner.de>	<SNT120-DS18A6E549EA0695AEC6EE0BAE5A0@phx.gbl>
	<4B65299A.7060304@perathoner.de>
Message-ID: <SNT120-DS5ECCE5CBDC3F823133677AE580@phx.gbl>

>> So why do the PG generated mobis do not have a TOC ?

>Better ask mobipocket. We use their official 'mobigen' conversion tool 
for linux.

Mobipocket is Amazon.  The latest version of mobigen is called kindlegen at:

http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000234621

Not that that seems to help any.  Supposedly it has better support now for
NCX but that didn't cause it to create a TOC for me.  What did "work" for me
was either of the following two approaches:

1) Use Calibre ebook-convert instead of mobigen.  Apply it to either your
epub or your opf and it generates a book including TOC.  You'd need to check
of course that it doesn't introduce other problems for you.

Or:

2) I CAN generate a TOC using kindlegen and using your opf (extracted from
your epub files) when I perform the following changes:

a) in your opf explicitly add a toc.htm file

<manifest>
...
	<item href="toc.htm" id="item8" media-type="application/xhtml+xml"/>
...
</manifest>

And

<guide>
    <reference href="cover.jpg" type="cover" title="Cover Image"/>
    <reference type="toc" title="Table of Contents" href="toc.htm" />
</guide>

Where then the toc.htm contains basically the same information you are
already generating for the toc.ncx except in html format -- which then begs
the question what "support" for NCX actually means?

But in any case taking this approach (which you can see is also the approach
taken in the worked "Sample" book example distributed with kindlegen)
creates a MOBI file with TOC support as users would expect.


From jimad at msn.com  Mon Feb  1 11:06:28 2010
From: jimad at msn.com (James Adcock)
Date: Mon, 1 Feb 2010 11:06:28 -0800
Subject: [gutvol-d] Re: More iPad details
In-Reply-To: <F89A84E8-D781-487E-AE03-503E413DFC0B@uni-trier.de>
References: <alpine.DEB.2.00.1001271748350.26348@mail.pglaf.org>	<SNT120-DS123D68EB3B258BBEC38A7EAE5C0@phx.gbl>	<F50B6772-7C0B-4126-AEFF-8CA1D8BCBEC9@uni-trier.de>	<SNT120-DS77A1950EB81C6FAF95A3EAE5B0@phx.gbl>
	<F89A84E8-D781-487E-AE03-503E413DFC0B@uni-trier.de>
Message-ID: <SNT120-DS168CB5D00C1AFEE6BDDABDAE580@phx.gbl>

>You have made my point. 

Well, I am happy to have made your point -- but I still have no idea what
your point is?


From jimad at msn.com  Mon Feb  1 11:28:49 2010
From: jimad at msn.com (James Adcock)
Date: Mon, 1 Feb 2010 11:28:49 -0800
Subject: [gutvol-d] Re: Formats and gripes
In-Reply-To: <SNT120-DS5ECCE5CBDC3F823133677AE580@phx.gbl>
References: <a0bf3e960912050813n481ef547uea27caec910d186b@mail.gmail.com>	<a0bf3e960912050820o105d3daen19f344ba32e66157@mail.gmail.com>	<a0bf3e960912050826kfff7dfatd8af522a6d79d90c@mail.gmail.com>	<a0bf3e960912050846t48b2e462x7dba2bf5ec546174@mail.gmail.com>	<a0bf3e960912050901k661817f6ie8b415826e6d151d@mail.gmail.com>	<20100124200541.GH27785@pglaf.org>	<SNT120-DS20D464376EA57A5412A1A3AE5D0@phx.gbl>	<4B60A4D4.8050701@perathoner.de>	<SNT120-DS1804AAB8140071074F5058AE5C0@phx.gbl>	<4B613AB1.4020302@perathoner.de>	<SNT120-DS21BABAB523D5DA00B3F462AE5C0@phx.gbl>	<4B61D55B.8010405@perathoner.de>	<SNT120-DS18A6E549EA0695AEC6EE0BAE5A0@phx.gbl>	<4B65299A.7060304@perathoner.de>
	<SNT120-DS5ECCE5CBDC3F823133677AE580@phx.gbl>
Message-ID: <SNT120-DS212FD69C5A8451715FB5A0AE580@phx.gbl>

Sorry, also just got an Amazon email pointing me to this doc:

http://s3.amazonaws.com/kindlegen/AmazonKindlePublishingGuidelinesV1.3.pdf

where page 11 it says:

TOC guideline #1: the Logical TOC (NCX) is mandatory
The Logical Table Of Contents is very important for our mutual customer's
reading
experience as it allows them to easily navigate between chapters on Kindle
2. So all
Kindle books should have both logical and HTML TOCs. Users expect to see an
HTML
TOC when paging through a book from the beginning, while the logical table
of
contents is an additional way for users to navigate books.

So indeed they want both the toc.ncx and the toc.htm -- still haven't
figured out what they think they are doing with the toc.ncx!

-----Original Message-----
From: gutvol-d-bounces at lists.pglaf.org
[mailto:gutvol-d-bounces at lists.pglaf.org] On Behalf Of James Adcock
Sent: Monday, February 01, 2010 11:02 AM
To: 'Project Gutenberg Volunteer Discussion'
Subject: [gutvol-d] Re: Formats and gripes

>> So why do the PG generated mobis do not have a TOC ?

>Better ask mobipocket. We use their official 'mobigen' conversion tool 
for linux.

Mobipocket is Amazon.  The latest version of mobigen is called kindlegen at:

http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000234621

Not that that seems to help any.  Supposedly it has better support now for
NCX but that didn't cause it to create a TOC for me.  What did "work" for me
was either of the following two approaches:

1) Use Calibre ebook-convert instead of mobigen.  Apply it to either your
epub or your opf and it generates a book including TOC.  You'd need to check
of course that it doesn't introduce other problems for you.

Or:

2) I CAN generate a TOC using kindlegen and using your opf (extracted from
your epub files) when I perform the following changes:

a) in your opf explicitly add a toc.htm file

<manifest>
...
	<item href="toc.htm" id="item8" media-type="application/xhtml+xml"/>
...
</manifest>

And

<guide>
    <reference href="cover.jpg" type="cover" title="Cover Image"/>
    <reference type="toc" title="Table of Contents" href="toc.htm" />
</guide>

Where then the toc.htm contains basically the same information you are
already generating for the toc.ncx except in html format -- which then begs
the question what "support" for NCX actually means?

But in any case taking this approach (which you can see is also the approach
taken in the worked "Sample" book example distributed with kindlegen)
creates a MOBI file with TOC support as users would expect.


_______________________________________________
gutvol-d mailing list
gutvol-d at lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d


From marcello at perathoner.de  Mon Feb  1 11:32:30 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Mon, 01 Feb 2010 20:32:30 +0100
Subject: [gutvol-d] Re: Formats and gripes
In-Reply-To: <SNT120-DS5ECCE5CBDC3F823133677AE580@phx.gbl>
References: <a0bf3e960912050813n481ef547uea27caec910d186b@mail.gmail.com>	<a0bf3e960912050820o105d3daen19f344ba32e66157@mail.gmail.com>	<a0bf3e960912050826kfff7dfatd8af522a6d79d90c@mail.gmail.com>	<a0bf3e960912050846t48b2e462x7dba2bf5ec546174@mail.gmail.com>	<a0bf3e960912050901k661817f6ie8b415826e6d151d@mail.gmail.com>	<20100124200541.GH27785@pglaf.org>	<SNT120-DS20D464376EA57A5412A1A3AE5D0@phx.gbl>	<4B60A4D4.8050701@perathoner.de>	<SNT120-DS1804AAB8140071074F5058AE5C0@phx.gbl>	<4B613AB1.4020302@perathoner.de>	<SNT120-DS21BABAB523D5DA00B3F462AE5C0@phx.gbl>	<4B61D55B.8010405@perathoner.de>	<SNT120-DS18A6E549EA0695AEC6EE0BAE5A0@phx.gbl>	<4B65299A.7060304@perathoner.de>
	<SNT120-DS5ECCE5CBDC3F823133677AE580@phx.gbl>
Message-ID: <4B672C4E.8030600@perathoner.de>

James Adcock wrote:
>>> So why do the PG generated mobis do not have a TOC ?
> 
>> Better ask mobipocket. We use their official 'mobigen' conversion tool 
> for linux.
> 
> Mobipocket is Amazon.  The latest version of mobigen is called kindlegen at:
> 
> http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000234621

kindlegen tells me that it builds a TOC.

Now if Amazon would release their "Kindle for PC" for Linux too, I could 
actually check the generated files... knowing that the Kindle runs on 
Linux where's the big holdup?


$ ./kindlegen pg31142.epub

***********************************************
* Amazon.com kindlegen(Linux)   V1.0 build 85 *
* A command line e-book compiler              *
* Copyright Amazon.com 2009                   *
***********************************************

opt version: try to minimize (default)
Info(prcgen): Added metadata dc:Title        "On the Nature of Thought / 
or, The act of thinking and its connexion with a perspicuous sentence"
Info(prcgen): Added metadata dc:Date         "2010-01-31"
Info(prcgen): Added metadata dc:Creator      "John Haslam"
Info(prcgen): Added metadata dc:Rights       "Public domain in the USA."
Info(prcgen): Added metadata dc:Source 
"http://www.gutenberg.org/files/31142/31142-h/31142-h.htm"
Info(prcgen): Parsing files  0000001
Info(prcgen): Resolving hyperlinks
Info(prcgen): Building table of content     URL: 
/tmp/fileY6tCul/31142/toc.ncx
Info(prcgen): Computing UNICODE ranges used in the book
Info(prcgen): Found UNICODE range: Basic Latin [20..7E]
Info(prcgen): Found UNICODE range: General Punctuation - Windows 1252 
[2013..2014]
Info(prcgen): Found UNICODE range: Latin-1 Supplement [A0..FF]
Info(prcgen): Building MOBI file, record count:   0000023
Info(prcgen): Final stats - text compressed to (in % of original size): 
  054.13%
Info(prcgen): The document identifier is: "On_the_Natur-cuous_sentence"
Info(prcgen): The file format version is V6
Info(prcgen): Saving MOBI file
Info(prcgen): MOBI File successfully generated!
$


-- 
Marcello Perathoner
webmaster at gutenberg.org

From marcello at perathoner.de  Mon Feb  1 11:56:39 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Mon, 01 Feb 2010 20:56:39 +0100
Subject: [gutvol-d] Re: Formats and gripes
In-Reply-To: <SNT120-DS212FD69C5A8451715FB5A0AE580@phx.gbl>
References: <a0bf3e960912050813n481ef547uea27caec910d186b@mail.gmail.com>	<a0bf3e960912050820o105d3daen19f344ba32e66157@mail.gmail.com>	<a0bf3e960912050826kfff7dfatd8af522a6d79d90c@mail.gmail.com>	<a0bf3e960912050846t48b2e462x7dba2bf5ec546174@mail.gmail.com>	<a0bf3e960912050901k661817f6ie8b415826e6d151d@mail.gmail.com>	<20100124200541.GH27785@pglaf.org>	<SNT120-DS20D464376EA57A5412A1A3AE5D0@phx.gbl>	<4B60A4D4.8050701@perathoner.de>	<SNT120-DS1804AAB8140071074F5058AE5C0@phx.gbl>	<4B613AB1.4020302@perathoner.de>	<SNT120-DS21BABAB523D5DA00B3F462AE5C0@phx.gbl>	<4B61D55B.8010405@perathoner.de>	<SNT120-DS18A6E549EA0695AEC6EE0BAE5A0@phx.gbl>	<4B65299A.7060304@perathoner.de>	<SNT120-DS5ECCE5CBDC3F823133677AE580@phx.gbl>
	<SNT120-DS212FD69C5A8451715FB5A0AE580@phx.gbl>
Message-ID: <4B6731F7.8050503@perathoner.de>

James Adcock wrote:

> 2. So all
> Kindle books should have both logical and HTML TOCs. Users expect to see an
> HTML
> TOC when paging through a book from the beginning, while the logical table
> of
> contents is an additional way for users to navigate books.

As most PG ebooks already contain a TOC inside the HTML, its pointless 
to generate another one.


-- 
Marcello Perathoner
webmaster at gutenberg.org

From jimad at msn.com  Mon Feb  1 12:23:02 2010
From: jimad at msn.com (James Adcock)
Date: Mon, 1 Feb 2010 12:23:02 -0800
Subject: [gutvol-d] Re: Formats and gripes
In-Reply-To: <4B6731F7.8050503@perathoner.de>
References: <a0bf3e960912050813n481ef547uea27caec910d186b@mail.gmail.com>	<a0bf3e960912050820o105d3daen19f344ba32e66157@mail.gmail.com>	<a0bf3e960912050826kfff7dfatd8af522a6d79d90c@mail.gmail.com>	<a0bf3e960912050846t48b2e462x7dba2bf5ec546174@mail.gmail.com>	<a0bf3e960912050901k661817f6ie8b415826e6d151d@mail.gmail.com>	<20100124200541.GH27785@pglaf.org>	<SNT120-DS20D464376EA57A5412A1A3AE5D0@phx.gbl>	<4B60A4D4.8050701@perathoner.de>	<SNT120-DS1804AAB8140071074F5058AE5C0@phx.gbl>	<4B613AB1.4020302@perathoner.de>	<SNT120-DS21BABAB523D5DA00B3F462AE5C0@phx.gbl>	<4B61D55B.8010405@perathoner.de>	<SNT120-DS18A6E549EA0695AEC6EE0BAE5A0@phx.gbl>	<4B65299A.7060304@perathoner.de>	<SNT120-DS5ECCE5CBDC3F823133677AE580@phx.gbl>	<SNT120-DS212FD69C5A8451715FB5A0AE580@phx.gbl>
	<4B6731F7.8050503@perathoner.de>
Message-ID: <SNT120-DS8C66C8C8D717F5D4029B6AE580@phx.gbl>


>As most PG ebooks already contain a TOC inside the HTML, its pointless 
to generate another one.

Sigh, you are going around in circles. The issue is that you are generating
MOBI files that do not correctly implement the TOC standard of MOBI files.
The result of this is that when a user of a PG file clicks on the dedicated
"TOC" button on their e-reader device, the MOBI file you generate fails to
take them to the TOC.  This is a file format failure on the part of the file
format YOU are generating.  Yes, the HTML files that the creator of the PG
book often also contain a "TOC" in HTML format.  IF, for example, you were
to generate a toc.htm pointing to the "TOC" already in one of the books'
HTML files and correctly link that toc.htm to your opf file, THEN when the
PG user clicks on the dedicated "TOC" button in their ebook reader then that
TOC button WOULD correctly function and it would take them to the TOC the
user has already generated in one of their html files.  Or alternatively, if
they have already created a file called toc.htm you could just link to that
correctly as required in the opf file and everything would work.  Or, if you
are generating a TOC in NCX format you could with trivial changes also
generate a toc.htm which you could correctly link into the opf file and then
the TOC button would also work.  Or you could use Calibre ebook-convert
software which would do this automatically for you and again everything
would actually work.

But, instead you continue to pimp the resulting MOBI file format because YOU
think YOU should be the one to choose which devices PG users should be
reading on, rather than generating valid files in the file formats that PG
customers need to read on the devices they already own.

I think this is silly.  Let the marketplace decide.  If Amazon acts in an
onerous way to customers then customers will choose to buy from Apple and
read in EPUB format.  If Apple acts in an onerous way to customers then
customers will choose to buy from Amazon and will read in MOBI format.
Having the choice helps drive the e-book vendors into less onerous behavior
-- hopefully! -- So far all that Apple has succeeded in doing is driving up
the price for new releases for all ebook readers from $9.99 to $15.99 --
thanks Jobs, that's quite an accomplishment!

-----Original Message-----
From: gutvol-d-bounces at lists.pglaf.org
[mailto:gutvol-d-bounces at lists.pglaf.org] On Behalf Of Marcello Perathoner
Sent: Monday, February 01, 2010 11:57 AM
To: Project Gutenberg Volunteer Discussion
Subject: [gutvol-d] Re: Formats and gripes

James Adcock wrote:

> 2. So all
> Kindle books should have both logical and HTML TOCs. Users expect to see
an
> HTML
> TOC when paging through a book from the beginning, while the logical table
> of
> contents is an additional way for users to navigate books.

As most PG ebooks already contain a TOC inside the HTML, its pointless 
to generate another one.


-- 
Marcello Perathoner
webmaster at gutenberg.org
_______________________________________________
gutvol-d mailing list
gutvol-d at lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d


From marcello at perathoner.de  Mon Feb  1 13:05:35 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Mon, 01 Feb 2010 22:05:35 +0100
Subject: [gutvol-d] Re: Formats and gripes
In-Reply-To: <SNT120-DS8C66C8C8D717F5D4029B6AE580@phx.gbl>
References: <a0bf3e960912050813n481ef547uea27caec910d186b@mail.gmail.com>	<a0bf3e960912050820o105d3daen19f344ba32e66157@mail.gmail.com>	<a0bf3e960912050826kfff7dfatd8af522a6d79d90c@mail.gmail.com>	<a0bf3e960912050846t48b2e462x7dba2bf5ec546174@mail.gmail.com>	<a0bf3e960912050901k661817f6ie8b415826e6d151d@mail.gmail.com>	<20100124200541.GH27785@pglaf.org>	<SNT120-DS20D464376EA57A5412A1A3AE5D0@phx.gbl>	<4B60A4D4.8050701@perathoner.de>	<SNT120-DS1804AAB8140071074F5058AE5C0@phx.gbl>	<4B613AB1.4020302@perathoner.de>	<SNT120-DS21BABAB523D5DA00B3F462AE5C0@phx.gbl>	<4B61D55B.8010405@perathoner.de>	<SNT120-DS18A6E549EA0695AEC6EE0BAE5A0@phx.gbl>	<4B65299A.7060304@perathoner.de>	<SNT120-DS5ECCE5CBDC3F823133677AE580@phx.gbl>	<SNT120-DS212FD69C5A8451715FB5A0AE580@phx.gbl>	<4B6731F7.8050503@perathoner.de>
	<SNT120-DS8C66C8C8D717F5D4029B6AE580@phx.gbl>
Message-ID: <4B67421F.7090500@perathoner.de>

James Adcock wrote:
>> As most PG ebooks already contain a TOC inside the HTML, its pointless 
> to generate another one.
> 
> Sigh, you are going around in circles. The issue is that you are generating
> MOBI files that do not correctly implement the TOC standard of MOBI files.

mobigen is generating MOBI files that ...

> The result of this is that when a user of a PG file clicks on the dedicated
> "TOC" button on their e-reader device, the MOBI file you generate fails to
> take them to the TOC.  This is a file format failure on the part of the file
> format YOU are generating.

Not at all. The epub files I generate validate with epubcheck and the 
TOC displays correctly on a ADE readers.

mobigen then, for whatever reason of its own, fumbles a perfectly valid 
toc.ncx in a perfectly valid epub file. This is Amazon's problem. I 
suggest they download a copy of the epub spec and give it to their 
developers.

> Or you could use Calibre ebook-convert
> software which would do this automatically for you and again everything
> would actually work.

Calibre is slow and converts everything first to an interim format (Sony 
LRF I think) which loses most formatting. But foremost calibre is a 
kitchen sink that has dozens of dependencies some of which I cannot 
install on ibiblio. E.g. it wants cherrypy v2 whereas I use cherrypy v3 
for gutenberg development. (What calibre needs a web application server 
for is beyond me.)


> But, instead you continue to pimp the resulting MOBI file format because YOU
> think YOU should be the one to choose which devices PG users should be
> reading on, rather than generating valid files in the file formats that PG
> customers need to read on the devices they already own.

I use the official kindlegen v 1.0 (as of today) that Amazon says 
publishers should use to generate files for the Kindle.

Save your breath to complain to Amazon, because its their application 
that is broken and not my epub files.

If my files don't pass epubcheck, I will fix them. If Amazon needs some 
non-standard gimmick inserted because they can't be bothered to 
implement the spec, then I will definitely NOT insert it.


> I think this is silly.  Let the marketplace decide.  If Amazon acts in an
> onerous way to customers then customers will choose to buy from Apple and
> read in EPUB format.  If Apple acts in an onerous way to customers then
> customers will choose to buy from Amazon and will read in MOBI format.

Let me know when `the marketplace? has fixed the bugs in their app.


-- 
Marcello Perathoner
webmaster at gutenberg.org

From jimad at msn.com  Mon Feb  1 13:26:36 2010
From: jimad at msn.com (Jim Adcock)
Date: Mon, 1 Feb 2010 13:26:36 -0800
Subject: [gutvol-d] Re: Formats and gripes
In-Reply-To: <4B67421F.7090500@perathoner.de>
References: <a0bf3e960912050813n481ef547uea27caec910d186b@mail.gmail.com>	<a0bf3e960912050820o105d3daen19f344ba32e66157@mail.gmail.com>	<a0bf3e960912050826kfff7dfatd8af522a6d79d90c@mail.gmail.com>	<a0bf3e960912050846t48b2e462x7dba2bf5ec546174@mail.gmail.com>	<a0bf3e960912050901k661817f6ie8b415826e6d151d@mail.gmail.com>	<20100124200541.GH27785@pglaf.org>	<SNT120-DS20D464376EA57A5412A1A3AE5D0@phx.gbl>	<4B60A4D4.8050701@perathoner.de>	<SNT120-DS1804AAB8140071074F5058AE5C0@phx.gbl>	<4B613AB1.4020302@perathoner.de>	<SNT120-DS21BABAB523D5DA00B3F462AE5C0@phx.gbl>	<4B61D55B.8010405@perathoner.de>	<SNT120-DS18A6E549EA0695AEC6EE0BAE5A0@phx.gbl>	<4B65299A.7060304@perathoner.de>	<SNT120-DS5ECCE5CBDC3F823133677AE580@phx.gbl>	<SNT120-DS212FD69C5A8451715FB5A0AE580@phx.gbl>	<4B6731F7.8050503@perathoner.de>	<SNT120-DS8C66C8C8D717F5D4029B6AE580@phx.gbl>
	<4B67421F.7090500@perathoner.de>
Message-ID: <SNT120-DS22E9FF38587673C9FDAC82AE580@phx.gbl>

>Not at all. The epub files I generate validate with epubcheck and the 
TOC displays correctly on a ADE readers.

You are leaving out the TOC element in the guide structure of the epub
files.  While it is legal to do so in EPUB [not MOBI], that ADE displays a
"TOC" [actually the NCX guide structure] even when you are leaving out of
the guide can be considered "an extension" at best, a non-conforming
behavior of ADE at worse.

NCX is NOT a "TOC" per se, see:

http://www.openebook.org/2007/opf/OPF_2.0_final_spec.html#Section2.4.1

particularly 2.4.1.1

where in comparison it shows how TOC, List of Illustrations, etc are
SUPPOSED to be implemented at:

http://www.openebook.org/2007/opf/OPF_2.0_final_spec.html#Section2.6


From Bowerbird at aol.com  Mon Feb  1 16:14:58 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 1 Feb 2010 19:14:58 EST
Subject: [gutvol-d] roundlessness -- 001
Message-ID: <1c73a.392d55ff.3898c882@aol.com>

welcome to february 2010...            :+)

roger frank (known as rfrank over at distributed proofreaders)
is doing an experimental test of a roundless proofing system:
>    http://www.fadedpage.com

i've argued for years and years that d.p. should go roundless,
so i see this experiment as a wonderful thing, and i support it.

i repeat, this is a very very very very very very very good thing...

so worthwhile, in fact, that i will spend some time analyzing it,
and offering up the valuable gift of some constructive criticism.

i'm sure roger will be thrilled to hear it...

in order to get the most out of my posts, you should probably
go over and register at the site and do a little bit of work there.
that way you'll get enough experience to have a feel for the site.
you might wanna read the forums too, so as to grasp the issues.
it won't take much time, and you'll acquaint you with the future.

i'll probably have 28 days worth of material -- so settle in and
make yourselves comfortable as we look at roger's experiment
during the month of february...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100201/b15ce428/attachment.html>

From pterandon at gmail.com  Mon Feb  1 18:38:14 2010
From: pterandon at gmail.com (Greg M. Johnson)
Date: Mon, 1 Feb 2010 21:38:14 -0500
Subject: [gutvol-d] Psychology of interacting with (Google's) ebooks.
Message-ID: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>

I downloaded two epub's from Google Books and one or both of the book
reading apps on my Android phone didn't even see one of them.

I think that some of these collections are designed with the idea that the
repository should be on the web, and you the reader should go search the web
interface to find a book you want, then download that one book, have perfect
confidence it's going to be cool to read and functioning properly, then
maybe you'll go on to the next one a few days later.

I don't think humans work that way.  First of all web interfaces, especially
on a phone, are inherently slow, and sometimes unavailable either due to
wifi/ 3G coverage or due to embarassment about using "work bandwidth".  The
Google Books interface isn't *bad*, but it's still like being fed at a
gourmet banquet with a baby spoon.    The user may have one bad experience
with a downloaded text, no matter how small, and they want to curate their
own collection first, maybe hoard-up more books than they or their family
could read in a lifetime, cull out the icky or malfunctioning texts, and
then have say 20 on their reader and 2000 on a DVD in a safe in their
basement. At least that's how I respond to having one or two minor
problems.  ;)

I don't think that Google Books at least gets this. I spent so much time at
Google Books, browsing in apparently spider-like fashion, that I got this
warning:
"We're sorry...

... but your computer or network may be sending automated queries. To
protect our users, we can't process your request right now."
I guess they're right. At any moment I was about to try to download a few
hundred epub's.


-- 
Greg M. Johnson
http://pterandon.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100201/718777b7/attachment.html>

From schultzk at uni-trier.de  Mon Feb  1 23:09:29 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Tue, 2 Feb 2010 08:09:29 +0100
Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks.
In-Reply-To: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>
References: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>
Message-ID: <3766AD4E-D2C0-41DB-BAC4-943A4DB3F2FE@uni-trier.de>

Hi Greg,

	
Am 02.02.2010 um 03:38 schrieb Greg M. Johnson:

> I downloaded two epub's from Google Books and one or both of the book reading apps on my Android phone didn't even see one of them. 
	You my have put the books on your phone. BUT does your Phone/reader know they are their?!!!
	On my Nokias I load music with thier tool from my Mac. But, I have to have the Player scan the phone
	for music to see them. Maybe you have to do that. Or maybe the reader on your Android needs some
	other files to see the books!
	Hope this helps. 
	 
	regards
		Keith.


From marcello at perathoner.de  Mon Feb  1 23:14:59 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Tue, 02 Feb 2010 08:14:59 +0100
Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks.
In-Reply-To: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>
References: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>
Message-ID: <4B67D0F3.1040907@perathoner.de>

Greg M. Johnson wrote:

> I don't think that Google Books at least gets this. I spent so much time at 
> Google Books, browsing in apparently spider-like fashion, that I got this warning:
> 
> 
>   "We're sorry...
> 
> ... but your computer or network may be sending automated queries. To protect 
> our users, we can't process your request right now."

That may not be a quetion of getting `it? but of getting `hit?.

gutenberg.org too gets hit by dozens of spiders a day, some of them 
sitting on big pipes and working with up to a hundred threads.

While one of those spiders is at work, a human user can just about 
forget getting anything out of gutenberg.org because all server cycles 
are used to serve the spider.

This is why gutenberg.org automatically denies access to IPs that make 
more than a certain amount of requests per hour.

I think with Google the problem may be even worse than with gutenberg.org.


-- 
Marcello Perathoner
webmaster at gutenberg.org

From pterandon at gmail.com  Tue Feb  2 05:17:35 2010
From: pterandon at gmail.com (Greg M. Johnson)
Date: Tue, 2 Feb 2010 08:17:35 -0500
Subject: [gutvol-d] Re: Formats and gripes
Message-ID: <a0bf3e961002020517x46c8de10r9f8e8cbfc9c9a4b0@mail.gmail.com>

From: "James Adcock" <jimad at msn.com>
To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d at lists.pglaf.org>
Date: Mon, 1 Feb 2010 12:23:02 -0800
Subject: [gutvol-d] Re: Formats and gripes

>>As most PG ebooks already contain a TOC inside the HTML,
>> its pointless to generate another one.
>
> Sigh, you are going around in circles. The issue is that you
> are generating MOBI files that do not correctly implement
> the TOC standard of MOBI files.

TOC is one thing.  PG's epub file for "At the Earth's Core"  (pg123.epub)
shows up as under a list of "Unknown Authors" on my Android phone's FBReader
(software recommended by PG).  There's no title for it either  in the
display in one's Library.   Once you open it, it appears to work well, even
with a TOC!  But is there something different about the way this text was
prepared in comparison to, say the way the epub for "The Three Musketeers"
was prepared?  That shows up correctly with title and author.


-- 
Greg M. Johnson
http://pterandon.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100202/3ff903de/attachment.html>

From marcello at perathoner.de  Tue Feb  2 06:48:56 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Tue, 02 Feb 2010 15:48:56 +0100
Subject: [gutvol-d] Re: Formats and gripes
In-Reply-To: <SNT120-DS22E9FF38587673C9FDAC82AE580@phx.gbl>
References: <a0bf3e960912050813n481ef547uea27caec910d186b@mail.gmail.com>	<a0bf3e960912050820o105d3daen19f344ba32e66157@mail.gmail.com>	<a0bf3e960912050826kfff7dfatd8af522a6d79d90c@mail.gmail.com>	<a0bf3e960912050846t48b2e462x7dba2bf5ec546174@mail.gmail.com>	<a0bf3e960912050901k661817f6ie8b415826e6d151d@mail.gmail.com>	<20100124200541.GH27785@pglaf.org>	<SNT120-DS20D464376EA57A5412A1A3AE5D0@phx.gbl>	<4B60A4D4.8050701@perathoner.de>	<SNT120-DS1804AAB8140071074F5058AE5C0@phx.gbl>	<4B613AB1.4020302@perathoner.de>	<SNT120-DS21BABAB523D5DA00B3F462AE5C0@phx.gbl>	<4B61D55B.8010405@perathoner.de>	<SNT120-DS18A6E549EA0695AEC6EE0BAE5A0@phx.gbl>	<4B65299A.7060304@perathoner.de>	<SNT120-DS5ECCE5CBDC3F823133677AE580@phx.gbl>	<SNT120-DS212FD69C5A8451715FB5A0AE580@phx.gbl>	<4B6731F7.8050503@perathoner.de>	<SNT120-DS8C66C8C8D717F5D4029B6AE580@phx.gbl>	<4B67421F.7090500@perathoner.de>
	<SNT120-DS22E9FF38587673C9FDAC82AE580@phx.gbl>
Message-ID: <4B683B58.90800@perathoner.de>

Jim Adcock wrote:
>> Not at all. The epub files I generate validate with epubcheck and the 
> TOC displays correctly on a ADE readers.
> 
> You are leaving out the TOC element in the guide structure of the epub
> files.  While it is legal to do so in EPUB [not MOBI], that ADE displays a
> "TOC" [actually the NCX guide structure] even when you are leaving out of
> the guide can be considered "an extension" at best, a non-conforming
> behavior of ADE at worse.

 From the epub spec:

> Within the package there may be one guide element, 
> Reading Systems are not required to use the guide element in any way. 

The guide is optional on both sides, the publishing side and the 
consumer side. If Amazon makes it a requirement to have a guide in the 
epub they clearly didn't understand the spec.


-- 
Marcello Perathoner
webmaster at gutenberg.org


From prosfilaes at gmail.com  Tue Feb  2 10:33:05 2010
From: prosfilaes at gmail.com (David Starner)
Date: Tue, 2 Feb 2010 13:33:05 -0500
Subject: [gutvol-d] Re: Formats and gripes
In-Reply-To: <4B683B58.90800@perathoner.de>
References: <a0bf3e960912050813n481ef547uea27caec910d186b@mail.gmail.com>
	<SNT120-DS18A6E549EA0695AEC6EE0BAE5A0@phx.gbl>
	<4B65299A.7060304@perathoner.de>
	<SNT120-DS5ECCE5CBDC3F823133677AE580@phx.gbl>
	<SNT120-DS212FD69C5A8451715FB5A0AE580@phx.gbl>
	<4B6731F7.8050503@perathoner.de>
	<SNT120-DS8C66C8C8D717F5D4029B6AE580@phx.gbl>
	<4B67421F.7090500@perathoner.de>
	<SNT120-DS22E9FF38587673C9FDAC82AE580@phx.gbl>
	<4B683B58.90800@perathoner.de>
Message-ID: <6d99d1fd1002021033r50294c57j1dd8dd8180b8fd53@mail.gmail.com>

On Tue, Feb 2, 2010 at 9:48 AM, Marcello Perathoner
<marcello at perathoner.de> wrote:
> The guide is optional on both sides, the publishing side and the consumer
> side. If Amazon makes it a requirement to have a guide in the epub they
> clearly didn't understand the spec.

Clearly. You've been around for a while; you know that in practice
there are optional features that are mandatory if you want decent
support for the user.

-- 
Kie ekzistas vivo, ekzistas espero.

From jimad at msn.com  Tue Feb  2 11:50:33 2010
From: jimad at msn.com (James Adcock)
Date: Tue, 2 Feb 2010 11:50:33 -0800
Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks.
In-Reply-To: <4B67D0F3.1040907@perathoner.de>
References: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>
	<4B67D0F3.1040907@perathoner.de>
Message-ID: <SNT120-DS23A03BF231B1DF495D2B49AE570@phx.gbl>

In my experience and opinion, Google Books is designed to be overly paranoid
about the spidering issue.  I can spend 15 minutes there searching for
interesting books without even downloading hardly any of the them, and then
Google goes into paranoid mode, and starts requiring "Captcha" on everything
I do.  Also, the search algorithm, whatever it is, is bizarre.  One day I
can find a particular book, I come back the next day and enter the same
search terms, and suddenly Google Books can't find it any more.  Having said
that, I find I can usually live with a Google Book that I find and am
interested in -- either in the PDF format or the EPUB, it depends --
assuming I can't find a PG version of the book where a real human being has
fixed the scannos! Someday maybe I'll even learn to live with the occasional
thumb that shows up in my books! Certainly it is cool the ancient and
obscure things one can find on Google Books.  Not clear their efforts are
really overall to the long-term benefit of society however. And there is a
general problem that the more residual benefits citizens find in old books,
then the more likely our "representatives" will take away our constitutional
rights to read and share old books, and "sell" those rights back to ebook
retailers like Google -- as has already happened in the millennium copyright
laws, and/or DRM.


From hart at pglaf.org  Tue Feb  2 12:34:11 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Tue, 2 Feb 2010 12:34:11 -0800 (PST)
Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks.
In-Reply-To: <SNT120-DS23A03BF231B1DF495D2B49AE570@phx.gbl>
References: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>
	<4B67D0F3.1040907@perathoner.de>
	<SNT120-DS23A03BF231B1DF495D2B49AE570@phx.gbl>
Message-ID: <alpine.DEB.2.00.1002021233540.16465@mail.pglaf.org>


Well said!!!

mh


On Tue, 2 Feb 2010, James Adcock wrote:

> In my experience and opinion, Google Books is designed to be overly paranoid
> about the spidering issue.  I can spend 15 minutes there searching for
> interesting books without even downloading hardly any of the them, and then
> Google goes into paranoid mode, and starts requiring "Captcha" on everything
> I do.  Also, the search algorithm, whatever it is, is bizarre.  One day I
> can find a particular book, I come back the next day and enter the same
> search terms, and suddenly Google Books can't find it any more.  Having said
> that, I find I can usually live with a Google Book that I find and am
> interested in -- either in the PDF format or the EPUB, it depends --
> assuming I can't find a PG version of the book where a real human being has
> fixed the scannos! Someday maybe I'll even learn to live with the occasional
> thumb that shows up in my books! Certainly it is cool the ancient and
> obscure things one can find on Google Books.  Not clear their efforts are
> really overall to the long-term benefit of society however. And there is a
> general problem that the more residual benefits citizens find in old books,
> then the more likely our "representatives" will take away our constitutional
> rights to read and share old books, and "sell" those rights back to ebook
> retailers like Google -- as has already happened in the millennium copyright
> laws, and/or DRM.
>
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

From hart at pglaf.org  Tue Feb  2 12:35:45 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Tue, 2 Feb 2010 12:35:45 -0800 (PST)
Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks.
In-Reply-To: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>
References: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>
Message-ID: <alpine.DEB.2.00.1002021235020.16465@mail.pglaf.org>


Well said!!!

I should have posted this earlier. . .and mentioned thatI asked permission
to use this, forward it, etc., in the future. . . .


Michael


On Mon, 1 Feb 2010, Greg M. Johnson wrote:

> I downloaded two epub's from Google Books and one or both of the book reading apps on my
> Android phone didn't even see one of them.
>
> I think that some of these collections are designed with the idea that the repository
> should be on the web, and you the reader should go search the web interface to find a
> book you want, then download that one book, have perfect confidence it's going to be cool
> to read and functioning properly, then maybe you'll go on to the next one a few days
> later.
>
> I don't think humans work that way.? First of all web interfaces, especially on a phone,
> are inherently slow, and sometimes unavailable either due to wifi/ 3G coverage or due to
> embarassment about using "work bandwidth".? The Google Books interface isn't *bad*, but
> it's still like being fed at a gourmet banquet with a baby spoon.? ? The user may have
> one bad experience with a downloaded text, no matter how small, and they want to curate
> their own collection first, maybe hoard-up more books than they or their family could
> read in a lifetime, cull out the icky or malfunctioning texts, and then have say 20 on
> their reader and 2000 on a DVD in a safe in their basement. At least that's how I respond
> to having one or two minor problems.? ;)
>
> I don't think that Google Books at least gets this. I spent so much time at Google Books,
> browsing in apparently spider-like fashion, that I got this warning:
>                                      "We're sorry...
>
> ... but your computer or network may be sending automated queries. To protect our users,
> we can't process your request right now."
>
> I guess they're right. At any moment I was about to try to download a few hundred epub's.
>
>
> --
> Greg M. Johnson
> http://pterandon.blogspot.com
>
>

From Bowerbird at aol.com  Tue Feb  2 15:59:55 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 2 Feb 2010 18:59:55 EST
Subject: [gutvol-d] roundlessness -- 002
Message-ID: <56f4.5ddd8802.389a167b@aol.com>

we're looking at rfrank's "roundless" experiment at fadedpage.com...

as i said yesterday, this test is a very very very very very good thing,
because distributed proofreaders has been bogged down in a morass
of "rounds" for many years now.   their standard workflow now calls for
_three_ rounds of proofing, followed by _two_ rounds of formatting...

throw in a "preprocessing" round, and their "postprocessing", which
is following by "postprocessing verification", and you've got 8 rounds.

i don't know about you, but to me, that seems like a lot...

but that's not the worst of it.   the worst is the resultant backlogs...

the problem arises because d.p. has thousands of proofers doing p1
(the first round of proofing), but d.p. only has hundreds that do p2
(the second round), and mere _dozens_ doing p3 ("final" proofing)...

needless to say, the large number of proofers doing p1 can proof
more than the smaller number doing p2, or the tiny number in p3.
the backlog created is (understandably) frustrating and demoralizing
for the proofers trying to keep up in p2, and is killing the p3 proofers.

there is also the gnawing feeling that not all pages _need_ 3 rounds.
indeed, _most_ pages in _most_ books are simple enough that they
can be finished in one round, two at the most.   so the _inefficiency_
of the 3-round proofing is rather striking as well.   the thought is that
each page should be proofed only as many times as that page needs;
this has been labeled as a "roundless" system.

aside from the backlogs of partially-done material, the other sign of
a problem with the dp.p. workflow is that production has flattened...
even though d.p. enjoys a constant stream of incoming volunteers,
thanks to all of the good-will that project gutenberg's free e-books
have generated over the years, d.p. output has leveled out at under
250 books per month, which works out to less than 3,000 per year.
against the backdrop of the _millions_ of books google has scanned,
this is a mere drop in the bucket.   a small drop in a very large bucket.

rfrank doesn't go into all of this on his site.   perhaps he didn't need 
to,
since the d.p. people he's recruited are well-acquainted with the issues.

but rfrank is also unclear on many of the details of his little experiment,
which is a more worrying matter.

specifically, i don't see a lot of experimental rigor here.   it seems to 
me
that roger is unfamiliar with the mechanics of the scientific method and
its applicability to human social experiments.   i see no evidence of any
stated hypotheses, nor any way such hypotheses can be disconfirmed...

the reason people developed the scientific method was because we found
that when we just fooled around "to see how things turn out", we often
ended up fooling ourselves about what we had seen, and what it meant.

we learned that we had to actually specify our hypotheses, and devise
tests (experiments) specifically designed to disconfirm our hypotheses.

otherwise, our brains are only too willing to accommodate what we find
as being "supportive" of our initial impressions.   ("experimenter bias"
is the term by which this insidious phenomenon is most well-known.)

if i'm correct, this problem will surface in rfrank's future results, and
surface repeatedly, so there's no need for me to labor the point now.
but i wanted to frame this particular issue, here and now, in advance.

that's enough for today.   see you tomorrow...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100202/7a5de2b5/attachment-0001.html>

From jimad at msn.com  Tue Feb  2 16:02:49 2010
From: jimad at msn.com (Jim Adcock)
Date: Tue, 2 Feb 2010 16:02:49 -0800
Subject: [gutvol-d] Re: Formats and gripes
In-Reply-To: <4B683B58.90800@perathoner.de>
References: <a0bf3e960912050813n481ef547uea27caec910d186b@mail.gmail.com>	<a0bf3e960912050820o105d3daen19f344ba32e66157@mail.gmail.com>	<a0bf3e960912050826kfff7dfatd8af522a6d79d90c@mail.gmail.com>	<a0bf3e960912050846t48b2e462x7dba2bf5ec546174@mail.gmail.com>	<a0bf3e960912050901k661817f6ie8b415826e6d151d@mail.gmail.com>	<20100124200541.GH27785@pglaf.org>	<SNT120-DS20D464376EA57A5412A1A3AE5D0@phx.gbl>	<4B60A4D4.8050701@perathoner.de>	<SNT120-DS1804AAB8140071074F5058AE5C0@phx.gbl>	<4B613AB1.4020302@perathoner.de>	<SNT120-DS21BABAB523D5DA00B3F462AE5C0@phx.gbl>	<4B61D55B.8010405@perathoner.de>	<SNT120-DS18A6E549EA0695AEC6EE0BAE5A0@phx.gbl>	<4B65299A.7060304@perathoner.de>	<SNT120-DS5ECCE5CBDC3F823133677AE580@phx.gbl>	<SNT120-DS212FD69C5A8451715FB5A0AE580@phx.gbl>	<4B6731F7.8050503@perathoner.de>	<SNT120-DS8C66C8C8D717F5D4029B6AE580@phx.gbl>	<4B67421F.7090500@perathoner.de>	<SNT120-DS22E9FF38587673C9FDAC82AE580@phx.gbl>
	<4B683B58.90800@perathoner.de>
Message-ID: <SNT120-DS1CF5E7A50DDDCAAB0BED5AE560@phx.gbl>

>The guide is optional on both sides, the publishing side and the 
consumer side. If Amazon makes it a requirement to have a guide in the 
epub they clearly didn't understand the spec.

Amazon doesn't make it a requirement to have a guide in epub, they make it a
requirement to have a guide in mobi.  Both epub and mobi can be made from
OPF, they just have slightly different requirements on that OPF file set.
You could easily generate the set of files required for epub, generate that
file format, then add the one extra file required for a conforming mobi --
which is just a slightly different syntax than the ngx file, add one link
statement in the opf, and recompile the set of files for a fully conforming
mobi.

But instead you blame Amazon for the fact that YOU are choosing to make
files that will not work correctly on the majority of e-book readers being
sold in the market.  You could easily make them work if you wanted to, but
you don't want them to work. 

Other web sites for books including sites for free books using basically the
same set of tools that you are using, instead of making excuses and
finger-pointing ARE making files that work correctly on the majority of
e-book readers being sold in the market. It's not like this is a whole lot
of work for you one way or another.  It's just that you WANT to pimp the
files you are making for Kindles.


From jimad at msn.com  Tue Feb  2 16:13:02 2010
From: jimad at msn.com (Jim Adcock)
Date: Tue, 2 Feb 2010 16:13:02 -0800
Subject: [gutvol-d] Re: Formats and gripes
In-Reply-To: <4B672C4E.8030600@perathoner.de>
References: <a0bf3e960912050813n481ef547uea27caec910d186b@mail.gmail.com>	<a0bf3e960912050820o105d3daen19f344ba32e66157@mail.gmail.com>	<a0bf3e960912050826kfff7dfatd8af522a6d79d90c@mail.gmail.com>	<a0bf3e960912050846t48b2e462x7dba2bf5ec546174@mail.gmail.com>	<a0bf3e960912050901k661817f6ie8b415826e6d151d@mail.gmail.com>	<20100124200541.GH27785@pglaf.org>	<SNT120-DS20D464376EA57A5412A1A3AE5D0@phx.gbl>	<4B60A4D4.8050701@perathoner.de>	<SNT120-DS1804AAB8140071074F5058AE5C0@phx.gbl>	<4B613AB1.4020302@perathoner.de>	<SNT120-DS21BABAB523D5DA00B3F462AE5C0@phx.gbl>	<4B61D55B.8010405@perathoner.de>	<SNT120-DS18A6E549EA0695AEC6EE0BAE5A0@phx.gbl>	<4B65299A.7060304@perathoner.de>	<SNT120-DS5ECCE5CBDC3F823133677AE580@phx.gbl>
	<4B672C4E.8030600@perathoner.de>
Message-ID: <SNT120-DS135A7A28AE00A35E43F188AE560@phx.gbl>

>Now if Amazon would release their "Kindle for PC" for Linux too, I could 
actually check the generated files... knowing that the Kindle runs on 
Linux where's the big holdup?

Don't know other than presumably not enough people in the word run Linux to
make it worth their while.  You CAN however use Linux to install
Mobipocket's free mobile device compatible Reader -- Mobipocket being part
of Amazon -- said reader supports about 50 different popular mobile devices.
The Mobipocket Reader will also allow you to confirm the fact that you are
not adding a conforming TOC to your mobi files.

Read:

http://www.mobipocket.com/en/DownloadSoft/ProductDetailsReader.asp

and look for the little penguin on the right hand side of the page.


From jimad at msn.com  Tue Feb  2 17:33:01 2010
From: jimad at msn.com (Jim Adcock)
Date: Tue, 2 Feb 2010 17:33:01 -0800
Subject: [gutvol-d] Re: roundlessness -- 002
In-Reply-To: <56f4.5ddd8802.389a167b@aol.com>
References: <56f4.5ddd8802.389a167b@aol.com>
Message-ID: <SNT120-DS2459AD258B41D0F637AB17AE560@phx.gbl>

>...killing the p3 proofers.

The problem is worse: under the pressure to produce, and having become
"jaded" the p3'ers apparently do not bother to even look at the digitized
images of the author's text but rather assume that they know best and
introduce changes which are other than what the author wrote.  There is also
the problem of "false positives" -- once the errors left in the text become
infrequent-enough the human mind wants to make changes to "show you're
making a positive contribution" even when there was no error there that the
P3'ers ought to be fixing.  But even the p3 problem is nothing compared to
the wait time in post-processing, where things can get hung up for about
literally another year.

If PG were able to easily accept a txt file now and the html version (and
other versions later) not only would readers get some books a year earlier,
but we could probably save some efforts that die and get lost somewhere
between txt complete and html complete. Why does posting have to happen "all
at once" ???


From gbnewby at pglaf.org  Tue Feb  2 17:44:12 2010
From: gbnewby at pglaf.org (Greg Newby)
Date: Tue, 2 Feb 2010 17:44:12 -0800
Subject: [gutvol-d] Re: roundlessness -- 002
In-Reply-To: <SNT120-DS2459AD258B41D0F637AB17AE560@phx.gbl>
References: <56f4.5ddd8802.389a167b@aol.com>
	<SNT120-DS2459AD258B41D0F637AB17AE560@phx.gbl>
Message-ID: <20100203014412.GA26584@pglaf.org>

On Tue, Feb 02, 2010 at 05:33:01PM -0800, Jim Adcock wrote:
> ...
> If PG were able to easily accept a txt file now and the html version (and
> other versions later) not only would readers get some books a year earlier,
> but we could probably save some efforts that die and get lost somewhere
> between txt complete and html complete. Why does posting have to happen "all
> at once" ???

It doesn't.  In fact, "extracting" works from DP earlier was a big push
I made a couple of years ago.  At that time, such two stage (or other
great-than-one stage) output was something that didn't fit well with
the workflow.  Maybe that's something that could be revisited.

It's important to not double the effort involved at the final posting
phase (whitewashing) through such a two stage process.  But there are
several good ways of insuring this, which could be incorporated with
the process.

There is definitely flexibility.

  -- Greg

From dakretz at gmail.com  Tue Feb  2 18:00:48 2010
From: dakretz at gmail.com (don kretz)
Date: Tue, 2 Feb 2010 18:00:48 -0800
Subject: [gutvol-d] Re: roundlessness -- 002
In-Reply-To: <20100203014412.GA26584@pglaf.org>
References: <56f4.5ddd8802.389a167b@aol.com>
	<SNT120-DS2459AD258B41D0F637AB17AE560@phx.gbl>
	<20100203014412.GA26584@pglaf.org>
Message-ID: <627d59b81002021800x472c11e3n634eedd90a840bb6@mail.gmail.com>

That's real good news, Greg, especially if you're talking about flexibility
on
the DP side. 100% of the responsibility for evaluating and recommending
changes to the DP process has been apparently relegated to the DP Board
of Directors.

Since you are one of the five directors, you're in the know if anyone is.
Since
you represent 20% of the horsepower responsible for coming up with those
changes, I trust you've been busy.


On Tue, Feb 2, 2010 at 5:44 PM, Greg Newby <gbnewby at pglaf.org> wrote:

> On Tue, Feb 02, 2010 at 05:33:01PM -0800, Jim Adcock wrote:
> > ...
> > If PG were able to easily accept a txt file now and the html version (and
> > other versions later) not only would readers get some books a year
> earlier,
> > but we could probably save some efforts that die and get lost somewhere
> > between txt complete and html complete. Why does posting have to happen
> "all
> > at once" ???
>
> It doesn't.  In fact, "extracting" works from DP earlier was a big push
> I made a couple of years ago.  At that time, such two stage (or other
> great-than-one stage) output was something that didn't fit well with
> the workflow.  Maybe that's something that could be revisited.
>
> It's important to not double the effort involved at the final posting
> phase (whitewashing) through such a two stage process.  But there are
> several good ways of insuring this, which could be incorporated with
> the process.
>
> There is definitely flexibility.
>
>  -- Greg
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100202/c027297c/attachment.html>

From gbnewby at pglaf.org  Tue Feb  2 18:20:55 2010
From: gbnewby at pglaf.org (Greg Newby)
Date: Tue, 2 Feb 2010 18:20:55 -0800
Subject: [gutvol-d] Re: roundlessness -- 002
In-Reply-To: <627d59b81002021800x472c11e3n634eedd90a840bb6@mail.gmail.com>
References: <56f4.5ddd8802.389a167b@aol.com>
	<SNT120-DS2459AD258B41D0F637AB17AE560@phx.gbl>
	<20100203014412.GA26584@pglaf.org>
	<627d59b81002021800x472c11e3n634eedd90a840bb6@mail.gmail.com>
Message-ID: <20100203022055.GA28054@pglaf.org>

On Tue, Feb 02, 2010 at 06:00:48PM -0800, don kretz wrote:
> That's real good news, Greg, especially if you're talking about flexibility
> on
> the DP side. 100% of the responsibility for evaluating and recommending
> changes to the DP process has been apparently relegated to the DP Board
> of Directors.

I don't think that was the intention of the (relatively) new Board and
new GM.  The Board has ideas, but isn't trying to manage day to day
activity.

> Since you are one of the five directors, you're in the know if anyone is.
> Since
> you represent 20% of the horsepower responsible for coming up with those
> changes, I trust you've been busy.

Indeed, but actually we have not been looking at this level
of detail for changes in the DP processing chain.  The Board
isn't to micromange, and isn't to get in the way of progress.

That said, if you think there are proposals, ideas for change,
etc. that are not getting the attention they deserve, I would
be happy to bring them to the board (or GM, as appropriate) on
anyone's behalf, anonymously if desired.

  -- Greg

> On Tue, Feb 2, 2010 at 5:44 PM, Greg Newby <gbnewby at pglaf.org> wrote:
> 
> > On Tue, Feb 02, 2010 at 05:33:01PM -0800, Jim Adcock wrote:
> > > ...
> > > If PG were able to easily accept a txt file now and the html version (and
> > > other versions later) not only would readers get some books a year
> > earlier,
> > > but we could probably save some efforts that die and get lost somewhere
> > > between txt complete and html complete. Why does posting have to happen
> > "all
> > > at once" ???
> >
> > It doesn't.  In fact, "extracting" works from DP earlier was a big push
> > I made a couple of years ago.  At that time, such two stage (or other
> > great-than-one stage) output was something that didn't fit well with
> > the workflow.  Maybe that's something that could be revisited.
> >
> > It's important to not double the effort involved at the final posting
> > phase (whitewashing) through such a two stage process.  But there are
> > several good ways of insuring this, which could be incorporated with
> > the process.
> >
> > There is definitely flexibility.
> >
> >  -- Greg
> > _______________________________________________
> > gutvol-d mailing list
> > gutvol-d at lists.pglaf.org
> > http://lists.pglaf.org/mailman/listinfo/gutvol-d
> >

From ke at gnu.franken.de  Tue Feb  2 21:28:32 2010
From: ke at gnu.franken.de (Karl Eichwalder)
Date: Wed, 03 Feb 2010 06:28:32 +0100
Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks.
In-Reply-To: <SNT120-DS23A03BF231B1DF495D2B49AE570@phx.gbl> (James Adcock's
	message of "Tue, 2 Feb 2010 11:50:33 -0800")
References: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>
	<4B67D0F3.1040907@perathoner.de>
	<SNT120-DS23A03BF231B1DF495D2B49AE570@phx.gbl>
Message-ID: <m2ock6q45b.fsf@gnu.franken.de>

"James Adcock" <jimad at msn.com> writes:

> One day I can find a particular book, I come back the next day and
> enter the same search terms, and suddenly Google Books can't find it
> any more.

So what?  If the environment changes (more books, new reviews, external
linking, etc.), yesterdays assumptions could be different or even "wrong"
today.

Sidenote: It is the same with the idea of the iso-8859-1 (or ASCII for
languages, that require more characters) version of books.  These days
everything should be UTF-8 encoded by default.  The ASCII idea was fine
some twenty years ago, but today it is time for change.

On gutenberg you cannot find most books at all!  They simply do not
exist in our cosmos.  And waht's worse, even the important books are
mostly missing or weakly done.

I'm hapy that google offers all these books.  If one issue has defects,
chances are high that there is another copy in the Google cache that
you can use as a remedy.

-- 
Karl Eichwalder

From dakretz at gmail.com  Tue Feb  2 21:43:07 2010
From: dakretz at gmail.com (don kretz)
Date: Tue, 2 Feb 2010 21:43:07 -0800
Subject: [gutvol-d] Re: roundlessness -- 002
In-Reply-To: <20100203022055.GA28054@pglaf.org>
References: <56f4.5ddd8802.389a167b@aol.com>
	<SNT120-DS2459AD258B41D0F637AB17AE560@phx.gbl>
	<20100203014412.GA26584@pglaf.org>
	<627d59b81002021800x472c11e3n634eedd90a840bb6@mail.gmail.com>
	<20100203022055.GA28054@pglaf.org>
Message-ID: <627d59b81002022143k3582d0fam473fcd4a01523749@mail.gmail.com>

And on the other end we're hearing the same thing - the GM is there only to
manage,
and initiative for change will come from the Board. I'm absolutely not
suggesting the Board
is or should be micro or macro managing. I think everyone is expecting that
the
Board is about Planning.  You're not? You disagree?

On Tue, Feb 2, 2010 at 6:20 PM, Greg Newby <gbnewby at pglaf.org> wrote:

> On Tue, Feb 02, 2010 at 06:00:48PM -0800, don kretz wrote:
> > That's real good news, Greg, especially if you're talking about
> flexibility
> > on
> > the DP side. 100% of the responsibility for evaluating and recommending
> > changes to the DP process has been apparently relegated to the DP Board
> > of Directors.
>
> I don't think that was the intention of the (relatively) new Board and
> new GM.  The Board has ideas, but isn't trying to manage day to day
> activity.
>
> > Since you are one of the five directors, you're in the know if anyone is.
> > Since
> > you represent 20% of the horsepower responsible for coming up with those
> > changes, I trust you've been busy.
>
> Indeed, but actually we have not been looking at this level
> of detail for changes in the DP processing chain.  The Board
> isn't to micromange, and isn't to get in the way of progress.
>
> That said, if you think there are proposals, ideas for change,
> etc. that are not getting the attention they deserve, I would
> be happy to bring them to the board (or GM, as appropriate) on
> anyone's behalf, anonymously if desired.
>
>  -- Greg
>
> > On Tue, Feb 2, 2010 at 5:44 PM, Greg Newby <gbnewby at pglaf.org> wrote:
> >
> > > On Tue, Feb 02, 2010 at 05:33:01PM -0800, Jim Adcock wrote:
> > > > ...
> > > > If PG were able to easily accept a txt file now and the html version
> (and
> > > > other versions later) not only would readers get some books a year
> > > earlier,
> > > > but we could probably save some efforts that die and get lost
> somewhere
> > > > between txt complete and html complete. Why does posting have to
> happen
> > > "all
> > > > at once" ???
> > >
> > > It doesn't.  In fact, "extracting" works from DP earlier was a big push
> > > I made a couple of years ago.  At that time, such two stage (or other
> > > great-than-one stage) output was something that didn't fit well with
> > > the workflow.  Maybe that's something that could be revisited.
> > >
> > > It's important to not double the effort involved at the final posting
> > > phase (whitewashing) through such a two stage process.  But there are
> > > several good ways of insuring this, which could be incorporated with
> > > the process.
> > >
> > > There is definitely flexibility.
> > >
> > >  -- Greg
> > > _______________________________________________
> > > gutvol-d mailing list
> > > gutvol-d at lists.pglaf.org
> > > http://lists.pglaf.org/mailman/listinfo/gutvol-d
> > >
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100202/a6c99f4e/attachment-0001.html>

From ke at gnu.franken.de  Tue Feb  2 23:01:40 2010
From: ke at gnu.franken.de (Karl Eichwalder)
Date: Wed, 03 Feb 2010 08:01:40 +0100
Subject: [gutvol-d] Re: roundlessness -- 002
In-Reply-To: <20100203014412.GA26584@pglaf.org> (Greg Newby's message of "Tue, 
	2 Feb 2010 17:44:12 -0800")
References: <56f4.5ddd8802.389a167b@aol.com>
	<SNT120-DS2459AD258B41D0F637AB17AE560@phx.gbl>
	<20100203014412.GA26584@pglaf.org>
Message-ID: <m2k4uupzu3.fsf@gnu.franken.de>

Greg Newby <gbnewby at pglaf.org> writes:

> On Tue, Feb 02, 2010 at 05:33:01PM -0800, Jim Adcock wrote:

> It doesn't.  In fact, "extracting" works from DP earlier was a big push
> I made a couple of years ago.  At that time, such two stage (or other
> great-than-one stage) output was something that didn't fit well with
> the workflow.  Maybe that's something that could be revisited.

I'm all for it.  In the DP forum, I proposed this several times.

> It's important to not double the effort involved at the final posting
> phase (whitewashing) through such a two stage process.  But there are
> several good ways of insuring this, which could be incorporated with
> the process.

Could we give this a try with manually selected books first?  How can we
make sure that we do not waste the whitewashers' time?

-- 
Karl Eichwalder

From traverso at posso.dm.unipi.it  Tue Feb  2 23:18:27 2010
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Wed,  3 Feb 2010 08:18:27 +0100 (CET)
Subject: [gutvol-d] Re: roundlessness -- 002
In-Reply-To: <m2k4uupzu3.fsf@gnu.franken.de> (message from Karl Eichwalder on
	Wed, 03 Feb 2010 08:01:40 +0100)
References: <56f4.5ddd8802.389a167b@aol.com>
	<SNT120-DS2459AD258B41D0F637AB17AE560@phx.gbl>
	<20100203014412.GA26584@pglaf.org> <m2k4uupzu3.fsf@gnu.franken.de>
Message-ID: <20100203071827.50C00FFB2@cardano.dm.unipi.it>


While we are at it, could we consider a revision of the requirements
for the PG txt files? Allowing a bit more of flexibility (for example,
allow to preserve the original line and page breaks) and possibly with
the availability of the page images will improve considerably the
maintenance of the files and the addition of new versions.

Carlo

From marcello at perathoner.de  Tue Feb  2 23:21:14 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed, 03 Feb 2010 08:21:14 +0100
Subject: [gutvol-d] Re: Formats and gripes
In-Reply-To: <SNT120-DS1CF5E7A50DDDCAAB0BED5AE560@phx.gbl>
References: <a0bf3e960912050813n481ef547uea27caec910d186b@mail.gmail.com>	<a0bf3e960912050826kfff7dfatd8af522a6d79d90c@mail.gmail.com>	<a0bf3e960912050846t48b2e462x7dba2bf5ec546174@mail.gmail.com>	<a0bf3e960912050901k661817f6ie8b415826e6d151d@mail.gmail.com>	<20100124200541.GH27785@pglaf.org>	<SNT120-DS20D464376EA57A5412A1A3AE5D0@phx.gbl>	<4B60A4D4.8050701@perathoner.de>	<SNT120-DS1804AAB8140071074F5058AE5C0@phx.gbl>	<4B613AB1.4020302@perathoner.de>	<SNT120-DS21BABAB523D5DA00B3F462AE5C0@phx.gbl>	<4B61D55B.8010405@perathoner.de>	<SNT120-DS18A6E549EA0695AEC6EE0BAE5A0@phx.gbl>	<4B65299A.7060304@perathoner.de>	<SNT120-DS5ECCE5CBDC3F823133677AE580@phx.gbl>	<SNT120-DS212FD69C5A8451715FB5A0AE580@phx.gbl>	<4B6731F7.8050503@perathoner.de>	<SNT120-DS8C66C8C8D717F5D4029B6AE580@phx.gbl>	<4B67421F.7090500@perathoner.de>	<SNT120-DS22E9FF38587673C9FDAC82AE580@phx.gbl>	<4B683B58.90800@perathoner.de>
	<SNT120-DS1CF5E7A50DDDCAAB0BED5AE560@phx.gbl>
Message-ID: <4B6923EA.6050700@perathoner.de>

Jim Adcock wrote:

> But instead you blame Amazon for the fact that YOU are choosing to make
> files that will not work correctly on the majority of e-book readers being
> sold in the market.  You could easily make them work if you wanted to, but
> you don't want them to work. 

If I didn't want them to work I wouldn't generate them in the first time.


To recap for the last time:

1. I do generate epub files that pass epubcheck and display correctly on 
ADE mobile readers.

2. Amazon provides a converter "kindlegen" that claims to convert epub 
files into their proprietary mobi format.

3. kindlegen fumbles the perfectly valid toc that is inside my epubs and 
generates a mobi file without toc (your claim).

4. You tell me that I should volunteer more unpaid time to work around a 
bug in Amazon's converter, reverse engineer their closed proprietary 
format for which they provide no documentation maybe and test it on a 
dozen devices that I should buy out of my own pocket.


IMHO you should bugger the people that chose to make the format 
proprietary, to not document it in any way and on top of that release 
buggy converter software.


Remember another textbook example: that Internet Explorer even in its 
8th incarnation still does not follow w3c standards. And why is that 
possible? Because developers all over the world chose to work around 
Microsofts bugs instead of forcing them to fix their software.


I'm not going down that slippery slope: If I would I'd spend more time 
working around other people's bugs than writing new functionality.

But YOU are perfectly free to volunteer your time to save Amazon some 
bucks: Take my epubs, patch them, and convert them to mobis that display 
the toc when you hit the toc button, and redistribute them on your site.


-- 
Marcello Perathoner
webmaster at gutenberg.org

From frank.vandrogen at bc.biol.ethz.ch  Tue Feb  2 23:28:13 2010
From: frank.vandrogen at bc.biol.ethz.ch (van Drogen  Frank)
Date: Wed, 3 Feb 2010 08:28:13 +0100
Subject: [gutvol-d] Re: roundlessness -- 002
In-Reply-To: <56f4.5ddd8802.389a167b@aol.com>
References: <56f4.5ddd8802.389a167b@aol.com>
Message-ID: <5B4C3A336FC71D4495CB3318A111D285042A23A7@EX2.d.ethz.ch>

> i see no evidence of any

> stated hypotheses, nor any way such hypotheses can be disconfirmed...

> 

> the reason people developed the scientific method was because we found

> that when we just fooled around "to see how things turn out", we often

> ended up fooling ourselves about what we had seen, and what it meant.

 
Not quite familiar with modern advances in sciences, I recon. Now-a-days
it 

seems we're supposed to look at systems as a whole, instead of doing 

hypothesis drivin experiments (at least, granting agencies seem to think
so). 

 
Frank

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100203/9e6713f1/attachment.html>

From Bowerbird at aol.com  Wed Feb  3 02:24:11 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 3 Feb 2010 05:24:11 EST
Subject: [gutvol-d] Re: roundlessness -- 002
Message-ID: <e8a.3da17d11.389aa8cb@aol.com>

get yer own thread!

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100203/4ea230ef/attachment.html>

From dakretz at gmail.com  Wed Feb  3 08:53:08 2010
From: dakretz at gmail.com (don kretz)
Date: Wed, 3 Feb 2010 08:53:08 -0800
Subject: [gutvol-d] Re: Formats and gripes
In-Reply-To: <4B6923EA.6050700@perathoner.de>
References: <a0bf3e960912050813n481ef547uea27caec910d186b@mail.gmail.com>
	<SNT120-DS5ECCE5CBDC3F823133677AE580@phx.gbl>
	<SNT120-DS212FD69C5A8451715FB5A0AE580@phx.gbl>
	<4B6731F7.8050503@perathoner.de>
	<SNT120-DS8C66C8C8D717F5D4029B6AE580@phx.gbl>
	<4B67421F.7090500@perathoner.de>
	<SNT120-DS22E9FF38587673C9FDAC82AE580@phx.gbl>
	<4B683B58.90800@perathoner.de>
	<SNT120-DS1CF5E7A50DDDCAAB0BED5AE560@phx.gbl>
	<4B6923EA.6050700@perathoner.de>
Message-ID: <627d59b81002030853k15255e3bkeb3e43f02b91bed9@mail.gmail.com>

>
>
> I'm not going down that slippery slope: If I would I'd spend more time
> working around other people's bugs than writing new functionality.
>
> But YOU are perfectly free to volunteer your time to save Amazon some
> bucks: Take my epubs, patch them, and convert them to mobis that display the
> toc when you hit the toc button, and redistribute them on your site.
>
> --
> Marcello Perathoner
> webmaster at gutenberg.org


I had a car like this once. The turn signal was on the right side of the
steering column.
The headlight dimmer was on the left side. The window winders worked
backwards.
The inside door locks would lock when you pulled them up, and unlock when
you
pushed them down. An iconoclastic car' which was one reason I liked it. No
concessions.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100203/cf448495/attachment.html>

From Bowerbird at aol.com  Wed Feb  3 10:05:01 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 3 Feb 2010 13:05:01 EST
Subject: [gutvol-d] Re: roundlessness -- 002
Message-ID: <d6dc.16d9161e.389b14cd@aol.com>

frank said:
>    Now-a-days it seems we're supposed to look at systems 
>    as a whole, instead of doing hypothesis drivin experiments 
>    (at least, granting agencies seem to think so).

i believe that as this series continues, my point will become 
crystal-clear.
in the absence of such clarity, or if you think your point continues to 
have
some merit, frank, please do make it in a more specific manner later on...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100203/483f684b/attachment.html>

From Bowerbird at aol.com  Wed Feb  3 10:17:08 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 3 Feb 2010 13:17:08 EST
Subject: [gutvol-d] Re: Formats and gripes
Message-ID: <ddb3.214f2f.389b17a4@aol.com>

dakretz said:
>    I had a car like this once. The turn signal 
>    was on the right side of the steering column.
>    The headlight dimmer was on the left side. 
>    The window winders worked backwards.
>    The inside door locks would lock when you pulled them up, 
>    and unlock when you pushed them down. 
>    An iconoclastic car' which was one reason I liked it.

funny how much we are willing to deviate from "the standard"
when we are making the decision to do so for our own reasons,
and how unwilling we are to do so when someone else asks us...

marcello would jump through all kinds of hoops to make his own
preferred formats work, but he won't do jack shit for anyone else.

if you're not gonna make a mobi version that runs on the kindle,
there isn't much sense in making a mobi version at all, is there?

but like all technocrats, marcello is great at displacing the blame:
"if it doesn't work for you, it must be your fault.   not my problem."

and even if another "project gutenberg volunteer" were to _solve_
this particular problem, i doubt marcello would mount the solution.
i don't know who gave him this power to decide what gets blessed
and what doesn't, but i wish they would now take it away from him.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100203/b011f84a/attachment-0001.html>

From joey at joeysmith.com  Wed Feb  3 12:11:04 2010
From: joey at joeysmith.com (Joey Smith)
Date: Wed, 3 Feb 2010 13:11:04 -0700
Subject: [gutvol-d] Re: Psychology of interacting with (Google's)	ebooks.
In-Reply-To: <m2ock6q45b.fsf@gnu.franken.de>
References: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>
	<4B67D0F3.1040907@perathoner.de>
	<SNT120-DS23A03BF231B1DF495D2B49AE570@phx.gbl>
	<m2ock6q45b.fsf@gnu.franken.de>
Message-ID: <20100203201104.GA957@joeysmith.com>

On Wed, Feb 03, 2010 at 06:28:32AM +0100, Karl Eichwalder wrote:

[snip]

> On gutenberg you cannot find most books at all!  They simply do not
> exist in our cosmos.  And waht's worse, even the important books are
> mostly missing or weakly done.
> 
> I'm hapy that google offers all these books.  If one issue has defects,
> chances are high that there is another copy in the Google cache that
> you can use as a remedy.

Do you have a list of these "important books" that PG is missing but which
are available in Google Books?

From dakretz at gmail.com  Wed Feb  3 12:23:03 2010
From: dakretz at gmail.com (don kretz)
Date: Wed, 3 Feb 2010 12:23:03 -0800
Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks.
In-Reply-To: <20100203201104.GA957@joeysmith.com>
References: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>
	<4B67D0F3.1040907@perathoner.de>
	<SNT120-DS23A03BF231B1DF495D2B49AE570@phx.gbl>
	<m2ock6q45b.fsf@gnu.franken.de> <20100203201104.GA957@joeysmith.com>
Message-ID: <627d59b81002031223v67963d79u3998df223b8d8856@mail.gmail.com>

In fact, DP recently had an active discussion about trying harder to work
from lists of "important books" not yet on PG. This would be very helpful.

On Wed, Feb 3, 2010 at 12:11 PM, Joey Smith <joey at joeysmith.com> wrote:

> On Wed, Feb 03, 2010 at 06:28:32AM +0100, Karl Eichwalder wrote:
>
> [snip]
>
> > On gutenberg you cannot find most books at all!  They simply do not
> > exist in our cosmos.  And waht's worse, even the important books are
> > mostly missing or weakly done.
> >
> > I'm hapy that google offers all these books.  If one issue has defects,
> > chances are high that there is another copy in the Google cache that
> > you can use as a remedy.
>
> Do you have a list of these "important books" that PG is missing but which
> are available in Google Books?
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100203/bcb696ad/attachment.html>

From marcello at perathoner.de  Wed Feb  3 13:23:59 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed, 03 Feb 2010 22:23:59 +0100
Subject: [gutvol-d] Re: Psychology of interacting with (Google's)	ebooks.
In-Reply-To: <20100203201104.GA957@joeysmith.com>
References: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>	<4B67D0F3.1040907@perathoner.de>	<SNT120-DS23A03BF231B1DF495D2B49AE570@phx.gbl>	<m2ock6q45b.fsf@gnu.franken.de>
	<20100203201104.GA957@joeysmith.com>
Message-ID: <4B69E96F.9020508@perathoner.de>

Joey Smith wrote:
> On Wed, Feb 03, 2010 at 06:28:32AM +0100, Karl Eichwalder wrote:
> 
> [snip]
> 
>> On gutenberg you cannot find most books at all!  They simply do not
>> exist in our cosmos.  And waht's worse, even the important books are
>> mostly missing or weakly done.
>>
>> I'm hapy that google offers all these books.  If one issue has defects,
>> chances are high that there is another copy in the Google cache that
>> you can use as a remedy.
> 
> Do you have a list of these "important books" that PG is missing but which
> are available in Google Books?

Marx's Kapital
Freud's Traumdeutung
Russell's Principia Mathematica
Grey's Anatomy

... just a few off the top of my head.


-- 
Marcello Perathoner
webmaster at gutenberg.org

From ke at gnu.franken.de  Wed Feb  3 22:53:08 2010
From: ke at gnu.franken.de (Karl Eichwalder)
Date: Thu, 04 Feb 2010 07:53:08 +0100
Subject: [gutvol-d] Re: Psychology of interacting with (Google's)	ebooks.
In-Reply-To: <4B69E96F.9020508@perathoner.de> (Marcello Perathoner's message
	of "Wed, 03 Feb 2010 22:23:59 +0100")
References: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>
	<4B67D0F3.1040907@perathoner.de>
	<SNT120-DS23A03BF231B1DF495D2B49AE570@phx.gbl>
	<m2ock6q45b.fsf@gnu.franken.de> <20100203201104.GA957@joeysmith.com>
	<4B69E96F.9020508@perathoner.de>
Message-ID: <m2hbpx331n.fsf@gnu.franken.de>

Marcello Perathoner <marcello at perathoner.de> writes:

>> Do you have a list of these "important books" that PG is missing but which
>> are available in Google Books?
>
> Marx's Kapital
> Freud's Traumdeutung
> Russell's Principia Mathematica
> Grey's Anatomy
>
> ... just a few off the top of my head.

Yes, and not a single text by by Novalis, just two books by Fontane,
ditto by W. Raabe, three by Stifter, 1 text by Jean Paul.  I think there
is nearly a single German edition of poems of the Middle Ages (say,
Walther von der Vogelweide).  And litterature about these topics is also
rather rare--Goggle offers tons of those.  All the German
Journals--there is basically nothing available from gutenberg.org.

I'm not sure whether there are still broken LOTE editions at
gutenberg.org, where you simple replaced umlauts with "machting" letters
(? -> a) for the sake of clean ASCII text...

I do not blame us because of these defiancies, but please treat
competitors respectfully, etc. pp.

-- 
Karl Eichwalder

From traverso at posso.dm.unipi.it  Thu Feb  4 01:49:17 2010
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Thu,  4 Feb 2010 10:49:17 +0100 (CET)
Subject: [gutvol-d] autogenerated HTML
Message-ID: <20100204094917.A23DDFFB5@cardano.dm.unipi.it>


I have just become aware that PG now autogenerates HTML for texts that
don't have it. Unfortunately however sometimes the autogenerated file
is garbage (e.g. poetry rewrapped, see 31079). Would it be possible to
have the autogeneration program to find what is the problem, or at
least to preview the autogenerated file and possibly fix either the
program or the files?

Carlo

From Bowerbird at aol.com  Thu Feb  4 02:39:09 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 4 Feb 2010 05:39:09 EST
Subject: [gutvol-d] Re: autogenerated HTML
Message-ID: <1108.22585ed6.389bfdcd@aol.com>

carlo said:
>   I have just become aware that PG now 
>    autogenerates HTML for texts that don't have it. 

since a number of people are fond of rewriting history here,
let's note for the record that i suggested this some time ago.

indeed, my recommendation was that the .txt version should
be used to autogenerate the .html version for _all_ the books,
that hand-crafted .html be abandoned because it is too hard
to maintain and to upgrade.   i also suggested that conformance
to this strategy would enable p.g. to improve the .txt versions...

and i predicted that sooner or later, you'd all come around to
this workflow.   and how you have.   so i will say "i told you so."


>    Unfortunately however sometimes the autogenerated file
>    is garbage (e.g. poetry rewrapped, see 31079). 

without even looking at those files, i can guess what's wrong...

many of the books that are exclusively poetry are set flush to
the left margin, lacking any of the leading spaces that serve
as a signal to the conversion program not to wrap the lines...

so of course the converter is gonna wrap the lines.

this is an error, a major error, in the processing of these books.

(and it's so easy to change every linebreak to a linebreak+space.)


>    Would it be possible to have the autogeneration program to find 
>    what is the problem, or at least to preview the autogenerated file 
>    and possibly fix either the program or the files?

i've never tried to verify it with a closer analysis, but my impression
is that some of the whitewashers use a slightly different converter...

and then of course there are a number of different ones over at d.p.,
including the one in thundercat's app, and another by david garcia...

without dedication to making the .txt program correct at the outset,
however, it doesn't matter how good the converter might be...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100204/aa29c151/attachment.html>

From marcello at perathoner.de  Thu Feb  4 10:15:53 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Thu, 04 Feb 2010 19:15:53 +0100
Subject: [gutvol-d] Re: autogenerated HTML
In-Reply-To: <20100204094917.A23DDFFB5@cardano.dm.unipi.it>
References: <20100204094917.A23DDFFB5@cardano.dm.unipi.it>
Message-ID: <4B6B0ED9.2080304@perathoner.de>

Carlo Traverso wrote:

> I have just become aware that PG now autogenerates HTML for texts that
> don't have it. Unfortunately however sometimes the autogenerated file
> is garbage (e.g. poetry rewrapped, see 31079). Would it be possible to
> have the autogeneration program to find what is the problem, or at
> least to preview the autogenerated file and possibly fix either the
> program or the files?

   http://www.gutenberg.org/tools/epubmaker-0.02-preview-2009-11-26.tgz

Look into parsers/GutenbergTextParser.py


-- 
Marcello Perathoner
webmaster at gutenberg.org

From Bowerbird at aol.com  Fri Feb  5 12:45:33 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 5 Feb 2010 15:45:33 EST
Subject: [gutvol-d] Re: autogenerated HTML
Message-ID: <126a8.62677071.389ddd6d@aol.com>

i said:
>    without even looking at those files, i can guess what's wrong...
>    
>    many of the books that are exclusively poetry are set flush to
>    the left margin, lacking any of the leading spaces that serve
>    as a signal to the conversion program not to wrap the lines...
>
>    so of course the converter is gonna wrap the lines.
>
>    this is an error, a major error, in the processing of these books.
>
>    (and it's so easy to change every linebreak to a linebreak+space.)

well, gee, i should have looked at those files earlier, because 
when i did get around to looking at them, i got a good laugh.

see, the lines _were_ indented, so shouldn't have been wrapped.

and wouldn't have been wrapped if most of the converters around
would've been used.

but i have since learned that there is an "experimental" converter,
programmed by marcello, and that's what was used for this book.

the irony of this gave me a big hearty laugh.

you see, when i laid out the philosophy of z.m.l. here on this list,
some of you will remember how much crap marcello threw at me.
the guy was relentless.   even though he almost never made _any_
posts to the list otherwise, he would respond negatively to anything
i said.   but never constructively.   he'd just throw out pure bullcrap...
for a long time i responded, just to clearly specify that it was crap;
but after a while i decided to let his crap speak for its crappy self,
and i stopped responding.   after that, he stopped responding to me.
(i guess that thing they say about not feeding the trolls is right on.)
but if you wanna see what an ass he was, you can check the archives.
you can also go look at a website he set up with a lot of quotes from
messages i sent here.   most of them are taken out of context, sure,
but even then i stand behind them.   you'll see that they were correct.
(that's right, he set up a _fan_ page, on his web-site, to ridicule me;
i don't know where the guy is coming from, but i think he should 
get a life.)

anyway, marcello insisted that my approach was bunk, and that one
could not successfully generate a full-on .html file from a .txt version.

yet now he is writing code to do just that.

(he's not doing it _successfully_ yet, so in regard to _himself_ alone,
i guess he was right.   but i have been successful for a long time now.)

what's especially ironic -- and funny to me -- is that marcello is now
suffering through the same complications that i experienced, namely
maddeningly inconsistent .txt files, which he must program around,
just like i did.   further, this will lead him to the same conclusion that
i came to, which is that this would be _much_ simpler if only the rules
that p.g. has already established for .txt files were simply _followed_...

even more irony, and thus even more humor?

marcello is programming in _python_, a language where indentation
is _meaningful_.   this is ironic, and funny, because one of the things
about z.m.l. which marcello once tried to lambaste is that whitespace
is meaningful.   every time he does an indent, i hope he chokes on it.

meanwhile, i just have one thing to say to all you z.m.l. naysayers:
i told you so.   you were wrong.   i was right.   i hope you choke on it.

(wait, is that one thing, or 4 things?   oh well, guess it doesn't matter.)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100205/65da0d14/attachment-0001.html>

From pterandon at gmail.com  Fri Feb  5 17:25:11 2010
From: pterandon at gmail.com (Greg M. Johnson)
Date: Fri, 5 Feb 2010 20:25:11 -0500
Subject: [gutvol-d] Re: psychology of interactimg with ebooks
In-Reply-To: <a0bf3e961002051723u16d8ab3cm51cf731dce2534f1@mail.gmail.com>
References: <a0bf3e961002051713l36b33776gab69abe27e891f6f@mail.gmail.com>
	<a0bf3e961002051715o3082a95tee586616cf9b9097@mail.gmail.com>
	<a0bf3e961002051721s74b821cage12920fe24688b16@mail.gmail.com>
	<a0bf3e961002051723u16d8ab3cm51cf731dce2534f1@mail.gmail.com>
Message-ID: <a0bf3e961002051725g7ad69a1cnf2569429102371f5@mail.gmail.com>

<<discussion of poetry auto-generated files messed up>>

IMNSHO,  another fruit of making the TXT-80 format the default standard.
The programming will ne a headache if one can't just delete *all*the CR's.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100205/98c9acf8/attachment.html>

From Bowerbird at aol.com  Fri Feb  5 18:04:49 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 5 Feb 2010 21:04:49 EST
Subject: [gutvol-d] Re: psychology of interactimg with ebooks
Message-ID: <1b1f0.5792aa2e.389e2841@aol.com>

greg said:
>   The programming will ne a headache 
>    if one can't just delete *all*the CR's.

nah.   programming an unwrap routine is quite easy...

>    http://z-m-l.com/go/unwrap.pl

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100205/ed476f49/attachment.html>

From prosfilaes at gmail.com  Sat Feb  6 05:35:38 2010
From: prosfilaes at gmail.com (David Starner)
Date: Sat, 6 Feb 2010 08:35:38 -0500
Subject: [gutvol-d] Re: psychology of interactimg with ebooks
In-Reply-To: <a0bf3e961002051725g7ad69a1cnf2569429102371f5@mail.gmail.com>
References: <a0bf3e961002051713l36b33776gab69abe27e891f6f@mail.gmail.com>
	<a0bf3e961002051715o3082a95tee586616cf9b9097@mail.gmail.com>
	<a0bf3e961002051721s74b821cage12920fe24688b16@mail.gmail.com>
	<a0bf3e961002051723u16d8ab3cm51cf731dce2534f1@mail.gmail.com>
	<a0bf3e961002051725g7ad69a1cnf2569429102371f5@mail.gmail.com>
Message-ID: <6d99d1fd1002060535r137a7312y9c22bc357e1ecc09@mail.gmail.com>

On Fri, Feb 5, 2010 at 8:25 PM, Greg M. Johnson <pterandon at gmail.com> wrote:
> <<discussion of poetry auto-generated files messed up>>
>
> IMNSHO,? another fruit of making the TXT-80 format the default standard.
> The programming will ne a headache if one can't just delete *all*the CR's.

But it's done; we can no more change it now than Compuserve can change
the format of GIF files.

-- 
Kie ekzistas vivo, ekzistas espero.

From pterandon at gmail.com  Sat Feb  6 05:50:06 2010
From: pterandon at gmail.com (Greg M. Johnson)
Date: Sat, 6 Feb 2010 08:50:06 -0500
Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks.
In-Reply-To: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>
References: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>
Message-ID: <a0bf3e961002060550y35ac100fhd61a299d9590cf2e@mail.gmail.com>

Marcello's story about big-pipe servers settling on pg   makes me think of a
Borg spaceship passing over a peaceful little village.

At the risk of sounding like Eleanor Clift's response to the Soviets in "The
Watchmen" movie, I'll ask:
So what on earth are these big-pipe servers doing?

Are they generating their own independent collection in case of a collapse
of the internet? Are they engaged in some really inefficient search
algorithm that requires opening every single file?  Are they some Google
wannabe who's indexing your site?  Is it malicious mischief/ DOS?

Or, is it a case of  an "honest" (if cluelessly implemented) demand that
could be met with some more products that could be torrented.  Could that
entity be looking for a MOBI of the top 1000 books, and EPUB of everything
in the German language?


---------- Forwarded message ----------
> From: Marcello Perathoner <marcello at perathoner.de>
> To: Project Gutenberg Volunteer Discussion <gutvol-d at lists.pglaf.org>
> Date: Tue, 02 Feb 2010 08:14:59 +0100
> Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks.
> Greg M. Johnson wrote:
>
>  I don't think that Google Books at least gets this. I spent so much time
>> at Google Books, browsing in apparently spider-like fashion, that I got this
>> warning:
>>
>>
>>  "We're sorry...
>>
>> ... but your computer or network may be sending automated queries. To
>> protect our users, we can't process your request right now."
>>
>
> That may not be a quetion of getting `it? but of getting `hit?.
>
> gutenberg.org too gets hit by dozens of spiders a day, some of them
> sitting on big pipes and working with up to a hundred threads.
>
> While one of those spiders is at work, a human user can just about forget
> getting anything out of gutenberg.org because all server cycles are used
> to serve the spider.
>
> This is why gutenberg.org automatically denies access to IPs that make
> more than a certain amount of requests per hour.
>
> I think with Google the problem may be even worse than with gutenberg.org.
>  --
> Marcello Perathoner
>


-- 
Greg M. Johnson
http://pterandon.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100206/fd30389d/attachment.html>

From marcello at perathoner.de  Sat Feb  6 10:00:43 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Sat, 06 Feb 2010 19:00:43 +0100
Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks.
In-Reply-To: <a0bf3e961002060550y35ac100fhd61a299d9590cf2e@mail.gmail.com>
References: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>
	<a0bf3e961002060550y35ac100fhd61a299d9590cf2e@mail.gmail.com>
Message-ID: <4B6DAE4B.7000007@perathoner.de>

Greg M. Johnson wrote:
> Marcello's story about big-pipe servers settling on pg   makes me think of a 
> Borg spaceship passing over a peaceful little village.  
> 
> At the risk of sounding like Eleanor Clift's response to the Soviets in "The 
> Watchmen" movie, I'll ask:
> So what on earth are these big-pipe servers doing?

Most of them are collecting innocent-looking phrases to inject into spam 
mails.


-- 
Marcello Perathoner
webmaster at gutenberg.org

From Bowerbird at aol.com  Sat Feb  6 14:31:47 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 6 Feb 2010 17:31:47 EST
Subject: [gutvol-d] Re: Psychology of interacting with (Google's) ebooks.
Message-ID: <2dc9e.665ac718.389f47d3@aol.com>

greg said:
>    So what on earth are these big-pipe servers doing?

collecting information, so as to assemble it, and reassemble it,
thereby producing new information that might change the world.
or make money.   lots of money.   lots and lots and lots of money...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100206/e0cdba0a/attachment.html>

From gbnewby at pglaf.org  Sat Feb  6 16:18:50 2010
From: gbnewby at pglaf.org (Greg Newby)
Date: Sat, 6 Feb 2010 16:18:50 -0800
Subject: [gutvol-d] Technical measures? (Re: Re: Psychology of interacting
 with (Google's)) ebooks.
In-Reply-To: <4B6DAE4B.7000007@perathoner.de>
References: <a0bf3e961002011838x1f57c7b8q35623f6e2766cf54@mail.gmail.com>
	<a0bf3e961002060550y35ac100fhd61a299d9590cf2e@mail.gmail.com>
	<4B6DAE4B.7000007@perathoner.de>
Message-ID: <20100207001850.GD14117@pglaf.org>

On Sat, Feb 06, 2010 at 07:00:43PM +0100, Marcello Perathoner wrote:
> Greg M. Johnson wrote:
> >Marcello's story about big-pipe servers settling on pg   makes me
> >think of a Borg spaceship passing over a peaceful little village.
> >
> >At the risk of sounding like Eleanor Clift's response to the
> >Soviets in "The Watchmen" movie, I'll ask:
> >So what on earth are these big-pipe servers doing?
> 
> Most of them are collecting innocent-looking phrases to inject into
> spam mails.

Did you ever look into mod_evasive or a similar approach?  It's a good
way of automatically shutting down abusers.  Takes some tuning (a bit
like spam filters).  This is something iBiblio would be happy to help
with, I'm sure.

  http://www.zdziarski.com/projects/mod_evasive/

  -- Greg

From gbnewby at pglaf.org  Sun Feb  7 10:41:51 2010
From: gbnewby at pglaf.org (Greg Newby)
Date: Sun, 7 Feb 2010 10:41:51 -0800
Subject: [gutvol-d] Re: roundlessness -- 002
In-Reply-To: <627d59b81002022143k3582d0fam473fcd4a01523749@mail.gmail.com>
References: <56f4.5ddd8802.389a167b@aol.com>
	<SNT120-DS2459AD258B41D0F637AB17AE560@phx.gbl>
	<20100203014412.GA26584@pglaf.org>
	<627d59b81002021800x472c11e3n634eedd90a840bb6@mail.gmail.com>
	<20100203022055.GA28054@pglaf.org>
	<627d59b81002022143k3582d0fam473fcd4a01523749@mail.gmail.com>
Message-ID: <20100207184151.GA6083@pglaf.org>

On Tue, Feb 02, 2010 at 09:43:07PM -0800, don kretz wrote:
> And on the other end we're hearing the same thing - the GM is there only to
> manage,
> and initiative for change will come from the Board. I'm absolutely not
> suggesting the Board
> is or should be micro or macro managing. I think everyone is expecting that
> the
> Board is about Planning.  You're not? You disagree?

Planning is exactly right.  (Sorry for not responding sooner)
  -- Greg

> On Tue, Feb 2, 2010 at 6:20 PM, Greg Newby <gbnewby at pglaf.org> wrote:
> 
> > On Tue, Feb 02, 2010 at 06:00:48PM -0800, don kretz wrote:
> > > That's real good news, Greg, especially if you're talking about
> > flexibility
> > > on
> > > the DP side. 100% of the responsibility for evaluating and recommending
> > > changes to the DP process has been apparently relegated to the DP Board
> > > of Directors.
> >
> > I don't think that was the intention of the (relatively) new Board and
> > new GM.  The Board has ideas, but isn't trying to manage day to day
> > activity.
> >
> > > Since you are one of the five directors, you're in the know if anyone is.
> > > Since
> > > you represent 20% of the horsepower responsible for coming up with those
> > > changes, I trust you've been busy.
> >
> > Indeed, but actually we have not been looking at this level
> > of detail for changes in the DP processing chain.  The Board
> > isn't to micromange, and isn't to get in the way of progress.
> >
> > That said, if you think there are proposals, ideas for change,
> > etc. that are not getting the attention they deserve, I would
> > be happy to bring them to the board (or GM, as appropriate) on
> > anyone's behalf, anonymously if desired.
> >
> >  -- Greg
> >
> > > On Tue, Feb 2, 2010 at 5:44 PM, Greg Newby <gbnewby at pglaf.org> wrote:
> > >
> > > > On Tue, Feb 02, 2010 at 05:33:01PM -0800, Jim Adcock wrote:
> > > > > ...
> > > > > If PG were able to easily accept a txt file now and the html version
> > (and
> > > > > other versions later) not only would readers get some books a year
> > > > earlier,
> > > > > but we could probably save some efforts that die and get lost
> > somewhere
> > > > > between txt complete and html complete. Why does posting have to
> > happen
> > > > "all
> > > > > at once" ???
> > > >
> > > > It doesn't.  In fact, "extracting" works from DP earlier was a big push
> > > > I made a couple of years ago.  At that time, such two stage (or other
> > > > great-than-one stage) output was something that didn't fit well with
> > > > the workflow.  Maybe that's something that could be revisited.
> > > >
> > > > It's important to not double the effort involved at the final posting
> > > > phase (whitewashing) through such a two stage process.  But there are
> > > > several good ways of insuring this, which could be incorporated with
> > > > the process.
> > > >
> > > > There is definitely flexibility.
> > > >
> > > >  -- Greg
> > > > _______________________________________________
> > > > gutvol-d mailing list
> > > > gutvol-d at lists.pglaf.org
> > > > http://lists.pglaf.org/mailman/listinfo/gutvol-d
> > > >
> > _______________________________________________
> > gutvol-d mailing list
> > gutvol-d at lists.pglaf.org
> > http://lists.pglaf.org/mailman/listinfo/gutvol-d
> >

From gbnewby at pglaf.org  Sun Feb  7 10:46:25 2010
From: gbnewby at pglaf.org (Greg Newby)
Date: Sun, 7 Feb 2010 10:46:25 -0800
Subject: [gutvol-d] Re: roundlessness -- 002
In-Reply-To: <m2k4uupzu3.fsf@gnu.franken.de>
References: <56f4.5ddd8802.389a167b@aol.com>
	<SNT120-DS2459AD258B41D0F637AB17AE560@phx.gbl>
	<20100203014412.GA26584@pglaf.org> <m2k4uupzu3.fsf@gnu.franken.de>
Message-ID: <20100207184625.GB6083@pglaf.org>

On Wed, Feb 03, 2010 at 08:01:40AM +0100, Karl Eichwalder wrote:
> Greg Newby <gbnewby at pglaf.org> writes:
> 
> > On Tue, Feb 02, 2010 at 05:33:01PM -0800, Jim Adcock wrote:
> 
> > It doesn't.  In fact, "extracting" works from DP earlier was a big push
> > I made a couple of years ago.  At that time, such two stage (or other
> > great-than-one stage) output was something that didn't fit well with
> > the workflow.  Maybe that's something that could be revisited.
> 
> I'm all for it.  In the DP forum, I proposed this several times.
> 
> > It's important to not double the effort involved at the final posting
> > phase (whitewashing) through such a two stage process.  But there are
> > several good ways of insuring this, which could be incorporated with
> > the process.
> 
> Could we give this a try with manually selected books first?  How can we
> make sure that we do not waste the whitewashers' time?

Definitely.  On a trial basis, the extra (or different) workload
isn't such a big concern...we don't need to streamline while we're
trying to experiment.

>From the ww'er side, all you really need is a note with the 
upload that mentions "HTML will be forthcoming later," and then
reference the .txt eBook # when the HTML is finally uploaded.

>From the DP side, it seems that all this takes is an early 
extraction of formatted, proofread text, prior to going to HTML.

I'm sure it's somewhat more complicated than that, due to
various cascading effects and perhaps some hard-coded policy
on workflow, but I hope we all could accommodate some minor
upheaval in the interest of exploration.
  -- Greg

From Bowerbird at aol.com  Sun Feb  7 10:50:29 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 7 Feb 2010 13:50:29 EST
Subject: [gutvol-d] rfrank reports in
Message-ID: <8005.73d837a3.38a06575@aol.com>

ok, rfrank has made a report over in the d.p, forums on 
the latest set of results from his "roundless" experiment.

so let's see what he says, and what i think in reaction...

***

rfrank said:
>    So far in testing the roundless system as it stands, 
>    I've left it to the proofer to say when they thought a page 
>    was done. Turns out, that is reliable for only a very few proofers. 
>    Those who wish to say "I told you so" can chime in now, rightfully.

ok.


>    the post-processing clearable errors were 
>    caused mostly by four proofers, and
>    each of those made several different kinds of mistakes. 
>    These mistakes were found almost entirely 
>    on pages that were one-and-done, that is, proofed once. 
>    So what is to be done?

inform those proofers they are making mistakes, and how,
and that they not doing nearly as well as they think they are.

and to put this into perspective, there are about a dozen
committed proofreaders taking part in this experiment,
with another 5 dozen people contributing fewer pages...

so four "bad" proofers constitutes about 1/3 of the lot...

in other words, even though "4 proofers" sounds _rare_,
the actuality is the percentage of "bad" proofers is high.

this fact should _not_ be surprising.   when you fail to give
people any feedback on their performance, many will think
they're doing a fine job, even if they're doing a terrible job.

(this is a big problem over at the d.p. mothership, but we
probably shouldn't be getting into that can of worms now.)

after all, if they didn't think they were doing fine, they would
change what they were doing, so they _could_ be doing fine.
so you absolutely need to give them good and fast feedback.


>    One solution is to have every page 
>    looked at by at least two proofers. 

that seems straightforward, but it has some gotchas.


>    That seems straightforward but it has some gotchas. 

right.      :+)


>    If every proofer knows that every page 
>    is going to be looked at by someone else, 
>    will they proof that page differently 
>    than if they intended it to be one-and-done? 

it's likely.   so you'd need to assume it, and work from there.


>    I think they might. Knowing the underlying mechanism 
>    can undermine the process. 

well, you must assume people "know the underlying mechanism",
because you want to be open and transparent about it with them.
there's really no other option when you're working with volunteers.


>    Also, what if the second proofer is one of the four mentioned earlier? 


or what if they both were?


>    There is a good chance that many of the errors would slip through. 

right.


>    It's easy for me to change the site code to force two looks at every 
page, 
>    and I'll probably do that, perhaps even with a project in progress. 

doesn't matter.   even after two forced looks, some errors will remain.


>    A down side to the "every page looked at by at least two proofers" 
>    approach is specific to fadedpage: that there are only a dozen or so 
>    active proofers of the 60 or so registered users. The double-look 
>    algorithm adds about 35% to the number of page looks on a project. 

doesn't matter.   there's no need for any haste on the books coming out...


>   A better solution that just a double-look is 
>    to actually instantiate Confidence in Proofer (CiP).

i was afraid you were gonna say that.   and it's absolutely the wrong 
approach.


>   For these four proofers, the system could schedule a second look at 
>    their pages even if they check the "this is done" box when done 
proofing. 
>    It would give them plenty of diffs to look at, and they would be 
>    expected to look at those diffs that show some correction was made. 

well, it'd be better just to inform them and educate them in the first 
place,
rather than impose an "expectation" on them that informs them (indirectly)
and forces them to educate themselves (again, in a very indirect 
fashion)...


>    If diffs were not checked, then their access to new pages 
>    would be reduced. The kind of proofer who checks diffs, 
>    learns, and continues to contribute is exactly what is needed.

well, yeah, maybe...   but you're assuming a real luxury of 
an overabundance of volunteers, and a willingness to throw
a good number of them away as "not exactly what is needed".

it's better to figure out how to find a use for _all_ volunteers.


>   I believe for a roundless system to work, there has to be 
>    a reliable mechanism for stopping a page as done.

d'oh.   there has been complete agreement that that is the issue from day 
1.


>   I also believe that to have a reliable way to make that determination, 
>    some form of Confidence in Proofer needs to be in place.

some people have held that belief, yes.

i think it's unobtainable, and wrongheaded, and basically a dead end.
even if you get a rudimentary version, it won't turn out to be useful...


>    Therefore, CiP, which is important, and page tweets, which are 
>    useful and fun, are currently my main coding efforts at fadedpage.

yeah, well, you'll be coming back sometime down the line and saying
"those who wish to say 'i told you so' can chime in now, rightfully"...

the thread has more, on confidence-in-proofer, but i'm not gonna
waste any more of my time dealing with that flawed concept...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100207/f3db25cc/attachment.html>

From Bowerbird at aol.com  Sun Feb  7 11:12:17 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 7 Feb 2010 14:12:17 EST
Subject: [gutvol-d] how to make roundlessness work, in one brief post
Message-ID: <88b8.2d7d96f9.38a06a91@aol.com>

here's how to make a roundless system work, in a nutshell...

1.   do aggressive preprocessing.
2.   use nonintrusive zen markup.
3.   submit the page to a p1 proofer.
4.   repeat #3 until no change is made.
5.   submit the page to a p1 proofer again.
6.   if a change is made, go back to step #3.
7.   if there is no change in #5, page is done.

if you want it even briefer, do aggressive preprocessing,
and then repeat processing through p1, until you obtain
2 consecutive rounds of no change, and the page is done.

for greater accuracy, or if you have proofers in abundance,
repeat until you get 3 consecutive rounds without change.
(but the increased accuracy isn't worth the increased work.)
for lesser accuracy, stop after 1 round that sees no change,
but the decreased accuracy here is too high a price to pay...

aggressive preprocessing is the secret, because most errors
can be located automatically, so the pages are clean before
they even get to "proofers", who are really "smoothreaders".

this, of course, is the same formula i've suggested for years.
(once you've hit upon the right answer, no reason to change.)

you can easily assure yourselves that this is the right answer;
track how many errors persist through 2 rounds of no-change.
(versus how many persist through 1 round, 3 rounds, 4, etc.)

no need to collect any messy stats.   just 2 rounds of no-diff.
the time you spend exploring other stuff is just wasted time.

just watch.   this is the formula that will prove to be the best.
and when you get around to admitting it, i'll say "i told you so".

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100207/0ff7cdd9/attachment.html>

From Bowerbird at aol.com  Sun Feb  7 14:13:45 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 7 Feb 2010 17:13:45 EST
Subject: [gutvol-d] roundlessness -- 003
Message-ID: <cb62.4e38a06d.38a09519@aol.com>

we're looking at rfrank's "roundless" experiment at fadedpage.com...

***

ok, it's been a while since i went through this little drill, so 
i'm gonna give you a little refresher-course on where i stand
on a workflow that distribution proofreaders could be using,
particularly from the viewpoint of rfrank's roundless research.

***

first off, i would substantially re-ramp the "preprocessing"...
this is the first step, where you're doing the scanning and o.c.r.,
or fetching the results from one of the major scanning projects.

we'll start at the very beginning...

***

i've advocated -- strongly -- a filenaming system that indicates,
for each file, the contents of the file, so the name is meaningful.
specifically, the _page-number_ of the file should be included
in a consistent way in the name of a file.   thus, for instance:
>    http://z-m-l.com/go/myant/myantp123.jpg

the image-file for page #123 in the book "my antonia" is named
myantp123.jpg.   the "myant" prefix is common to files for the book,
and the "p123" part refers to page 123.   this is so straightforward
that you couldn't be faulted for assuming that it's taken for granted.

but it's not.   many of the content-providers over at d.p. name their
files differently, so the filename is _not_ unambiguously tied to the
page-numbered contents inside; and a price is paid for this unclarity.

yes, rfrank is one of these ill-advised content-providers, and he has
carried over this bad habit to this experiment on roundless proofing.

this shortcoming is even worse in a roundless system, because there
is a need to refer to specific pages in such a system, and the absence
of a sensible naming-scheme therefore becomes a bigger problem...
(in a round-based system, where all the pages are treated as a "batch",
the problem isn't as bad, but it's still something that should be fixed.)

***

next, there are a number of things that are done in "preprocessing",
at distributed proofreaders and by rfrank in his roundless experiment.

some of these things should _not_ be done, in my considered opinion.
on the other hand, there are other things that _should_ be being done.

these are some things that are done which should _not_ be done:
1.   run-heads and page-numbers are eliminated.
2.   end-line hyphenates are being joined.
3.   end-line em-dashes are being "clothed".

these are some things that _should_ be done, which are not:
1.   obvious and easily-located problems should be fixed.
2.   spacey-quotes should be fixed.
3.   ellipses should be standardized on a 3-dot ellipse.

we can engage in debate on all of these suggestions, but it will be
instructive for us to see some of the o.c.r. errors i'm talking about.

appended to this post is a list of bad words or lines that were pulled
from a book currently being proofed at rfrank's roundless experiment.
these are almost all errors that will certainly need to be fixed.

aggressive preprocessing can find these errors without looking for them.
that's very important, in a roundless system, because if you can find and
fix all the errors _before_ a page is subjected to a word-by-word review,
then that word-by-word review can become the first "no-diff" in a chain of 
"no-diff" reviews, and the shorter that chain is, the more efficient you 
are.

in the other scenario, if the page is dirty, you might have to have 1 or 2
(or even 3) proofings before the page is clean enough to receive a no-diff,
meaning your efficiency has plummeted.

there's no reason to make a human _search_ for errors when those errors
can be located quickly and efficiently by a computerized search routine...

-bowerbird

p.s.   sorry i've been sluggish with this series.   i thought i'd been away
from this stuff long enough that it wouldn't bore me to do it all again,
but so far it has been a drag...   i can only repeat this so many times...
i'll try to get the motivation back again, but no promises if i cannot...

p.p.s.   here's that list of "probable errors" pulled out of rfrank's book:

of'more
Fd say
pver
t'other
Enew
a'slipped
pn
curtiss-robih
somethin'to
ght
asighin'
buncft
whick
mother'Ship
ground^igood
hefteaw
iny
wonderf
^Devolutions
punkins
outen
tpreviously
bustl
gun'ls
thaft
ag^ny
sud-v
denly
ij
J>ack
jumpin'his
blame'em
Jiack
apture
twise
Weil
numba
chorteled
pvounded
Wowl
oaly
^{w}Oh!
haulded
uae
vre
knovt
apeak
gink
givin*
stretch.'*
J'ack
wheen
clost
you'l
althought
eyefull
weuns
I"<cr>
oa
pHot
etchin'
jest'magine
iframe
ha.nd
you'fe
valk
a'been
outen
morc'n
MCGrath
tc
Unc'
wuss'n
pizen
fresfi
hirn
orr
hinsel|


But you never can tell just what may hap]7^{en
sion of weird noises springing up from the g<9^{a}^
74-*
/'    161
17S
"Working over a bird with red feathers,'^{1
fall for such a decent game 93 taxidecentry or
18?
was the work of a few second3. Hardly had the
THE COMEBACK     227
HUNT OF THE S-18


belonged to the Hun pHot, Oscar Gleeb.
must be pesU, at Jn' you like all get-out, so I made
cruel to keep me a'guessing any longCi.
so anxious to get started oA their way
<i> Porter Press </i>   disappears from an airpnQp


.Watch my smoke, that's all."
.until finally it died away completely. This gave
.keep your eyes fastened on him. Whatever
. A fortune hangs in the balance when young Dan Tierney, press

, "Gripes! that was worth somthin'to glimpse,

; for Perk valued a few words of praise
;whooped the delighted Perk as he squatted


pn you to hold my own. That's j^t how it should
say,^{r} she bust out o' that little fog cloud right
such as would tell ^ business being put through
rendezvous and^ it's our game to chase after them,
light, whether <i>^{r</i>}'and or gulf, the chances were
holding ground^igood--a heavily laden sailing
path, and going through the most wonderf ^Devolutions.
and be ready to pounce down on their inten<J^{e}d
prey after the fashion of a hungry eagle striki <i>^{n}
a fat</i> duck that had been selected out of the flc^{c}k

One thing he did do was to cut his intend^
wide circle short and again head toward ^{ne
}scene of action, a move that certainly afford^
the eager Perk more or less satisfaction, he bel^{n}
thrilled with the expectation of breaking into <i>th^{e
}game</i> without much more loss of time.

But you never can tell just what may hap]7^{en
}when rival forces are striving against one ^^{n}'
other. The best laid plans often go wrong a^{n}^
there was always a chance of the unexpected
happening.

Hardly had the airship whipped around ag^^{n
}so as to head into the north than Perk beca<"^{e
}aware of the fact that there was a sudden acc^{es}'
sion of weird noises springing up from the g<9^{a}^
toward which they were now aiming. Jack, t^
must have caught the increased volume, for ^{ne
}sheered off as if to hold back a bit so as to gr^^{s}P
the meaning of the new racket.

Men were no longer simply talking or laug^h"
ing as they so cheerfully labored in transferring
some of the contraband from the sloop to ^{ne
}deck of the speedboat--their voices were rai^^{e}^
to shouts in which surprise, even the element ^


the frenzied sufferers in their ag^ny had been
^{w}Oh! That can be put through without muck
said that name exactly three tin^s, like it meant
operator as Oswald Kearns pick oui^an ordinary
would rather have Jack praise him than ^ny one
^{r}fully five feet long and as thick through the body
him, no matter where he goes-^-sorter dude, I'd
will y^{u}> boy--two--three fellers jest swarmed
"Working over a bird with red feathers,'^{1??
M^ans our gent has a raft o' ships comin' an'
through fire or some similar means of destruc^
our man ditto. Mebbe now I'll soon^{x}get a chance


"Gosh, amighty, we're flyin'\Mgh, buddy!"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100207/7fa1a90d/attachment-0001.html>

From jimad at msn.com  Mon Feb  8 14:30:18 2010
From: jimad at msn.com (Jim Adcock)
Date: Mon, 8 Feb 2010 14:30:18 -0800
Subject: [gutvol-d] Re: Formats and gripes
In-Reply-To: <4B6923EA.6050700@perathoner.de>
References: <a0bf3e960912050813n481ef547uea27caec910d186b@mail.gmail.com>	<a0bf3e960912050826kfff7dfatd8af522a6d79d90c@mail.gmail.com>	<a0bf3e960912050846t48b2e462x7dba2bf5ec546174@mail.gmail.com>	<a0bf3e960912050901k661817f6ie8b415826e6d151d@mail.gmail.com>	<20100124200541.GH27785@pglaf.org>	<SNT120-DS20D464376EA57A5412A1A3AE5D0@phx.gbl>	<4B60A4D4.8050701@perathoner.de>	<SNT120-DS1804AAB8140071074F5058AE5C0@phx.gbl>	<4B613AB1.4020302@perathoner.de>	<SNT120-DS21BABAB523D5DA00B3F462AE5C0@phx.gbl>	<4B61D55B.8010405@perathoner.de>	<SNT120-DS18A6E549EA0695AEC6EE0BAE5A0@phx.gbl>	<4B65299A.7060304@perathoner.de>	<SNT120-DS5ECCE5CBDC3F823133677AE580@phx.gbl>	<SNT120-DS212FD69C5A8451715FB5A0AE580@phx.gbl>	<4B6731F7.8050503@perathoner.de>	<SNT120-DS8C66C8C8D717F5D4029B6AE580@phx.gbl>	<4B67421F.7090500@perathoner.de>	<SNT120-DS22E9FF38587673C9FDAC82AE580@phx.gbl>	<4B683B58.90800@perathoner.de>	<SNT120-DS1CF5E7A50DDDCAAB0BED5AE560@phx.gbl>
	<4B6923EA.6050700@perathoner.de>
Message-ID: <SNT120-DS12AE1275DA7D1557EFC2D5AE510@phx.gbl>


>But YOU are perfectly free to volunteer your time to save Amazon some 
bucks: Take my epubs, patch them, and convert them to mobis that display 
the toc when you hit the toc button, and redistribute them on your site.

Again, the OPF file format specifies that TOC and NGX are separate things,
intended for separate purposes.  ADE screws the pooch on this one, providing
one interface for both the TOC and NGX.  Given that ADE screws the pooch,
you and many other epub implementers make a pragmatic decision to target ADE
for your generated epubs, generate a NGX, and leave out the TOC.  Kindle
expects to see both TOC and NGX and uses them for distinct purposes -- as
OPF specifies! If Kindle were to emulate ADE's mistakes, then there could
also not be a separate TOC and NGX on Kindle, fulfilling their separate
purposes as specified in the OPF specification.

It is not a question of wasting your time or wasting my time but rather
wasting PG "customers" time because what PG is providing today to customers
is broken.  It would be less broken if there weren't a lot of extra PG
verbiage at the start of books making it harder for customers to find the
embedded HTML TOC which would allow the customers to navigate to that which
they want to read. But it would be even better if customers could push the
"TOC" button on their machine and have a TOC actually displayed -- as
happens with "real" e-books.

Further, the whitewashers' typically require an additional "real" TOC to be
implemented in the HTML, which then is also not actually being used,
resulting in additional wasted time and energy on the part of the
volunteers. PG for example, could simply adopt a convention that a TOC.html
be shipped with a submitted HTML, linked into that doc, and then you could
link to that TOC.html, and the time and effort that whitewashers are asking
of submitters then would not be wasted.

It would be fine if you don't want to do it right if PG would allow
submission of generated EPUBs and MOBIs so submitters could choose to do it
right and not have their time and efforts wasted.  But, you don't allow that
to happen either!

Again, all that these policies today accomplish is that customers get
frustrated with PG, take PG books apart and rebuild them properly -- taking
off the PG name and verbiage in the process, and redistribute them on other
sites.  I just think it would be nice if PG volunteers get recognized for
the time and effort they contribute by having people actually read "PG"
books, as opposed to rebranded ex-PG books where the name has been taken off
so customers don't even realize they are reading something which is 99% PG
efforts.


From jimad at msn.com  Tue Feb  9 11:39:39 2010
From: jimad at msn.com (Jim Adcock)
Date: Tue, 9 Feb 2010 11:39:39 -0800
Subject: [gutvol-d] Re: rfrank reports in
In-Reply-To: <8005.73d837a3.38a06575@aol.com>
References: <8005.73d837a3.38a06575@aol.com>
Message-ID: <SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>

>   If every proofer knows that every page 
>   is going to be looked at by someone else, 
>   will they proof that page differently 
>   than if they intended it to be one-and-done? 

Under the *current* DP system everyone knows that everything being done is
also going to be worked on by about six other people. The hard part then is
getting anyone to feel "ownership" about anything -- particularly about
getting something *done*.

Automatic scoring of proofing efforts and automatic reporting back of
scannos that slip by that other people find -- without making a "big deal
value judgement" about those that slip by might make a positive
contribution.

Getting more people who care to read the finished or almost finished product
and providing an easy and convenient way to give feedback on bugs found, or
god forbid to be able to actually fix those bugs directly might also make a
contribution.


From jimad at msn.com  Tue Feb  9 11:48:15 2010
From: jimad at msn.com (Jim Adcock)
Date: Tue, 9 Feb 2010 11:48:15 -0800
Subject: [gutvol-d] Re: how to make roundlessness work, in one brief post
In-Reply-To: <88b8.2d7d96f9.38a06a91@aol.com>
References: <88b8.2d7d96f9.38a06a91@aol.com>
Message-ID: <SNT120-DS1E5101D000FBA6EB5787EAE500@phx.gbl>

>aggressive preprocessing is the secret, because most errors can be located
automatically, so the pages are clean before they even get to "proofers",
who are really "smoothreaders".

Agreed with this part at least -- many motivated "early readers" love a
particular author, and would be happy to get early access to the text via
some kind of tool that allowed them to fix or at least mark the bugs they
find as a part of their reading.  "Marking" bugs as a part of reading could
be as simple as asking them to read on a notepad or what have you and put a
Q-mark in the text where they think they see a bug.  Then diff their back
submission to find the bugs that need to be fixed. Readers of e-books could
even back-submit a "bookmarks" file that tags where errors were seen
allowing proofing to be done on any e-book reader.


From klofstrom at gmail.com  Tue Feb  9 11:56:59 2010
From: klofstrom at gmail.com (Karen Lofstrom)
Date: Tue, 9 Feb 2010 09:56:59 -1000
Subject: [gutvol-d] Re: rfrank reports in
In-Reply-To: <SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>
Message-ID: <1e8e65081002091156p2bc4c1eco7d229047823a8e46@mail.gmail.com>

On Tue, Feb 9, 2010 at 9:39 AM, Jim Adcock <jimad at msn.com> wrote:

> Under the *current* DP system everyone knows that everything being done is also going to be worked on by about six other people. The hard part then is getting anyone to feel "ownership" about anything -- particularly about getting something *done*.

Jim, this is unfair to DP and to those of us who work there. I'm a
high-count proofer in P3. I do care about finishing off books ...
indeed, I'm a member of P3 Archers, a team that works to "shoot down"
books that are almost-but-not-quite finished (we completed 27 projects
last week). I did my share of slogging on the Baburnama, a nightmare
project with lots of diacritic-spattered Turki, as well as other
mouldie oldies. I also care about the quality of my work. I can't be
sure that a formatter or a PPer is going to catch an error if I miss
it in P3. I spellcheck and if I'm not sure of a word, look it up in
OneLook online dictionary.

I'm not sure that the current system at DP is the best possible, but I
also know that various groups are experimenting with other workflows.
It's a Rube Goldberg contraption in some ways, but it does keep
putting out the books: more than 17,000 at last count.

--
Karen Lofstrom

From jimad at msn.com  Tue Feb  9 13:27:23 2010
From: jimad at msn.com (James Adcock)
Date: Tue, 9 Feb 2010 13:27:23 -0800
Subject: [gutvol-d] Re: rfrank reports in
In-Reply-To: <1e8e65081002091156p2bc4c1eco7d229047823a8e46@mail.gmail.com>
References: <8005.73d837a3.38a06575@aol.com>	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>
	<1e8e65081002091156p2bc4c1eco7d229047823a8e46@mail.gmail.com>
Message-ID: <SNT120-DS111F8A4A13321CC8EF3D4EAE500@phx.gbl>

>> Under the *current* DP system everyone knows that everything being done
is also going to be worked on by about six other people. The hard part then
is getting anyone to feel "ownership" about anything -- particularly about
getting something *done*.

>Jim, this is unfair to DP and to those of us who work there. I'm a
high-count proofer in P3. I do care about finishing off books ...

I have two books, highly requested, in DP that I spent about 40 hours each
getting them into DP and where they have been moldering for almost a year
now. They are "stuck" and there is no way to get them unstuck and the txt
has been "ready to go" from almost the beginning.  Again, the txt part,
including P1, P2, P3 is the easy part of the problem, and is working
relatively well compared to the rest of the DP process. This compares, for
example, that I can personally crank out a book -- perhaps not quite as good
as DP -- taking about the same 40 hours *total*, and can get it done
including HTML in less than a month elapsed time including god knows how
many family emergencies intruding on my efforts.  I *try* to take ownership
of these books at DP but am prevented in doing so by the system and the
management -- god knows if I were allowed to do so I would personally have
finished them off a half a year ago! A fundamental part of the DP problem is
that the "design" (if you want to call it that) of the queuing system
doesn't work.  Another part of the problem, frankly, is the disproportionate
amount of time spent on books that are very complicated, poorly scanned, and
not very good choices to begin with -- meaning simply that they are books
when all is said and done that not that many people are going to want to
read. Under the current system bad ideas are allowed to consume a
disproportionate amount of everyone's time and effort -- but isn't that true
of life in general!


From ajhaines at shaw.ca  Tue Feb  9 13:40:30 2010
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Tue, 9 Feb 2010 13:40:30 -0800
Subject: [gutvol-d] Re: rfrank reports in
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>
Message-ID: <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>

Re:

>Getting more people who care to read the finished or almost finished 
>product
>and providing an easy and convenient way to give feedback on bugs found, or

DP-US and DP-Canada both have a smooth-read facility, with instructions on 
how report problems.


>god forbid to be able to actually fix those bugs directly might also make a
>"contribution.

Allowing the hoi polloi, as it were, to "fix bugs" is a sure-fire way of 
introducing errors.  I occasionally have to disallow an errata-reported 
error because the reporter wasn't aware that a word was, in fact, valid. 
For example, "ancle" is a valid, albeit archaic, variant of "ankle", and is 
not an error.  But, if it's a typo/scanno for "uncle", it is.  I've also 
handled reported errors where the error was real, but the suggested 
correction was incorrect.


Al


----- Original Message ----- 
From: "Jim Adcock" <jimad at msn.com>
To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d at lists.pglaf.org>
Sent: Tuesday, February 09, 2010 11:39 AM
Subject: [gutvol-d] Re: rfrank reports in


>>   If every proofer knows that every page
>>   is going to be looked at by someone else,
>>   will they proof that page differently
>>   than if they intended it to be one-and-done?
>
> Under the *current* DP system everyone knows that everything being done is
> also going to be worked on by about six other people. The hard part then 
> is
> getting anyone to feel "ownership" about anything -- particularly about
> getting something *done*.
>
> Automatic scoring of proofing efforts and automatic reporting back of
> scannos that slip by that other people find -- without making a "big deal
> value judgement" about those that slip by might make a positive
> contribution.
>
> Getting more people who care to read the finished or almost finished 
> product
> and providing an easy and convenient way to give feedback on bugs found, 
> or
> god forbid to be able to actually fix those bugs directly might also make 
> a
> contribution.
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
> 


From jimad at msn.com  Tue Feb  9 14:04:06 2010
From: jimad at msn.com (James Adcock)
Date: Tue, 9 Feb 2010 14:04:06 -0800
Subject: [gutvol-d] Re: rfrank reports in
In-Reply-To: <75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>
References: <8005.73d837a3.38a06575@aol.com>	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>
	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>
Message-ID: <SNT120-DS24A947851758312487F600AE500@phx.gbl>

>DP-US and DP-Canada both have a smooth-read facility, with instructions on 
how report problems.

Below describes how one might get started doing smooth reading if anyone
cares to see in part what the problem might be:

http://www.pgdp.net/wiki/Smooth-reading_FAQ

If there were a tie-in between PG and DP to let people in general know when
SR is happening DP might get more SRs. A list of books "on deck" if you will
and how to get them.

>Allowing the hoi polloi, as it were, to "fix bugs" is a sure-fire way of 
introducing errors.  I occasionally have to disallow an errata-reported 
error because the reporter wasn't aware that a word was, in fact, valid. 
For example, "ancle" is a valid, albeit archaic, variant of "ankle", and is 
not an error.  But, if it's a typo/scanno for "uncle", it is.  I've also 
handled reported errors where the error was real, but the suggested 
correction was incorrect.

This would be a problem that DP already has because in my experience many a
P3 "knows" so well how to do their job that they never bother to
double-check what it is the author actually wrote or that which the
publisher actually published -- which in practice turns them into gold
plated SRs.

Don't get me wrong, DP has many excellent dedicated people at all levels,
including all levels of P1, P2, P3 -- its just that moving up the ranks
doesn't necessarily mean people are actually getting any better at what they
are doing. And the queuing system guarantees that the upper level "experts"
are going to be overloaded.


From klofstrom at gmail.com  Tue Feb  9 14:57:25 2010
From: klofstrom at gmail.com (Karen Lofstrom)
Date: Tue, 9 Feb 2010 12:57:25 -1000
Subject: [gutvol-d] Re: rfrank reports in
In-Reply-To: <SNT120-DS24A947851758312487F600AE500@phx.gbl>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>
	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>
	<SNT120-DS24A947851758312487F600AE500@phx.gbl>
Message-ID: <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>

On Tue, Feb 9, 2010 at 12:04 PM, James Adcock <jimad at msn.com> wrote:

> This would be a problem that DP already has because in my experience many a P3 "knows" so well how to do their job that they never bother to double-check what it is the author actually wrote or that which the publisher actually published -- which in practice turns them into gold plated SRs.

And how would you know this? Long experience as a formatter or PPer?

--
Karen Lofstrom

From prosfilaes at gmail.com  Tue Feb  9 15:01:44 2010
From: prosfilaes at gmail.com (David Starner)
Date: Tue, 9 Feb 2010 18:01:44 -0500
Subject: [gutvol-d] Re: rfrank reports in
In-Reply-To: <SNT120-DS111F8A4A13321CC8EF3D4EAE500@phx.gbl>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>
	<1e8e65081002091156p2bc4c1eco7d229047823a8e46@mail.gmail.com>
	<SNT120-DS111F8A4A13321CC8EF3D4EAE500@phx.gbl>
Message-ID: <6d99d1fd1002091501v33bfdcc5q699d1af4ea3590fc@mail.gmail.com>

On Tue, Feb 9, 2010 at 4:27 PM, James Adcock <jimad at msn.com> wrote:
>?Again, the txt part,
> including P1, P2, P3 is the easy part of the problem, and is working
> relatively well compared to the rest of the DP process.

Since when has it been okay to toss out italics, indentation of poetry
and proper footnotes in the text file?

-- 
Kie ekzistas vivo, ekzistas espero.

From Bowerbird at aol.com  Tue Feb  9 15:56:44 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 9 Feb 2010 18:56:44 EST
Subject: [gutvol-d] Re: rfrank reports in
Message-ID: <19a40.fc4738.38a3503c@aol.com>

jim said:
>    Under the *current* DP system 

i'm steering clear of discussing the d.p. system right now;
there's so much cruft over there it's not worth the trouble.

i'm either discussing rfrank's experiment, or my own system.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100209/078c213f/attachment.html>

From Bowerbird at aol.com  Tue Feb  9 16:08:33 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 9 Feb 2010 19:08:33 EST
Subject: [gutvol-d] Re: how to make roundlessness work, in one brief post
Message-ID: <19fdb.2d582660.38a35301@aol.com>

jim said:
>    many motivated "early readers" love a particular author, 
>    and would be happy to get early access to the text via
>    some kind of tool that allowed them to fix or at least mark 
>    the bugs they find as a part of their reading.? "Marking" bugs 
>    as a part of reading could be as simple as asking them to 
>    read on a notepad or what have you and put a
>    Q-mark in the text where they think they see a bug.

while you seem to be talking about smoothreading here,
the text you quoted from me was about preprocessing...

preprocessing happens before the text goes to any proofer
-- it's scheduled immediately after o.c.r. has been done --
and it doesn't require reading of _any_ kind at all, which is 
why it is about fourteen times more efficient than proofing.

a preprocessing tool finds glitches that are almost certainly
errors, and takes you to them directly in the text-file while
displaying the appropriate scan for referral, and often even
gives you buttons that will perform the desired correction...
some glitches (like spacey quotes) can even be auto-fixed.

i have demonstrated here that, for several books i tested,
this preprocessing would take the errors down to a rate of
less-than-one-error-every-10-pages, which makes the
word-by-word proofing rounds almost like smoothreading.

check out dkretz's "twisted" tool to see an example of this...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100209/228b4467/attachment.html>

From ke at gnu.franken.de  Tue Feb  9 19:59:20 2010
From: ke at gnu.franken.de (Karl Eichwalder)
Date: Wed, 10 Feb 2010 04:59:20 +0100
Subject: [gutvol-d] Re: rfrank reports in
In-Reply-To: <SNT120-DS111F8A4A13321CC8EF3D4EAE500@phx.gbl> (James Adcock's
	message of "Tue, 9 Feb 2010 13:27:23 -0800")
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>
	<1e8e65081002091156p2bc4c1eco7d229047823a8e46@mail.gmail.com>
	<SNT120-DS111F8A4A13321CC8EF3D4EAE500@phx.gbl>
Message-ID: <m2hbpp91wn.fsf@gnu.franken.de>

"James Adcock" <jimad at msn.com> writes:

> many family emergencies intruding on my efforts.  I *try* to take ownership
> of these books at DP but am prevented in doing so by the system and the
> management -- god knows if I were allowed to do so I would personally have
> finished them off a half a year ago! A fundamental part of the DP problem is
> that the "design" (if you want to call it that) of the queuing system
> doesn't work.

I also consider this a serious defect.  IMO, it must be possible, if
someone want to work on a book, to "activate" it (= unlock it
from a waiting state).

> Another part of the problem, frankly, is the disproportionate amount
> of time spent on books that are very complicated, poorly scanned, and
> not very good choices to begin with -- meaning simply that they are
> books when all is said and done that not that many people are going to
> want to read. Under the current system bad ideas are allowed to
> consume a disproportionate amount of everyone's time and effort --

I'm always wondering why people work on books they are not interested
in...

> but isn't that true of life in general!

Probably ;)

-- 
Karl Eichwalder

From jimad at msn.com  Tue Feb  9 20:59:30 2010
From: jimad at msn.com (Jim Adcock)
Date: Tue, 9 Feb 2010 20:59:30 -0800
Subject: [gutvol-d] Re: rfrank reports in
In-Reply-To: <m2hbpp91wn.fsf@gnu.franken.de>
References: <8005.73d837a3.38a06575@aol.com>	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>	<1e8e65081002091156p2bc4c1eco7d229047823a8e46@mail.gmail.com>	<SNT120-DS111F8A4A13321CC8EF3D4EAE500@phx.gbl>
	<m2hbpp91wn.fsf@gnu.franken.de>
Message-ID: <SNT120-DS3E80B5C7681D8F56C3FD5AE4F0@phx.gbl>

>I'm always wondering why people work on books they are not interested
in...

Because the "good books" don't get released from the queue until these other
ones get finished, and because many people who volunteer for DP are
incredibly open hearted.


From jimad at msn.com  Tue Feb  9 21:12:22 2010
From: jimad at msn.com (Jim Adcock)
Date: Tue, 9 Feb 2010 21:12:22 -0800
Subject: [gutvol-d] Re: rfrank reports in
In-Reply-To: <1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>
References: <8005.73d837a3.38a06575@aol.com>	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>	<SNT120-DS24A947851758312487F600AE500@phx.gbl>
	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>
Message-ID: <SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>

>And how would you know this? Long experience as a formatter or PPer?

It's not hard to see, actually, when a P3 or others make changes which don't
match the page images.  You just have to actually look at one and then the
other.  I've seen many great proofers at all of the P1, P2, and P3 levels. I
have also seen "well reputed" P3's who turn out results that don't match the
page images. The text they create scans perfectly fine, it's just that it's
not what the author wrote -- particularly when it comes to punctuation. Best
way to get great results is to have people working on a text and an author
they absolutely love, not just cranking out the numbers.


From jimad at msn.com  Tue Feb  9 22:59:46 2010
From: jimad at msn.com (Jim Adcock)
Date: Tue, 9 Feb 2010 22:59:46 -0800
Subject: [gutvol-d] Re: rfrank reports in
In-Reply-To: <SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
References: <8005.73d837a3.38a06575@aol.com>	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>	<SNT120-DS24A947851758312487F600AE500@phx.gbl>	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
Message-ID: <SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>

>>And how would you know this? Long experience as a formatter or PPer?

>It's not hard to see, actually, when a P3 or others make changes which
don't match the page images. 

I just checked my previous claims about the problems with P3 (for example)
against a pretty straight-forward text.  I reviewed 200 pages from that
text.  P3's made 38 changes on those pages.  Of these changes 7 changes
represented a positive contribution towards making the txt correct. Of those
positive changes about a half could easily be found by a simple tool like
guiguts. 10 of the changes introduced by the P3's were negative changes --
changes that moved the text to a less perfect state. The remaining 21
changes were basically "null changes" relating to established DP procedure,
which neither really made the txt any better nor any worse. Most of the
negative changes were relating to punctuation, as I previously claimed.
Again, it's really hard for the human mind to accept that the best thing to
do when things aren't broken is to leave them alone -- people really want to
make a "positive contribution" by changing things.

By my calculation DP is cranking out an average of 194 books a month --
which is impressive.  But consider some of the upper level queue times:

2000 books stuck in P3 = 10.3 Months stuck in P3

2840 books stuck in F2 = 14.6 Months stuck in F2

2562 books stuck in PP = 13.2 Months stuck in PP

Total 38 Months, about 3 years waiting on these higher level queues, which
means it takes about three and a half years in total for a book to get
through DP nowadays? -- And getting longer every day.

Seems "pretty obvious" to me looking at the DP "red bar" graph at
http://www.pgdp.net/c/activity_hub.php that the P3, F2 and PP efforts are
"out of control."  Which doesn't mean that one should admonish the troops to
do better.  Rather, it means that the process needs to be redesigned to fit
the resources actually available -- somehow you have to move more people
into the roles currently labeled "P3, F2, and PP" or you have to redesign
things to make their jobs MUCH faster and easier, or you have to redesign
the process, or redesign the goals of the organization.  I'm not saying this
is good or this is bad -- I'm just saying that this is obvious! You cannot
indefinitely run an organization that takes more orders in the front door
than you ship out the back door -- no matter how big hearted you are.


From klofstrom at gmail.com  Tue Feb  9 23:25:55 2010
From: klofstrom at gmail.com (Karen Lofstrom)
Date: Tue, 9 Feb 2010 21:25:55 -1000
Subject: [gutvol-d] Re: rfrank reports in
In-Reply-To: <SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>
	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>
	<SNT120-DS24A947851758312487F600AE500@phx.gbl>
	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
Message-ID: <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>

On Tue, Feb 9, 2010 at 8:59 PM, Jim Adcock <jimad at msn.com> wrote:

> I just checked my previous claims about the problems with P3 (for example) against a pretty straight-forward text. ?I reviewed 200 pages from that
text. ?P3's made 38 changes on those pages. ?Of these changes 7 change
represented a positive contribution towards making the txt correct. Of
those
positive changes about a half could easily be found by a simple tool
like guiguts. 10 of the changes introduced by the P3's were negative
changes --
changes that moved the text to a less perfect state.

I'd have to look at them before trusting you on this, as you seem to
have an extremely negative, fault-finding attitude towards DP. I
wonder if you'd count my occasional bracketed comments, such as
[**P3--seems to be a mistake in the original; s/b ;], as errors.

> The remaining 21  changes were basically "null changes" relating to established DP procedure,
which neither really made the txt any better nor any worse.

Nonetheless, they were useful to the formatters and PPers as making
the text predictable.

> ... which means it takes about three and a half years in total for a book to get through DP nowadays? -- And getting longer every day

None of us likes that! Yes, the current round system is broken.

It produces better texts than the old 2-round system. Some of the
second-round proofers in those days wanted page count and didn't give
a #$%@$#% about accuracy. The results were as dismal as you would
expect.

However, we've now producing very good texts at an enormous cost.
We're discussing further changes. It doesn't particularly help, when
one is drowning and flailing about for a handhold, to have a bystander
jumping up and down, shouting, "You're drowning, you idiot!"

--
Karen Lofstrom

From Bowerbird at aol.com  Wed Feb 10 14:02:30 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 10 Feb 2010 17:02:30 EST
Subject: [gutvol-d] Re: rfrank reports in
Message-ID: <134f7.6b712356.38a486f6@aol.com>

jim said:
>   I just checked my previous claims about the problems with P3 
>    (for example) against a pretty straight-forward text. ?
>    I reviewed 200 pages from that text. ?
>    P3's made 38 changes on those pages. ?
>    Of these changes 7 change represented a positive 
>    contribution towards making the txt correct. 
>    Of those positive changes about a half could 
>    easily be found by a simple tool like guiguts. 
>    10 of the changes introduced by the P3's were negative changes 
>    -- changes that moved the text to a less perfect state.

jim, and others, if you're going to continue to discuss the
problems in the current system at distributed proofreaders,
please do it in your own thread, and not in my threads, ok?

as for your findings, jim, it helps to report the actual lines.

but it's probably not necessary.

in the past, i have documented the same things you report,
in great detail, in book after book after book, with evidence.

in comparison, your anecdotal reports are relatively flimsy.

i'm not saying you're wrong.   indeed, you're absolutely correct.
i'm just saying your reports are not going to convince anyone.
heck, there are people here who refused to believe what i said,
in spite of the fact i piled up enough evidence to choke a horse.

(nor did i _create_ the evidence; i used data taken directly from
various experiments, performed over at d.p. by other people...
the truth is out there, and easy to find, if you just care to look.
this is what is so silly about all these "experiments".   i've told
everyone here the simple correct answers, so all that's needed
is to test these simple hypotheses and see they _are_ correct.
but instead people are testing overly complicated stuff in ways
that are not definitive, leading them to become more confused.)

***

perhaps the most impressive findings of my results were these:

1.   the best way to know a page is "finished" is when proofers
stop making changes to it...   up to that point, it's not finished!
it's the "best" way to know because a no-diff is easy to measure.

2.   3 rounds of p1 were as effective as a series of p1-p2-p3.
(simple solution to queue problems?   run the text through
p1 until every page comes out repeatedly as a no-diff page.)

3.   the third round of p1 found as many additional errors
as the p3 round, but _neither_ route found _all_ the errors.
the p1(3) proofers found errors that the p3 proofers missed,
and the p3 proofers found errors the p1(3) proofers missed,
plus there were other errors that both p1(3) and p3 missed.
(takeaway: the p1 proofers are not inferior to the p3 proofers,
and a "make the page better" philosophy will eventually work
to "create a perfect page", without all the attendant pressure.)

4.   "parallel" p1 was _not_ useful at turning up any more errors,
but it might have value to determine a page is "done", although
more research would need to be done to test that hypothesis...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100210/15a5ab41/attachment.html>

From jimad at msn.com  Wed Feb 10 17:57:31 2010
From: jimad at msn.com (Jim Adcock)
Date: Wed, 10 Feb 2010 17:57:31 -0800
Subject: [gutvol-d] Re: rfrank reports in
In-Reply-To: <1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
References: <8005.73d837a3.38a06575@aol.com>	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>	<SNT120-DS24A947851758312487F600AE500@phx.gbl>	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
Message-ID: <SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>

>However, we've now producing very good texts at an enormous cost.
We're discussing further changes. It doesn't particularly help, when
one is drowning and flailing about for a handhold, to have a bystander
jumping up and down, shouting, "You're drowning, you idiot!"

If I'm a bystander it is because the texts that I have submitted to DP which
I thought I would be working on have been frozen in the queues by DP for the
last year.  What you suggest instead is that I also jump into the pool and
start flailing around with you.  Been there done that, got tired of it,
climbed back out of the pool. Flailing harder or faster or throwing more
people into the pool really isn't going to help.

I have made what I consider many positive suggestions, any of which simply
invoke anger and defensiveness on the part of DP'ers:

One of which is post the text after P3 rather than waiting to finish PP.
This would make about an additional 4,000 texts available on PG.  If one
counts volunteer hours worth $10/hr this represents an "unfinished
inventory" of about $2,000,000. If you value PG downloads similar to
Amazon's minimum cost of $1 a book, then these 4,000 would generate about
$150,000 a year in additional value to society.

Other obvious suggestions would be to adjust your "experience" thresholds
and testing methods for admittance to P3, F2, PP in order to allow a bit
more people into these areas and see how much it *really* hurts your quality
and productivity -- or not! Fundamentally it is the unbalanced number of
people allowed into the upper rounds (or rather not allowed into the higher
rounds) which is killing you.  Further, any tools that you can offer P3, F2,
or PP to make their lives easier would help you greatly.

Another suggestion I have made is to do what many other commercial
digitizers of text using human beings do: Run two humans in parallel on the
same text and then diff the results. If you get a diff on some page run a
third person and vote the results. If you were to double up on the P1 and P2
efforts like this that would help the P3 queue. If you doubled up the F1
efforts that would help the F2 queue.  Don't know how to help the PP queue
except I don't understand why you allow almost finished texts to be stuck
moldering in the hard-drives of one PP'er so long. If a PP'er just can't get
it done -- take it away and assign it to someone else. Doesn't matter how
good or experienced a PP'er is if they just can't get it done.

Another suggestion is to auto score proofers and formatters efforts and
automatically assign them to the place in your process where their level of
abilities will do the most good -- or at least the least damage. It is easy
to auto score the P1, P2, and F1 efforts -- it is basically the ratio of the
number of fixes that they make divided by the number of fixes made on the
same pages by the successive round.  Have the P3s and F2s "retest" on a P2
or F1 round occasionally so that you can autoscore whether they still know
what they are doing or not.

Another suggestion would be to update the toolset being used to make them
more fun, less time-wasting, and less tweaky. Simple common tasks ought to
be simple, unpainful, and fast.  

Allowing higher rez page scans for the people with the bandwidth to handle
them would make all the rounds easier.

Another suggestion would be to get PG to allow one to query on how many
downloads various texts are getting, so that people who are submitting texts
to DP which aren't getting read might get some feedback about what their
actions is really accomplishing, or not.

Modifying bowerbird's suggestions slightly, there *are* at least some texts
that fit pretty well into template forms, such as some simple novels.
Perhaps an automated or semi-automated tool for turning these simpler texts
into HTML quickly?

Another obvious suggestion is that there are too many texts in the world to
take them all on.  Are the readers of PG really interested in "Annals of the
Annual Proctology Meeting of 1847" ? Is there at least some way to try to
discourage really bad ideas? Looking at the actual text of the English
language submissions in P1 right now it looks to me that about half of them
have a reasonable chance of being read.  Is there any way to more actively
promote the acquisition and prioritizing of texts that are generally
recognized as being "better than average" aka "famous" or at least "well
known"?

Another obvious suggestion would be to empower PM's to have at least one
active project where if that project gets stuck they are allowed to take
whatever actions necessary to get it unstuck....


From prosfilaes at gmail.com  Wed Feb 10 19:14:05 2010
From: prosfilaes at gmail.com (David Starner)
Date: Wed, 10 Feb 2010 22:14:05 -0500
Subject: [gutvol-d] Re: rfrank reports in
In-Reply-To: <SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>
	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>
	<SNT120-DS24A947851758312487F600AE500@phx.gbl>
	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
Message-ID: <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>

On Wed, Feb 10, 2010 at 8:57 PM, Jim Adcock <jimad at msn.com> wrote:
> One of which is post the text after P3 rather than waiting to finish PP.

To which I pointed out that this would in many cases result in the
posting of severely deficient texts. Formatting is important.

>?Don't know how to help the PP queue
> except I don't understand why you allow almost finished texts to be stuck
> moldering in the hard-drives of one PP'er so long. If a PP'er just can't get
> it done -- take it away and assign it to someone else. Doesn't matter how
> good or experienced a PP'er is if they just can't get it done.

Because sometimes it may be worth letting a text molder rather then
preemptorially ripping it out of someone's hands and annoying the hell
out of them.

> Perhaps an automated or semi-automated tool for turning these simpler texts
> into HTML quickly?

Is guiguts not quick enough for you? This is a fairly simple tool problem.

> Another obvious suggestion is that there are too many texts in the world to
> take them all on. ?Are the readers of PG really interested in "Annals of the
> Annual Proctology Meeting of 1847" ?

It's easy to come up with a rhetorically stupid title. But if you
pulled a real title, then we could actually discuss the audience and
why someone would upload that.

> Is there any way to more actively
> promote the acquisition and prioritizing of texts that are generally
> recognized as being "better than average" aka "famous" or at least "well
> known"?

That presumes that that should be our goal. Some of the works I'm
proudest of are works where the PG edition is the best in the world.
Sure, more people may read the Canterbury Tales, but every who reads
our edition of Stephen Hawes's "A Joyful Meditation of the Coronation
of King Henry the Eighth" is thrilled that we have it, because the
alternative was deciphering the blackletter originals and trying to
figure out the lost parts yourself. Augustan Reprint Society works are
a large class of works I've done where they have some scholarly
interest, but the reader will only find facsimiles outside of PG.

On the other hand is stuff like "1931: A Glance at the Twentieth
Century" by Henry Hartshorne. It is none of those things; it's just a
fun work to read, even if that fun comes at its own expense. I don't
think anyone who worked on it is the least bit unhappy about that.

-- 
Kie ekzistas vivo, ekzistas espero.

From schultzk at uni-trier.de  Thu Feb 11 01:10:26 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Thu, 11 Feb 2010 10:10:26 +0100
Subject: [gutvol-d] Revisting DP
In-Reply-To: <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>
	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>
	<SNT120-DS24A947851758312487F600AE500@phx.gbl>
	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
Message-ID: <975FD7DD-C4F0-49B9-86FB-CC23F727A912@uni-trier.de>

Hi Everybody,

	I have been following the thread "rfrank reports in".
	[Yes, BB its been hijacked]
	It seems obvious to all that the DP system has severe
	deficiencies. The question is how to help.
	Which leads to the question of what is flawed.
	It is obvious that the system after P3 is evidently
	to complex. Furthermore, the method of creating the
	perfect ebook. This has resulted in that there are evidently
	to few persons that can be trusted with this complexity.
	The method in general is not the problem, but the rules that 
	have to be abided by!!

	I have suggest in the past other alternatives which are by 
	far simpler and would produce the required results. 
	I would implement a system if i had the time, furthermore 
	it would be a one person operation. 

	DP has a hugh amount of person-power which the could use 
	more efficiently. As we all have noted. 

	But, until the ones who have the say over at DP are willing
	to simplify their system the problems will persist.  

	The formating/transcription rules required by DP have been developed
	over time, yet the where evidently added in in an ad-hoc manner.
	Any system over time should be revamped and streamlined.
	Optimized if you will. Sure a few tools need to be rewritten, but the
	basic frame should already be there so that should not propose a great 
	ordeal.

	The other questions that remain are what is a perfect book?
	Or, What is a predictable book?

	regards
		Keith.


From Bowerbird at aol.com  Thu Feb 11 10:29:41 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 11 Feb 2010 13:29:41 EST
Subject: [gutvol-d] Re: Revisting DP
Message-ID: <c8cc.6657e688.38a5a695@aol.com>

keith said:
>    It seems obvious to all that 
>    the DP system has severe deficiencies. 
>    The question is how to help.

you need to discuss that over at d.p.

they don't listen to anything over here.

the only reason i discuss things here
is because they banned me from there


-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100211/5b2d8d69/attachment.html>

From schultzk at uni-trier.de  Thu Feb 11 23:48:56 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Fri, 12 Feb 2010 08:48:56 +0100
Subject: [gutvol-d] Re: Revisting DP
In-Reply-To: <c8cc.6657e688.38a5a695@aol.com>
References: <c8cc.6657e688.38a5a695@aol.com>
Message-ID: <9A521268-9BC7-4B06-8F38-4AF58E087547@uni-trier.de>

Hi BB,

	I think we know the problems DP has with help.
	Just more or less rounding things up that were 
	discussed here.
	
	Yet, as you said ranting here will not help thing overthere,
	and ranting there get you put on the unwanted list,
	no matter how polite you are.

	regards
		Keith. 


Am 11.02.2010 um 19:29 schrieb Bowerbird at aol.com:

> keith said:
> >   It seems obvious to all that 
> >   the DP system has severe deficiencies. 
> >   The question is how to help.
> 
> you need to discuss that over at d.p.
> 
> they don't listen to anything over here.
> 
> the only reason i discuss things here
> is because they banned me from there
> 
> 
> -bowerbird
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100212/e1782252/attachment-0001.html>

From jimad at msn.com  Fri Feb 12 10:47:08 2010
From: jimad at msn.com (Jim Adcock)
Date: Fri, 12 Feb 2010 10:47:08 -0800
Subject: [gutvol-d] DP: was rfrank reports in
In-Reply-To: <6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
References: <8005.73d837a3.38a06575@aol.com>	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>	<SNT120-DS24A947851758312487F600AE500@phx.gbl>	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
Message-ID: <SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>

>To which I pointed out that this would in many cases result in the
posting of severely deficient texts. Formatting is important.

OK, but I can also point to texts that were almost "good to go" before they went into DP, only to molder indefinitely there.  Is there some way to make a decision on this one way or another. How about letting the PM make the decision whether or not to post a "preliminary version" to PG?

>Because sometimes it may be worth letting a text molder rather then
preemptorially ripping it out of someone's hands and annoying the hell
out of them.

OK, but how and when do you decide that the PP has actually moved on in life and is not really willing to finish up the book to which others have in good faith contributed their blood sweat and tears in the hopes of getting an honest to god book?  Not to mention the possibility of a PP not working in good faith?

>Is guiguts not quick enough for you? This is a fairly simple tool problem.

Tried it previously and didn't find any value in it.  I will take a look at it again.

>It's easy to come up with a rhetorically stupid title. But if you
pulled a real title, then we could actually discuss the audience and
why someone would upload that.

Pick any title active in the rounds right now.  Based on the best statistics I can find on PG usage, which is actually from IA, the most popular books from PG get read literally 100,000 times more often than the least read books.  Now, it is hard to find a book that is going to be that popular.  But it is easy to find a good book which will get read literally 40x more often than the books in DP right now, as well as being at least several times faster and a easier to create.

>> Is there any way to more actively
>> promote the acquisition and prioritizing of texts that are generally
>> recognized as being "better than average" aka "famous" or at least "well
>> known"?
>
>That presumes that that should be our goal. Some of the works I'm
>proudest of are works where the PG edition is the best in the world.
>Sure, more people may read the Canterbury Tales, but every who reads
>our edition of Stephen Hawes's "A Joyful Meditation....

Is it possible to split the queues and the efforts into "esoterica" vs. "books that will be actively read?"  Right now the "books that will be actively read" I am afraid are stuck in the queue behind "books that no one is actually willing to work on."  I went there recently to try to help and it looked like "the powers that be" were trying to force through books that really no one wants to work on -- books that were really hard and not very interesting even to the people who volunteer their time to DP. You can't force people to work on things they don't want to work on.  Either they work on texts that they want to work on, or if DP is not willing to present any of those, they they go on with their lives, or maybe, like in my case, they "route around damage" and work on books outside of DP.

The problem is NOT that there is "esoterica" vs. "books that will be actively read" -- the problem is that the "esoterica" takes so much time and effort compared to "books that will be actively read" that "esoterica" ends up swamping the other categories.

Are you really saving a book if you pickle it for posterity without it getting read?  Isn't that like locking up a ballerina's shoes in order to preserve ballet? Or locking up an artists paint and brushes in order to preserve art? To my taste books exist while they are being read.  Otherwise they fail to exist -- beyond little magnetic domains stuck somewhere on the internet.

A simple answer would be to put in separate queues for the differing levels of difficulty and/or categories of books. Then people who want to work on esoterica can do so without impacting people who don't.


From prosfilaes at gmail.com  Fri Feb 12 11:22:07 2010
From: prosfilaes at gmail.com (David Starner)
Date: Fri, 12 Feb 2010 14:22:07 -0500
Subject: [gutvol-d] Re: DP: was rfrank reports in
In-Reply-To: <SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
References: <8005.73d837a3.38a06575@aol.com>
	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>
	<SNT120-DS24A947851758312487F600AE500@phx.gbl>
	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
Message-ID: <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com>

On Fri, Feb 12, 2010 at 1:47 PM, Jim Adcock <jimad at msn.com> wrote:
> OK, but how and when do you decide that the PP has actually moved on in life and is not really willing to finish up the book to which others have in good faith contributed their blood sweat and tears in the hopes of getting an honest to god book? ?Not to mention the possibility of a PP not working in good faith?

That's not a problem to be solved ranting; it's a problem to be solved
by study of the statistics and talking to the PPers.

> Is it possible to split the queues and the efforts into "esoterica" vs. "books that will be actively read?" ?Right now the "books that will be actively read" I am afraid are stuck in the queue behind "books that no one is actually willing to work on." ?I went there recently to try to help and it looked like "the powers that be" were trying to force through books that really no one wants to work on -- books that were really hard and not very interesting even to the people who volunteer their time to DP.

This is the funny thing; there's no connection between books that will
be actively read, and books people want to work on. What books would
be actively read: Euclid, Newton's Principia, the Oxford English
Dictionary. We've had scans of the OED for years; no one has been
willing to attack it. We can probably come up with a dozen usable
scans of Euclid; no one is currently working on getting PG a complete
copy of Euclid, because it's a total pain to work on. But you take
some moldy old historical fiction or better yet some sci-fi story that
hasn't been reprinted since it was first published, and they will
rocket through DP.

> The problem is NOT that there is "esoterica" vs. "books that will be actively read" -- the problem is that the "esoterica" takes so much time and effort compared to "books that will be actively read" that "esoterica" ends up swamping the other categories.

Bullshit. How long do you think the OED would take? That's a book that
will be actively read. Why did "Dryden's Works (13 of 18):
Translations; Pastorals" take two months to go through P2? If you're
classifying the complete works of Dryden as esoterica, then what on
Earth are you classifying as books that will be actively read?
Certainly not the historical trash fiction that does blow through DP.

-- 
Kie ekzistas vivo, ekzistas espero.

From klofstrom at gmail.com  Fri Feb 12 11:27:04 2010
From: klofstrom at gmail.com (Karen Lofstrom)
Date: Fri, 12 Feb 2010 09:27:04 -1000
Subject: [gutvol-d] Re: DP: was rfrank reports in
In-Reply-To: <SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
References: <8005.73d837a3.38a06575@aol.com>
	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>
	<SNT120-DS24A947851758312487F600AE500@phx.gbl>
	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
Message-ID: <1e8e65081002121127r77da6414y32a6f4b00f35d6fc@mail.gmail.com>

On Fri, Feb 12, 2010 at 8:47 AM, Jim Adcock <jimad at msn.com> wrote:

> OK, but I can also point to texts that were almost "good to go" before they went into DP, only to molder indefinitely there. ?Is there some way to make a decision on this one way or another. How about letting the PM make the decision whether or not to post a "preliminary version" to PG?

The poll that's up at DP right now has the respondents just about
evenly split on this issue. I would be OK with doing it, but I also
understand those who feel that the "preliminary" posting might hang
around for years, displacing the final, polished, ACCURATE product.

> Is it possible to split the queues and the efforts into "esoterica" vs. "books that will be actively read?"

No. That's like recommending that publishers solve their financial
problems by only printing best-sellers. Some books that YOU think are
esoterica might actually be of great interest to a small but
appreciative community, such as scholars the world over. Take, for
example, the Baburnama, the memoirs of Babur, the Turki conqueror of
northern South Asia and founder of the Mughal dynasty, as translated
by Beveridge. Fiendishly difficult text, took a year to get through
P3, will probably take a lot of time in F1 and F2 and PP, a real slog
... but it's an essential work in South Asian history and I'm sure
that it will be of great use to students and scholars once finished. I
don't regret the time I spent on it.

>?I went there recently to try to help and it looked like "the powers that be" were trying to force through books that really no one wants to work on -- books that were really hard and not very interesting even to the people who volunteer their time to DP.

There's no forcing going on. The policy from Day One has been that we
work on what the content providers submit. Sometimes works that look
enticing or valuable to them aren't appealing to the proofers, and
then take a long time to wend their way through the system. (Some
texts, like Greg Week's science fiction stories, zip through in days.)

The problem is that the mouldie oldies clog the queues. There have
been quite a few proposals for changing the queue system and the round
system, and some experiments are running right now. We'll see what
happens. DP made a HUGE change when it moved to five rounds rather
than two, and I think it will be able to change again.

--
Karen Lofstrom


You can't force people to work on things they don't want to work on.
Either they work on texts that they want to work on, or if DP is not
willing to present any of those, they they go on with their lives, or
maybe, like in my case, they "route around damage" and work on books
outside of DP.
>
> The problem is NOT that there is "esoterica" vs. "books that will be actively read" -- the problem is that the "esoterica" takes so much time and effort compared to "books that will be actively read" that "esoterica" ends up swamping the other categories.
>
> Are you really saving a book if you pickle it for posterity without it getting read? ?Isn't that like locking up a ballerina's shoes in order to preserve ballet? Or locking up an artists paint and brushes in order to preserve art? To my taste books exist while they are being read. ?Otherwise they fail to exist -- beyond little magnetic domains stuck somewhere on the internet.
>
> A simple answer would be to put in separate queues for the differing levels of difficulty and/or categories of books. Then people who want to work on esoterica can do so without impacting people who don't.
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

From grythumn at gmail.com  Fri Feb 12 11:45:39 2010
From: grythumn at gmail.com (Robert Cicconetti)
Date: Fri, 12 Feb 2010 14:45:39 -0500
Subject: [gutvol-d] Re: DP: was rfrank reports in
In-Reply-To: <6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS24A947851758312487F600AE500@phx.gbl>
	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com>
Message-ID: <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com>

On Fri, Feb 12, 2010 at 2:22 PM, David Starner <prosfilaes at gmail.com> wrote:
> Dictionary. We've had scans of the OED for years; no one has been
> willing to attack it. We can probably come up with a dozen usable

Not exactly true. I have a clearance on it, and have a fascicle
prepped and at DP. The holdup is that I have yet to come up with a
good markup for proofing that can be machine transformed into various
dictionary formats. Straight TEI is too big, and likely to lead to
inconsistencies. I refuse to start something this big without a decent
plan for the final output.

Granted, once started, it will probably take decades to work through DP...

-R C
(Who is somewhat easily distracted, and has been working on other projects.)

From ajhaines at shaw.ca  Fri Feb 12 12:05:20 2010
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Fri, 12 Feb 2010 12:05:20 -0800
Subject: [gutvol-d] Re: DP: was rfrank reports in
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>
	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>
	<SNT120-DS24A947851758312487F600AE500@phx.gbl>
	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
Message-ID: <18CC2C23FCF249DEA672595196E236B2@alp2400>

Speaking as a Whitewasher (and probably for the other WWers, too), I have 
absolutely no interest in posting a "preliminary" version of something if a 
"revised" version is going to appear in a few days/weeks/months, requiring 
me to re-do the posting process.  Ditto for posting a text-only version if 
an HTML version is in the works.

My PG priorities are my own productions first, followed by WWing, then 
Errata and Reposts.  My own productions are not going to be allowed to 
suffer just because someone is in a rush to get a preliminary version out 
the door.  I can always create another priority--"No Rush Whatsover".

In short, it's MY time I volunteer to PG, and it's not yours to waste.

Al


----- Original Message ----- 
From: "Jim Adcock" <jimad at msn.com>
To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d at lists.pglaf.org>
Sent: Friday, February 12, 2010 10:47 AM
Subject: [gutvol-d] DP: was rfrank reports in


> >To which I pointed out that this would in many cases result in the
> posting of severely deficient texts. Formatting is important.
>
> OK, but I can also point to texts that were almost "good to go" before 
> they went into DP, only to molder indefinitely there.  Is there some way 
> to make a decision on this one way or another. How about letting the PM 
> make the decision whether or not to post a "preliminary version" to PG?
>
>>Because sometimes it may be worth letting a text molder rather then
> preemptorially ripping it out of someone's hands and annoying the hell
> out of them.
>
> OK, but how and when do you decide that the PP has actually moved on in 
> life and is not really willing to finish up the book to which others have 
> in good faith contributed their blood sweat and tears in the hopes of 
> getting an honest to god book?  Not to mention the possibility of a PP not 
> working in good faith?
>
>>Is guiguts not quick enough for you? This is a fairly simple tool problem.
>
> Tried it previously and didn't find any value in it.  I will take a look 
> at it again.
>
>>It's easy to come up with a rhetorically stupid title. But if you
> pulled a real title, then we could actually discuss the audience and
> why someone would upload that.
>
> Pick any title active in the rounds right now.  Based on the best 
> statistics I can find on PG usage, which is actually from IA, the most 
> popular books from PG get read literally 100,000 times more often than the 
> least read books.  Now, it is hard to find a book that is going to be that 
> popular.  But it is easy to find a good book which will get read literally 
> 40x more often than the books in DP right now, as well as being at least 
> several times faster and a easier to create.
>
>>> Is there any way to more actively
>>> promote the acquisition and prioritizing of texts that are generally
>>> recognized as being "better than average" aka "famous" or at least "well
>>> known"?
>>
>>That presumes that that should be our goal. Some of the works I'm
>>proudest of are works where the PG edition is the best in the world.
>>Sure, more people may read the Canterbury Tales, but every who reads
>>our edition of Stephen Hawes's "A Joyful Meditation....
>
> Is it possible to split the queues and the efforts into "esoterica" vs. 
> "books that will be actively read?"  Right now the "books that will be 
> actively read" I am afraid are stuck in the queue behind "books that no 
> one is actually willing to work on."  I went there recently to try to help 
> and it looked like "the powers that be" were trying to force through books 
> that really no one wants to work on -- books that were really hard and not 
> very interesting even to the people who volunteer their time to DP. You 
> can't force people to work on things they don't want to work on.  Either 
> they work on texts that they want to work on, or if DP is not willing to 
> present any of those, they they go on with their lives, or maybe, like in 
> my case, they "route around damage" and work on books outside of DP.
>
> The problem is NOT that there is "esoterica" vs. "books that will be 
> actively read" -- the problem is that the "esoterica" takes so much time 
> and effort compared to "books that will be actively read" that "esoterica" 
> ends up swamping the other categories.
>
> Are you really saving a book if you pickle it for posterity without it 
> getting read?  Isn't that like locking up a ballerina's shoes in order to 
> preserve ballet? Or locking up an artists paint and brushes in order to 
> preserve art? To my taste books exist while they are being read. 
> Otherwise they fail to exist -- beyond little magnetic domains stuck 
> somewhere on the internet.
>
> A simple answer would be to put in separate queues for the differing 
> levels of difficulty and/or categories of books. Then people who want to 
> work on esoterica can do so without impacting people who don't.
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
> 


From greg at durendal.org  Fri Feb 12 13:45:15 2010
From: greg at durendal.org (Greg Weeks)
Date: Fri, 12 Feb 2010 16:45:15 -0500 (EST)
Subject: [gutvol-d] [SPAM] Re:  Re: DP: was rfrank reports in
In-Reply-To: <18CC2C23FCF249DEA672595196E236B2@alp2400>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>
	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>
	<SNT120-DS24A947851758312487F600AE500@phx.gbl>
	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<18CC2C23FCF249DEA672595196E236B2@alp2400>
Message-ID: <alpine.DEB.2.00.1002121643370.28065@durendal.durendal.org>

On Fri, 12 Feb 2010, Al Haines (shaw) wrote:

> Speaking as a Whitewasher (and probably for the other WWers, too), I have 
> absolutely no interest in posting a "preliminary" version of something if a 
> "revised" version is going to appear in a few days/weeks/months, requiring me 
> to re-do the posting process.  Ditto for posting a text-only version if an 
> HTML version is in the works.

The proposal isn't to "post" to PG at all, but to something like 
preprints.readingroo.ms but entirely automated.

-- 
Greg Weeks
http://durendal.org:8080/greg/


From sly at victoria.tc.ca  Fri Feb 12 15:06:24 2010
From: sly at victoria.tc.ca (Andrew Sly)
Date: Fri, 12 Feb 2010 15:06:24 -0800 (PST)
Subject: [gutvol-d] Re: [SPAM] Re:  Re: DP: was rfrank reports in
In-Reply-To: <alpine.DEB.2.00.1002121643370.28065@durendal.durendal.org>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>
	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>
	<SNT120-DS24A947851758312487F600AE500@phx.gbl>
	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<18CC2C23FCF249DEA672595196E236B2@alp2400>
	<alpine.DEB.2.00.1002121643370.28065@durendal.durendal.org>
Message-ID: <Pine.GSO.4.58.1002121459480.9647@vtn1.victoria.tc.ca>

I agree that's what has been said in discussions on the DP forums.
I would argue that intent was not clear from what's been been posted
on this list.

>From what I've seen, it's hard to stay focused on one concept, because
everyone starts dragging in their own concerns on marginally related
topics and making those the main focus.

--Andrew

On Fri, 12 Feb 2010, Greg Weeks wrote:

> On Fri, 12 Feb 2010, Al Haines (shaw) wrote:
>
> > Speaking as a Whitewasher (and probably for the other WWers, too), I have
> > absolutely no interest in posting a "preliminary" version of something if a
> > "revised" version is going to appear in a few days/weeks/months, requiring me
> > to re-do the posting process.  Ditto for posting a text-only version if an
> > HTML version is in the works.
>
> The proposal isn't to "post" to PG at all, but to something like
> preprints.readingroo.ms but entirely automated.
>
>

From dakretz at gmail.com  Fri Feb 12 15:33:20 2010
From: dakretz at gmail.com (don kretz)
Date: Fri, 12 Feb 2010 15:33:20 -0800
Subject: [gutvol-d] Re: [SPAM] Re: Re: DP: was rfrank reports in
In-Reply-To: <Pine.GSO.4.58.1002121459480.9647@vtn1.victoria.tc.ca>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<18CC2C23FCF249DEA672595196E236B2@alp2400>
	<alpine.DEB.2.00.1002121643370.28065@durendal.durendal.org>
	<Pine.GSO.4.58.1002121459480.9647@vtn1.victoria.tc.ca>
Message-ID: <627d59b81002121533y422adbf4uecf18d0f234922f4@mail.gmail.com>

This is very interesting viewed from both sites.

We have one spokesman for PG suggesting that, for the purposes of
increasing the rate of increasing the stock, using text at some level of
markup sophistication (which I seem to remember strongly featured
text but not HTML), there was some possibility of room for flexibility
(or something like that.)

There immediately erupted on DP two substantially differing
(mis)interpretations
of what this might mean.

Both of them have received responses I'd characterize as mostly
indifference to revulsion, with a few outliers on both sides.

Now we have a second PG spokesman, and the score here seems
to be one vote for "maybe" and another for "hell no".

Not much basis left for discussion, but we get a lot of productive
venting done on both sides.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100212/d307ca09/attachment.html>

From Bowerbird at aol.com  Fri Feb 12 16:58:11 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 12 Feb 2010 19:58:11 EST
Subject: [gutvol-d] Re: rfrank reports in
Message-ID: <1830d.5e4b7dd4.38a75323@aol.com>


what a convoy of clowns...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100212/d13d8847/attachment.html>

From greg at durendal.org  Fri Feb 12 17:26:29 2010
From: greg at durendal.org (Greg Weeks)
Date: Fri, 12 Feb 2010 20:26:29 -0500 (EST)
Subject: [gutvol-d] [SPAM] Re: Re: [SPAM] Re: Re: DP: was rfrank reports in
In-Reply-To: <Pine.GSO.4.58.1002121459480.9647@vtn1.victoria.tc.ca>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>
	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>
	<SNT120-DS24A947851758312487F600AE500@phx.gbl>
	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<18CC2C23FCF249DEA672595196E236B2@alp2400>
	<alpine.DEB.2.00.1002121643370.28065@durendal.durendal.org>
	<Pine.GSO.4.58.1002121459480.9647@vtn1.victoria.tc.ca>
Message-ID: <alpine.DEB.2.00.1002122015450.28877@durendal.durendal.org>


Yes, after the initial proposal, there were 5 or six other proposals made 
in the same thread to drag it off in odd courses. As far as I can see 
there's only one person actually doing anything other than argue. That's 
hanne_dk and I hope to see an automated script to process DPs intermediate 
files into something that doesn't look too bad for most texts. It appears 
it'll never get "official"  approval as there's too many people adamantly 
against doing to to "their" texts. Oh well.

Greg Weeks

On Fri, 12 Feb 2010, Andrew Sly wrote:

> I agree that's what has been said in discussions on the DP forums.
> I would argue that intent was not clear from what's been been posted
> on this list.
>
>> From what I've seen, it's hard to stay focused on one concept, because
> everyone starts dragging in their own concerns on marginally related
> topics and making those the main focus.
>
> --Andrew
>
> On Fri, 12 Feb 2010, Greg Weeks wrote:
>
>> On Fri, 12 Feb 2010, Al Haines (shaw) wrote:
>>
>>> Speaking as a Whitewasher (and probably for the other WWers, too), I have
>>> absolutely no interest in posting a "preliminary" version of something if a
>>> "revised" version is going to appear in a few days/weeks/months, requiring me
>>> to re-do the posting process.  Ditto for posting a text-only version if an
>>> HTML version is in the works.
>>
>> The proposal isn't to "post" to PG at all, but to something like
>> preprints.readingroo.ms but entirely automated.
>>
>>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

-- 
Greg Weeks
http://durendal.org:8080/greg/


From dakretz at gmail.com  Fri Feb 12 17:49:41 2010
From: dakretz at gmail.com (don kretz)
Date: Fri, 12 Feb 2010 17:49:41 -0800
Subject: [gutvol-d] {Disarmed} Re: [SPAM] Re: Re: [SPAM] Re: Re: DP: was
	rfrank reports in
In-Reply-To: <alpine.DEB.2.00.1002122015450.28877@durendal.durendal.org>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<18CC2C23FCF249DEA672595196E236B2@alp2400>
	<alpine.DEB.2.00.1002121643370.28065@durendal.durendal.org>
	<Pine.GSO.4.58.1002121459480.9647@vtn1.victoria.tc.ca>
	<alpine.DEB.2.00.1002122015450.28877@durendal.durendal.org>
Message-ID: <627d59b81002121749y55c6d30bp2f889a7878bb9abb@mail.gmail.com>

Well, to close the circle, if they were posted, they would be (I say this
advisedly) *perfect* fodder for, say, an offline utility program to run
automated checks and do basic formatting. I bet in most cases one person
could whip up a high-quality post-PP equivalent in, say, a day or two.

On Fri, Feb 12, 2010 at 5:26 PM, Greg Weeks <greg at durendal.org> wrote:

>
> Yes, after the initial proposal, there were 5 or six other proposals made
> in the same thread to drag it off in odd courses. As far as I can see
> there's only one person actually doing anything other than argue. That's
> hanne_dk and I hope to see an automated script to process DPs intermediate
> files into something that doesn't look too bad for most texts. It appears
> it'll never get "official"  approval as there's too many people adamantly
> against doing to to "their" texts. Oh well.
>
> Greg Weeks
>
>
> On Fri, 12 Feb 2010, Andrew Sly wrote:
>
>  I agree that's what has been said in discussions on the DP forums.
>> I would argue that intent was not clear from what's been been posted
>> on this list.
>>
>>  From what I've seen, it's hard to stay focused on one concept, because
>>>
>> everyone starts dragging in their own concerns on marginally related
>> topics and making those the main focus.
>>
>> --Andrew
>>
>> On Fri, 12 Feb 2010, Greg Weeks wrote:
>>
>>  On Fri, 12 Feb 2010, Al Haines (shaw) wrote:
>>>
>>>  Speaking as a Whitewasher (and probably for the other WWers, too), I
>>>> have
>>>> absolutely no interest in posting a "preliminary" version of something
>>>> if a
>>>> "revised" version is going to appear in a few days/weeks/months,
>>>> requiring me
>>>> to re-do the posting process.  Ditto for posting a text-only version if
>>>> an
>>>> HTML version is in the works.
>>>>
>>>
>>> The proposal isn't to "post" to PG at all, but to something like
>>> preprints.readingroo.ms but entirely automated.
>>>
>>>
>>>  _______________________________________________
>> gutvol-d mailing list
>> gutvol-d at lists.pglaf.org
>> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>>
>>
> --
> Greg Weeks
> http://durendal.org:8080/greg/
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100212/c7907d41/attachment.html>

From ke at gnu.franken.de  Fri Feb 12 18:26:24 2010
From: ke at gnu.franken.de (Karl Eichwalder)
Date: Sat, 13 Feb 2010 03:26:24 +0100
Subject: [gutvol-d] Using SVN or git/bazar (Re: Re: DP: was rfrank reports
	in)
In-Reply-To: <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com>
	(Robert Cicconetti's message of "Fri, 12 Feb 2010 14:45:39 -0500")
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS24A947851758312487F600AE500@phx.gbl>
	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com>
	<15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com>
Message-ID: <m2mxzdc1m7.fsf_-_@gnu.franken.de>

Robert Cicconetti <grythumn at gmail.com> writes:

> Not exactly true. I have a clearance on it, and have a fascicle
> prepped and at DP. The holdup is that I have yet to come up with a
> good markup for proofing that can be machine transformed into various
> dictionary formats.

Lame excuse ;)  The proofing rounds are easy (and you only see the
difficulties, once you actually let the crowd work on it).

> Straight TEI is too big, and likely to lead to inconsistencies. I
> refuse to start something this big without a decent plan for the final
> output.

I'd recommend to do all "formatting" (= XML tagging) off-site.  It would
probably the best to use SVN or git/bazar for collaboration.  Any idea
where we could host such a repository?

-- 
Karl Eichwalder

From dakretz at gmail.com  Fri Feb 12 18:34:48 2010
From: dakretz at gmail.com (don kretz)
Date: Fri, 12 Feb 2010 18:34:48 -0800
Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was rfrank
	reports in)
In-Reply-To: <m2mxzdc1m7.fsf_-_@gnu.franken.de>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com>
	<15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com>
	<m2mxzdc1m7.fsf_-_@gnu.franken.de>
Message-ID: <627d59b81002121834i2edda5a4n9e4efdc5d162afb9@mail.gmail.com>

Google Code <http://code.google.com/intl/en/>

On Fri, Feb 12, 2010 at 6:26 PM, Karl Eichwalder <ke at gnu.franken.de> wrote:

> Robert Cicconetti <grythumn at gmail.com> writes:
>
> > Not exactly true. I have a clearance on it, and have a fascicle
> > prepped and at DP. The holdup is that I have yet to come up with a
> > good markup for proofing that can be machine transformed into various
> > dictionary formats.
>
> Lame excuse ;)  The proofing rounds are easy (and you only see the
> difficulties, once you actually let the crowd work on it).
>
> > Straight TEI is too big, and likely to lead to inconsistencies. I
> > refuse to start something this big without a decent plan for the final
> > output.
>
> I'd recommend to do all "formatting" (= XML tagging) off-site.  It would
> probably the best to use SVN or git/bazar for collaboration.  Any idea
> where we could host such a repository?
>
> --
> Karl Eichwalder
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100212/aca118ff/attachment.html>

From grythumn at gmail.com  Fri Feb 12 19:16:21 2010
From: grythumn at gmail.com (Robert Cicconetti)
Date: Fri, 12 Feb 2010 22:16:21 -0500
Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was rfrank
	reports in)
In-Reply-To: <m2mxzdc1m7.fsf_-_@gnu.franken.de>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com>
	<15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com>
	<m2mxzdc1m7.fsf_-_@gnu.franken.de>
Message-ID: <15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com>

On Fri, Feb 12, 2010 at 9:26 PM, Karl Eichwalder <ke at gnu.franken.de> wrote:
> Robert Cicconetti <grythumn at gmail.com> writes:
>
>> Not exactly true. I have a clearance on it, and have a fascicle
>> prepped and at DP. The holdup is that I have yet to come up with a
>> good markup for proofing that can be machine transformed into various
>> dictionary formats.
>
> Lame excuse ;) ?The proofing rounds are easy (and you only see the
> difficulties, once you actually let the crowd work on it).

Not really. The OED uses a predecessor of IPA with some oddball
symbols.. at the least I have to come up with a table for those or
they'll be all over the place. I started one, need to finish it.

>> Straight TEI is too big, and likely to lead to inconsistencies. I
>> refuse to start something this big without a decent plan for the final
>> output.
>
> I'd recommend to do all "formatting" (= XML tagging) off-site. ?It would
> probably the best to use SVN or git/bazar for collaboration. ?Any idea
> where we could host such a repository?

I'm not prepared to abandon the DP workflow, especially for a project
of this scale, and considering the amount of markup that will be
required. At DP I reasonably assume it'll keep moving, even if I drop
off the grid or get hit by a bus.

-R C

From dakretz at gmail.com  Fri Feb 12 20:04:18 2010
From: dakretz at gmail.com (don kretz)
Date: Fri, 12 Feb 2010 20:04:18 -0800
Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was rfrank
	reports in)
In-Reply-To: <15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com>
	<15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com>
	<m2mxzdc1m7.fsf_-_@gnu.franken.de>
	<15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com>
Message-ID: <627d59b81002122004j376a99fdofbd0f70b0e2df8df@mail.gmail.com>

Here's a point of reference for you.

The current Encyclop?dia Britannica project in F2 has been there since
September.

Number of pages:                 232
Pages remaining:                  24
Pages I've done:                 203
Pages other people have done:      5

Some rounds get cherry-picked pretty badly; and OED is not a cherry.

Stay away from buses.


On Fri, Feb 12, 2010 at 7:16 PM, Robert Cicconetti <grythumn at gmail.com>wrote:

> On Fri, Feb 12, 2010 at 9:26 PM, Karl Eichwalder <ke at gnu.franken.de>
> wrote:
> > Robert Cicconetti <grythumn at gmail.com> writes:
>
> I'm not prepared to abandon the DP workflow, especially for a project
> of this scale, and considering the amount of markup that will be
> required. At DP I reasonably assume it'll keep moving, even if I drop
> off the grid or get hit by a bus.
>
> -R C
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100212/7717b6b2/attachment.html>

From ke at gnu.franken.de  Sat Feb 13 00:17:14 2010
From: ke at gnu.franken.de (Karl Eichwalder)
Date: Sat, 13 Feb 2010 09:17:14 +0100
Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was rfrank
	reports in)
In-Reply-To: <15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com>
	(Robert Cicconetti's message of "Fri, 12 Feb 2010 22:16:21 -0500")
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com>
	<15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com>
	<m2mxzdc1m7.fsf_-_@gnu.franken.de>
	<15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com>
Message-ID: <m2hbplbldh.fsf@gnu.franken.de>

Robert Cicconetti <grythumn at gmail.com> writes:

> Not really. The OED uses a predecessor of IPA with some oddball
> symbols.. at the least I have to come up with a table for those or
> they'll be all over the place. I started one, need to finish it.

You could consider processing it at dp-canada or dp-int--both are UTF-8
enabled.

>> I'd recommend to do all "formatting" (= XML tagging) off-site. ?It would
>> probably the best to use SVN or git/bazar for collaboration. ?Any idea
>> where we could host such a repository?
>
> I'm not prepared to abandon the DP workflow, especially for a project
> of this scale, and considering the amount of markup that will be
> required. At DP I reasonably assume it'll keep moving, even if I drop
> off the grid or get hit by a bus.

That's why I propose to use a public repository.  Of course, you would
leave a appropriate comment on the project page.

Doing TEI tagging page-wise is cumbersome.  Doing TEI tagging off-site
using your XML editor is much better.

-- 
Karl Eichwalder

From grythumn at gmail.com  Sat Feb 13 06:39:05 2010
From: grythumn at gmail.com (Robert Cicconetti)
Date: Sat, 13 Feb 2010 09:39:05 -0500
Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was rfrank
	reports in)
In-Reply-To: <627d59b81002122004j376a99fdofbd0f70b0e2df8df@mail.gmail.com>
References: <8005.73d837a3.38a06575@aol.com>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com>
	<15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com>
	<m2mxzdc1m7.fsf_-_@gnu.franken.de>
	<15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com>
	<627d59b81002122004j376a99fdofbd0f70b0e2df8df@mail.gmail.com>
Message-ID: <15cfa2a51002130639l7fd30080k948fc7998cc3ad72@mail.gmail.com>

On Fri, Feb 12, 2010 at 11:04 PM, don kretz <dakretz at gmail.com> wrote:
> Here's a point of reference for you.
> The current Encyclop?dia Britannica project in F2 has been there since
> September.
> Number of pages: ? ? ? ? ? ? ? ? 232
> Pages remaining: ? ? ? ? ? ? ? ? ?24
> Pages I've done: ? ? ? ? ? ? ? ? 203
> Pages other people have done: ? ? ?5
> Some rounds get cherry-picked pretty badly; and OED is not a cherry.
> Stay away from buses.

I might be willing to do a parallel F1 / merge, and automated markup
check for F2 skip if I don't have to find a PP in advance.

Let me be blunt... I'm easily distracted; doing this kind of markup
would drive me nuts quickly and result in orphaned projects. I'll prep
the images, run OCR, answer questions, write the code to do the
automated checks. But I don't PP or format.

-Bob

From ke at gnu.franken.de  Sat Feb 13 09:12:36 2010
From: ke at gnu.franken.de (Karl Eichwalder)
Date: Sat, 13 Feb 2010 18:12:36 +0100
Subject: [gutvol-d] Re: Using SVN or git/bazar
In-Reply-To: <627d59b81002121834i2edda5a4n9e4efdc5d162afb9@mail.gmail.com>
	(don kretz's message of "Fri, 12 Feb 2010 18:34:48 -0800")
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com>
	<15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com>
	<m2mxzdc1m7.fsf_-_@gnu.franken.de>
	<627d59b81002121834i2edda5a4n9e4efdc5d162afb9@mail.gmail.com>
Message-ID: <m26361awl7.fsf_-_@gnu.franken.de>

don kretz <dakretz at gmail.com> writes:

> Google Code <http://code.google.com/intl/en/>

Why not? ;)  I just created http://code.google.com/p/tieck-texts/ and
seeded it with 'Briefe an Ludwig Tieck (1 of 4) {fraktur} {type-in}'.
I'll update the project comments later.

Wondering whether Google will accept this project...

-- 
Karl Eichwalder

From schultzk at uni-trier.de  Sat Feb 13 10:03:24 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Sat, 13 Feb 2010 19:03:24 +0100
Subject: [gutvol-d] Re: DP: was rfrank reports in
In-Reply-To: <15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS24A947851758312487F600AE500@phx.gbl>
	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com>
	<15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com>
Message-ID: <C44AF757-2A33-4178-9C01-BD53E4D08B9A@uni-trier.de>

Hi Robert,

	As far as a markup is concerned I would suggest
	using TeX or XeTeX.

	For one you can encode all the information we want as you want.
	Such as \entry, \pronunciation, \meaning, \synonym, etc, you name
	it.

	Then either write comands for formating or a TeX script to produce the desired
	output, or use any other language to process the data.

	Another way to go is use XML to encode the data and take it from there.

	Eitherway you have full control of the input data and output.

	regards
		Keith

	
Am 12.02.2010 um 20:45 schrieb Robert Cicconetti:

> On Fri, Feb 12, 2010 at 2:22 PM, David Starner <prosfilaes at gmail.com> wrote:
>> Dictionary. We've had scans of the OED for years; no one has been
>> willing to attack it. We can probably come up with a dozen usable
> 
> Not exactly true. I have a clearance on it, and have a fascicle
> prepped and at DP. The holdup is that I have yet to come up with a
> good markup for proofing that can be machine transformed into various
> dictionary formats. Straight TEI is too big, and likely to lead to
> inconsistencies. I refuse to start something this big without a decent
> plan for the final output.
> 
> Granted, once started, it will probably take decades to work through DP...


From dakretz at gmail.com  Sat Feb 13 10:23:06 2010
From: dakretz at gmail.com (don kretz)
Date: Sat, 13 Feb 2010 10:23:06 -0800
Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was rfrank
	reports in)
In-Reply-To: <15cfa2a51002130639l7fd30080k948fc7998cc3ad72@mail.gmail.com>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com>
	<15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com>
	<m2mxzdc1m7.fsf_-_@gnu.franken.de>
	<15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com>
	<627d59b81002122004j376a99fdofbd0f70b0e2df8df@mail.gmail.com>
	<15cfa2a51002130639l7fd30080k948fc7998cc3ad72@mail.gmail.com>
Message-ID: <627d59b81002131023h6d2ec44r842d268d941d79e0@mail.gmail.com>

A little more information.

That same project (which is 232 pages, about 1/8 of one volume out of 29
volumes.)

It was being P3 proofread from April to November of 2007 (about 7 months).

Then it sat in queues with no one working on it from Nov. 2007 to Sept. 2009
(almost two years) except for a brief spell (3 months) when it was in F1.

And that was pretty speedy. A new project (such as I'm preparing now, which
will
be 300+ pages,) will not be quite so fortunate, because now the queues are
much longer; and more significantly, there will be many more EB volumes
ahead of it when it gets to each queue.

So I'd be prepared to spend some time proofing at least (if you don't prefer
formatting and PP) so help it along in those brief windows of opportunity
(roughly 9-12 months) when it's available to anyone at all. (But given
well-established trends, it will  probably be much longer.)

Fortunately, you'll have lots of time to scan and OCR each project. In
fact, I bet you'll be so fortunate as to have a new generation of scanning
technology available every couple of projects or so.

It may easily take longer to proof, format, and publish the ebook than it
took for the original - an acknowledged epic in itself.

For sure, it could be re-typeset in a small fraction of the time.


On Sat, Feb 13, 2010 at 6:39 AM, Robert Cicconetti <grythumn at gmail.com>wrote:

> On Fri, Feb 12, 2010 at 11:04 PM, don kretz <dakretz at gmail.com> wrote:
> > Here's a point of reference for you.
> > The current Encyclop?dia Britannica project in F2 has been there since
> > September.
> > Number of pages:                 232
> > Pages remaining:                  24
> > Pages I've done:                 203
> > Pages other people have done:      5
> > Some rounds get cherry-picked pretty badly; and OED is not a cherry.
> > Stay away from buses.
>
> I might be willing to do a parallel F1 / merge, and automated markup
> check for F2 skip if I don't have to find a PP in advance.
>
> Let me be blunt... I'm easily distracted; doing this kind of markup
> would drive me nuts quickly and result in orphaned projects. I'll prep
> the images, run OCR, answer questions, write the code to do the
> automated checks. But I don't PP or format.
>
> -Bob
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100213/64fd160f/attachment.html>

From dakretz at gmail.com  Sat Feb 13 10:32:28 2010
From: dakretz at gmail.com (don kretz)
Date: Sat, 13 Feb 2010 10:32:28 -0800
Subject: [gutvol-d] Re: DP: was rfrank reports in
In-Reply-To: <C44AF757-2A33-4178-9C01-BD53E4D08B9A@uni-trier.de>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com>
	<15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com>
	<C44AF757-2A33-4178-9C01-BD53E4D08B9A@uni-trier.de>
Message-ID: <627d59b81002131032q629937aeo21ce4ca1cca02693@mail.gmail.com>

You might want to work something out with these
guys<http://www.longnow.org/clock/> to
keep track
of your project logs after you're gone.


>
> Am 12.02.2010 um 20:45 schrieb Robert Cicconetti:
> >
> > Granted, once started, it will probably take decades to work through
> DP...
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100213/f8ad312f/attachment.html>

From gbuchana at teksavvy.com  Sat Feb 13 10:51:50 2010
From: gbuchana at teksavvy.com (Gardner Buchanan)
Date: Sat, 13 Feb 2010 13:51:50 -0500
Subject: [gutvol-d] Many solo projects out there in gutvol-d land?
Message-ID: <4B76F4C6.3030006@teksavvy.com>

I've done a few books for PG.  I've used DP -- back in the day,
but mostly I've been doing solo projects.  I don't hear a lot
about folks doing projects solo these days. Are there many of
us out there?

============================================================
Gardner Buchanan                     <gbuchana at teksavvy.com>
Ottawa, ON             FreeBSD: Where you want to go. Today.

From sly at victoria.tc.ca  Sat Feb 13 11:49:14 2010
From: sly at victoria.tc.ca (Andrew Sly)
Date: Sat, 13 Feb 2010 11:49:14 -0800 (PST)
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <4B76F4C6.3030006@teksavvy.com>
References: <4B76F4C6.3030006@teksavvy.com>
Message-ID: <Pine.GSO.4.58.1002131125590.14703@vtn1.victoria.tc.ca>

I suspect there are. You just don't see a lot of communication
between them.

I'm often checking newly posted texts for the catalog records,
and I do notice credits sometimes that do not mention dp.

I do projects on my own sometimes. I know Al Haines
does many. I recall seeing a few religious texts lately
from an individual contributor.

--Andrew

On Sat, 13 Feb 2010, Gardner Buchanan wrote:

> I've done a few books for PG.  I've used DP -- back in the day,
> but mostly I've been doing solo projects.  I don't hear a lot
> about folks doing projects solo these days. Are there many of
> us out there?
>

From ajhaines at shaw.ca  Sat Feb 13 12:25:48 2010
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Sat, 13 Feb 2010 12:25:48 -0800
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
References: <4B76F4C6.3030006@teksavvy.com>
Message-ID: <DD5174F06F724F3C8CCEAADEE9390D34@alp2400>

I've produced many books single-handed, from scan to post, for both PG-US 
and PG-Canada.  As a Whitewasher, I've encountered maybe a dozen solo 
producers.

Many of the first-timers I deal with aren't prepared for the work involved 
in producing a book, and abandon their projects.  Very few become 
multi-project submitters.

Abandoned projects are not lost.  My practice is to wait a year, then decide 
if I want to do the book myself.  If I do, and I can find a scanset, I get a 
clearance, and produce the book.

Many of the early producers, who did books when etext numbers were less than 
about 5000, no longer produce.  I can think of only a few who do.  Gardner, 
you're one, and David Price and David Widger are others.  I didn't start 
until the very early 10000's--my first book was #10750, released January 
2004.


Al


----- Original Message ----- 
From: "Gardner Buchanan" <gbuchana at teksavvy.com>
To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
Sent: Saturday, February 13, 2010 10:51 AM
Subject: [gutvol-d] Many solo projects out there in gutvol-d land?


> I've done a few books for PG.  I've used DP -- back in the day,
> but mostly I've been doing solo projects.  I don't hear a lot
> about folks doing projects solo these days. Are there many of
> us out there?
>
> ============================================================
> Gardner Buchanan                     <gbuchana at teksavvy.com>
> Ottawa, ON             FreeBSD: Where you want to go. Today.
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
> 


From dakretz at gmail.com  Sat Feb 13 12:53:05 2010
From: dakretz at gmail.com (don kretz)
Date: Sat, 13 Feb 2010 12:53:05 -0800
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
Message-ID: <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>

Interesting. I hadn't realize the two organizations were so closely
interdependent.

So effectively, PG's release volume is almost directly dependent on DP's
posting volume.

And whatever validation requirements PG might have don't have much relevance
if
they differ from DP's requirements, as long as the WWers don't reject them.

DP is the publisher, and PG is the distributor (roughly speaking).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100213/2c568add/attachment.html>

From prosfilaes at gmail.com  Sat Feb 13 20:47:08 2010
From: prosfilaes at gmail.com (David Starner)
Date: Sat, 13 Feb 2010 23:47:08 -0500
Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was rfrank
	reports in)
In-Reply-To: <m2hbplbldh.fsf@gnu.franken.de>
References: <8005.73d837a3.38a06575@aol.com>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com>
	<15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com>
	<m2mxzdc1m7.fsf_-_@gnu.franken.de>
	<15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com>
	<m2hbplbldh.fsf@gnu.franken.de>
Message-ID: <6d99d1fd1002132047q3547ec29x8a82ba42efba6dd9@mail.gmail.com>

On Sat, Feb 13, 2010 at 3:17 AM, Karl Eichwalder <ke at gnu.franken.de> wrote:
> Robert Cicconetti <grythumn at gmail.com> writes:
>
>> Not really. The OED uses a predecessor of IPA with some oddball
>> symbols.. at the least I have to come up with a table for those or
>> they'll be all over the place. I started one, need to finish it.
>
> You could consider processing it at dp-canada or dp-int--both are UTF-8
> enabled.

I have two problems with that. One, I'm not sure all the symbols are
in Unicode. Two, just making Unicode available doesn't overcome the
problems that these characters are not on any physical keyboards and
only the most esoteric software keyboards. Even with Unicode
available, if it were pure IPA, I'd go with SAMPA.

-- 
Kie ekzistas vivo, ekzistas espero.

From sly at victoria.tc.ca  Sat Feb 13 23:20:25 2010
From: sly at victoria.tc.ca (Andrew Sly)
Date: Sat, 13 Feb 2010 23:20:25 -0800 (PST)
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
Message-ID: <Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>


On Sat, 13 Feb 2010, don kretz wrote:

> Interesting. I hadn't realize the two organizations were so closely
> interdependent.

Well, yes. There is a lot of interplay and adaption between the two.
But I would not say that either is dependant upon the other for its
existence. If PG were to somehow disappear or close down, I'm sure
that DP would continue, finding another repository for its finished
texts--or creating one if needed. And if DP were to disappear, PG would
go on just as it always has, only with a much lower volume of texts
being posted.

> So effectively, PG's release volume is almost directly dependent on DP's
> posting volume.

The majority of new PG texts for many years have come frome DP, yes.
For a quick comparison, I see that DP's 15,000th text was posted on
May 12, 2009. They will have done many more since then, and have by now
done more than half of the 31,000 odd items in PG.

A while ago, I added this to the Wikipedia article on Project Gutenberg,
to try to clarify what effect DP had had on it:
"This effort greatly increased the number and variety of texts being added
to Project Gutenberg, as well as making it easier for new volunteers to
start contributing."

I could go on describing the hows and wherefores of that in more detail,
but this is getting too long already.

> And whatever validation requirements PG might have don't have much relevance
> if
> they differ from DP's requirements, as long as the WWers don't reject them.

Well, that has been part of the balancing act, if you will. PG has
always adapted (albeit, sometimes slowly) according to its contributors.
And DP contributors, after conversations back and forth, have helped
to shape what direction PG is going in.

One example that comes to mind is dropping the requirement that
a text be of a certain length, in order to accomodate all the
sci-fi short stories.

In my own opinion, this can be difficult, becuase there are many
parts that make up this process of DP-PG. Sometimes people make
suggestions that seem good from their point of view, but very
few seem to have an accurate over-all picture, to know how one
action can affect other parts of the process.

> DP is the publisher, and PG is the distributor (roughly speaking).

I don't know if that metaphor fits perfectly. Project Gutenberg
itself seems to fill more of the publishers role, as well as
distributor and archiver. DP does what might be compared to
the traditional roles of type-setter, proofreader, fact-checker,
etc.

And don't underestimate the role of the post-processor.
It still comes down to one person who has to do a lot of work
on the text, and often make descisions about how to deal with
many various things, before it is ready for submitting to
PG.


--Andrew

From dakretz at gmail.com  Sun Feb 14 00:09:44 2010
From: dakretz at gmail.com (don kretz)
Date: Sun, 14 Feb 2010 00:09:44 -0800
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
Message-ID: <627d59b81002140009u4a42a463hbab6742d65c7b310@mail.gmail.com>

Not to worry - the last thing any of us do is undervalue the post-processor.
The job just seems to become more complex, and the amount of value-add
the provide beyond what the rest of us do keeps increasing. I don't think
anyone is particularly happy about that, least of all the PPers.They're
the smallest piece of pipe everything has to fit through, and they aren't
getting much help in the way of tool support.


On Sat, Feb 13, 2010 at 11:20 PM, Andrew Sly <sly at victoria.tc.ca> wrote:

>
> And don't underestimate the role of the post-processor.
> It still comes down to one person who has to do a lot of work
> on the text, and often make descisions about how to deal with
> many various things, before it is ready for submitting to
> PG.
>
>
> --Andrew
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100214/95c5acd2/attachment.html>

From ke at gnu.franken.de  Sun Feb 14 00:42:23 2010
From: ke at gnu.franken.de (Karl Eichwalder)
Date: Sun, 14 Feb 2010 09:42:23 +0100
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca> (Andrew
	Sly's message of "Sat, 13 Feb 2010 23:20:25 -0800 (PST)")
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
Message-ID: <m21vgob440.fsf@gnu.franken.de>

Andrew Sly <sly at victoria.tc.ca> writes:

> And don't underestimate the role of the post-processor.
> It still comes down to one person who has to do a lot of work
> on the text, and often make descisions about how to deal with
> many various things, before it is ready for submitting to
> PG.

I think we can change this.  It would be much better to do this
mysterious PP'ing in a collaborative manner.  To experience this, I
created an SVN repository and started with TEI tagging.  I'll add more
of the PGTEI framework soon:

http://code.google.com/p/tieck-texts/

ATM, there is just one book and one contributor.  More to come--thus far
I did not announce it widely.  pgdp seems to be down right now...

-- 
Karl Eichwalder

From traverso at posso.dm.unipi.it  Sun Feb 14 01:02:52 2010
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Sun, 14 Feb 2010 10:02:52 +0100 (CET)
Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was
	rfrank	reports in)
In-Reply-To: <6d99d1fd1002132047q3547ec29x8a82ba42efba6dd9@mail.gmail.com>
	(message from David Starner on Sat, 13 Feb 2010 23:47:08 -0500)
References: <8005.73d837a3.38a06575@aol.com>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com>
	<15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com>
	<m2mxzdc1m7.fsf_-_@gnu.franken.de>
	<15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com>
	<m2hbplbldh.fsf@gnu.franken.de>
	<6d99d1fd1002132047q3547ec29x8a82ba42efba6dd9@mail.gmail.com>
Message-ID: <20100214090252.504D6FFB4@cardano.dm.unipi.it>

>>>>> "David" == David Starner <prosfilaes at gmail.com> writes:

    David> On Sat, Feb 13, 2010 at 3:17 AM, Karl Eichwalder
    David> <ke at gnu.franken.de> wrote:
    >> Robert Cicconetti <grythumn at gmail.com> writes:
    >> 
    >>> Not really. The OED uses a predecessor of IPA with some
    >>> oddball symbols.. at the least I have to come up with a table
    >>> for those or they'll be all over the place. I started one,
    >>> need to finish it.
    >>  You could consider processing it at dp-canada or dp-int--both
    >> are UTF-8 enabled.

    David> I have two problems with that. One, I'm not sure all the
    David> symbols are in Unicode. 

This could be managed with replacements of the few (are they few?)
missing characters.

           Two, just making Unicode available
    David> doesn't overcome the problems that these characters are not
    David> on any physical keyboards and only the most esoteric
    David> software keyboards. 

This could be managed with a character picker, like the greek and 
hieroglyph popups in the proofing interface. Or some of the tools in
some of Don's project comments.

                     Even with Unicode available, if it were
    David> pure IPA, I'd go with SAMPA.

SAMPA might be OK for publcation, and probably for entering too, but
for checking (rounds after the first) it requires to know the
correspondence OED/SAMPA. Impossible, except for experts.

One might however easily build converters from SAMPA and IPA to OED
using the conversion software that is running at DP-EU (convert button
in the standard interface). Undocumented, but I know it, and I have
both the software and part at least of the conversion tables, and can
build in minutes any further table needed. Probably it is something
that might be experimented at DP-EU: apparently Nikola is maintaining
the converter, and adding a table to it is straightforward (it is an
ASCII table). If you want, I can start there a project with a few
pages.

There is however another worse problem: I am not sure that the OED is
free from copyright in Canada or Serbia. 

Carlo

From Bowerbird at aol.com  Sun Feb 14 09:17:05 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 14 Feb 2010 12:17:05 EST
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
Message-ID: <1c256.7508109.38a98a11@aol.com>

karl said:
>    I think we can change this.? It would be much better 
>    to do this mysterious PP'ing in a collaborative manner.? 
>    To experience this, I created an SVN repository and 
>    started with TEI tagging.? I'll add more of 
>    the PGTEI framework soon:

that's right.   to simplify the job of postprocessing,
throw in a dose of s.v.n. and then add some t.e.i.,
and all the complexities will waft away on a breeze.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100214/75b6aa8a/attachment.html>

From donovan at abs.net  Sun Feb 14 09:27:28 2010
From: donovan at abs.net (D Garcia)
Date: Sun, 14 Feb 2010 12:27:28 -0500
Subject: [gutvol-d] DP Outage [WAS: Re: ... solo projects ...]
In-Reply-To: <m21vgob440.fsf@gnu.franken.de>
References: <4B76F4C6.3030006@teksavvy.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<m21vgob440.fsf@gnu.franken.de>
Message-ID: <201002141227.28318.donovan@abs.net>

Karl Eichwalder <ke at gnu.franken.de> wrote:

> I did not announce it widely.  pgdp seems to be down right now...

The server is up, the network is down.

Unfortunately, our colocation provider is one of many in the NJ/NYC region 
that has been affected by fiber cuts related to the underground transformer 
explosion in NYC. Both upstream providers are working at this time to put 
temporary solutions in place to restore connectivity to these facilities until 
permanent repairs can be made. We did just obtain an ETA of "a couple more 
hours" from them via our coloc contact, but that would appear at the moment to 
be a somewhat optimistic educated guess. Hopefully service will be restored by 
this evening (Sunday US EST).

David (donovan)

From dakretz at gmail.com  Sun Feb 14 09:27:28 2010
From: dakretz at gmail.com (don kretz)
Date: Sun, 14 Feb 2010 09:27:28 -0800
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <1c256.7508109.38a98a11@aol.com>
References: <1c256.7508109.38a98a11@aol.com>
Message-ID: <627d59b81002140927i43ab562bi4ce26ac3de6dcbe6@mail.gmail.com>

Yes, it's down again.

I'm not sure I see how this would fix anything. You still have the
PPers at the crossroads with the (somewhat doubtful) requirement
they will accept the opportunity to become intimately familiar with
XML.

At DP, the cost of doing something is measured in volunteer
inconvenience.As a consequence, change is not embraced with
much enthusiasm, nor is measurement nor personal responsibility,
and responsibility tends to be provided by software coercion.


On Sun, Feb 14, 2010 at 9:17 AM, <Bowerbird at aol.com> wrote:

> karl said:
> >   I think we can change this.  It would be much better
> >   to do this mysterious PP'ing in a collaborative manner.
> >   To experience this, I created an SVN repository and
> >   started with TEI tagging.  I'll add more of
> >   the PGTEI framework soon:
>
> that's right.  to simplify the job of postprocessing,
> throw in a dose of s.v.n. and then add some t.e.i.,
> and all the complexities will waft away on a breeze.
>
> -bowerbird
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100214/41b25458/attachment-0001.html>

From Bowerbird at aol.com  Sun Feb 14 09:46:35 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 14 Feb 2010 12:46:35 EST
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
Message-ID: <1cd29.10b0e871.38a990fb@aol.com>

postprocessing at distributed proofreaders is difficult
only because the proofers are instructed to throw out
meaningful data, which later then needs to be replaced,
and the formatters insert obtrusive pseudo-markup,
much of which later needs to be reworked or deleted.

if the proofers used nonobtrusive zen markup instead,
it wouldn't interfere with their proofing task, and there
wouldn't need to be a separate formatting task, even if
(in reality) people decided to specialize on that aspect.

also, the conversion of proofed/formatted pages into a
full-on electronic-book should be an automatic process.
i've already demonstrated this many times, but would be
happy to do it once again, on any book of your choice...

this point is particularly important in a roundless system,
where the object is to move a page to a "finished" status
as quickly as possible.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100214/b184d776/attachment.html>

From marcello at perathoner.de  Sun Feb 14 09:58:22 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Sun, 14 Feb 2010 18:58:22 +0100
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <627d59b81002140927i43ab562bi4ce26ac3de6dcbe6@mail.gmail.com>
References: <1c256.7508109.38a98a11@aol.com>
	<627d59b81002140927i43ab562bi4ce26ac3de6dcbe6@mail.gmail.com>
Message-ID: <4B7839BE.3080206@perathoner.de>

don kretz wrote:

> At DP, the cost of doing something is measured in volunteer
> inconvenience.As a consequence, change is not embraced with 
> much enthusiasm, nor is measurement nor personal responsibility,
> and responsibility tends to be provided by software coercion.

That is very short-sighted. The inconvenience for the volunteer should 
be balanced against the usefulness for the reader. The mindset at PG and 
DP is that everybody being volunteers they don't have to account for the 
quality of their work.


-- 
Marcello Perathoner
webmaster at gutenberg.org

From walter.van.holst at xs4all.nl  Sun Feb 14 12:25:36 2010
From: walter.van.holst at xs4all.nl (Walter van Holst)
Date: Sun, 14 Feb 2010 21:25:36 +0100
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
References: <4B76F4C6.3030006@teksavvy.com>	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
Message-ID: <4B785C40.5000304@xs4all.nl>

On 2/14/10 8:20 AM, Andrew Sly wrote:

> And don't underestimate the role of the post-processor.
> It still comes down to one person who has to do a lot of work
> on the text, and often make descisions about how to deal with
> many various things, before it is ready for submitting to
> PG.

What is it they actually do?

Regards,

  Walter

From dakretz at gmail.com  Sun Feb 14 12:47:02 2010
From: dakretz at gmail.com (don kretz)
Date: Sun, 14 Feb 2010 12:47:02 -0800
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <4B785C40.5000304@xs4all.nl>
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<4B785C40.5000304@xs4all.nl>
Message-ID: <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>

On Sun, Feb 14, 2010 at 12:25 PM, Walter van Holst <
walter.van.holst at xs4all.nl> wrote:

> On 2/14/10 8:20 AM, Andrew Sly wrote:
>
>  And don't underestimate the role of the post-processor.
>> It still comes down to one person who has to do a lot of work
>> on the text, and often make descisions about how to deal with
>> many various things, before it is ready for submitting to
>> PG.
>>
>
> What is it they actually do?
>
> Regards,
>
>  Walter
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>


That's a simple question with a complicated answer.


Here is an explanation <http://www.pgdp.net/c/faq/post_proof.php>that is
apparently as concise as anyone has been able to come up with.


As you can see, it's some mixture of:

a.) validating all the work done by about 6 Rounds of work on each page in
the project;
b.) running a bunch of other semi-manual checks on the project;
c.) filling the gap caused by the fact that the text markup and layout
produced by the Rounds isn't the same as the text format and layout required
by PG;
d.) producing a complete HTML version of the project based on the format and
markup that was originally considered appropriate for the text-only version
that was all  that PG offered at the time it was designed.

So you can see that it's by far the majority of the individual tasks
required to produce an e-book (text and html), only a small few of which
have been distributed.

In some cases the PPer also reproofs the entire project.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100214/e218c3c4/attachment.html>

From grythumn at gmail.com  Sun Feb 14 21:49:05 2010
From: grythumn at gmail.com (Robert Cicconetti)
Date: Mon, 15 Feb 2010 00:49:05 -0500
Subject: [gutvol-d] Re: Using SVN or git/bazar (Re: Re: DP: was rfrank
	reports in)
In-Reply-To: <20100214090252.504D6FFB4@cardano.dm.unipi.it>
References: <8005.73d837a3.38a06575@aol.com>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<6d99d1fd1002121122t2f74ad81vff1c0e500ceafb37@mail.gmail.com>
	<15cfa2a51002121145g15f09b24x4c242b8b59bed4b9@mail.gmail.com>
	<m2mxzdc1m7.fsf_-_@gnu.franken.de>
	<15cfa2a51002121916y1e34c1ecqabe2160723872854@mail.gmail.com>
	<m2hbplbldh.fsf@gnu.franken.de>
	<6d99d1fd1002132047q3547ec29x8a82ba42efba6dd9@mail.gmail.com>
	<20100214090252.504D6FFB4@cardano.dm.unipi.it>
Message-ID: <15cfa2a51002142149q79761674m4796b4870893aa27@mail.gmail.com>

On Sun, Feb 14, 2010 at 4:02 AM, Carlo Traverso
<traverso at posso.dm.unipi.it> wrote:
> ? ?David> I have two problems with that. One, I'm not sure all the
> ? ?David> symbols are in Unicode.
>
> This could be managed with replacements of the few (are they few?)
> missing characters.

The OED phonetic alphabet, and an incomplete match to various unicode symbols:

http://home.comcast.net/~grythumn/oed/

> There is however another worse problem: I am not sure that the OED is
> free from copyright in Canada or Serbia.

Better hope there is some sort of corporate work exception... there
were several editors, dozens of subeditors, and hundreds of volunteer
readers. Not all of whom appear on the title page, but many are
listed.

-Bob

From schultzk at uni-trier.de  Mon Feb 15 01:00:32 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Mon, 15 Feb 2010 10:00:32 +0100
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<4B785C40.5000304@xs4all.nl>
	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
Message-ID: <A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>

Hi All,

	Let me see if I understand this right.

	6 Rounds of work is done just to be worked over so that text
	and HTML versions can be created and the final result
	is published.

	Why, in goods name is it not done the other way around !!
	Get a clean text and HTML version and then and all the
	googly goop after words. Sure would save alot of time.

	I know DP knows above markup, but have they ever
	heard about pseudo-code/markup.

	regards
		Keith.

Am 14.02.2010 um 21:47 schrieb don kretz:

> On Sun, Feb 14, 2010 at 12:25 PM, Walter van Holst <walter.van.holst at xs4all.nl> wrote:
> On 2/14/10 8:20 AM, Andrew Sly wrote:
> 
> And don't underestimate the role of the post-processor.
> It still comes down to one person who has to do a lot of work
> on the text, and often make descisions about how to deal with
> many various things, before it is ready for submitting to
> PG.
> 
> What is it they actually do?
> 
> 
> That's a simple question with a complicated answer.
> 
> 
> Here is an explanation that is apparently as concise as anyone has been able to come up with.
> 
> 
> As you can see, it's some mixture of:
> 
> a.) validating all the work done by about 6 Rounds of work on each page in the project;
> b.) running a bunch of other semi-manual checks on the project;
> c.) filling the gap caused by the fact that the text markup and layout produced by the Rounds isn't the same as the text format and layout required by PG;
> d.) producing a complete HTML version of the project based on the format and markup that was originally considered appropriate for the text-only version that was all  that PG offered at the time it was designed.
> 
> So you can see that it's by far the majority of the individual tasks required to produce an e-book (text and html), only a small few of which have been distributed.
> 
> In some cases the PPer also reproofs the entire project.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100215/87e7faf1/attachment.html>

From dakretz at gmail.com  Mon Feb 15 09:43:01 2010
From: dakretz at gmail.com (don kretz)
Date: Mon, 15 Feb 2010 09:43:01 -0800
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<4B785C40.5000304@xs4all.nl>
	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>
Message-ID: <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>

On Mon, Feb 15, 2010 at 1:00 AM, Keith J. Schultz <schultzk at uni-trier.de>wrote:

> Hi All,
>
> Let me see if I understand this right.
>
> 6 Rounds of work is done just to be worked over so that text
> and HTML versions can be created and the final result
> is published.
>
> Why, in goods name is it not done the other way around !!
> Get a clean text and HTML version and then and all the
> googly goop after words. Sure would save alot of time.
>
> I know DP knows above markup, but have they ever
> heard about pseudo-code/markup.
>
> regards
> Keith.
>
> Am 14.02.2010 um 21:47 schrieb don kretz:
>
>
You'd think it would be obvious, wouldn't you?

When DP started, here was the basic process as far as the
participants were concerned.

1.) A person takes a page of text and a picture of the text,
plus a mediocre online text editor and some guidelines for
follow, and tries to get the text to match the picture.

2.) A second person takes their work and the same picture
and guidelines, and tries to make it better.

3.) The system strings the text files together and hands
them off to PG to publish.

Clean, simple, and most importantly it provides each person
with the immediate and obvous positive gratification of
seeing their work self-evidently closing the gap between
the text and the picture.

Now, almost all the process has been so completely
decomposed and constrained that almost all the oppportunity
for gratification shows up for a little bit to the first proofer (who
still must not do *too much* to make it look like the picture,
i.e. format it); maybe the first formatter (if there's even much
left to do), and supremely and finally, gloriously, the Post
Processor (whose name is associated semi-eternally posted
with their work.)

There's a whole lot more that can be said (is is said, in the
DP forums, loudly, into the vastness of space), about how
it got to be this way, and how happy people are about it,
and what might be done. These are not dumb people, even
though the work seems to have become dumb work.
But there's the picture in a nutshell.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100215/1106c43d/attachment-0001.html>

From Bowerbird at aol.com  Mon Feb 15 10:13:17 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 15 Feb 2010 13:13:17 EST
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
Message-ID: <c709.452845e6.38aae8bd@aol.com>

why do you guys insist on hijacking threads?   it's rude.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100215/6db7ccd9/attachment.html>

From Bowerbird at aol.com  Mon Feb 15 10:32:51 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 15 Feb 2010 13:32:51 EST
Subject: [gutvol-d] Re: When DP started, here was the basic process
Message-ID: <d159.782f2713.38aaed53@aol.com>

see how easy it is to change the subject-header?

***

don said:
>    When DP started, here was the basic process

the irony was, back in those olden days, it was
actually much more difficult to digitize a text,
because the o.c.r. was horrific, and thus it was
a pure pain to proof.

nowadays, even though o.c.r. is vastly improved,
it seems to take forever for a book to transit d.p.

here's an illustrative datapoint i just churned...

in one of the books that rfrank is using for his
roundless experiment, even tepid preprocessing
(which is what he practices) combined with o.c.r.
to produce 20% of the book's 240 pages perfectly.

another 30% of the pages had only 1 error on 'em.

and most of the errors failed spellcheck, meaning
they could've been isolated and fixed immediately,
without need of a word-by-word proofing modality.

d.p. uses dozens of volunteers, taking hours of time,
to do something that one person can do in one hour.

which would, you know, ordinarily be a very sad thing.
except what makes it funny, in this particular case, is
that the people at d.p. think they're being "efficient"...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100215/f9385eaf/attachment.html>

From sly at victoria.tc.ca  Mon Feb 15 11:11:03 2010
From: sly at victoria.tc.ca (Andrew Sly)
Date: Mon, 15 Feb 2010 11:11:03 -0800 (PST)
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<4B785C40.5000304@xs4all.nl>
	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>
	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>
Message-ID: <Pine.GSO.4.58.1002151035180.27234@vtn1.victoria.tc.ca>


On Mon, 15 Feb 2010, don kretz wrote:

> When DP started, here was the basic process as far as the
> participants were concerned.
>
> 1.) A person takes a page of text and a picture of the text,
> plus a mediocre online text editor and some guidelines for
> follow, and tries to get the text to match the picture.
>
> 2.) A second person takes their work and the same picture
> and guidelines, and tries to make it better.
>
> 3.) The system strings the text files together and hands
> them off to PG to publish.


Are you sure you have phrased that in the way you wanted?

At no point in the history of DP was the output of the rounds
"strung together and handed directly off to PG". I cannot
recall if the name of "post-processor" has always been
used--but there has always been someone in that role.

Anyone who has worked on PP would know that the output
on the rounds at DP is _not_ ready to be posted as a
finished text without a good deal more work.

But this is ok--this is as intended. The purpose of DP
(as I understand it) has always been to distribute much
of the work, and make things easier for the person preparing
the text for submission to PG.

To put this in context, let's compare with pre-DP times,
when everything was done on an individual basis.
An easy text that has come through DP can be prepared and
submitted in one day; a more difficult one can take a
week or two; a really hard one might take months
working on it on and off.

Now take those same texts without the DP preparation, where
an individual starts working himself from the ocr output.
The easy text could take perhaps three to six weeks; the
more difficult one five to eight months or longer; and the
hardest texts that have been done through DP could never
have been attempted by an individual.

One other very significant aspect is that DP has been
set up to encourage a sense of community. And you have
ready access to people with specialized knowledge about
many languages, musical notation, obscure unicode characters,
obselete typesetting conventions, etc.

In the time before DP it was quite common for somone to
put much effort into working on a text, and the burn out
and abandon the project. Having DP gives many people a
chance to do their bit, and have a much more manageable
learning curve.

--Andrew


From traverso at posso.dm.unipi.it  Mon Feb 15 11:27:29 2010
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Mon, 15 Feb 2010 20:27:29 +0100 (CET)
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <Pine.GSO.4.58.1002151035180.27234@vtn1.victoria.tc.ca> (message
	from Andrew Sly on Mon, 15 Feb 2010 11:11:03 -0800 (PST))
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<4B785C40.5000304@xs4all.nl>
	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>
	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>
	<Pine.GSO.4.58.1002151035180.27234@vtn1.victoria.tc.ca>
Message-ID: <20100215192729.AFB1DFFB5@cardano.dm.unipi.it>

>>>>> "Andrew" == Andrew Sly <sly at victoria.tc.ca> writes:


    Andrew> Are you sure you have phrased that in the way you wanted?

    Andrew> At no point in the history of DP was the output of the
    Andrew> rounds "strung together and handed directly off to PG". I
    Andrew> cannot recall if the name of "post-processor" has always
    Andrew> been used--but there has always been someone in that role.

When I started at DP, in 2002, the work needed to pass from the R2
output to posting to PG was officially estimated in 30 minutes,
without any specialized tool. I think that "strung together and handed
directly off to PG" is a correct metaphor for 30 minutes of
work. Enough to remove the separators, reflow the line ends, and that
was all. No formatting (italics converted to uppercase for ship
names), accents removed, no spell-checking, no gutcheck. This was a
task of the project manager, and handing the task to somebody else was
exceptional. 

Of course, even then, it took to me much longer to complete a book,
since I used to re-read the book to catch a bunch of remaining errors.

Carlo

From dakretz at gmail.com  Mon Feb 15 12:20:30 2010
From: dakretz at gmail.com (don kretz)
Date: Mon, 15 Feb 2010 12:20:30 -0800
Subject: [gutvol-d] Re: When DP started, here was the basic process
In-Reply-To: <d159.782f2713.38aaed53@aol.com>
References: <d159.782f2713.38aaed53@aol.com>
Message-ID: <627d59b81002151220j6816e805m5806a2424011125@mail.gmail.com>

Very good. So it should now be obvious to you why
the thread was not hijacked, it was providing valuable
background for roger's thinly-disguised "experiment".
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100215/271281d3/attachment.html>

From klofstrom at gmail.com  Mon Feb 15 12:47:16 2010
From: klofstrom at gmail.com (Karen Lofstrom)
Date: Mon, 15 Feb 2010 10:47:16 -1000
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<4B785C40.5000304@xs4all.nl>
	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>
	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>
Message-ID: <1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com>

On Mon, Feb 15, 2010 at 7:43 AM, don kretz <dakretz at gmail.com> wrote:

Re the two-round system:

> Clean, simple, and most importantly it provides each person
> with the immediate and obvous positive gratification of
> seeing their work self-evidently closing the gap between
> the text and the picture.

Yes, and it often produced godawful results. If the R2 proofrer was
sloppy, a sloppy text went to the PPer. Some PPers exhausted
themselves reproofing the text to fix the mistakes that R2 had left.
Others just processed the text and sent it off to PG, warts and all.

One R2 proofer had proofed an astonishing number of pages ... but he
did so by smoothreading them hurriedly, without checking against the
image. He missed many errors.

PPers complained. Readers of PG texts complained. The current workflow
at DP is a *reaction* to the previous lack of quality control. That's
why P3ers have to pass a test. That's why proofing and formatting were
separated. OK, our quality control is strangling us. I don't think the
answer is to go back to the good old days of two rounds and
error-ridden texts.

--
Karen Lofstrom

From dakretz at gmail.com  Mon Feb 15 13:15:26 2010
From: dakretz at gmail.com (don kretz)
Date: Mon, 15 Feb 2010 13:15:26 -0800
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com>
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<4B785C40.5000304@xs4all.nl>
	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>
	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>
	<1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com>
Message-ID: <627d59b81002151315k552be78bw4019e0d65f64da1f@mail.gmail.com>

Nor is anyone suggesting going back. I was describing the progression
and how it has affected the relationship between the users and the work.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100215/b520a4f1/attachment.html>

From dakretz at gmail.com  Mon Feb 15 13:26:09 2010
From: dakretz at gmail.com (don kretz)
Date: Mon, 15 Feb 2010 13:26:09 -0800
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <627d59b81002151315k552be78bw4019e0d65f64da1f@mail.gmail.com>
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<4B785C40.5000304@xs4all.nl>
	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>
	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>
	<1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com>
	<627d59b81002151315k552be78bw4019e0d65f64da1f@mail.gmail.com>
Message-ID: <627d59b81002151326x54098df5x7abd558241729844@mail.gmail.com>

There have at each step been a number of alternatives for dealing with
quality issues. We (or someone, it was hardly "we") made choices which
had consequences. One of the consequences was improved quality.
Another was a change in the user's work experience (always a greater
constraint, notice, seldom if ever improved user tools.) We are where
we are. We can I suppose say it was done the best way possible, and
what we have is the inevitable cost of the improvements." I think that's
a difficult position to defend. Which is exactly what roger is,
intentionally
or not, making quite clear. We can't recast the decisions made in the
past, but we need to do a better job of learning from them and dong
better. Sooner would be nicer than later. Hence rfrank's project.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100215/496d2bb5/attachment-0001.html>

From ajhaines at shaw.ca  Mon Feb 15 14:03:16 2010
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Mon, 15 Feb 2010 14:03:16 -0800
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<4B785C40.5000304@xs4all.nl>
	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>
	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>
	<1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com>
Message-ID: <9F120957CF48439F9C63FD74DE1B25F7@alp2400>

As a Whitewasher who's dealt with old DP productions as well as new ones, 
over the last couple of years, I second (and third and fourth) everything 
Karen says.

Others may hold DP's current system to be inefficient/slow/etc,, but it does 
one thing that makes it all worth while--it can produce error-free texts.

Example: I'm currently dealing with an errata report for an old DP 
production.  I haven't looked into the problem in detail yet, but from what 
I've seen, at least several pages are missing, followed by a repeat of 
material that precedes the missing material.  I'm going to have to go 
through the problem area of the posted text, compare it to a scanset, figure 
out which material is missing/redundant, OCR and proof whatever's missing, 
knit it into the text, then run Gutcheck/Jeebies/Gutspell on the repaired 
text, which will undoubtedly unearth a raft of other errors, all followed by 
a reformat and a repost.  Also undoubtedly, many other errors will remain.

Is it worth it?  Personally speaking, no.  It's going to take hours to fix 
this text, time that I'd far rather spend on my own productions, but there's 
currently no mechanism except for the Whitewashers, a.k.a. Errata Team, to 
fix this kind of thing.  (Probably simpler to just re-do this text from 
scratch, which is something *I'm* not about to do.)

In short, DP's current processes produce error-free texts; its old 
processes, from what I've seen of the results, didn't.

Al


----- Original Message ----- 
From: "Karen Lofstrom" <klofstrom at gmail.com>
To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
Sent: Monday, February 15, 2010 12:47 PM
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?


> On Mon, Feb 15, 2010 at 7:43 AM, don kretz <dakretz at gmail.com> wrote:
>
> Re the two-round system:
>
>> Clean, simple, and most importantly it provides each person
>> with the immediate and obvous positive gratification of
>> seeing their work self-evidently closing the gap between
>> the text and the picture.
>
> Yes, and it often produced godawful results. If the R2 proofrer was
> sloppy, a sloppy text went to the PPer. Some PPers exhausted
> themselves reproofing the text to fix the mistakes that R2 had left.
> Others just processed the text and sent it off to PG, warts and all.
>
> One R2 proofer had proofed an astonishing number of pages ... but he
> did so by smoothreading them hurriedly, without checking against the
> image. He missed many errors.
>
> PPers complained. Readers of PG texts complained. The current workflow
> at DP is a *reaction* to the previous lack of quality control. That's
> why P3ers have to pass a test. That's why proofing and formatting were
> separated. OK, our quality control is strangling us. I don't think the
> answer is to go back to the good old days of two rounds and
> error-ridden texts.
>
> --
> Karen Lofstrom
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
> 


From dakretz at gmail.com  Mon Feb 15 14:12:20 2010
From: dakretz at gmail.com (don kretz)
Date: Mon, 15 Feb 2010 14:12:20 -0800
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <9F120957CF48439F9C63FD74DE1B25F7@alp2400>
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<4B785C40.5000304@xs4all.nl>
	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>
	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>
	<1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com>
	<9F120957CF48439F9C63FD74DE1B25F7@alp2400>
Message-ID: <627d59b81002151412j40dba25bm9718d1c20670c9a@mail.gmail.com>

I can't think of anyone I know who would argue otherwise. That's not an
issue that's open for discussion, I don't think.

On Mon, Feb 15, 2010 at 2:03 PM, Al Haines (shaw) <ajhaines at shaw.ca> wrote:

> As a Whitewasher who's dealt with old DP productions as well as new ones,
> over the last couple of years, I second (and third and fourth) everything
> Karen says.
>
> Others may hold DP's current system to be inefficient/slow/etc,, but it
> does one thing that makes it all worth while--it can produce error-free
> texts.
>
> Example: I'm currently dealing with an errata report for an old DP
> production.  I haven't looked into the problem in detail yet, but from what
> I've seen, at least several pages are missing, followed by a repeat of
> material that precedes the missing material.  I'm going to have to go
> through the problem area of the posted text, compare it to a scanset, figure
> out which material is missing/redundant, OCR and proof whatever's missing,
> knit it into the text, then run Gutcheck/Jeebies/Gutspell on the repaired
> text, which will undoubtedly unearth a raft of other errors, all followed by
> a reformat and a repost.  Also undoubtedly, many other errors will remain.
>
> Is it worth it?  Personally speaking, no.  It's going to take hours to fix
> this text, time that I'd far rather spend on my own productions, but there's
> currently no mechanism except for the Whitewashers, a.k.a. Errata Team, to
> fix this kind of thing.  (Probably simpler to just re-do this text from
> scratch, which is something *I'm* not about to do.)
>
> In short, DP's current processes produce error-free texts; its old
> processes, from what I've seen of the results, didn't.
>
> Al
>
>
> ----- Original Message ----- From: "Karen Lofstrom" <klofstrom at gmail.com>
> To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
> Sent: Monday, February 15, 2010 12:47 PM
> Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
>
>
>
>  On Mon, Feb 15, 2010 at 7:43 AM, don kretz <dakretz at gmail.com> wrote:
>>
>> Re the two-round system:
>>
>>  Clean, simple, and most importantly it provides each person
>>> with the immediate and obvous positive gratification of
>>> seeing their work self-evidently closing the gap between
>>> the text and the picture.
>>>
>>
>> Yes, and it often produced godawful results. If the R2 proofrer was
>> sloppy, a sloppy text went to the PPer. Some PPers exhausted
>> themselves reproofing the text to fix the mistakes that R2 had left.
>> Others just processed the text and sent it off to PG, warts and all.
>>
>> One R2 proofer had proofed an astonishing number of pages ... but he
>> did so by smoothreading them hurriedly, without checking against the
>> image. He missed many errors.
>>
>> PPers complained. Readers of PG texts complained. The current workflow
>> at DP is a *reaction* to the previous lack of quality control. That's
>> why P3ers have to pass a test. That's why proofing and formatting were
>> separated. OK, our quality control is strangling us. I don't think the
>> answer is to go back to the good old days of two rounds and
>> error-ridden texts.
>>
>> --
>> Karen Lofstrom
>> _______________________________________________
>> gutvol-d mailing list
>> gutvol-d at lists.pglaf.org
>> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>>
>>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100215/5143e906/attachment.html>

From Bowerbird at aol.com  Mon Feb 15 15:51:53 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 15 Feb 2010 18:51:53 EST
Subject: [gutvol-d] Re: When DP started, here was the basic process
Message-ID: <17845.277aae89.38ab3819@aol.com>

don said:
>    So it should now be obvious to you 
>    why the thread was not hijacked

um, no.

gardner's thread was most certainly hijacked.

he was looking for other solo producers, and
someone who might be interested in that topic
now has to plow through a bunch of posts that 
talk about something completely different...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100215/ca60503c/attachment.html>

From Bowerbird at aol.com  Mon Feb 15 16:01:30 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 15 Feb 2010 19:01:30 EST
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
Message-ID: <17cc8.47818a6d.38ab3a5a@aol.com>

andrew said:
>   An easy text that has come through DP 
>    can be prepared and submitted in one day; 
>    a more difficult one can take a week or two; 
>    a really hard one might take months
>    working on it on and off.
>
>    Now take those same texts without the DP preparation, where 
>    an individual starts working himself from the ocr output.
>    The easy text could take perhaps three to six weeks; 
>    the more difficult one five to eight months or longer; 
>    and the hardest texts that have been done through DP 
>    could never have been attempted by an individual.

you just made up all of those numbers.

take an easy text, the kind that can be "prepared and
submitted in one day" after having gone through d.p.,
but which -- according to you -- "could take perhaps
three to six weeks" were it to be done by a solo person.

your figures are just ridiculous...

it takes an hour, perhaps two or three, to spellcheck
a typical easy book and get it formatted into shape...

for a more difficult book, the spellchecking time is
dwarfed by the formatting task, which is not really
significantly lessened by having gone through d.p.

and no text is so difficult that it "could never have been
attempted by an individual", so that's just balderdash...
there might not be any individuals who _are_ motivated
to take on big projects, but given the rate at which these
big projects get finished at d.p., the gap isn't all that big.


>    One other very significant aspect is that DP 
>    has been set up to encourage a sense of community.

i'm not sure charlz "set up" d.p. for that specific purpose.
it's true that a sense of community _has_ developed there.
but that can happen just about anywhere.   it's also the case
that the d.p. community indulges itself often in groupthink,
which is one down side of "community".   i'm not arguing that
the down side offsets the good, because i don't think it does,
but if we are going to mention one side, let's mention both...


>    And you have ready access to people with 
>    specialized knowledge about many languages, 
>    musical notation, obscure unicode characters,
>    obselete typesetting conventions, etc.

that's true.   but it's also the case that that "ready access"
_could_ have developed right here, on this listserve, and
been available to everyone, including the "solo" producers.
so having it exist only within the d.p. silo is a bit regrettable.


>    At no point in the history of DP was

well, carlo has already pointed out that andrew's memory is
a bit foggy on this particular point.   and that happens with
individuals as we grow older, so there's no shame in that...

but there's a tendency among d.p. people to rewrite history,
almost always in a way that's favorable to their interpretation,
so it's always refreshing when that tendency gets a fact-check.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100215/2122f8dd/attachment.html>

From Bowerbird at aol.com  Mon Feb 15 16:31:33 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 15 Feb 2010 19:31:33 EST
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
Message-ID: <18a2b.33166dc4.38ab4165@aol.com>

al said:
>    In short, DP's current processes produce error-free texts; 
>    its old processes, from what I've seen of the results, didn't.

oh my, this is just too rich.

a full-on admission that an "old" d.p. e-text was full of bugs.

too rich.

because d.p. cheerleaders -- like karen "zora" lofstrom --
have _always_ maintained that d.p. output was super-clean;
it's the stuff from the _individual_ producers that is shoddy,
_not_ the material from d.p.   you can see this same attitude 
expressed _to_this_very_day_ over on the d.p. forum boards.

of course, it was _easy_ to prove them wrong in the old days;
all you had to do was make a laundry-list of errors in a text.

(i provided such a list of errors to this very listserve for a book
that was postprocessed by zora herself -- #13603 -- and the
_hundreds_ of errors i located have _still_ not been repaired,
even though the book was posted way back in october of 2004.
so much for zora's stance of superiority.   her work is flawed.)

anyway, after enough laundry-lists of errors had been made,
d.p. people finally had to admit their quality-control was faulty.

sadly, they didn't know how to fix their system, so they just
piled on more rounds, and built a flawed "certification system"
to promote some proofers to "final-round" status, which only
had the effect of stagnating their workflow with huge queues,
as a boatload of books (thousands!) plugged up the system...

and they've clung to this hierarchical model in the face of
clear evidence (from their own experiments!) that _proved_
that the p3 proofers aren't any better than the p1 proofers...

and even though it's perfectly clear that you can get good pages
without subjecting every page in every book to 3 proofing rounds
and 2 formatting rounds, following by postprocessing, and then
maybe smoothreading, and then maybe postprocessing verification,
nonetheless that's what their workflow system calls for them to do.

so they experiment with ways to circumvent that workflow system,
instead of just fixing it.   it is a comedy of errors, in slow-motion...

but hey, as long as you get "error-free texts", then who cares if
you're wasting tons of time and energy donated by the volunteers?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100215/b101e509/attachment-0001.html>

From sly at victoria.tc.ca  Mon Feb 15 22:11:28 2010
From: sly at victoria.tc.ca (Andrew Sly)
Date: Mon, 15 Feb 2010 22:11:28 -0800 (PST)
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <20100215192729.AFB1DFFB5@cardano.dm.unipi.it>
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<4B785C40.5000304@xs4all.nl>
	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>
	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>
	<Pine.GSO.4.58.1002151035180.27234@vtn1.victoria.tc.ca>
	<20100215192729.AFB1DFFB5@cardano.dm.unipi.it>
Message-ID: <Pine.GSO.4.58.1002152146210.29501@vtn1.victoria.tc.ca>


On Mon, 15 Feb 2010, Carlo Traverso wrote:

> >>>>> "Andrew" == Andrew Sly <sly at victoria.tc.ca> writes:
>
>     Andrew> At no point in the history of DP was the output of the
>     Andrew> rounds "strung together and handed directly off to PG". I
>     Andrew> cannot recall if the name of "post-processor" has always
>     Andrew> been used--but there has always been someone in that role.
>
> When I started at DP, in 2002, the work needed to pass from the R2
> output to posting to PG was officially estimated in 30 minutes,
> without any specialized tool. I think that "strung together and handed
> directly off to PG" is a correct metaphor for 30 minutes of
> work. Enough to remove the separators, reflow the line ends, and that
> was all. No formatting (italics converted to uppercase for ship
> names), accents removed, no spell-checking, no gutcheck. This was a
> task of the project manager, and handing the task to somebody else was
> exceptional.

Thanks Carlo. Perhaps my memory has become hazy in the intervening years.
:)

But still I questions your list. Why accents removed? It was fairly
routine to post latin-1 texts at that time. (I can find an "8-bit"
text as #1595, with a release date if Jan, 1999.)

The earliest reference to gutcheck that I can find in my old emails
is on Tue, 23 Jul 2002, but I don't think it was in common use yet.
It was actually something that Jim T. had written as an evaluation
tool for submitted texts.

> Of course, even then, it took to me much longer to complete a book,
> since I used to re-read the book to catch a bunch of remaining errors.

I did the same with the project I ran through DP at that time as well.
Perhaps that's why I assumed it was the norm.

--Andrew

From traverso at posso.dm.unipi.it  Tue Feb 16 00:44:00 2010
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Tue, 16 Feb 2010 09:44:00 +0100 (CET)
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <Pine.GSO.4.58.1002152146210.29501@vtn1.victoria.tc.ca> (message
	from Andrew Sly on Mon, 15 Feb 2010 22:11:28 -0800 (PST))
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<4B785C40.5000304@xs4all.nl>
	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>
	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>
	<Pine.GSO.4.58.1002151035180.27234@vtn1.victoria.tc.ca>
	<20100215192729.AFB1DFFB5@cardano.dm.unipi.it>
	<Pine.GSO.4.58.1002152146210.29501@vtn1.victoria.tc.ca>
Message-ID: <20100216084400.933ACFFCA@cardano.dm.unipi.it>

>>>>> "Andrew" == Andrew Sly <sly at victoria.tc.ca> writes:

    Andrew> On Mon, 15 Feb 2010, Carlo Traverso wrote:

    Andrew> Thanks Carlo. Perhaps my memory has become hazy in the
    Andrew> intervening years.  :)

    Andrew> But still I questions your list. Why accents removed? It
    Andrew> was fairly routine to post latin-1 texts at that time. (I
    Andrew> can find an "8-bit" text as #1595, with a release date if
    Andrew> Jan, 1999.)

 These were the DP guidelines, (copied from PG official guidelines), I
 remember Ultima Thule, a book on iceland, with a discussion on what
 to do of the eths in names (that were eventually replaced with th)
 while the accents were routinely dropped. The book eventually was
 redone from scratch, it might have been the last one before DP
 changed officially to preserving accents.

Carlo

From walter.van.holst at xs4all.nl  Tue Feb 16 01:12:42 2010
From: walter.van.holst at xs4all.nl (Walter van Holst)
Date: Tue, 16 Feb 2010 10:12:42 +0100
Subject: [gutvol-d] Re:
 =?utf-8?q?Many_solo_projects_out_there_in_gutvol-d_land=3F?=
In-Reply-To: <627d59b81002151326x54098df5x7abd558241729844@mail.gmail.com>
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<4B785C40.5000304@xs4all.nl>
	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>
	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>
	<1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com>
	<627d59b81002151315k552be78bw4019e0d65f64da1f@mail.gmail.com>
	<627d59b81002151326x54098df5x7abd558241729844@mail.gmail.com>
Message-ID: <e6bb4ab439a13c6508cc61dc654fff7c@xs4all.nl>

On Mon, 15 Feb 2010 13:26:09 -0800, don kretz <dakretz at gmail.com> wrote:

> or not, making quite clear. We can't recast the decisions made in the
> past, but we need to do a better job of learning from them and dong
>  better. Sooner would be nicer than later. Hence rfrank's project.

In that vein, how flexible is the DP software? I've been wondering to what
extent parallel P1 rounds might be helpful. I find P2 proofing exceedingly
boring because of the small number of errors that are left to be fixed in
texts that are well-scanned and well-proofed in P1. I can't imagine how
mind-numbing P3 will be if I ever become eligible for that 'status'. I can
imagine that only having to look at the differences between redundant P1
proofed texts might be helpful since it would take two independent P1
proofers to overlook the same error to have it slip through.

Another potential improvement might be to make texts available to the next
round on a per page basis instead of having to wait for all pages to be
finished in the previous round.

Aforementioned suggestions may be silly, feel free to point out their
silliness.

Regards,

 Walter

From traverso at posso.dm.unipi.it  Tue Feb 16 02:30:30 2010
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Tue, 16 Feb 2010 11:30:30 +0100 (CET)
Subject: [gutvol-d] Re:
 =?utf-8?q?Many_solo_projects_out_there_in_gutvol-d_land=3F?=
In-Reply-To: <e6bb4ab439a13c6508cc61dc654fff7c@xs4all.nl> (message from Walter
	van Holst on Tue, 16 Feb 2010 10:12:42 +0100)
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<4B785C40.5000304@xs4all.nl>
	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>
	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>
	<1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com>
	<627d59b81002151315k552be78bw4019e0d65f64da1f@mail.gmail.com>
	<627d59b81002151326x54098df5x7abd558241729844@mail.gmail.com>
	<e6bb4ab439a13c6508cc61dc654fff7c@xs4all.nl>
Message-ID: <20100216103030.B767FFFCA@cardano.dm.unipi.it>

>>>>> "Walter" == Walter van Holst <walter.van.holst at xs4all.nl> writes:

    Walter> On Mon, 15 Feb 2010 13:26:09 -0800, don kretz
    Walter> <dakretz at gmail.com> wrote:

    >> or not, making quite clear. We can't recast the decisions made
    >> in the past, but we need to do a better job of learning from
    >> them and dong better. Sooner would be nicer than later. Hence
    >> rfrank's project.

    Walter> In that vein, how flexible is the DP software? I've been
    Walter> wondering to what extent parallel P1 rounds might be
    Walter> helpful. I find P2 proofing exceedingly boring because of
    Walter> the small number of errors that are left to be fixed in
    Walter> texts that are well-scanned and well-proofed in P1. I
    Walter> can't imagine how mind-numbing P3 will be if I ever become
    Walter> eligible for that 'status'. I can imagine that only having
    Walter> to look at the differences between redundant P1 proofed
    Walter> texts might be helpful since it would take two independent
    Walter> P1 proofers to overlook the same error to have it slip
    Walter> through.

This would be simple enough, just allowing a PM to load a set of txt
files and a dummy proofer name in one of the projects columns. The
administrators (having DB access) do this if asked, I suppose with a
script (I have one in the test site). Another improvement would be to
allow a PM to skip a round; this too is reserved to the few, overloaded
administrators, but it is just changing a flag at one point in the code.


    Walter> Another potential improvement might be to make texts
    Walter> available to the next round on a per page basis instead of
    Walter> having to wait for all pages to be finished in the
    Walter> previous round.

This might be trickier, since the whole philosophy of DP code is based
on rounds and per-round permissions. It would require at least to start
a new test DP site in which new changes in the code are made and
extensively experimented in a live environment. The current test site
is used for testing features that are potentially disruptive, and is
inadequate for live testing: it is for alpha testing, a beta testing
site would be necessary, or probably more than one. 

rfrank's test site at fadepage has abandoned the round philosophy, but
is not derived from DP code, it is reimplemented from scratch.

    Walter> Aforementioned suggestions may be silly, feel free to
    Walter> point out their silliness.

Not silly at all; I believe that the main problem of DP is its
rigidity, the "one size fits all" philosophy, that is partly in the
code, but mostly in the procedures, and is necessary in a huge
structure. Smaller DP sites like DP-EU and DP-CAN have shown a more
flexible structure, so I believe that a confederation of different DP
sites, sharing a common aim and a common codebase, but different local
laws and software configurations, and a loose coordination, would be a
better model.

Carlo

From dakretz at gmail.com  Tue Feb 16 07:43:50 2010
From: dakretz at gmail.com (don kretz)
Date: Tue, 16 Feb 2010 07:43:50 -0800
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <20100216103030.B767FFFCA@cardano.dm.unipi.it>
References: <4B76F4C6.3030006@teksavvy.com> <4B785C40.5000304@xs4all.nl>
	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>
	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>
	<1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com>
	<627d59b81002151315k552be78bw4019e0d65f64da1f@mail.gmail.com>
	<627d59b81002151326x54098df5x7abd558241729844@mail.gmail.com>
	<e6bb4ab439a13c6508cc61dc654fff7c@xs4all.nl>
	<20100216103030.B767FFFCA@cardano.dm.unipi.it>
Message-ID: <627d59b81002160743x6eb4788dld0875f9995a936fc@mail.gmail.com>

I'm biting my tongue, Carlo.

The difficulties aren't primarily with the code, which can be (and
on occasion has been) amended to overcome those types of
problems.

However, none of our volunteers has considered it appropriate
or within the scope of their skills or interests to, for instance,
document it; so it's pretty closely held within a small group.


On Tue, Feb 16, 2010 at 2:30 AM, Carlo Traverso
<traverso at posso.dm.unipi.it>wrote:

> >>>>> "Walter" == Walter van Holst <walter.van.holst at xs4all.nl> writes:
>
>    Walter> On Mon, 15 Feb 2010 13:26:09 -0800, don kretz
>     Walter> <dakretz at gmail.com> wrote:
>
>    >> or not, making quite clear. We can't recast the decisions made
>    >> in the past, but we need to do a better job of learning from
>    >> them and dong better. Sooner would be nicer than later. Hence
>    >> rfrank's project.
>
>     Walter> In that vein, how flexible is the DP software? I've been
>    Walter> wondering to what extent parallel P1 rounds might be
>    Walter> helpful. I find P2 proofing exceedingly boring because of
>    Walter> the small number of errors that are left to be fixed in
>    Walter> texts that are well-scanned and well-proofed in P1. I
>    Walter> can't imagine how mind-numbing P3 will be if I ever become
>    Walter> eligible for that 'status'. I can imagine that only having
>    Walter> to look at the differences between redundant P1 proofed
>    Walter> texts might be helpful since it would take two independent
>    Walter> P1 proofers to overlook the same error to have it slip
>    Walter> through.
>
> This would be simple enough, just allowing a PM to load a set of txt
> files and a dummy proofer name in one of the projects columns. The
> administrators (having DB access) do this if asked, I suppose with a
> script (I have one in the test site). Another improvement would be to
> allow a PM to skip a round; this too is reserved to the few, overloaded
> administrators, but it is just changing a flag at one point in the code.
>
>
>    Walter> Another potential improvement might be to make texts
>    Walter> available to the next round on a per page basis instead of
>    Walter> having to wait for all pages to be finished in the
>    Walter> previous round.
>
> This might be trickier, since the whole philosophy of DP code is based
> on rounds and per-round permissions. It would require at least to start
> a new test DP site in which new changes in the code are made and
> extensively experimented in a live environment. The current test site
> is used for testing features that are potentially disruptive, and is
> inadequate for live testing: it is for alpha testing, a beta testing
> site would be necessary, or probably more than one.
>
> rfrank's test site at fadepage has abandoned the round philosophy, but
> is not derived from DP code, it is reimplemented from scratch.
>
>    Walter> Aforementioned suggestions may be silly, feel free to
>    Walter> point out their silliness.
>
> Not silly at all; I believe that the main problem of DP is its
> rigidity, the "one size fits all" philosophy, that is partly in the
> code, but mostly in the procedures, and is necessary in a huge
> structure. Smaller DP sites like DP-EU and DP-CAN have shown a more
> flexible structure, so I believe that a confederation of different DP
> sites, sharing a common aim and a common codebase, but different local
> laws and software configurations, and a loose coordination, would be a
> better model.
>
> Carlo
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100216/eceb6af2/attachment.html>

From Bowerbird at aol.com  Tue Feb 16 13:47:33 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 16 Feb 2010 16:47:33 EST
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
Message-ID: <15691.224faab8.38ac6c75@aol.com>

walter said:
>    In that vein, how flexible is the DP software? 

it depends.   in general it's not all that flexible...
but sometimes a little creativity goes a long way.

even more inflexible than the software, however,
is the willingness of its coders to change anything.

and even when there is someone like dakretz who
is willing to roll up his sleeves and do some work,
the administrators won't let him.   so there you go.


>    I've been wondering to what extent 
>    parallel P1 rounds might be helpful. 
...
>    Aforementioned suggestions may be silly, 
>    feel free to point out their silliness.

it's _good_ to "wonder" about things, walter.
it means your mind is working on a solution.
so that part isn't silly at all.

and the part about parallel p1 rounds is not
silly either.   to the contrary, it's a good idea;
might not _work_, but it's still a good _idea_.
so that's not silly at all.

what _is_ silly, however, is that -- in spite of
the fact that people have had this good idea
for a very long time now -- d.p. has _never_
actually _tested_ it directly to see if it works.

oh, they've run some research, and tried out
some things, but they've never actually done
a full-on _experiment_ to test the hypothesis.

so, for years and years a parallel-proof idea
has been around, but we're still "wondering"
whether it might work or not.   _that_ is silly...

for the record, once again, i've reassembled
some data from various d.p. "experiments"
(i'm using the term extremely loosely here)
and i've even written the software that helps
you reconcile two iterations of parallel proofs,
so i can give you some conclusions on all that,
namely that it doesn't give you better accuracy,
and thus it certainly doesn't outweigh the cost
of doing the reconciliation (which is rather high,
even given a good tool), so i don't recommend it.

however, a focused experiment on this matter
would be good, so as to validate my findings...

having said all that, though, there's a "variant" on
parallel proofing that you might find interesting...

taking o.c.r. results from 2 different sources and
comparing them to find their differences and then
resolving those differences and calling it "finished"
_does_ happen to be an extremely effective strategy,
since it avoids all the word-by-word proofing rounds.

i documented all this on a thread on the d.p. forums,
entitled "a revolutionary methodology for proofing",
or something to that effect...   you could look it up...


>    I find P2 proofing exceedingly boring 
>    because of the small number of errors 
>    that are left to be fixed in texts that are 
>    well-scanned and well-proofed in P1. 

well, there's a lot that could be said about this, walter.
perhaps first and foremost is that proofing _is_ boring.
especially a word-by-word proofing on an accurate text.


>    I can't imagine how mind-numbing P3 will be 
>    if I ever become eligible for that 'status'. 

since most of the o.c.r. errors are gone by the time of p3,
most p3 proofers have resorted to trying to find errors in
the book itself, errors that the publisher/typesetter made.
this lets them leave a comment, so they can do something.

for instance, in the book that i'm now examining which
rfrank used in his "roundless" experiment, there were 50
comments left in a 240-page book, or 20% of the pages.

of course, addressing all these comments is a task that is
done by the postprocessor, which is one of many reasons
why that job has become more taxing in the current era...


>    I can imagine that only having to look at the differences 
>    between redundant P1 proofed texts might be helpful 
>    since it would take two independent P1 proofers
>    to overlook the same error to have it slip through.

well, yes, and that's the main argument for parallel proofing.

but it ends up that yes, indeed, "two independent p1 proofers"
often _do_ "overlook the same error" and it then slips through.

and in the same manner, sometimes an independent p1 and p2
proofer "overlook the same error" and it then slips through to p3.

now it would be great if we had some solid _data_ on the numbers,
so we could decide how much energy we want to spend on catching
these errors that slip through.   we've found that _some_ errors can
go as many as 7 or 8 or 9 rounds without being caught, but no one
is suggesting we spend that many proofing rounds on every page...

so we have to decide how many rounds we will expend our energy,
in order to catch what percentage of errors.   it's really that simple.

and to make that decision, it would be great if we had some data.
and it's silly -- ridiculous! -- that we have not collected that data.


>    Another potential improvement might be to make
>    texts available to the next round on a per page basis 
>    instead of having to wait for all pages to be finished 
>    in the previous round.

well, now you're suggesting a "roundless" system, walter.

which is also not a silly suggestion.   unfortunately, it's not
a _new_ suggestion either, so you're not advancing the art.

what you _are_ doing is showing we have no data on _this_
particular wrinkle either, even though it's a very old idea...

and again, this failure to collect data and test hypotheses is
extremely silly, especially since we debate matters endlessly.
like clara peller bellowing "where's the beef?", we should now
make it a community slogan to demand "where's your data?"

meanwhile, i keep myself busy by collecting what data i can,
and writing the software tools that we need to do these jobs.
and i talk and talk, but most people here are too busy being
silly to listen to me.   which i find to be endlessly amusing.    :+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100216/e316964b/attachment-0001.html>

From Bowerbird at aol.com  Tue Feb 16 15:34:03 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 16 Feb 2010 18:34:03 EST
Subject: [gutvol-d] roundlessness -- 004
Message-ID: <18e16.1867d154.38ac856b@aol.com>

we're looking at rfrank's "roundless" experiment at fadedpage.com...

***

i'm going over the data for one of the books rfrank used in his test,
and once again the results i observe are striking and unequivocal...

the proofers made hundreds of changes in this 240-page book, but
most of 'em could've been detected and fixed during preprocessing,
which would've made the workflow both more smooth and efficient.

sure, there is the occasional stealth scanno -- "array" for "army", and
"riot" for "not" -- which (one could argue) would seem to require the
word-by-word proofing that is expected at distributed proofreaders.
but they are few and far between, and in almost all cases innocuous.

and certainly one round of such close proofing will be all that would
be needed if the obvious-and-easy-to-automatically-detect errors
were found and fixed in preprocessing.   once these obvious glitches
have been fixed, the proofer is essentially doing _smooth-reading_...

this is of the utmost importance if you really want (as rfrank claims)
to have each page be "one and out", (i.e., be finished by one proofer).
otherwise once is simply not enough, not for a good many pages...

it's also the case that -- with the right tool -- doing preprocessing
is fun and exhilarating.   it's really a kick in the pants to be able to
improve a book so quickly and efficiently, and move it to "the finish".
compared to the boring nature of proofing, there is no comparison...

i have demonstrated this same finding on book after book after book,
with no exceptions, so i am quite confident that it is extremely robust.
all you have to do is look for it, and i assure you that you will find 
it...

i wonder why so many of you are so resistant to learning the truth...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100216/eed4ab27/attachment.html>

From Bowerbird at aol.com  Wed Feb 17 10:43:58 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 17 Feb 2010 13:43:58 EST
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
Message-ID: <eead.6870f59a.38ad92ee@aol.com>


yeah, i know, a call for actual _data_.   what a bummer, man.
ruins _all_ your fun, and brings the dialog to an abrupt stop.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100217/02cab2be/attachment.html>

From ajhaines at shaw.ca  Wed Feb 17 11:20:48 2010
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Wed, 17 Feb 2010 11:20:48 -0800
Subject: [gutvol-d] "The Inheritance" by Susan Edmondstone Ferrier
Message-ID: <AAE554C15D4C40AEAEC72EFF648D4671@alp2400>

If anyone's looking for a project, look no further than the above.  Internet 
Archive has assorted editions, none of which are projects in DP.  All 
editions appear to be clearable under Rule1 (pre-1923).

Hmmm...  maybe bowerbird is up for them?

Al


From Bowerbird at aol.com  Wed Feb 17 12:41:20 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 17 Feb 2010 15:41:20 EST
Subject: [gutvol-d] Re: "The Inheritance" by Susan Edmondstone Ferrier
Message-ID: <12985.490903c3.38adae70@aol.com>

al said:
>   Hmmm...? maybe bowerbird is up for them?

what's my motivation, al?

the one edition i looked at -- in english, i speak no german -- is
a simple book, quite straightforward, so wouldn't prove anything.

i mean, i'm totally willing to run through the exercise-wheel, but
what do i get when i come out the other side?

how about this, a win-win-win-win situation for everyone...

for a while, the p.g. website has been directing newbies over to
distributed proofreaders if they want to help out with the cause.

and sure enough, d.p. gets a ton of volunteers as a result of that.

unfortunately, d.p. doesn't appreciate the newbies, because they
just contribute more stuff to the plethora of p1-proofed backlog.

so how about, if i were to do this book for you, p.g. would start
sending all the new volunteers to rfrank's roundless site instead?

d.p. happy, rfrank happy, the bird happy, al happy, and p.g. happy.

so, do we have a deal?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100217/a1fd7ad3/attachment.html>

From ajhaines at shaw.ca  Wed Feb 17 14:50:07 2010
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Wed, 17 Feb 2010 14:50:07 -0800
Subject: [gutvol-d] Re: "The Inheritance" by Susan Edmondstone Ferrier
References: <12985.490903c3.38adae70@aol.com>
Message-ID: <97FD5D5CD0E846AD94214B14886737BA@alp2400>

I'm not aware of any general PG practice to send newcomers to DP.  

Your motivation?  What do I care?  Be altruistic, and do a book.  If you want something challenging, PG's Preprints page (http://preprints.readingroo.ms/) has lots of candidates.


  ----- Original Message ----- 
  From: Bowerbird at aol.com 
  To: gutvol-d at lists.pglaf.org ; bowerbird at aol.com 
  Sent: Wednesday, February 17, 2010 12:41 PM
  Subject: [gutvol-d] Re: "The Inheritance" by Susan Edmondstone Ferrier


  al said:
  >   Hmmm...  maybe bowerbird is up for them?

  what's my motivation, al?

  the one edition i looked at -- in english, i speak no german -- is
  a simple book, quite straightforward, so wouldn't prove anything.

  i mean, i'm totally willing to run through the exercise-wheel, but
  what do i get when i come out the other side?

  how about this, a win-win-win-win situation for everyone...

  for a while, the p.g. website has been directing newbies over to
  distributed proofreaders if they want to help out with the cause.

  and sure enough, d.p. gets a ton of volunteers as a result of that.

  unfortunately, d.p. doesn't appreciate the newbies, because they
  just contribute more stuff to the plethora of p1-proofed backlog.

  so how about, if i were to do this book for you, p.g. would start
  sending all the new volunteers to rfrank's roundless site instead?

  d.p. happy, rfrank happy, the bird happy, al happy, and p.g. happy.

  so, do we have a deal?

  -bowerbird


------------------------------------------------------------------------------


  _______________________________________________
  gutvol-d mailing list
  gutvol-d at lists.pglaf.org
  http://lists.pglaf.org/mailman/listinfo/gutvol-d
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100217/e6546bf7/attachment.html>

From Bowerbird at aol.com  Wed Feb 17 15:55:59 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 17 Feb 2010 18:55:59 EST
Subject: [gutvol-d] Re: "The Inheritance" by Susan Edmondstone Ferrier
Message-ID: <1929b.3e85d22f.38addc0f@aol.com>

al said:
>   I'm not aware of any general PG?practice to send newcomers to DP.

i'm not surprised to learn that you're not paying any attention, al.


>    Your motivation?

yeah, you know, the reason why i would spend time and energy on this.


>    What do I care?

why did you suggest i do this book?


>    Be altruistic, and do a book.

i'm altruistic just by being here, sharing my analyses with people, al.

i'm altruistic when i _do_ the research that _leads_ to those analyses.

i'm altruistic when i design and program tools that do what's required,
because that proves that such tools can indeed be designed and coded,
and it lets me be precise and experienced when i assess their value...

i'm altruistic when i work up my suggestions for an improved workflow.

i'm altruistic in a number of ways, al.   to "do a book" seems unnecessary,
unless you're intentionally suggesting that i should lower my sights a lot.

i clean text for the fun of it, al, just like my girlfriend does her 
sudoku.
so i don't need a _lot_ of motivation...   but i certainly do need 
_some_...

you've given me no good reason to work on the book you've suggested,
al, so i'm not sure why you even bothered to mention my name at all...


>    If you want something challenging

in the exact same way that _i_ will decide how i will be altruistic
(or even _if_ i will be altruistic), i will also be the one who decides
what is "challenging" to me.   but, like, thanks for your suggestion,
and have a nice day, ok?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100217/ced96725/attachment-0001.html>

From gbuchana at teksavvy.com  Wed Feb 17 20:21:57 2010
From: gbuchana at teksavvy.com (Gardner Buchanan)
Date: Wed, 17 Feb 2010 23:21:57 -0500
Subject: [gutvol-d] Re: Bowerbird's software projects
In-Reply-To: <1929b.3e85d22f.38addc0f@aol.com>
References: <1929b.3e85d22f.38addc0f@aol.com>
Message-ID: <4B7CC065.6030004@teksavvy.com>

On 17-Feb-2010 18:55, Bowerbird at aol.com wrote:
>
> i'm altruistic when i design and program tools that do what's required,
> because that proves that such tools can indeed be designed and coded,
>

Where was that Sourceforge project again?  I know you've talked
about tools that do more/better checking than Gutcheck and have
automated fixing and such.  I would like to try them out.  Where
can get get my hands on this stuff?

============================================================
Gardner Buchanan                     <gbuchana at teksavvy.com>
Ottawa, ON             FreeBSD: Where you want to go. Today.

From Bowerbird at aol.com  Thu Feb 18 10:47:26 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 18 Feb 2010 13:47:26 EST
Subject: [gutvol-d] Re: Bowerbird's software projects
Message-ID: <de3f.69f469e7.38aee53e@aol.com>

gardner said:
>   Where was that Sourceforge project again?? 

there is no sourceforge project.

my source code has never been open-source.
my enemies will need to write their own code.


>    I know you've talked about tools that 
>    do more/better checking than Gutcheck 

actually, i don't think i've ever compared my stuff
with any other software directly, because any tool
is better than no tool.   gutcheck has some charms.
my tools do different things, and do things differently,
but whether they do "more" or "better" is not an issue.


>    and have automated fixing and such.

some have, yes.   it's also important to remember that
i do lots of experimentation, with quick-and-dirty code
that serves to test the usefulness of a particular feature,
but which might never be implemented further, perhaps
because it doesn't prove to be worthy, or because the
generalized code would take more time than i can give,
or simply because that task just hasn't been done yet...


>    I would like to try them out.? 
>    Where can get get my hands on this stuff?

i'll be happy to send you a copy, gardner, since you are
an independent producer -- that's my target sweetspot.

until i release the program generally, which might be
very soon but also might not, you'll have to agree not to
distribute the app any further, since i want to know who
has it so that i can engage them in dialog about it, but
that's the only restriction at this point.

you'll also need to tell me what version you want -- mac
or p.c. or linux.   your signature-block screams out linux,
which is fine, but you'd be one of my first linux users, so
if you want the more-well-tested windows version, say so.

finally, please give a short description -- frontchannel --
of your _current_ workflow.   how do you do your books?
do you use an editor, or some other tool?   use gutcheck?
what kind of preprocessing do you do on the raw o.c.r.?
if you need to view a scan to check the text on some page,
how do you do that?   how do you find errors, with reg-ex?,
or via a word-by-word proof of every page?   anything else?

if anyone else wants to get a copy of my program, say so,
either frontchannel or back.   the same conditions apply...

also, if you're interested, you should check out don's app:
>    http://code.google.com/p/dp50/downloads/list
his tool is similar in many ways, and you might like it too.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100218/ec4d5446/attachment.html>

From joey at joeysmith.com  Thu Feb 18 14:56:25 2010
From: joey at joeysmith.com (Joey Smith)
Date: Thu, 18 Feb 2010 15:56:25 -0700
Subject: [gutvol-d] Re: Bowerbird's software projects
In-Reply-To: <de3f.69f469e7.38aee53e@aol.com>
References: <de3f.69f469e7.38aee53e@aol.com>
Message-ID: <20100218225625.GA29062@joeysmith.com>

On Thu, Feb 18, 2010 at 01:47:26PM -0500, Bowerbird at aol.com wrote:
> if anyone else wants to get a copy of my program, say so,
> either frontchannel or back.   the same conditions apply...

I believe I've already said so, for Linux, at least twice. Each
time I was told I'd have to join some "Yahoo!" listserv, which
is too far to go for a an unproven piece of software...I generally
try pretty hard to keep my information out of the clutches of
Yahoo!

From Bowerbird at aol.com  Thu Feb 18 16:20:03 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 18 Feb 2010 19:20:03 EST
Subject: [gutvol-d] roundlessness -- 005
Message-ID: <18218.6bab9c25.38af3333@aol.com>

we're looking at rfrank's "roundless" experiment at fadedpage.com...

***

i've been looking at the data for one of the books rfrank used...

the book is titled "eagles of the sky", and it runs about 240 pages.

in total, there were about 500 lines that were changed by proofers.

here's a list of roughly 300 of those lines:
>    http://z-m-l.com/go/frabf/frabf300diffs.html

the list shows the original line, from o.c.r., and the edited line.
there's also a link to each page-scan, if you want to view that...

i could've included all 500 changed lines in this list, except that
the 200 lines i've excluded contained errors that _should_have_
(most definitely) been fixed in preprocessing, and it burns me up
to be a witness to such a tremendous waste of proofer resources.

it's just a crime against the generous contribution of the proofers
to put such shoddy text in front of them and expect them to fix it.

roger frank may think i'm talking shit about him again, but i tell you,
i'll talk shit about _any_ producer who gives shoddy text to proofers,
and, on top of that, i will feel _justified_ and _moral_ about doing it.

i mean, it's not exactly a _mystery_ how to do good preprocessing.
i have spent a lot of time writing posts here documenting _exactly_
how to do good preprocessing, and i did a heckuva lot of research
to learn and test those preprocessing methods to prove their utility.

so when someone just _ignores_ what i've done, and continues to
burn out proofers by wasting their time and energy with material
that should've had the obvious-and-easy-to-detect-automatically
errors fixed before it went in front of 'em, i have a right to be mad.

someone needs to talk some sense into roger's head, and do it fast.

not to say that roger is the only one, or the worst one.   not by far.
i don't even bother looking at what the p.g. content producers are
giving to their proofers these days, to protect my blood pressure.
but since the odds that anything has changed over there are none
and none, someone should talk some sense into _their_ heads too.

it's a _shame_ to be wasting proofer time, and you producers who
fail to do good preprocessing should be _ashamed_ of yourselves.

***

at any rate, if that list of 300 changes is too dense for your brain,
you can also thumb through the pages and see the changes made:
>    http://z-m-l.com/go/frabf/frabfp123.html

all changed lines are marked in red (o.c.r.) and blue (edits), so if
any particular page (like page 123, linked to above) is all-in-black,
it means that none of the lines on that page had any changes made.

you will find this "stepping through the edits" to be very friendly...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100218/75df69af/attachment.html>

From gbuchana at teksavvy.com  Thu Feb 18 16:37:14 2010
From: gbuchana at teksavvy.com (Gardner Buchanan)
Date: Thu, 18 Feb 2010 19:37:14 -0500
Subject: [gutvol-d] Re: Bowerbird's software projects
In-Reply-To: <de3f.69f469e7.38aee53e@aol.com>
References: <de3f.69f469e7.38aee53e@aol.com>
Message-ID: <4B7DDD3A.6070801@teksavvy.com>

On 18-Feb-2010 13:47, Bowerbird at aol.com wrote:

>
> actually, i don't think i've ever compared my stuff
> with any other software directly, because any tool

Perhaps not, but over time you have described checks that
your tools can do and fixes that you can automatically
make that sound a little to me like a super-duper gutcheck.

Also the workflow I picture is a little like gutcheck --
I am thinking of text-in text-out command line tools,
not something that needs to look at image scans or makes
me talk to it in a fancy U/I.

This is perhaps an inaccurate impression I have.
If the comparison is totally inappropriate, I'm sorry.

>
> you'll also need to tell me what version you want -- mac
> or p.c. or linux. your signature-block screams out linux,

Probably something that would run in FreeBSD would be
most useful -- a Linux build would, I think.  Windows
would be fine too.

>
> finally, please give a short description -- frontchannel --
> of your _current_ workflow. how do you do your books?

This is still fairly accurate:

http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_Voices#Gardner_Buchanan

...although I have a nicer flatbed scanner now.

I used to always page-by-page scan, OCR and first-proof
books that I was doing from physical copies.  The last
couple of books I've done instead by scanning, bulk OCR
and then proof from the scans and raw OCR text, which I
can do on the road with my laptop or anywhere I can mount
a USB key for a couple of hours.

After OCR I have a few basic things that I do via regular
expressions in vi:  I find and fix spaced punctuation, find
and fix M-dashes.  If there's any obvious consistent scannos
-- the Heavysege item I just finished had Ys that looked to
Finereader more like Vs, for example -- I will have a crack
at finding those.  I have been known to write a one-off
perl script to get at something that bugs me enough.

The thing is that I do not have a specific set of checks
and fixes that I consistently do.  I rely a lot on jeebies
and gutcheck.  I would like something perhaps with a wider
range of things that it can find so I don't have to know
all the things to look for.  Over the years you have mentioned
several automated checks and fixes that sounded sensible
enough to me.  I'm not keen enough to go back through
the archives, find them and implement them -- but I am
nevertheless interested in trying a tool like this out
on a project to see if it adds value for what I do.

Heck, you can grab
http://www.gutenberg.org/dirs/3/1/2/1/31212/31212-8.txt
and just tell me what you find.  I have no doubt there
is lots to find.

============================================================
Gardner Buchanan                     <gbuchana at teksavvy.com>
Ottawa, ON             FreeBSD: Where you want to go. Today.

From Bowerbird at aol.com  Thu Feb 18 18:21:13 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 18 Feb 2010 21:21:13 EST
Subject: [gutvol-d] [SPAM] re:  Re: Bowerbird's software projects
Message-ID: <1b289.d2043e8.38af4f99@aol.com>

gardner said:
>    Perhaps not, but over time you have described 
>    checks that your tools can do and 
>    fixes that you can automatically make 
>    that sound a little to me like a super-duper gutcheck.

yes, except that those checks and fixes are most often
programmed only into one-off versions of programs...

it's usually the case that making those checks and fixes
useful in the general case, against any random book, is
a more difficult matter.   this isn't an apology of any sort;
it's just that my intentions (for the most part) are to show
that a particular check can be accomplished, and is useful.
so far i haven't concentrated on building them into my app,
because nobody's really expressed much interest in the app.
the app has a general spellcheck ability, and that captures
a very high percentage of the errors that occur within a text.


>    Also the workflow I picture is a little like gutcheck --
>    I am thinking of text-in text-out command line tools,
>    not something that needs to look at image scans or 
>    makes me talk to it in a fancy U/I.

i'm not sure what you mean by the workflow you "picture".

i was asking about your _actual_ workflow, the one whereby
you currently digitize books.   are you saying that you now
do your digitizations without ever looking at image scans?
because i have a hard time imagining how you can do that.

you should also know i am a mac person, for good reason.

for me, the interface is prime.   if you're looking for tools
that work on a command-line, in a text-in-text-out way,
i'm the wrong tree for you to be barking up, that's for sure.

i certainly wouldn't call my interface "fancy".   to the contrary,
it's extremely utilitarian, and not very pretty, not pretty at all.

but it _is_ an interface, with buttons and menus and all that
nice stuff that makes the program a lot easier to work with...


>     a Linux build would, I think.? Windows would be fine too.

i'll send you both.


>    This is still fairly accurate:

ok, that was very useful...   my tool assumes that the page-scans
are in the same folder as the app, which is easy enough to satisfy.

the tool also assumes that your text is all in one file, and that the
page-boundary is of a certain type.   i'd assume that your vi skills
will enable you to satisfy this assumption in a fairly simple manner.

other than that, i'd say you'll be good to go.


>    The last couple of books I've done instead by scanning, 
>    bulk OCR and then proof from the scans and raw OCR text
>    which I can do on the road with my laptop or 
>    anywhere I can mount a USB key for a couple of hours.

that's how you'll want to operate with my software, yes.


>    After OCR I have a few basic things that I do 
>    via regular expressions in vi:

you can continue to do those things in vi if you like.

global changes in vi are much quicker than going through
one-by-one changes in the interface.


>    The thing is that I do not have 
>    a specific set of checks and fixes that I consistently do.

that's something you'll want to remedy.

i did a series here a couple years back where i collected
a list of checks that was necessary for the book i tested,
and somebody turned that list into a set of reg-ex tests.

you can find that set on the download page for don's app:
>?? http://code.google.com/p/dp50/downloads/list

indeed, since you are already using reg-ex, you'll probably
find that you prefer don's tool over mine, since his program
lets you actually _build_in_ your own list of reg-ex checks...


>    I rely a lot on jeebies and gutcheck.? 

so, when you get a report from them on the possible errors,
you enter vi and use search to locate each one of the errors?


>    I would like something perhaps with 
>    a wider range of things that it can find 
>    so I don't have to know all the things to look for.

well, yes.   and you can find some very extensive lists of
reg-ex checks, right on d.p.   the problem is that many
of those checks have a low signal-to-noise ratio, in that
they create far too many false-alarms.   this is a problem
even with gutcheck and heebe-jeebe, if i'm not mistaken.

so you really have to fine-tune your list of checks to the
particular corpus on which you are working, to be useful.

this is why don's app is so useful, because you can build in
the list of checks you want to do, and modify it at will, and
even enter in a specific reg-ex to see if it returns any hits.


>    Over the years you have mentioned
>    several automated checks and fixes 
>    that sounded sensible enough to me.

sounds like you really want to use that reg-ex list
that was based on the month-long series that i did.


>    I'm not keen enough to go back through the archives, 
>    find them and implement them -- but I am
>    nevertheless interested in trying a tool like this out
>    on a project to see if it adds value for what I do.

having heard all this, i'd guess don's app is the one for you.
>?? http://code.google.com/p/dp50/downloads/list

i'll send mine to you too, but his is based on reg-ex checks...


>    http://www.gutenberg.org/dirs/3/1/2/1/31212/31212-8.txt
>    and just tell me what you find.? I have no doubt there is lots

if the scans are online too, or can be, i'll certainly take a look at it...
without looking at them, i can't know if something is an error or not.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100218/cdacb238/attachment-0001.html>

From gbuchana at teksavvy.com  Thu Feb 18 19:39:01 2010
From: gbuchana at teksavvy.com (Gardner Buchanan)
Date: Thu, 18 Feb 2010 22:39:01 -0500
Subject: [gutvol-d] Re: Bowerbird's software projects
In-Reply-To: <1b289.d2043e8.38af4f99@aol.com>
References: <1b289.d2043e8.38af4f99@aol.com>
Message-ID: <4B7E07D5.1070500@teksavvy.com>

On 18-Feb-2010 21:21, Bowerbird at aol.com wrote:

>
> it's usually the case that making those checks and fixes
> useful in the general case, against any random book, is
> a more difficult matter.

That's kind of my experience, I guess.  Several fixes will
suggest themselves, in the context of a given specific text.
The next one might need different fixes.  But that doesn't mean
a long list of fixups might be tried when there's no cost
to just adding tests/fixes to the list.

>
> for me, the interface is prime. if you're looking for tools
> that work on a command-line, in a text-in-text-out way,
> i'm the wrong tree for you to be barking up, that's for sure.
>

I see.

>
> i'll send you both.
>

Better stick to Windoze, if it's a GUI.

> ok, that was very useful... my tool assumes that the page-scans
> are in the same folder as the app, which is easy enough to satisfy.
>
> the tool also assumes that your text is all in one file, and that the
> page-boundary is of a certain type. i'd assume that your vi skills
> will enable you to satisfy this assumption in a fairly simple manner.
>
> other than that, i'd say you'll be good to go.
>

Text in one file -- check.  I favour marking page boundaries with
"===00123" these days, but a global search/replace can fix that.

>
> i did a series here a couple years back where i collected
> a list of checks that was necessary for the book i tested,
> and somebody turned that list into a set of reg-ex tests.
>
> you can find that set on the download page for don's app:
>  >   http://code.google.com/p/dp50/downloads/list
>

Yes.  Looking at that.  I am not 100% sure I want to mess with
Twister exactly, but the list of regular expressions looks
interesting.  I'm picturing building a perl script that applies
all of these fixes, then creates a patch set based on the
the differences it has introduced.  I could then edit the
patch set as a file, nuking changes that are wrong, and
finally apply the patches for the changes I like.

>
>  > I rely a lot on jeebies and gutcheck.
>
> so, when you get a report from them on the possible errors,
> you enter vi and use search to locate each one of the errors?
>

Kind of.  Jeebies and gutcheck reference specific line numbers.
So I go through the output of these bottom up.  For each hit
I go to the specified line number and see what's up, fix if needed
and then move to the previous hit.  I work bottom to top so that
changes I make don't invalidate the line numbers in the gutcheck
output as I go.  I find it takes a good couple of passes before
I am satisfied I have all the genuine hits covered. Invariably
the WW finds things I've missed anyhow.

>
> sounds like you really want to use that reg-ex list
> that was based on the month-long series that i did.
>

Yeah.  Got those.  Like I say -- I will turn it into
a perl script and see where that takes me.

>
> i'll send mine to you too, but his is based on reg-ex checks...
>

Would be great.  Thanks.

>
>  > http://www.gutenberg.org/dirs/3/1/2/1/31212/31212-8.txt
>  > and just tell me what you find.  I have no doubt there is lots
>
> if the scans are online too, or can be, i'll certainly take a look at it...

Lots of choices there.
   http://www.canadiana.org/ECO/ItemRecord/48293?id=16c79d4f15394e51
   http://www.archive.org/details/advocateanovel00heavgoog
   http://books.google.com/books?id=ot4OAAAAYAAJ&oe=UTF-8

There are no page numbers in the Gutenberg text though.

See you,

============================================================
Gardner Buchanan                     <gbuchana at teksavvy.com>
Ottawa, ON             FreeBSD: Where you want to go. Today.

From dakretz at gmail.com  Thu Feb 18 21:18:30 2010
From: dakretz at gmail.com (don kretz)
Date: Thu, 18 Feb 2010 21:18:30 -0800
Subject: [gutvol-d] Re: Bowerbird's software projects
In-Reply-To: <4B7E07D5.1070500@teksavvy.com>
References: <1b289.d2043e8.38af4f99@aol.com> <4B7E07D5.1070500@teksavvy.com>
Message-ID: <627d59b81002182118s4fd527a6q4e4bdb47d649dc4f@mail.gmail.com>

For what it's worth, Twister comes out of pretty much your
approach, Gardner. I worked for a long time from regexes in
vi and am writing Twister to make as much of it  as "batchy"
as I can. For instance, when you load a regex file, you can
click a button to get a count of each of the regexes in your list.
I'm currently adding the ability to choose a regex and get a list
of occurrences with 3 lines of context.

The goal is to make it transparent how it works, and let you
adjust it to make it work the way you do.

But no guarantees - it's still buggy. :( You might want to wait
two or three days for a newer version.)

(I sure don't miss the requirement in vi to add all those
backslashes you don't need in any other regex context
I know of...)

On Thu, Feb 18, 2010 at 7:39 PM, Gardner Buchanan <gbuchana at teksavvy.com>wrote:

>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100218/bc7a5a8b/attachment.html>

From marcello at perathoner.de  Thu Feb 18 23:18:31 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Fri, 19 Feb 2010 08:18:31 +0100
Subject: [gutvol-d] Re: Bowerbird's software projects
In-Reply-To: <4B7E07D5.1070500@teksavvy.com>
References: <1b289.d2043e8.38af4f99@aol.com> <4B7E07D5.1070500@teksavvy.com>
Message-ID: <4B7E3B47.10408@perathoner.de>

Gardner Buchanan wrote:

> On 18-Feb-2010 21:21, Bowerbird at aol.com wrote:
> 
>> it's usually the case that making those checks and fixes
>> useful in the general case, against any random book, is
>> a more difficult matter.

The tragic bb in a nutshell. He gets one easy text, then builds a 
`program? that finds the bugs in that one easy text and proclaims it the 
ultimate fixing tools. ... Everybody laughs. ... BB waits one year. ... 
Repetitur.


To build a useful tool you have to:

1. get two random samples of scans, say two sets of 100 complete book 
scans, using different scan techniques and different OCR on books of 
different ages and provenience. You could get those out of google or IA.

2. build a bug list of those OCRed texts against proven good copies.

3. build a program using the texts and error lists of the first group. 
You are not allowed to look at the second group texts.

4. run the program against your blind group and record the percentage of 
positives and negatives it finds.

5. run any known tools against the blind group and see if yours performs 
significantly better.

6. If better then
      brag
    else
      shut up.


> That's kind of my experience, I guess.  Several fixes will
> suggest themselves, in the context of a given specific text.
> The next one might need different fixes.  But that doesn't mean
> a long list of fixups might be tried when there's no cost
> to just adding tests/fixes to the list.

If you have to enter the regexes manually you should use any editor that 
supports them.


-- 
Marcello Perathoner
webmaster at gutenberg.org

From Bowerbird at aol.com  Fri Feb 19 01:35:34 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 19 Feb 2010 04:35:34 EST
Subject: [gutvol-d] Re: Bowerbird's software projects
Message-ID: <4f.5dec8603.38afb566@aol.com>

gardner said:
>    That's kind of my experience, I guess.? 
>    Several fixes will suggest themselves, 
>    in the context of a given specific text.
>    The next one might need different fixes.

right.   a rule i always follow is that when i find an error,
i always search the rest of the text for other occurrences.


>    But that doesn't mean a long list of fixups 
>    might be tried when there's no cost
>    to just adding tests/fixes to the list.

well, except for what i mentioned about false alarms.

it's a rare test that doesn't turn up any false alarms,
but when one turns up too many of them, it becomes
a liability instead of an asset.   the question is always,
"how many false alarms is too many?", and the next
question is always, "how can i weed out false alarms?"


>    Better stick to Windoze, if it's a GUI.

actually, they are both generated from the same code,
so they should act identically.   whether they really do...


>    Text in one file -- check.? 
>     I favour marking page boundaries with "===00123"
>    these days, but a global search/replace can fix that.

my app is looking for separator lines that look like this:

 {{myantp123.png}} || the_runhead ||

(note that there is a space in the first column.)

that .png filename there is the name of the page-scan,
and the program assumes you name your files wisely...

so, for instance, if you want to jump to a certain page,
you simply type the page number and press enter, and
the program automatically jumps to that page.   nifty...


>    Yes.? Looking at that.? I am not 100% sure 
>    I want to mess with Twister exactly, but 
>    the list of regular expressions looks interesting.
>    I'm picturing building a perl script that 
>    applies all of these fixes, then creates a patch set 
>    based on the the differences it has introduced.? 
>    I could then edit the patch set as a file, 
>    nuking changes that are wrong, and
>    finally apply the patches for the changes I like.

i would be very surprised if you can make that workflow
more efficient than simply editing text in the interface...

the beauty of my app, and twister too, is that you can
view the page-scan to help you make the edit decision.

i'm well aware that you don't _need_ to view the scan
in order to resolve the vast majority of questions, but
the inefficiency in handling that thin minority is huge
if the bureaucracy of viewing the scan is too convoluted.


>    Jeebies and gutcheck reference specific line numbers.

try twister.   seriously.   the ability to jump right to the page
where the question occurs, and view the scan in context,
is a major boost to efficiency.   i bet you will be surprised...


>    I find it takes a good couple of passes before
>    I am satisfied I have all the genuine hits covered. 
>    Invariably the WW finds things I've missed anyhow.

that's a sign of an inefficient workflow.

you want to accomplish things in one pass,
and you want to make sure you got all of it.


>    Got those.? Like I say -- I will turn it into
>    a perl script and see where that takes me.

a perl script is operating blind.   get a seeing-eye dog.


>    Lots of choices there.
>    http://www.canadiana.org/ECO/ItemRecord/48293?id=16c79d4f15394e51
>    http://www.archive.org/details/advocateanovel00heavgoog
>    http://books.google.com/books?id=ot4OAAAAYAAJ&oe=UTF-8

none of those options are all that useful to me, however...

what i need is to have the individual scans available online,
each of them individually addressed with their own address.

for instance, that "myantp123.png" file i referenced above,
the one that reflects the scan of page 123 in "my antonia"?
you can find that right here, in sequence with all the rest:
>    http://z-m-l.com/go/myant/myantp123.png

this is the way the library of the future will be organized...
if you want your work in it, mount your files appropriately.

yes, i can download the .zip file of the scans from archive.org,
or pull 'em from the google .pdf, and then mount them myself,
but that's too much work for me to do, when you could have
mounted them correctly in the first place.


>    There are no page numbers in the Gutenberg text though.

then you threw away some very crucial information, didn't you?
probably rewrapped the text too, am i right?   and dehyphenated?
all these actions make any kind of reproofing an impossible task.

which is not to say that your proofing work was a waste of time.

no, in such situations, i'll download the o.c.r. from archive.org,
which _does_ still contain pagebreak info, and unwrapped text,
and end-line hyphenates.   and then i will use your proofed text
to make the corrections to the archive.org o.c.r.   and then i will
throw your text away, and keep the corrected, unwrapped and
page-marked text with the original end-line hyphenates in it...

and when i throw away your text, i throw away your credit-line.

had you kept all that valuable information which i need to have,
instead of tossing it out, i probably would keep your credit-line.

you know, just so you know...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100219/fc33f3ec/attachment.html>

From Bowerbird at aol.com  Fri Feb 19 12:47:09 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 19 Feb 2010 15:47:09 EST
Subject: [gutvol-d] banana cream for mac and windows is available
Message-ID: <10a64.5e5dc575.38b052cd@aol.com>

banana cream for mac and windows is available.

the linux version seems to have a plug-in conflict,
which i do not feel like debugging right now, so
linux people can use the windows version, or wait.
(note: i also have a version for the classic mac o.s.)

if you want to be a tester for this nice little app,
let me know and i'll tell you how you can get it...

this is the version from the fall of 2008, which is
the last time i worked on it in earnest, and i can't
remember what kind of state i left it in, so there
might well be some rough unfinished edges in it;
but for the most part, it should be pretty smooth.

like i said, though, it ain't pretty.   ain't pretty at all.      ;+)

-bowerbird

p.s.   if people want to use it, i'd be happy to work
on it again, to include incorporating all of those
reg-ex tests that gardner was requesting.   but...
as only gardner and joey (yes joey, i did hear you)
have asked for a copy so far, that seems unlikely...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100219/cd9f65c7/attachment-0001.html>

From Bowerbird at aol.com  Fri Feb 19 13:01:10 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 19 Feb 2010 16:01:10 EST
Subject: [gutvol-d] banana cream
Message-ID: <1115b.4e4ecfd9.38b05616@aol.com>

hey guys-

here's where you can get a windows version of banana cream:
>    http://z-m-l.com/go/bananacream/banana-cream2008.exe

you'll also want to download this sample text:
>    http://z-m-l.com/go/bananacream/tjbus.zml

put these two files in a folder of their own,
and then rename the program to "tjbus.exe"...
(by default, it loads the .zml file with the same name.)

the program will then download the scans for that book
("the jungle", by upton sinclair) automatically from my site.
(you can control the download method under one of the menus.)

if you have any questions, let me know.   i'd prefer to have the
chance to address any complaints before you go public, but i 
have no desire for you to muzzle your truth however you see it.

as stated earlier, i'd prefer you not distribute the app.   thanks.

-bowerbird

p.s.   the app follows my naming conventions, which require
a 5-letter prefix at the start of each filename (e.g., "tjbus"),
followed by a single letter declaring the type of page (either
"c" for a cover, or "f" for forward matter, or "p" for a page),
followed by the page-number (padded out to three places).
if the page was unnumbered, you use the last page number,
and append an "a", "b", "c", "d", etc., respectively, as needed.

it's also the case that the .zml file needs a certain structure.
i'll write something up for that, and send it along to you later.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100219/c19ff599/attachment.html>

From Bowerbird at aol.com  Fri Feb 19 13:10:54 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 19 Feb 2010 16:10:54 EST
Subject: [gutvol-d] oops
Message-ID: <115f7.768c9536.38b0585e@aol.com>

crap.

i meant to send that only to gardner and joey, not to the list.

i don't care who gets the app, but i _would_ like to know who.
so if you download it, please send me a backchannel saying so.
(and if you download it more than once,   let me know that too.)

if there are unaccounted downloads, i'll have to delete the file,
because i don't want too many copies out in the wild right now,
beings that this is not meant as a finished copy in any respect...

and since the cat is out of the bag, here's the mac version:
>    http://z-m-l.com/go/bananacream/bc2008f.app.zip

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100219/88fa3321/attachment.html>

From Bowerbird at aol.com  Fri Feb 19 16:20:54 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 19 Feb 2010 19:20:54 EST
Subject: [gutvol-d] ok, let's take a look at gardner's book,
	just for the exercise
Message-ID: <16399.189b0d8d.38b084e6@aol.com>

i said:
>    no, in such situations, i'll download the o.c.r. from archive.org,
>    which _does_ still contain pagebreak info, and unwrapped text,
>    and end-line hyphenates.   and then i will use your proofed text
>    to make the corrections to the archive.org o.c.r.   and then i will
>    throw your text away, and keep the corrected, unwrapped and
>    page-marked text with the original end-line hyphenates in it...

it occurs to me that it would be quite instructive to demo this exercise.

i'll be using gardner's book to show how i'd go through this process.

to prep, i downloaded the scans for his book from canadia.com, and
mounted them on my website, along with a skeleton copy of the text.

here's a sample url:
>    http://z-m-l.com/go/gardn/gardnp123.html

as you can see, the prefix for this book is "gardn", so if you put
a copy of the banana-cream program in a folder of its own, and
name it "gardn.exe", it will download the .zml file and the scans
from the website, and you'll be able to see how to start to use it.

-bowerbird

p.s.   mac users should name the app "gardn.app", of course...
(or, since your .app extensions are likely hidden, just "gardn".)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100219/3afe8353/attachment.html>

From gbuchana at teksavvy.com  Fri Feb 19 18:33:13 2010
From: gbuchana at teksavvy.com (Gardner Buchanan)
Date: Fri, 19 Feb 2010 21:33:13 -0500
Subject: [gutvol-d] Re: Bowerbird's software projects
In-Reply-To: <4f.5dec8603.38afb566@aol.com>
References: <4f.5dec8603.38afb566@aol.com>
Message-ID: <4B7F49E9.4090705@teksavvy.com>

On 19-Feb-2010 04:35, Bowerbird at aol.com wrote:

>
> that's a sign of an inefficient workflow.
>

I don't believe that a single pass is feasible, in particular for
mismatched quotes and spaced quotes.  You fix the open quote, or
in my case close more often than not, then that reveals another
quote problem further down/up.  In any event I am not troubled by
multiple passes.

> probably rewrapped the text too, am i right?  and dehyphenated?

Well it *is* a Gutenberg text after all.

> which is not to say that your proofing work was a waste of time.
>

Thanks.

>
> and when i throw away your text, i throw away your credit-line.
>

Sure.  The book *is* public domain after all.  Do what you like.

============================================================
Gardner Buchanan                     <gbuchana at teksavvy.com>
Ottawa, ON             FreeBSD: Where you want to go. Today.

From Bowerbird at aol.com  Fri Feb 19 20:57:09 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 19 Feb 2010 23:57:09 EST
Subject: [gutvol-d] Re: Bowerbird's software projects
Message-ID: <1bbf9.4e21db05.38b0c5a5@aol.com>

gardner said:
>    http://unixcomputer.net:81/new-photo/cd/Advocate/

i scraped them from canadiana.org myself...          :+)

but hey, do you still have your original o.c.r.?

or a latest version that has the original linebreaks intact?

***

>    I don't believe that a single pass is feasible, 

ok, i should elaborate.

multiple passes, to check different aspects, will be required.

but multiple passes to check the _same_ aspect are inefficient.


>    I don't believe that a single pass is feasible, 
>    in particular for mismatched quotes and spaced quotes.?

i believe you're wrong, and that i can show you.


>    in particular for mismatched quotes and spaced quotes.?
>    You fix the open quote, or in my case close more often than not, 
>    then that reveals another quote problem further down/up.? 

i have already demonstrated that you can fix spacey quotes,
and -- in the vast majority of cases -- fix 'em automatically.

leading and trailing spacey quotes are easy to fix, of course.

from there, it's a simple matter of segmenting the text into
_paragraphs_, and counting quotemarks in each paragraph,
making sure that the odd ones are open, and the even closed.

then when you come upon a spacey quote, fix it to be open
if it is an odd one, and fix it to be close if it is an even one.

if you come up against a case where there is an odd number
of quotes in a paragraph, and the next paragraph does not
start with a quote, then you have a case you need to look at.

similarly, if any of the quotemarks come up as the wrong type
(an odd that's close, or an even that's open), you need to look.

you can test this for yourself.   you'll find that it's very robust.
usually there's no need to spend much time on spacey quotes.


>    In any event I am not troubled by multiple passes.

ok.


>    Well it *is* a Gutenberg text after all.

right.   that point wasn't directed at you, as you correctly realized.


>    Thanks.

well, the fact that you haven't wasted your time is only _part_ of
the equation.   the fact that you won't get much credit down the line
(because _your_ text will be discarded because you threw away info
that people will want) is yet another (bigger) part of the equation.


>    Sure.? The book *is* public domain after all.? Do what you like.

i think you missed the point.

you can mount a version of your work that doesn't throw away
the important information, and then no one will have to re-do it,
in which case they will be happy to continue to give you the credit.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100219/4bbed502/attachment.html>

From dakretz at gmail.com  Fri Feb 19 21:19:50 2010
From: dakretz at gmail.com (don kretz)
Date: Fri, 19 Feb 2010 21:19:50 -0800
Subject: [gutvol-d] Re: Bowerbird's software projects
In-Reply-To: <1bbf9.4e21db05.38b0c5a5@aol.com>
References: <1bbf9.4e21db05.38b0c5a5@aol.com>
Message-ID: <627d59b81002192119g1d7db5eau5c3fe171c424a4d0@mail.gmail.com>

I'll concur on the spacey quotes. Twister  has a tab just for those.
You pick whether to visit all quotes or only anomalies, based on
spacing restarting every paragraph. I never bother to visit all any
more. It just pops from one bad quote-pair to the next, highlights
the whole thing, and offers a button to realign it correctly. If it's
just a spacing problem, not missing one end or the other; or
some other usage (e.g. inches, dittos, etc.) it always gets it
right. This is probably the fastest of all the regex checks that
require visual inspection.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100219/d379c2e1/attachment-0001.html>

From hart at pglaf.org  Fri Feb 19 21:20:48 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Fri, 19 Feb 2010 21:20:48 -0800 (PST)
Subject: [gutvol-d] Roundlessness
In-Reply-To: <16399.189b0d8d.38b084e6@aol.com>
References: <16399.189b0d8d.38b084e6@aol.com>
Message-ID: <alpine.DEB.2.00.1002192115110.15836@mail.pglaf.org>


No matter how much I trust anyone, I always make my own "last pass"
at any eBook I have the final responsibility for, including quotes,
which obvious CAN add up differently after space removal, etc., and
the always seems to be at least one "torn margin," etc.

It's just nice to have a pair of human eyeballs as the last resort,
even when you prepared the entire book all by yourself.  I think it
may be the case that I ALWAYS found at least one more error even if
it is just the most cursory pass.  Sometimes I just insist on quite
literally working on it until I find ONE more error just to prove I
was there doing my two cents worth.


Thanks!!!


Michael


From gbuchana at teksavvy.com  Sat Feb 20 10:52:20 2010
From: gbuchana at teksavvy.com (Gardner Buchanan)
Date: Sat, 20 Feb 2010 13:52:20 -0500
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <1bbf9.4e21db05.38b0c5a5@aol.com>
References: <1bbf9.4e21db05.38b0c5a5@aol.com>
Message-ID: <4B802F64.3040909@teksavvy.com>

On 19-Feb-2010 23:57, Bowerbird at aol.com wrote:

> (because _your_ text will be discarded because you threw away info
> that people will want) is yet another (bigger) part of the equation.
>

>
> you can mount a version of your work that doesn't throw away
> the important information, and then no one will have to re-do it,
>

I'm going to take this as a jumping off point for a more general
question about whether pagination of a published edition, is worth
saving.  Obviously there is a range of opinion.  I'll give you
mine.

What I believe, philosophically, I am shooting for is to
capture the core content, and reject the details that have
mainly to do with the medium of publication. So at the top
level, I think the text itself and notions like block quotations,
poetry layout, italics and stuff I keep. Stuff that is a
function of the fact it was printed on little rectangles of
paper -- hyphenation page numbering, line ends, I believe I
do not have any use for.

Maybe there are possible future uses of my text that would want
the things that I left out.  I tend to doubt that this could
ever be very important.  If I take for example what scholarly
editions tend to do, they focus on the text, tend to combine
information from different printings and editions, and winnow
out and reject the artefacts of hyphenation and pagination.
The seek out and highlight even small differences in the text,
but go to pains to filter out hyphenation artefacts.

In the grand scheme of things, there were undoubtedly
interesting things in earlier versions of a book than what we
have -- the author's manuscript, editors notes, even the
setter's notes all would be very interesting things to have.
But if I think what value I could get from having the author's
manuscript I do not picture knowing the pagination or line
endings of a longhand manuscript as being of foremost
importance.

Obviously others feel like preserving page numbers is
worthwhile -- I see that most PG-Canada texts have this.
As an individual contributor I do not feel that my time
is best spent capturing and encoding that, and so I don't.
And I am happy that PG finds my efforts acceptable despite
this deficiency.

I haven't done any sort of real research, but a quick look
shows me that not many texts attempt to preserve line
endings in any way.  Preserving line endings seems quite
unpopular.

My question to the pagination-preservers is: what is the
difference? Both hyphenation, line-endings and pagination
are mainly artefacts of the physical medium -- one of width
and the other of height. Bowerbird wants to keep both;
I see no need to keep either. But what is the reasoning
behind keeping one (pagination) and not the other?


============================================================
Gardner Buchanan                     <gbuchana at teksavvy.com>
Ottawa, ON             FreeBSD: Where you want to go. Today.

From klofstrom at gmail.com  Sat Feb 20 11:08:39 2010
From: klofstrom at gmail.com (Karen Lofstrom)
Date: Sat, 20 Feb 2010 09:08:39 -1000
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <4B802F64.3040909@teksavvy.com>
References: <1bbf9.4e21db05.38b0c5a5@aol.com> <4B802F64.3040909@teksavvy.com>
Message-ID: <1e8e65081002201108k3d84b248pd50b4548c2b95720@mail.gmail.com>

On Sat, Feb 20, 2010 at 8:52 AM, Gardner Buchanan <gbuchana at teksavvy.com> wrote:

> My question to the pagination-preservers is: what is the difference?

Pagination is crucial if you're talking about the text to someone else
-- whether in a scholarly context, or just referring to a certain
passage when writing a review.

If you say, "Nina is called a gypsy on page .89 of the 1899 edition",
someone else can find the passage and check your assertion. If you
say, "Somewhere in the first third of the book, Nina is called a
gypsy," people won't be able to find it. Even if you are reading on a
device that does search easily, you'd still have to pull up and scour
all mentions of gypsies.

Pagination isn't a perfect reference method. If you're in a class
where they're reading Gaskell's North and South, say, and the teacher
is referring to a modern reprint and you've got an ebook version of
the first edition, with the first edition pagination, you're going to
have to do some searching. You'll probably find what you want within a
range of a few pages, however.

The best method is the one used for religious texts: giving chapter
and verse. That reference is invariant across all versions. Perhaps
we'll adopt that eventually for ALL texts. Until then, pagination is a
next best.

--
Karen Lofstrom

From Bowerbird at aol.com  Sat Feb 20 12:11:31 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 20 Feb 2010 15:11:31 EST
Subject: [gutvol-d] Re: so what is so important about pagination?
Message-ID: <28809.577ebb12.38b19bf3@aol.com>

gardner said:
>   I'm going to take this as a jumping off point
>    for a more general question about 
>    whether pagination of a published edition, 
>    is worth saving.? Obviously there is a range of opinion.? 
>    I'll give you mine.

yes, there is a range of opinion.

i can give arguments -- even good arguments --
on both sides.

which obviously means that _some_ people
have good arguments for retaining pagination.

and you disenfranchise those people entirely
when you throw out the pagination, no matter
how good your intentions might be for doing so.

i'd rather not disenfranchise those people...

so i think it's necessary to include the pagination,
and the original linebreaks, with end-line hyphenates.

now, i think it's _imperative_ that we give people tools
that enable them to discard that pagination, and unwrap
those original linebreaks, and rejoin end-line hyphenates.

to do otherwise would be to disenfranchise _those_ people,
and i'd rather not disenfranchise them either.

so, for me, the answer to the question is extremely simple.


>    What I believe, philosophically, I am shooting for 
>    is to capture the core content, and reject the details 
>    that have mainly to do with the medium of publication.

i can understand that perspective.

i can also understand the other perspective.

and i see no reason that anyone has to be unhappy here.

it's very important to understand that this does _not_
have to be an either/or question.   we _can_ do _both_...


>    Maybe there are possible future uses of my text
>    that would want the things that I left out.? 
>    I tend to doubt that this could ever be very important.?

well, then, your imagination is starting to lag behind...      :+)

because we are now right in the middle of a situation here
where "the things that you left out" _are_ "very important",
namely a reproofing of your book, to test your accuracy...

it's an order of magnitude more difficult to proof a book
when the text has lost all of its linebreaks and pagination.

are you of the opinion that the future will simply _accept_
that you did a perfect job in the digitization of your books?
or do you think they will want to verify the quality of them?
if you make it too difficult for them to undertake that job,
they will just toss out your text and start anew.   your loss...


>    As an individual contributor I do not feel that my time
>    is best spent capturing and encoding that, and so I don't.

except the info was already captured.   then you threw it away.


>    And I am happy that PG finds my efforts acceptable 
>    despite this deficiency.

except the future will throw out all the d.p. works because
your deficiency is shared by the entire d.p. corpus, sadly...

(even the d.p. people who save pagination toss linebreaks.)


>    I haven't done any sort of real research, but 
>    a quick look shows me that 
>    not many texts attempt to preserve line endings in any way.? 
>    Preserving line endings seems quite unpopular.

the future needs to future-proof tens of millions of books...

they can afford to throw out everything done up to this point,
if they feel they need to, and they will, they most certainly will.
(and advances in o.c.r. and o.c.r. correction will make it easy.)

(well, as i pointed out, they might use some of the current texts
to proof the new o.c.r. that they do, but then they'll toss them.)


>    Bowerbird wants to keep both

actually, i don't need to take a personal stand on the issue,
not as an end-user.   and that's a good thing, because often
i don't have a need for the original linebreaks or pagination.
so i definitely want the option of discarding that information.

what i am saying is that, as a "best practice" for digitization,
the discarding of such information is clearly a terrible mistake.

and if you're doing it simply because _you_ don't see a need
for that information, then you're being selfish and shortsighted.

plus you're giving an ultimatum to people who want that info:
they're forced to toss your text as failing to meet their needs.
and i am positive that you're going to lose that bet, gardner...

-bowerbird

p.s.   by the way, i found 23 discrepancies in the paragraphing
between your version of "the advocate" and archive.org's o.c.r.
however, 21 were errors in the o.c.r., and only 2 in your book.
the two errors in your book, both of them missed paragraphs:

>    http://z-m-l.com/go/gardn/gardnp087.html
>    "What, take the bird back to the bush where we

>    http://z-m-l.com/go/gardn/gardnp101.html
>    Yet who shall blame the sun and moon for that?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100220/c2bd4292/attachment.html>

From Bowerbird at aol.com  Sat Feb 20 13:24:03 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 20 Feb 2010 16:24:03 EST
Subject: [gutvol-d] Re: Bowerbird's software projects
Message-ID: <29dd8.5294e4e8.38b1acf3@aol.com>

i've uploaded the archive.org o.c.r. for "the advocate"
-- the book that gardner suggested that i look at --
into the skeleton that was previously being used, at:
>    http://z-m-l.com/go/gardn/gardnp123.html

the last page of the book shows the changes i made:
>    http://z-m-l.com/go/gardn/gardnp126.html

the .zml that made these .html pages, as usual, is at:
>    http://z-m-l.com/go/gardn/gardn.zml

if you compare the .zml file to the original o.c.r.,
you can see that it is very similar.   it doesn't take
much work to massage typical o.c.r. output to .zml.

as no proofing has been done yet, the text is raw...

(although this book came from the internet archive,
it is a copy of a google book, which means the o.c.r.
is very shoddy, since google puts out low-res scans.
when archive.org does o.c.r. on its own scan-sets,
the o.c.r. is fairly good, since they're using abbyy.)

ordinarily at this point, in order to clean the o.c.r.,
i'd restore the linebreaks to gardner's p.g. e-text
by using the linebreaks from the o.c.r. as a guide...

however, gardner sent me a copy of the text as it
was _before_ he rewrapped the original linebreaks,
so i won't need to go through that boring exercise.

i decided to post this o.c.r. text anyway, just so you
could see what it would look like as it is "in process".
at this point, the structure of the thing is pretty solid,
in the sense that all of the paragraphing is correct,
and the chapter-heads are in place, and all of that,
it's just the scanning errors make the thing awful...

but if you look past those scanning errors, you can
see why this version is superior to the p.g. e-text:
it is obviously self-validating against the p-book
from which the text claims to have been generated,
since it is easily compared to scans of that p-book.
(or even against an actual hard-copy, if you prefer.)

even if the p.g. text _is_ accurate, it can't be _verified_
as accurate, not quickly and easily, like this text can...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100220/c7006e1b/attachment-0001.html>

From jimad at msn.com  Sat Feb 20 13:53:23 2010
From: jimad at msn.com (Jim Adcock)
Date: Sat, 20 Feb 2010 13:53:23 -0800
Subject: [gutvol-d] Re: DP: was rfrank reports in
In-Reply-To: <1e8e65081002121127r77da6414y32a6f4b00f35d6fc@mail.gmail.com>
References: <8005.73d837a3.38a06575@aol.com>	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>	<SNT120-DS24A947851758312487F600AE500@phx.gbl>	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<1e8e65081002121127r77da6414y32a6f4b00f35d6fc@mail.gmail.com>
Message-ID: <SNT120-DS1191EB7E9A9B680EF61BBDAE450@phx.gbl>

>There's no forcing going on. The policy from Day One has been that we
work on what the content providers submit. Sometimes works that look
enticing or valuable to them aren't appealing to the proofers, and
then take a long time to wend their way through the system. (Some
texts, like Greg Week's science fiction stories, zip through in days.)

Works that CP'ers submit which are stuck on the queues AREN'T being worked
on. People who volunteer for DP are forced to work on things not stuck on
queues.  That IS the forcing going on.

Work that progresses slowly through the Proofing rounds aren't really the
problem.  The problem is more works that get stuck in the formatting rounds
and the PP rounds.  What I've seen stuck in the proofing rounds has
sections, such as huge sections of publishers ads, or indexes, which most
Proofers get tired of pretty quick -- especially when the work is classified
as "Easy."  I would question the judgment of including publishers' ads when
they aren't even numbered pages nor relate to the subject matter.

Let's try to break this down again in a way that SHOULDN'T be controversial:

0) Premise: DP people ARE acknowledging that having books stuck on queues
3.5 years is not a good thing. If this is NOT a good thing, then SOMETHING
has to change.  If one wants to change the queuing times there is really
ONLY a couple things fundamentally that one can change:

1) You can reduce that rate at which content is placed onto the queues.
That implies SOME kind of principle of selection.  The principle right now
is "First Come First Serve." I suggest this is not a good thing for several
reasons: Books may be put on the queue that people really don't want to work
on.  Books may be put on the queue that people really don't want to read.
And books may be put on the queue that take time and energy disproportional
to the societal benefit to be gained from that book compared to some other
books. Note there are about 50 million books available worldwide that could
be worked on by DP, compared to 2500 roughly a year created by DP, implying
a queuing time for books in general of 20,000 years -- not including those
books that will have risen to the public domain in those 20,000 years!
Another way of saying this is that the selection process used to decide
which books get "rescued" by DP is on the order of 1 book in 10,000 gets
saved.  Now, if only one book in 10,000 gets saved, should this be "at
random" or should there be some kind of selection process -- even if it were
only that the DP volunteers who are going to do the work vote on what gets
put on the queue?

2) You can increase the rate at which content is taken off the queues.  This
requires placing more resources at those places in the queues where things
are getting bogged down, which are P3, F2, and PP. To place more resources
at these places requires at least SOME tweaking of DP's current system of
"technological high priesthood" and would require getting over DP's current
idea that somehow they are creating "perfect books" [which they certainly
are NOT doing!]

3) You can increase productivity by improving tools -- particularly tools
helping P3, F2, and PP.  Producing tools that help P1 is pretty easy, as
many people have suggested, but, it is actually NOT obvious that improving
tools for P1 would prove to be helpful to DP overall!  Making P1 faster and
easier without changing the current rules of "technological high priesthood"
will actually only make the queuing problems more extreme.


From jimad at msn.com  Sat Feb 20 14:00:23 2010
From: jimad at msn.com (Jim Adcock)
Date: Sat, 20 Feb 2010 14:00:23 -0800
Subject: [gutvol-d] Re: DP: was rfrank reports in
In-Reply-To: <18CC2C23FCF249DEA672595196E236B2@alp2400>
References: <8005.73d837a3.38a06575@aol.com>	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>	<SNT120-DS24A947851758312487F600AE500@phx.gbl>	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<18CC2C23FCF249DEA672595196E236B2@alp2400>
Message-ID: <SNT120-DS6A0D5CB643E1623F7AAC2AE450@phx.gbl>

>"revised" version is going to appear in a few days/weeks/months...

We are not talking about a "revised" version showing up in a couple months.
Rather, we are talking about doing a posting which includes HTML 3.5 years
later.  Verses posting the txt version now rather than later, and thereby
increasing the total collection size of PG by 20%.  One could argue that
this would make the whitewashers job easier rather than harder -- because
then Al wouldn't have to put up with random submissions from people like me
who give up on DP and "route around damage" [thereby introducing "damage" of
our own! :-]


From greg at durendal.org  Sat Feb 20 14:57:24 2010
From: greg at durendal.org (Greg Weeks)
Date: Sat, 20 Feb 2010 17:57:24 -0500 (EST)
Subject: [gutvol-d] [SPAM] Re:  Re: DP: was rfrank reports in
In-Reply-To: <SNT120-DS6A0D5CB643E1623F7AAC2AE450@phx.gbl>
References: <8005.73d837a3.38a06575@aol.com>
	<SNT120-DS1238ECE610EDCA023E7BBAE500@phx.gbl>
	<75EAE59EE9A2439CAF65C1B38795DDEF@alp2400>
	<SNT120-DS24A947851758312487F600AE500@phx.gbl>
	<1e8e65081002091457w4f9fbd71u3f32edfd5eebbf7@mail.gmail.com>
	<SNT120-DS65B55AB31A32F1F72317FAE4F0@phx.gbl>
	<SNT120-DS694A5D54BAD71CD486FE1AE4F0@phx.gbl>
	<1e8e65081002092325l27d0947eq11ec2568e599b1e4@mail.gmail.com>
	<SNT120-DS222FE54DF3AD4E8288D875AE4E0@phx.gbl>
	<6d99d1fd1002101914gaa4a0eq83b6b4529fde1e29@mail.gmail.com>
	<SNT120-DS21D91456CEF74336C05AA3AE4D0@phx.gbl>
	<18CC2C23FCF249DEA672595196E236B2@alp2400>
	<SNT120-DS6A0D5CB643E1623F7AAC2AE450@phx.gbl>
Message-ID: <alpine.DEB.2.00.1002201753550.11265@durendal.durendal.org>


At least one of the discussions going on was exactly the HTML coming a few 
weeks after the text scenario. This was the go through all the rounds and 
the PPer posts the text version as soon as it's done and posts the html 
later. This didn't seem like a terribly useful approach to me as the html 
version of the text is typically NOT where the bottleneck is at DP.

Of course there was at least five other aproaches being discussed in the 
thread.

Greg Weeks

On Sat, 20 Feb 2010, Jim Adcock wrote:

>> "revised" version is going to appear in a few days/weeks/months...
>
> We are not talking about a "revised" version showing up in a couple months.
> Rather, we are talking about doing a posting which includes HTML 3.5 years
> later.  Verses posting the txt version now rather than later, and thereby
> increasing the total collection size of PG by 20%.  One could argue that
> this would make the whitewashers job easier rather than harder -- because
> then Al wouldn't have to put up with random submissions from people like me
> who give up on DP and "route around damage" [thereby introducing "damage" of
> our own! :-]
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

-- 
Greg Weeks
http://durendal.org:8080/greg/


From jimad at msn.com  Sat Feb 20 15:13:20 2010
From: jimad at msn.com (Jim Adcock)
Date: Sat, 20 Feb 2010 15:13:20 -0800
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <4B76F4C6.3030006@teksavvy.com>
References: <4B76F4C6.3030006@teksavvy.com>
Message-ID: <SNT120-DS183C9552CBE2E57ED2490CAE450@phx.gbl>

I do "solos" given my frustration level with DP -- where I've submitted two
really good books but none have made it back out of the system. IMHO setting
up a book to go through the DP system aka Content Providing isn't a whole
lot less work than just doing the whole book for myself in the first place.

Not entirely happy working with myself either -- going it alone is a bit of
slog for me -- but my tolerance level for wasting time is about one month --
which is about how long it takes me to make a book while working around
various family emergencies -- as compared to 40 months for DP.  And with DP
nothing happens for months or years at a time -- and then the people there
are unhappy with you if you happen to be out of town if and when your book
pops off a queue and "goes active".

What I wish is that DP had a "Fast Trackers" division of people interested
in and committed to turning books out quickly, so that one could see a
project from beginning to end.

I still proof at DP occasionally when I have excess energy -- but not enough
to start my own new book project again!


From jimad at msn.com  Sat Feb 20 15:48:17 2010
From: jimad at msn.com (Jim Adcock)
Date: Sat, 20 Feb 2010 15:48:17 -0800
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <9F120957CF48439F9C63FD74DE1B25F7@alp2400>
References: <4B76F4C6.3030006@teksavvy.com>	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>	<4B785C40.5000304@xs4all.nl>	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>	<1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com>
	<9F120957CF48439F9C63FD74DE1B25F7@alp2400>
Message-ID: <SNT120-DS3F9D456A3A517C4CA1E21AE450@phx.gbl>

>.... but there's 
currently no mechanism except for the Whitewashers, a.k.a. Errata Team, to 
fix this kind of thing.  (Probably simpler to just re-do this text from 
scratch, which is something *I'm* not about to do.)

OK, HOW ABOUT a mechanism for fixing and/or improving things that were done
in the past that now look old and crufty by today's standards? -- whether
redoing something originally created by DP or by a solo? Certainly WW
shouldn't be the only way to fix old cruft.  If someone wants to take on a
"redo and improve" what does it take? Many of the things that actually get
read at PG are pretty old and crufty! -- I haven't been willing to take on
any of the Ye Olde Cruft for fear of pushback.


From jimad at msn.com  Sat Feb 20 15:50:30 2010
From: jimad at msn.com (Jim Adcock)
Date: Sat, 20 Feb 2010 15:50:30 -0800
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <9F120957CF48439F9C63FD74DE1B25F7@alp2400>
References: <4B76F4C6.3030006@teksavvy.com>	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>	<4B785C40.5000304@xs4all.nl>	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>	<1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com>
	<9F120957CF48439F9C63FD74DE1B25F7@alp2400>
Message-ID: <SNT120-DS93F6FB34B82E76AE781E9AE450@phx.gbl>

>In short, DP's current processes produce error-free texts....

I will disagree with this, at least given that DP's current processes
introduce punc errors pretty much by design.


From greg at durendal.org  Sat Feb 20 16:06:07 2010
From: greg at durendal.org (Greg Weeks)
Date: Sat, 20 Feb 2010 19:06:07 -0500 (EST)
Subject: [gutvol-d] [SPAM] Re: Re: Many solo projects out there in gutvol-d
	land?
In-Reply-To: <SNT120-DS183C9552CBE2E57ED2490CAE450@phx.gbl>
References: <4B76F4C6.3030006@teksavvy.com>
	<SNT120-DS183C9552CBE2E57ED2490CAE450@phx.gbl>
Message-ID: <alpine.DEB.2.00.1002201903310.11678@durendal.durendal.org>

On Sat, 20 Feb 2010, Jim Adcock wrote:

> What I wish is that DP had a "Fast Trackers" division of people interested
> in and committed to turning books out quickly, so that one could see a
> project from beginning to end.

Don (dkretz) and I and a small team experimented with this a few weeks 
ago. It's entirely possible to do this within the current DP constraints. 
I think we took about two weeks for the short we used. That wasn't the 
main purpose of the experiment, but the short period of time was one of 
the constraints for what we want to test.

-- 
Greg Weeks
http://durendal.org:8080/greg/


From hart at pglaf.org  Sat Feb 20 16:33:08 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Sat, 20 Feb 2010 16:33:08 -0800 (PST)
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <SNT120-DS93F6FB34B82E76AE781E9AE450@phx.gbl>
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<4B785C40.5000304@xs4all.nl>
	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>
	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>
	<1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com>
	<9F120957CF48439F9C63FD74DE1B25F7@alp2400>
	<SNT120-DS93F6FB34B82E76AE781E9AE450@phx.gbl>
Message-ID: <alpine.DEB.2.00.1002201628580.13867@mail.pglaf.org>


Let's just forget the whole idea of error free texts. . . .

Ever since I started Project Gutenberg I've never seen even
one book I read, even most articles and essays, without big
bluders you would think could never be published.

I would prefer just to get these materials in circulation--
then worry about approaching perfection along with Xeno.

Does anybody have a serious objection to putting the 8,000,
or so, books that were listed earlier as being in limbo, in
something like our "PrePrints" section, where we put eBooks
that are admittedly not ready for prime time???

Please. . . .


Michael


On Sat, 20 Feb 2010, Jim Adcock wrote:

> >In short, DP's current processes produce error-free texts....
>
> I will disagree with this, at least given that DP's current processes
> introduce punc errors pretty much by design.
>
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

From jimad at msn.com  Sat Feb 20 16:43:55 2010
From: jimad at msn.com (Jim Adcock)
Date: Sat, 20 Feb 2010 16:43:55 -0800
Subject: [gutvol-d] Kindle for Blackberry
In-Reply-To: <1cd29.10b0e871.38a990fb@aol.com>
References: <1cd29.10b0e871.38a990fb@aol.com>
Message-ID: <SNT120-DS87541B0796E2E8C082DFCAE440@phx.gbl>

Amazon has released "Kindle for Blackberry" for free at:

http://www.amazon.com/gp/feature.html/ref=klm_lnd_inst?docId=1000468551

I don't personally own a Blackberry, so I can't report on this one in
specific.  Typically would allow one to read "for pay" Amazon books plus
free public domain books including PG in MOBI format. Why would one care? i)
Yet another "free reader" software for cell phone devices.  ii) Good for
people interested in making MOBI versions of PG books -- or checking out how
their DP efforts look like once translated by PG to MOBI format and from
there to people's cell phones. iii) check out "the competition."


From greg at durendal.org  Sat Feb 20 16:45:39 2010
From: greg at durendal.org (Greg Weeks)
Date: Sat, 20 Feb 2010 19:45:39 -0500 (EST)
Subject: [gutvol-d] [SPAM] Re: Re: Many solo projects out there in gutvol-d
	land?
In-Reply-To: <alpine.DEB.2.00.1002201628580.13867@mail.pglaf.org>
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<4B785C40.5000304@xs4all.nl>
	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>
	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>
	<1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com>
	<9F120957CF48439F9C63FD74DE1B25F7@alp2400>
	<SNT120-DS93F6FB34B82E76AE781E9AE450@phx.gbl>
	<alpine.DEB.2.00.1002201628580.13867@mail.pglaf.org>
Message-ID: <alpine.DEB.2.00.1002201942370.11947@durendal.durendal.org>

On Sat, 20 Feb 2010, Michael S. Hart wrote:

> Does anybody have a serious objection to putting the 8,000,
> or so, books that were listed earlier as being in limbo, in
> something like our "PrePrints" section, where we put eBooks
> that are admittedly not ready for prime time???

Yea, there are people arguing that it's a horrible thing to do. I'm 100% 
with you on this. Available with a few errors is far more useful than 
unavailable. And it's not that they aren't actually available now, they 
are. DP has always had the concatenated text available for download. It's 
behind a sign on and not indexed by any of the search engines, so if you 
don't know it's there already you can't find it.

-- 
Greg Weeks
http://durendal.org:8080/greg/


From ajhaines at shaw.ca  Sat Feb 20 17:08:18 2010
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Sat, 20 Feb 2010 17:08:18 -0800
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?
References: <4B76F4C6.3030006@teksavvy.com>
	<DD5174F06F724F3C8CCEAADEE9390D34@alp2400>
	<627d59b81002131253k7036ac26td82ef7f67090706a@mail.gmail.com>
	<Pine.GSO.4.58.1002132245560.6690@vtn1.victoria.tc.ca>
	<4B785C40.5000304@xs4all.nl>
	<627d59b81002141247x6782fa5ckc8eb1c863299bb02@mail.gmail.com>
	<A87ABE40-9220-435C-9FBD-7A0C0890A214@uni-trier.de>
	<627d59b81002150943i4a7b6c12ibaa3e76061c4abf0@mail.gmail.com>
	<1e8e65081002151247y187698c7xc7ca8f1326dacefa@mail.gmail.com>
	<9F120957CF48439F9C63FD74DE1B25F7@alp2400>
	<SNT120-DS3F9D456A3A517C4CA1E21AE450@phx.gbl>
Message-ID: <3EBEA039872B469FBC6A629378263ACB@alp2400>

Any "mechanism" is informal, at best, and there's no list of old submissions 
that would benefit from being re-done.

To use as an example, Arizona Sketches, by J. A. Munk, PG#756.  Internet 
Archive has a number of source copies.  In 2008, I cleaned up PG's text 
file, made corrections, and created an HTML version.  It's missing all 
illustrations, any Latin1 characters, and so forth.


If the only intent is to correct a current PG etext, the corrected text and 
HTML files can be sent to PG's Errata system.  Do not reformat the files, so 
that the corrected ones can be compared to the posted ones.  It might take a 
few days for the WWers to deal with such submissions, but they *will* be 
dealt with.

However, if you want to add illustrations, or any other material that may be 
missing from the posted files, you'll have to submit a copyright clearance 
for the source edition, do whatever is needed to add the missing material to 
the posted files, do a thorough check/correction of those files from the 
source, then upload everything as normal, mentioning in the Note to 
Whitewashers field that the submission is intended as an update to an 
existing etext.  The WWers will decide whether to post the new submission as 
a new etext, or to replace (and archive) the existing files.  If the latter 
is chosen, the original submitter's credit will be added to the new 
version's Credit line.


----- Original Message ----- 
From: "Jim Adcock" <jimad at msn.com>
To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d at lists.pglaf.org>
Sent: Saturday, February 20, 2010 3:48 PM
Subject: [gutvol-d] Re: Many solo projects out there in gutvol-d land?


> >.... but there's
> currently no mechanism except for the Whitewashers, a.k.a. Errata Team, to
> fix this kind of thing.  (Probably simpler to just re-do this text from
> scratch, which is something *I'm* not about to do.)
>
> OK, HOW ABOUT a mechanism for fixing and/or improving things that were 
> done
> in the past that now look old and crufty by today's standards? -- whether
> redoing something originally created by DP or by a solo? Certainly WW
> shouldn't be the only way to fix old cruft.  If someone wants to take on a
> "redo and improve" what does it take? Many of the things that actually get
> read at PG are pretty old and crufty! -- I haven't been willing to take on
> any of the Ye Olde Cruft for fear of pushback.
>
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
> 


From jimad at msn.com  Sat Feb 20 17:35:39 2010
From: jimad at msn.com (Jim Adcock)
Date: Sat, 20 Feb 2010 17:35:39 -0800
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <28809.577ebb12.38b19bf3@aol.com>
References: <28809.577ebb12.38b19bf3@aol.com>
Message-ID: <SNT120-DS12455943132A9D37AC63F9AE440@phx.gbl>

Pagination is not necessarily the same thing as page numbers. I like
retaining some notion of page numbers even if it is just in the form of
invisible or semi-invisible HTML. I also like retaining original linebreaks
info to assist future proofing or reworking passes -- which again is not the
same as displaying original linebreaks.  I dislike anything that prevents
reflow, which I think is necessary for the enjoyment of most users.


From Bowerbird at aol.com  Sat Feb 20 17:52:13 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 20 Feb 2010 20:52:13 EST
Subject: [gutvol-d] [SPAM] re: Many solo projects out there in gutvol-d land?
Message-ID: <2ecd9.15b98434.38b1ebcd@aol.com>

michael said:
>   Let's just forget the whole idea of error free texts. . . .

well, that's going a bit overboard.

better, let's try to achieve perfection, but let's not let that high goal
get in the way of making work available even if it's not yet perfect...


>    I would prefer just to get these materials in circulation--
>    then worry about approaching perfection along with Xeno.

yes, except i don't see any part of project gutenberg that is
doing very much at all in the way of "approaching perfection".

once a text is posted, people seem to forget it completely...

even when the whitewashers "correct" an e-text, they aren't
doing nearly what they could be doing in order to improve it.

it seems like there is a constant mad dash to do new books,
but almost nothing is being done to fix up any older books.

i asked at 10,000 books for a review in terms of quality control,
and again at 15,000, and again at 20,000, and again at 25,000.
i didn't bother to ask again at 30,000, because what's the use?

but at some point, some hard questions will need to be asked...


>    Does anybody have a serious objection to putting the 8,000,
>    or so, books that were listed earlier as being in limbo, in
>    something like our "PrePrints" section, where we put eBooks
>    that are admittedly not ready for prime time???

well, i certainly don't...

but many of the volunteers over at distributed proofreaders do.

indeed, according to a poll (which has now received one of the
highest number of votes on any poll that has been done there),
there are split evenly -- right down the middle -- on this issue.

i don't know what to make of that.   but that's the way it is.

it's also worth mentioning that those 8,000 books are _not_
"almost done".   some of them really aren't even very close...
some are full of typos, still.   most contain pseudo-markup,
which really should be converted to something more useful
before the books are ever put in front of the general public.
lots contain "proofer's notes", which would confuse people.
it should also be noted that many of them are not in english,
which might (or might not) have bearing on the question, but
since i only speak english, i wouldn't have any idea what it is.

considering all this, it would _not_ be a simple procedure to
free up this matter.   it could use up a lot of time and energy,
and for very little benefit in return.   (does any use "preprints"?)

what _would_ be useful is for this material to be put on a wiki,
in order to test notions of public postprocessing collaboration.

instead of saying "here, take this unfinished work", we _should_
instead be saying "here, come help finish this unfinished work".

at one point, i was tempted to build such a postprocessing system.
but then i realized i didn't want to help d.p. get over their backlog;
d.p. deserves to suffer the consequences of their terrible workflow,
or they'll _never_ be motivated to fix it...   so i decided to let it be...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100220/89ac2bf9/attachment.html>

From Bowerbird at aol.com  Sat Feb 20 17:54:56 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 20 Feb 2010 20:54:56 EST
Subject: [gutvol-d] [SPAM] re: Re: so what is so important about pagination?
Message-ID: <2ed73.402f31ef.38b1ec70@aol.com>

jim said:
>   I dislike anything that prevents reflow, 
>    which I think is necessary for the enjoyment of most users.

there is absolutely nothing about retaining pagination (or linebreaks,
or end-line hyphenates) that "prevents reflow", jim, and i wish you
would stop repeating that nonsense.   you need to pay better attention.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100220/18d8d33f/attachment.html>

From Bowerbird at aol.com  Sat Feb 20 18:15:02 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 20 Feb 2010 21:15:02 EST
Subject: [gutvol-d] Re: ok, let's take a look at gardner's book,
	just for the exercise
Message-ID: <2f260.45cb45b8.38b1f126@aol.com>

i've merged gardner's corrections into the archive.org o.c.r.,
and posted the results on my website.  here's a sample url:
>?? http://z-m-l.com/go/gardn/gardnp123.html

gardner dehyphenated end-line hyphenates, so i rejoined
(some of) them.   i'll write another routine to do the rest...

out of a file that contains about 4,000 lines, there are
only 800 (at this point) which differ in the two versions.

so even early in the merge, 80% of the o.c.r. lines were right,
and that number will increase with more aggressive cleaning.

the (presumably incorrect) lines from the o.c.r. are in red,
while the lines from gardner's proofed copy are in blue...

if you prefer to view all of the edits on one web-page, see:
>    http://z-m-l.com/go/gardn/gardn-hybrid.html

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100220/7b8f3e77/attachment.html>

From jimad at msn.com  Sat Feb 20 19:15:31 2010
From: jimad at msn.com (James Adcock)
Date: Sat, 20 Feb 2010 19:15:31 -0800
Subject: [gutvol-d] Re: [SPAM] re: Re: so what is so important about
	pagination?
In-Reply-To: <2ed73.402f31ef.38b1ec70@aol.com>
References: <2ed73.402f31ef.38b1ec70@aol.com>
Message-ID: <SNT120-DS18CC17429357172A50C423AE440@phx.gbl>

>there is absolutely nothing about retaining pagination (or linebreaks,
or end-line hyphenates) that "prevents reflow", jim, and i wish you
would stop repeating that nonsense.

 
As always, we talk past each other - I talk about problems in the real
world, and Bowerbird responds with hypotheticals from bowerbirdworld.
Certainly current PG choice of linebreaks IS causing real world customers
from reading PG books on their choice of hardware.  I know because I have
responded to their complaints about PG brokenness on other forums. Real
world customers just want to read books, they don't want to have to route
around PG damage.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100220/10b87626/attachment-0001.html>

From dakretz at gmail.com  Sat Feb 20 19:48:05 2010
From: dakretz at gmail.com (don kretz)
Date: Sat, 20 Feb 2010 19:48:05 -0800
Subject: [gutvol-d] Re: [SPAM] re: Many solo projects out there in gutvol-d
	land?
In-Reply-To: <2ecd9.15b98434.38b1ebcd@aol.com>
References: <2ecd9.15b98434.38b1ebcd@aol.com>
Message-ID: <627d59b81002201948n196a7757g620d8c1b8306550d@mail.gmail.com>

I think it would certainly get their attention if they were told that
Michael S. Hart
would prefer that they focus on doing whatever it reasonably takes to remove
the notes, standardize the markup, and get them posted.

In fact, I might just post something to that effect myself. Let's see ...
it's "Ctl-C" here,
switch to DP, long in, "Ctl-V", ....


On Sat, Feb 20, 2010 at 5:52 PM, <Bowerbird at aol.com> wrote:

>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100220/41c93d8c/attachment.html>

From dakretz at gmail.com  Sat Feb 20 21:14:21 2010
From: dakretz at gmail.com (don kretz)
Date: Sat, 20 Feb 2010 21:14:21 -0800
Subject: [gutvol-d] Re: [SPAM] re: Re: so what is so important about
	pagination?
In-Reply-To: <SNT120-DS18CC17429357172A50C423AE440@phx.gbl>
References: <2ed73.402f31ef.38b1ec70@aol.com>
	<SNT120-DS18CC17429357172A50C423AE440@phx.gbl>
Message-ID: <627d59b81002202114g5c0a71een96b5b2c5576a6863@mail.gmail.com>

It's not trivial that it would make shared proofing a lot easier and less
ambiguous.

Just match the image.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100220/91bf74f6/attachment.html>

From schultzk at uni-trier.de  Sun Feb 21 06:35:34 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Sun, 21 Feb 2010 15:35:34 +0100
Subject: [gutvol-d] Re: [SPAM] re: Re: so what is so important about
	pagination?
In-Reply-To: <627d59b81002202114g5c0a71een96b5b2c5576a6863@mail.gmail.com>
References: <2ed73.402f31ef.38b1ec70@aol.com>
	<SNT120-DS18CC17429357172A50C423AE440@phx.gbl>
	<627d59b81002202114g5c0a71een96b5b2c5576a6863@mail.gmail.com>
Message-ID: <B6F6754B-7CF9-4DC8-BC8C-C53FA55DF426@uni-trier.de>

Hi Don,

	You are write about this. But, the people over at DP want more.
	Which is also fine.

	The problem is that over at DP they seem to me to focused
	on output formating. 

	What they do seem to understand that you can use markup as
	pseudo-code or pseudo-mark for that matter.

	The want to keep as much information as possible. That is not hard if
	you use pseudo-code. 

	The first step would be as you said is to match the scanned image.
	So you have a text containing alot of code marking the original
	linebreaks, chapter beginnings, page marks, page numbers, bold, italics,
	indentation, images. 
	This markup will not be easily human-readable, but computers
	do good work of rendering/display in an appropriate fashion.

	Then all you need is a simple tool that parser this format into the output
	format you want. 
	e.g for plain text:
		throw-out page breaks, images, hyphenation
		convert footnotes to PG Style
		convert bold, italics, PG Style
		start output PG Style
			output PG Header
			output text PG Style
				two linebreaks for paragrahs
				    before Chatpters, etc.
				wrap accordingly
	This an other simplification.

	for HTML (everything in one page)
		throw-out hyphenation
		create tags for bold, italic
		create tags for chapter header, with anchors
		create tags for paragraphs repecting indentation for verse and such.
		throw-out linebreaks
		create footnotes. with anchors
		create tags for images
		create TOCs
	You could also have the system produce a more complex HTML-structure,
	directories for chatpters, one file per page, etc.

	 The same procedure can be applied to other output formats. 

	That is the cool thing about pseudo-code it does not produce output
	if you do not want it or need it!!

	regards
		Keith.

				
Am 21.02.2010 um 06:14 schrieb don kretz:

> It's not trivial that it would make shared proofing a lot easier and less ambiguous.
> 
> Just match the image.
> 
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d


From dakretz at gmail.com  Sun Feb 21 08:46:31 2010
From: dakretz at gmail.com (don kretz)
Date: Sun, 21 Feb 2010 08:46:31 -0800
Subject: [gutvol-d] Re: [SPAM] re: Re: so what is so important about
	pagination?
In-Reply-To: <B6F6754B-7CF9-4DC8-BC8C-C53FA55DF426@uni-trier.de>
References: <2ed73.402f31ef.38b1ec70@aol.com>
	<SNT120-DS18CC17429357172A50C423AE440@phx.gbl>
	<627d59b81002202114g5c0a71een96b5b2c5576a6863@mail.gmail.com>
	<B6F6754B-7CF9-4DC8-BC8C-C53FA55DF426@uni-trier.de>
Message-ID: <627d59b81002210846l53ecedeeua8ede575c3ad4030@mail.gmail.com>

Keith, I agree 100% I've been arguing markdown and textile - even zml - for
years.


Don
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100221/da7d5fef/attachment.html>

From dakretz at gmail.com  Sun Feb 21 09:04:54 2010
From: dakretz at gmail.com (don kretz)
Date: Sun, 21 Feb 2010 09:04:54 -0800
Subject: [gutvol-d] Re: [SPAM] re: Re: so what is so important about
	pagination?
In-Reply-To: <627d59b81002210846l53ecedeeua8ede575c3ad4030@mail.gmail.com>
References: <2ed73.402f31ef.38b1ec70@aol.com>
	<SNT120-DS18CC17429357172A50C423AE440@phx.gbl>
	<627d59b81002202114g5c0a71een96b5b2c5576a6863@mail.gmail.com>
	<B6F6754B-7CF9-4DC8-BC8C-C53FA55DF426@uni-trier.de>
	<627d59b81002210846l53ecedeeua8ede575c3ad4030@mail.gmail.com>
Message-ID: <627d59b81002210904q3d657e74i458f4fd44ccd06b7@mail.gmail.com>

ReStructuredText
<http://www.freebase.com/view/en/restructuredtext/-/user/sandos/computation>is
a newer one that seems to be particularly extensible
(hence expressive and adaptable.)

On Sun, Feb 21, 2010 at 8:46 AM, don kretz <dakretz at gmail.com> wrote:

> Keith, I agree 100% I've been arguing markdown and textile - even zml - for
> years.
>
>
> Don
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100221/a2b1fc17/attachment.html>

From schultzk at uni-trier.de  Sun Feb 21 10:32:00 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Sun, 21 Feb 2010 19:32:00 +0100
Subject: [gutvol-d] Re: [SPAM] re: Re: so what is so important about
	pagination?
In-Reply-To: <627d59b81002210904q3d657e74i458f4fd44ccd06b7@mail.gmail.com>
References: <2ed73.402f31ef.38b1ec70@aol.com>
	<SNT120-DS18CC17429357172A50C423AE440@phx.gbl>
	<627d59b81002202114g5c0a71een96b5b2c5576a6863@mail.gmail.com>
	<B6F6754B-7CF9-4DC8-BC8C-C53FA55DF426@uni-trier.de>
	<627d59b81002210846l53ecedeeua8ede575c3ad4030@mail.gmail.com>
	<627d59b81002210904q3d657e74i458f4fd44ccd06b7@mail.gmail.com>
Message-ID: <736D3990-549A-4D41-9498-0586A4881B85@uni-trier.de>

Hi Don,

	I am not talking markdown or Restructured.

	I am talking about a true markup langauge. as Meta-language
	XML or TeX can be used. 

	The idea is to have tags which contain information that is not
	truly foramtting. E.G say a pagenumber Tag just states that this page
	is pagenumber n it could be intergrateg into page break like
	\page{5}. You could have a footer of an page that look like this
	\footer{\right{\bold{page} \italic{5}}}}
	This footer contains the number of the page but does have anything
	to due with the page or pagenumber tag.

	regards
		Keith.

Am 21.02.2010 um 18:04 schrieb don kretz:

> ReStructuredText is a newer one that seems to be particularly extensible
> (hence expressive and adaptable.)
> 
> On Sun, Feb 21, 2010 at 8:46 AM, don kretz <dakretz at gmail.com> wrote:
> Keith, I agree 100% I've been arguing markdown and textile - even zml - for years.
> 
> 
> Don
> 
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100221/b301b7fb/attachment.html>

From marcello at perathoner.de  Sun Feb 21 10:51:36 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Sun, 21 Feb 2010 19:51:36 +0100
Subject: [gutvol-d] Re: [SPAM] re: Re: so what is so important
	about	pagination?
In-Reply-To: <627d59b81002210904q3d657e74i458f4fd44ccd06b7@mail.gmail.com>
References: <2ed73.402f31ef.38b1ec70@aol.com>	<SNT120-DS18CC17429357172A50C423AE440@phx.gbl>	<627d59b81002202114g5c0a71een96b5b2c5576a6863@mail.gmail.com>	<B6F6754B-7CF9-4DC8-BC8C-C53FA55DF426@uni-trier.de>	<627d59b81002210846l53ecedeeua8ede575c3ad4030@mail.gmail.com>
	<627d59b81002210904q3d657e74i458f4fd44ccd06b7@mail.gmail.com>
Message-ID: <4B8180B8.4070305@perathoner.de>

don kretz wrote:

> ReStructuredText 
> <http://www.freebase.com/view/en/restructuredtext/-/user/sandos/computation>is
> a newer one that seems to be particularly extensible (hence
> expressive and adaptable.)

This is an example of a RST that EpubMaker (the converter that does all 
PG epubs) can convert to an industrial-strength epub:


.. -*- encoding: utf-8 -*-

.. meta::
    :DC.Creator: Raymond Chandler
    :DC.Title: The Big Sleep
    :DC.Language: English
    :DC.Created: 1939


The Big Sleep by Raymond Chandler
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. contents:: Contents
    :backlinks: entry


Chapter 1
=========

It was about eleven o?clock in the morning, mid October, with the sun 
not shining and a look of hard wet rain in the clearness of the 
foothills. I was wearing my powder-blue suit, with dark blue shirt, tie 
and display handkerchief, black brogues, black wool socks with dark blue 
clocks on them. I was neat, clean, shaved and sober, and I didn?t care 
who knew it. I was everything the well-dressed private detective ought 
to be. I was calling on four million dollars.

[...]


Chapter 2
=========

[...]


-- 
Marcello Perathoner
webmaster at gutenberg.org

From gbnewby at pglaf.org  Sun Feb 21 11:33:09 2010
From: gbnewby at pglaf.org (Greg Newby)
Date: Sun, 21 Feb 2010 11:33:09 -0800
Subject: [gutvol-d] Mirroring the firehost? Re: Re: Many solo projects out
 there in gutvol-d land?
Message-ID: <20100221193308.GE10824@pglaf.org>

On Sat, 20 Feb 2010, Michael S. Hart wrote:
> 
> > Does anybody have a serious objection to putting the 8,000,
> > or so, books that were listed earlier as being in limbo, in
> > something like our "PrePrints" section, where we put eBooks
> > that are admittedly not ready for prime time???
> 
> Yea, there are people arguing that it's a horrible thing to do. I'm 100% with
> you on this. Available with a few errors is far more useful than unavailable.
> And it's not that they aren't actually available now, they are. DP has always
> had the concatenated text available for download. It's behind a sign on and not
> indexed by any of the search engines, so if you don't know it's there already
> you can't find it.

What's the URL?  I could set up a nightly mirror...

Do they automatically disappear from this area, after they
are finally published?

  -- Greg

From greg at durendal.org  Sun Feb 21 11:48:38 2010
From: greg at durendal.org (Greg Weeks)
Date: Sun, 21 Feb 2010 14:48:38 -0500 (EST)
Subject: [gutvol-d] [SPAM] Re: Mirroring the firehost? Re: Re: Many solo
 projects out there in gutvol-d land?
In-Reply-To: <20100221193308.GE10824@pglaf.org>
References: <20100221193308.GE10824@pglaf.org>
Message-ID: <alpine.DEB.2.00.1002211442010.17501@durendal.durendal.org>

On Sun, 21 Feb 2010, Greg Newby wrote:

>> had the concatenated text available for download. It's behind a sign on and not
>> indexed by any of the search engines, so if you don't know it's there already
>> you can't find it.
>
> What's the URL?  I could set up a nightly mirror...
>
> Do they automatically disappear from this area, after they
> are finally published?

There's not a single place, you have to walk the projects lists using the 
search function. They do eventually disapear, but the status changes to 
posted when they are posted to PG.

Do you have a sign on at DP? If so try:

http://www.pgdp.net/c/tools/project_manager/projectmgr.php?show=search&title=&author=&language[]=&special_day[]=&projectid=&project_manager=&checkedoutby=&pp_er=&ppv_er=&postednum=&state[]=P3.proj_waiting&n_results_per_page=100

That's everything in the P3 waiting queue. If you pick one from that list. 
(I'm going to grab one of mine.)

http://www.pgdp.net/c/project.php?id=projectID4b5e3e5a9b845&detail_level=3

There's a link titled "Download Concatenated Text" with a download button 
that will download a zip with the text from the last proofing round.

The two queues that are most interesting because they are the largest are 
the P3 waiting and F2 waiting.

-- 
Greg Weeks
http://durendal.org:8080/greg/


From dakretz at gmail.com  Sun Feb 21 12:13:14 2010
From: dakretz at gmail.com (don kretz)
Date: Sun, 21 Feb 2010 12:13:14 -0800
Subject: [gutvol-d] Re: [SPAM] re: Re: so what is so important about
	pagination?
In-Reply-To: <4B8180B8.4070305@perathoner.de>
References: <2ed73.402f31ef.38b1ec70@aol.com>
	<SNT120-DS18CC17429357172A50C423AE440@phx.gbl>
	<627d59b81002202114g5c0a71een96b5b2c5576a6863@mail.gmail.com>
	<B6F6754B-7CF9-4DC8-BC8C-C53FA55DF426@uni-trier.de>
	<627d59b81002210846l53ecedeeua8ede575c3ad4030@mail.gmail.com>
	<627d59b81002210904q3d657e74i458f4fd44ccd06b7@mail.gmail.com>
	<4B8180B8.4070305@perathoner.de>
Message-ID: <627d59b81002211213m9de8d27p1f45f8836155e9b7@mail.gmail.com>

Right, there must be (and always is, in my experience,) at least one
unambiguous
and comprehensive mapping between the lightweight markup and whatever
XML-style tagging you want to declare. And, in many cases, more than one.
But HTML is usually first. TeX would also qualify. It's permissible to have,
for instance, light-weight markup for syntactic artifacts. There shouldn't
be
anything XML can do that can't map to your lwml. Worst case, just
incorporate
your HTML/XML/TeX directly.

The lwml just uses conventions and smaller tags to make the markup
readable and more easily editable.

If someone thinks this means dumbing-down the markup, then I think
they misunderstand the purpose and the execution.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100221/25138030/attachment.html>

From dakretz at gmail.com  Sun Feb 21 12:25:36 2010
From: dakretz at gmail.com (don kretz)
Date: Sun, 21 Feb 2010 12:25:36 -0800
Subject: [gutvol-d] Re: [SPAM] Re: Mirroring the firehost? Re: Re: Many solo
	projects out there in gutvol-d land?
In-Reply-To: <alpine.DEB.2.00.1002211442010.17501@durendal.durendal.org>
References: <20100221193308.GE10824@pglaf.org>
	<alpine.DEB.2.00.1002211442010.17501@durendal.durendal.org>
Message-ID: <627d59b81002211225y4109839bx90dd3179d3190109@mail.gmail.com>

Here's the html form with the GET variables that comprise the url.

<form method='post'
action='http://www.pgdp.net/c/tools/project_manager/generate_post_files.php'>
<input type='hidden' name='projectid' value='projectID4b7deca7757f8'>
<input type='radio' name='round_id' value='[OCR]'>[OCR]&nbsp;

<input type='radio' name='round_id' value='P1' >P1&nbsp;
<input type='radio' name='round_id' value='P2' CHECKED>P2&nbsp;
<br>For each page, use:<br>
<input type='radio' name='which_text' value='EQ' CHECKED>the text (if
any) saved in the selected round; or<br>
<input type='radio' name='which_text' value='LE'>the latest text saved
in any round up to and including the selected round.<br>
(If every page has been saved in the selected round, then the two
choices are equivalent.)<br>
<input type='hidden' name='include_proofers' value='0' /><input
type='hidden' name='save_files' value='0' /><input type='submit'
value='Download'>
</form>

All you need then is a list of the project codes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100221/35fee94d/attachment.html>

From Bowerbird at aol.com  Sun Feb 21 12:47:34 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 21 Feb 2010 15:47:34 EST
Subject: [gutvol-d] Re: so what is so important about pagination?
Message-ID: <bb1d.3e1f466b.38b2f5e6@aol.com>

keith said:
>   I am not talking markdown or Restructured.
>    I am talking about a true markup langauge.

it appears you don't really know what you're talking about, as
both markdown and restructured _are_ "true markup languages".

you want to invent a new one.   fine.   go ahead.   i did it myself...

-bowerbird

p.s.   don, restructured text is older than markdown, and textile too,
as far as i know...   it's a reworking of "structured text", which is the
granddaddy of all the light markup languages.   and -- by the way --
z.m.l. is older than markdown.   indeed, one of the few appearances
of charlz on this listserve was when he came to announce markdown
as a z.m.l. clone...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100221/dfc27cf6/attachment.html>

From Bowerbird at aol.com  Sun Feb 21 13:00:32 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 21 Feb 2010 16:00:32 EST
Subject: [gutvol-d] let us not be confused
Message-ID: <bfef.7902e780.38b2f8f0@aol.com>

ok, we've got a couple of different topics running around,
so let us take a minute to make sure we are not confused...

first of all, let's talk about my campaign for preprocessing...

i have demonstrated, over and over and over again, that d.p.
(and rfrank) should be doing _much_ better preprocessing...

i've shown how they can use _very_simple_means_ to do that,
and how -- if they did -- they could reduce the error-counts
in their books to a ridiculously small amount, even _before_
their text went in front of proofers.   i have talked about how
it is a huge _waste_ of the generous donations of volunteers
(in both time and energy) not to do aggressive preprocessing,
which automatically locates errors to make them easy to fix...

again, the crux of my argument -- and i have proven it to be
absolutely true, again and again -- is that it's _easy_ to do this.

indeed, when i have shown the steps taken to locate the errors,
it becomes painfully obvious how ridiculously simple they are...

they include obvious checks, like a number embedded in a word,
or a lowercase letter followed by a capital letter, or two commas
in a row, or a period at the beginning of a line.   _obvious_ stuff!
this isn't rocket science.   it's not even _hard_...   it's dirt-simple!
and yet neither d.p. nor rfrank has instituted such preprocessing.

***

let's contrast this with gardner's request, which was to compile
a list of reg-ex tests that will locate all possible errors in any
random book.   this request -- as worthy as it might seem -- is
_much_ more difficult to realize.   in fact, it's almost impossible.

a friend of mine over in england, nick hodson, is a very prolific
digitizer.   all by himself, he has done some 500 books or more.
nick collected an extensive set of checks over the years.   i can't
remember exactly how many there were, but roughly about 200.

however, once nick upgraded his o.c.r. program, he found that
about half of his checks were no longer required.   they had been
necessary essentially as an artifact of an outdated o.c.r. program.

the type of books nick was digitizing hadn't changed, and neither
had the quality of the scans, or the resolution of the scans, or the
digital retouching that he performed on the scans -- none of that.
he was the same person, using the same computer and scanner,
and he was doing the same things exactly as he had done before.

the only thing that changed was the version of his o.c.r. program.

yet he found many checks he formerly needed became unnecessary.

so, for an operation like d.p., who intakes all kinds of scans and
uses a wide variety of o.c.r. programs, operated by users with a
huge range of expertise, their results will be all over the board.
they're _never_ gonna get a definitive list of checks to be made.

it would be _immensely_ difficult, to the point of being impossible.

but that's totally beside our other point, about preprocessing...

because the fact of the matter is that a few dozen _simple_ tests
are all that d.p. needs in order to reduce the number of errors to
a level where they can be handled easily by their human proofers.

they're never gonna get 100%.   but they could find 90% so easily
that it's criminal negligence that they aren't doing that already...

heck, spell-check by itself will locate 50% of the errors for you...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100221/60f525f6/attachment-0001.html>

From traverso at posso.dm.unipi.it  Sun Feb 21 13:07:24 2010
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Sun, 21 Feb 2010 22:07:24 +0100 (CET)
Subject: [gutvol-d] Re: [SPAM] Re: Mirroring the firehost? Re: Re: Many solo
 projects out there in gutvol-d land?
In-Reply-To: <alpine.DEB.2.00.1002211442010.17501@durendal.durendal.org>
	(message from Greg Weeks on Sun, 21 Feb 2010 14:48:38 -0500 (EST))
References: <20100221193308.GE10824@pglaf.org>
	<alpine.DEB.2.00.1002211442010.17501@durendal.durendal.org>
Message-ID: <20100221210724.0FA49FFB1@cardano.dm.unipi.it>

>>>>> "Greg" == Greg Weeks <greg at durendal.org> writes:

    Greg> On Sun, 21 Feb 2010, Greg Newby wrote:

    >>> had the concatenated text available for download. It's behind
    >>> a sign on and not indexed by any of the search engines, so if
    >>> you don't know it's there already you can't find it.
    >>  What's the URL?  I could set up a nightly mirror...
    >> 
    >> Do they automatically disappear from this area, after they are
    >> finally published?

    Greg> There's not a single place, you have to walk the projects
    Greg> lists using the search function. They do eventually
    Greg> disapear, but the status changes to posted when they are
    Greg> posted to PG.

    Greg> Do you have a sign on at DP? If so try:

    Greg> http://www.pgdp.net/c/tools/project_manager/projectmgr.php?show=search&title=&author=&language[]=&special_day[]=&projectid=&project_manager=&checkedoutby=&pp_er=&ppv_er=&postednum=&state[]=P3.proj_waiting&n_results_per_page=100

    Greg> That's everything in the P3 waiting queue. If you pick one
    Greg> from that list.  (I'm going to grab one of mine.)

    Greg> http://www.pgdp.net/c/project.php?id=projectID4b5e3e5a9b845&detail_level=3

    Greg> There's a link titled "Download Concatenated Text" with a
    Greg> download button that will download a zip with the text from
    Greg> the last proofing round.

    Greg> The two queues that are most interesting because they are
    Greg> the largest are the P3 waiting and F2 waiting.

    Greg> -- Greg Weeks http://durendal.org:8080/greg/

I have scripts that can download concatenated text scripts without
manual handling, and without a browser, but are quite tricky, and I am
not willing to discuss them in public, but will provide them to Greg
Newby (as DP board member) if he wants. Just send me an email.

Carlo

From schultzk at uni-trier.de  Mon Feb 22 02:01:36 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Mon, 22 Feb 2010 11:01:36 +0100
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <bb1d.3e1f466b.38b2f5e6@aol.com>
References: <bb1d.3e1f466b.38b2f5e6@aol.com>
Message-ID: <9752B13F-42A6-4469-8BEB-DD3ECEAC06A5@uni-trier.de>


Am 21.02.2010 um 21:47 schrieb Bowerbird at aol.com:

> keith said:
> >   I am not talking markdown or Restructured.
> >   I am talking about a true markup langauge.
> 
> it appears you don't really know what you're talking about, as
> both markdown and restructured _are_ "true markup languages".
	It would be futile to discuss want constitutes a mark language.
	
> 
> you want to invent a new one.  fine.  go ahead.  i did it myself...

	No. Not a new mark up language. an encoding or transcription if you wish.
	
	Creating the "language" is not the problem. the code base for the tools and
	getting them to be used by a broad audience. I do not have the time to
	do so. DP has the infra structure. But what they are missing we have be there
	and back again.

	regards
		Keith.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100222/80c7eed4/attachment.html>

From Bowerbird at aol.com  Mon Feb 22 11:22:28 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 22 Feb 2010 14:22:28 EST
Subject: [gutvol-d] Re: so what is so important about pagination?
Message-ID: <105c1.42bfe29.38b43374@aol.com>

as i said earlier, there's no real need to provide "justification"
for pagination.   the fact that _some_ people want it is enough
to make us decide that we shouldn't toss out that information.

however, since nobody has mentioned the _best_ justification
for including pagination information, i might as well tell you...

here at the end of the paper-book half-millenium, we have
roughly 10 million different books out there in the world...

(this is according to my memory of recent figures, which may
be off, perhaps even by a large amount, but that's immaterial.)

if we figure there are an average of 1,000 copies of each book,
that means we've got about 10 billion copies of paper-books...

that's a lot of paper-books out there in the world.   a whole lot.

those paper copies are the _originals_, and they always will be.

in the future -- even right now, thanks to google -- we have a 
virtually unlimited number of digital copies of those originals.

but again, those digital versions will _always_ be "the copies"...

and the paper-books will _always_ be "the originals"...   forever.

(even books that're "born digital" often become physical quickly,
and that will continue into the far future with print-on-demand;
and paper-books, due to their _physical_and_material_ nature,
will always be the "real" books, while digital versions will always
be the "copies", especially since they can be manipulated at will,
while physical books have the virtue/liability of being "frozen".)

"real" doesn't mean "more valuable" or "more important", it means
_physical_ and _tangible_ and _visible_ and _made_out_of_atoms_.

you really have to ground yourself in this thinking to understand
-- _physical_ books are the "real" ones; digital books are "copies".

that's our first important factor...

and our second important factor is that e-books are manipulatable.

and just as the frozen nature of p-books is both virtue and liability,
so too is this manipulability.   on the one hand, it's easy to fix errors,
provide updates, and so on and so forth...   but, on the other hand,
it's also easy to alter the book in a way the author did not intend...

and if you don't think people _will_ try to rewrite history, you're nuts.

plus there's just sheer incompetence, which has already resulted in
a number of very shoddy digitizations of books, full of inaccuracies.

just try and find all the copies of "pride and prejudice" out there, and
then do a determination on which ones are "accurate" and which not.
you will find this task to be overwhelming, and nearly impossible, and
that's just one book out of our 10 million books.   that is the problem.

so there's little question that people in the future will be _skeptical_
about each and every e-book which they are handed, and rightly so...
for reasons from accidental to quite intentional, it might be inaccurate.

so we have a state where there are some "known" p-book "originals",
and a ton of digital "copies" that might or might not be "trustworthy".

(i believe jon noring has been absent from here for long enough that
it's once again safe to use that word without all his derogatory spin.)

now, there's only one solution to this state.   any specific digital copy
will have to be able to _prove_ its correspondence to a paper-copy...

the easiest way to provide such proof is to assume the same form as
the paper-copy; that is, it must adopt the linebreaks and pagination,
so that each and every page can be subjected to visual confirmation...

of course, in order to have value as a digital book, the file must be
able to drop the linebreaks/pagination, and assume another form,
one that reflows to the current set of desires of the end-user, _but_
it _must_ be able to mimic the look-and-feel of the paper-book too.

if it cannot, it's simply going to be discarded as being untrustworthy.

your e-book cannot afford to be nothing more than a formless blob.
it _must_ be able to "snap to" a form that exactly imitates the p-book.
and for it to be able to do that, you must keep linebreaks/pagination.

it's really that simple.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100222/1b0a3e32/attachment.html>

From lee at novomail.net  Mon Feb 22 11:28:21 2010
From: lee at novomail.net (Lee Passey)
Date: Mon, 22 Feb 2010 12:28:21 -0700
Subject: [gutvol-d] Re: "The Inheritance" by Susan Edmondstone Ferrier
In-Reply-To: <97FD5D5CD0E846AD94214B14886737BA@alp2400>
References: <12985.490903c3.38adae70@aol.com>
	<97FD5D5CD0E846AD94214B14886737BA@alp2400>
Message-ID: <4B82DAD5.801@novomail.net>

On 2/17/2010 3:50 PM, Al Haines (shaw) wrote:

> Your motivation? What do I care? Be altruistic, and do a book.

But at the end of the day, if BB produces a book and gives it to PG 
(after, of course, posting a copy to the Internet Archive before it is 
degraded by the whitewashers) you have a book.

But if he participates in Mr. Frank's "roundless" experiment, and you 
both encourage others to do so, at the end of the day you still have 
your book, probably faster than you would have gotten it otherwise 
(because I know quantity is important to PG), and perhaps even less 
error-prone than if a single individual had produced it (even though 
quality really isn't that important to PG), and BB and Mr. Frank have 
valuable data that perhaps can be used to develop a more efficient 
production system.

Either way, you still get your book, but in the latter scenario valuable 
data is produced as well. Be altruistic, Mr. Haines, and support the 
experiment.

From Bowerbird at aol.com  Mon Feb 22 11:48:08 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 22 Feb 2010 14:48:08 EST
Subject: [gutvol-d] hallelujah
Message-ID: <11d7d.11b56de9.38b43978@aol.com>

hallelujah!

some people are finally talking some sense into rfrank's head.

one person suggested some reg-ex tests be shown to proofers.
all by itself, this is an improvement, but not that big, because
these tests should be done _before_ the text goes to proofers.
however, what it did was it jolted roger out of his thinking that
such tests are done in _postprocessing_, a huge ideological shift.
roger admitted as much.

thank you lord!

another person pointed out that a global search-and-replace
would be a real asset. d'uh, who's been saying that for _years_?

one person said:
>    I've noticed "Pem" scanned as "Pern" a few times.

roger responded with:
>    Done. Fifty-seven replacements.

yes!   see how easy this can be?

so roger said, "tell me what kind of global changes you'd make".

so one person came back and said, "how about things like these:"
>    change did n't to didn't
>    change could n't to couldn't

kinda hard to believe that those fixes weren't already being made
in preprocessing, isn't it?   but hey, let's be glad for the progress...

sure enough, roger got the hint, and changed all of the floating
contractions, and then promised he'd do that in _preprocessing_
next time.   hallelujah!   now is the time to give roger that list of
30 tests that i highlighted back in that month-long series i did...

now the _next_ thing would be for someone to _volunteer_ to do
all of this preprocessing for roger, using the tool dkretz coded...

hallelujah!

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100222/99486d23/attachment.html>

From hart at pglaf.org  Mon Feb 22 11:57:26 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Mon, 22 Feb 2010 11:57:26 -0800 (PST)
Subject: [gutvol-d] !@!!@!!@!Re: Re: so what is so important about
 pagination?
In-Reply-To: <105c1.42bfe29.38b43374@aol.com>
References: <105c1.42bfe29.38b43374@aol.com>
Message-ID: <alpine.DEB.2.00.1002221143310.30703@mail.pglaf.org>


bowerbird says:

your e-book cannot afford to be nothing more than a formless blob.
it _must_ be able to "snap to" a form that exactly imitates the p-book.
and for it to be able to do that, you must keep linebreaks/pagination.

///

Making ebooks "a form that exactly imitates the p-book" is a KILLER!!!

While he mentions the various eBooks of Jane Austen, he fails to talk
about the wide variety of Jane Austen's p-books, and that paginations
run rampant among them, not to mention margination, spelling, etc.

THERE IS NO SUCH THING AS /ONE/ eBOOK THAT RULES THEM ALL. . . .

As any of you who have followed this kind of conversation before know
by now, I tried to find just TWO Declaration of Independence copies I
could use to say they agreed with each other when I started the first
entry in Project Gutenberg.  While I do not doubt that somwehere I am
likely to be able to FIND two, I did not find such a pair in research
of half a dozen copies at the time, nor even two copies that agreed a
vast majority of the times there were such issues.

IT WAS A COLOSSAL WASTE OF TIME!!!!!!!

When I think of going through much longer works. . .well, I do not!!!

We went through all of this with Paradise Lost very early on, and the
result was that we silenced our "pearls before swine" critics of some
very highly places Milton scholars, and it was fun doing so, but that
was all there was to it, no real change for the average reader.

I am not about to let one person, or journal, however scholarly, make
the decision for Project Gutenberg as to what editions to use and how
exactly to portray them in whatever format, margination, pagination--
or font, or color, or whatever.

If so. . .we are nothing more than a Xerox machine. . . .

We should, as always, create something "BETTER THAN THE ORIGINAL!!!"

Even if it means ruffling a few feathers. . . .

My own dream is a single file, hardly larger than a plain text file,
that contains all the editions VOLUNTEERS decide we should have.

If the ivory tower is not willing to do that last percent or three--
to create their own "PERFECT" edition--let them whine like swine.


From Bowerbird at aol.com  Mon Feb 22 13:31:10 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 22 Feb 2010 16:31:10 EST
Subject: [gutvol-d] Re: !@!!@!!@!Re: Re: so what is so important about
	pagination?
Message-ID: <15ba7.770b9009.38b4519e@aol.com>

michael, i wish you would've taken the time to read what i
actually wrote, instead of just giving your kneejerk reaction.

because your response doesn't address the point that i made.

i am loathe to get into this argument, because it won't mean
a thing in the long run...   the future will have its own issues,
and it will need to deal with them, and i laid it all out clearly.

but let me just address a few things, to provide some clarity.

>   he fails to talk about the wide variety of Jane Austen's
>    p-books, and that paginations run rampant among them, 
>    not to mention margination, spelling, etc.

there are different editions of many books, to be sure...

and i count each edition as a separate book.   your e-book
will have to mimic _one_ of the editions in a faithful manner,
or it will be discarded...   notice that when i say "mimic", i do
_not_ mean that it has to match it _exactly_.   so, for instance,
if you wanna close up spacey contractions, or correct spelling,
or make other kinds of changes, they might (or might not) be
totally acceptable to any one specific end-user in the future...

but you _will_ have to make it easy for that specific end-user to
_compare_ your e-book with a p-book, in order to spot changes.

i've shown how this comparison is done, by mounting a web-page
which has the text on one side of the screen, the scan on the other.

but if your e-book is a formless blob, that's not gonna cut it...

and, for the record, i'm most certainly _not_ recommending that
we create some "scholarly" version of our books.   i laugh at that.
the _only_ thing we know about the scholars of the future is that
we do _not_ know what they will want and it'd be foolish to guess.

put yourself in the shoes of the future.   you have a dozen different
e-book files, all purporting to be copies of "sense and sensibility".
you know that some of them have been doctored, and others have
been bowdlerized, and you _hope_ that some of them are accurate.
you can, with some work, find the differences between them, but
you'd prefer not to have to go through that exercise if you could,
because you'd have to then do more work to find the _right_ copy.

so, how do you proceed?

well, i can tell you that if _one_ of those copies made it _simple_
for you to verify its accuracy by assuming the form of the p-book,
that will be your obvious first choice.   think about it.   you'll agree.

so, michael, if you want to respond to this, answer that question.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100222/5bb95955/attachment.html>

From hart at pglaf.org  Mon Feb 22 14:25:27 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Mon, 22 Feb 2010 14:25:27 -0800 (PST)
Subject: [gutvol-d] Re: !@!!@!!@!Re: Re: so what is so important about
 pagination?
In-Reply-To: <15ba7.770b9009.38b4519e@aol.com>
References: <15ba7.770b9009.38b4519e@aol.com>
Message-ID: <alpine.DEB.2.00.1002221354330.4930@mail.pglaf.org>


On Mon, 22 Feb 2010, Bowerbird at aol.com wrote:

> michael, i wish you would've taken the time to read what i
> actually wrote, instead of just giving your kneejerk reaction.

And if you weren't so jerkknee you would have realized I had to
have read the whole thing to get to the part I quoted. . .duh!

When you ask people to pay attention, it helps to PAY ATTENTION.


> because your response doesn't address the point that i made.

It addresses EXACTLY the point you made that I quoted. . . .

If that part contradicts your other points. . .sorry. . . .

But having reread all of your comments, I don't see the change
you say is there. . . .


> i am loathe to get into this argument, because it won't mean
> a thing in the long run...? the future will have its own issues,
> and it will need to deal with them, and i laid it all out clearly.
>
> but let me just address a few things, to provide some clarity.


> >?? he fails to talk about the wide variety of Jane Austen's
> >?? p-books, and that paginations run rampant among them,
> >?? not to mention margination, spelling, etc.
>
> there are different editions of many books, to be sure...
>
> and i count each edition as a separate book.? your e-book
> will have to mimic _one_ of the editions in a faithful manner,

Then SAY that!!!  Right up front in plain language!!!

However, that still relegates us to being a Xerox machine, no?


> or it will be discarded...? notice that when i say "mimic", i do
> _not_ mean that it has to match it _exactly_.? so, for instance,
> if you wanna close up spacey contractions, or correct spelling,
> or make other kinds of changes, they might (or might not) be
> totally acceptable to any one specific end-user in the future...

I'm never going to get into any of these semantic arguments!!!!!!!

Mimic means to copy as closely as possible. . . .  Synonym:  copy.


> but you _will_ have to make it easy for that specific end-user to
> _compare_ your e-book with a p-book, in order to spot changes.

As I have said before, if you would listen, I am not AGAINST keeping
a copy with such pagination for such purposes, but I draw the lines,
pun intended, at keeping every character in the same page position
when there is no need for pages, in all available PG editions.

I want our eBooks to be optimally readable:

Minimal end of line hyphenation.

No page headers or footers.

Just plain reading.

Once again, I have no stance AGAINST people who want pagination,
I just don't want for force any such arbitrary formats on anyone
and neither should you or anyone else.

STOP TRYING TO FORCE YOUR OPINIONS ON OTHERS, MAKE THEM OPTIONS!


> i've shown how this comparison is done, by mounting a web-page
> which has the text on one side of the screen, the scan on the other.

As I have always said, I have no objection to this in proofreading,
just in real reading. . .but I am willing for it to be an OPTION!!!


> but if your e-book is a formless blob, that's not gonna cut it...

Tell that to the millions of people who prefer remargination to the
specifications of their own systems.


> and, for the record, i'm most certainly _not_ recommending that
> we create some "scholarly" version of our books.? i laugh at that.
> the _only_ thing we know about the scholars of the future is that
> we do _not_ know what they will want and it'd be foolish to guess.

I  CAN  tell you that most of the paper editions' page numbers will
fade along with the hyphenation.


> put yourself in the shoes of the future.? you have a dozen different
> e-book files, all purporting to be copies of "sense and sensibility".

> you know that some of them have been doctored, and others have
> been bowdlerized, and you _hope_ that some of them are accurate.

Last time I looked there were still pretty ubiquitous programs to
lay out all such differences.

IFF you have such deep interests, you can simply put up two editions
side by side when you look at them. . .I do. . . .

If not, then you aren't really that interested. . .it's all smoke.


> you can, with some work, find the differences between them, but
> you'd prefer not to have to go through that exercise if you could,
> because you'd have to then do more work to find the _right_ copy.

"_RIGHT_" copy???

Now you've contradicted yourself back into the ivory tower. . . .

"_RIGHT_" copy, indeeeeed. . . .


> so, how do you proceed?
>
> well, i can tell you that if _one_ of those copies made it _simple_
> for you to verify its accuracy by assuming the form of the p-book,
> that will be your obvious first choice.? think about it.? you'll agree.

This will  ONLY  do you any good if you manage to find that edition,
out of all the other paper editions in the world.


> so, michael, if you want to respond to this, answer that question.


Sorry, but I anticipated ALL of these questions when I first started,
and have answered, and will continue to answer, at length.


Why can't you just propose your ideas as OPTIONS, not CARVED IN STONE?


Michael

From lee at novomail.net  Mon Feb 22 14:45:43 2010
From: lee at novomail.net (Lee Passey)
Date: Mon, 22 Feb 2010 15:45:43 -0700
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <4B802F64.3040909@teksavvy.com>
References: <1bbf9.4e21db05.38b0c5a5@aol.com> <4B802F64.3040909@teksavvy.com>
Message-ID: <4B830917.5010602@novomail.net>

On 2/20/2010 11:52 AM, Gardner Buchanan wrote:

> My question to the pagination-preservers is: what is the
> difference? Both hyphenation, line-endings and pagination
> are mainly artefacts of the physical medium -- one of width
> and the other of height. Bowerbird wants to keep both;
> I see no need to keep either. But what is the reasoning
> behind keeping one (pagination) and not the other?

As with most things, your position depends on your perspective.

As a reader (consumer) of e-books, I want to get /all/ the production 
artifacts out of my way; a line should wrap wherever I would expect it 
to depending on the size of my viewport (screen), hyphenation should 
only occur between syllables at the right edge of the viewport, and page 
should end at the bottom of my viewport--no sooner, no later.

Page numbers, if any, should reflect the number of /virtual/ pages there 
are in the book I'm reading; i.e. the number of viewports to complete 
the book. These page numbers should not be embedded in the text, but 
should be displayed somewhere else in the User Agent where I can refer 
to them if I want to, but otherwise they are inconspicuous. Of course, 
if I change fonts or the viewport size I would expect the page numbers 
to be updated to reflect that chante.

BUT ...

As a producer of e-books, it is my self-appointed task to create an 
e-book whose reading experience matches, a nearly as possible, a 
specific instance of a historical paper book. Clearly this doesn't mean 
that in the final product the page- or line-endings have to match the 
source, as that would in many cases lead to an awkward "ouija" board 
reading experience, but it does mean that I want to maintain markers 
throughout the e-document that can 1.) /create/ a view where the page- 
and line-endings match so I can do a side-by-side comparison of a page 
image with my electronic version, and 2.) lead me efficiently back to a 
particular page scan if there is any question about the correctness of 
the electronic edition.

This apparent conflict between the two perspectives leads to two 
follow-on questions: 1.) where and how broad is the line between 
production and consumption?, and 2.) is it possible to create a single 
electronic document that can satisfy both needs?

In the case of the PG/DP co-dependency, I think the line is clear and 
narrow: Distributed Proofreaders is /only/ a producer of electronic 
documents and its only consumer is Project Gutenberg. Project Gutenberg 
is /only/ a consumer of electronic texts, and while DP is its primary 
producer it is not the only one.

According to Al Haines, one of PG's whitewashers, the PG 'errata' 
mechanism "is informal, at best, and there's no list of old submissions 
that would benefit from being re-done." Errata resolution at PG is 
handled via e-mail messages to a very small handful of whitewashers. 
According to Mr. Haines, "My PG priorities are my own productions first, 
followed by WWing, then Errata and Reposts." In yet another post, after 
detailing multiple problems with an old DP contribution he states:

>> Is it worth it? Personally speaking, no. It's going to take hours to fix
>> this text, time that I'd far rather spend on my own productions, but
>> there's currently no mechanism except for the Whitewashers, a.k.a.
>> Errata Team, to fix this kind of thing. (Probably simpler to just re-do
>> this text from scratch, which is something *I'm* not about to do.)

This is precisely the reason that DP puts such an emphasis on having a 
/completed/ text. Once an electronic document passes over from DP to PG 
there is almost /no/ chance that it will every be improved, revised, 
corrected. This is not to cast aspersions on the hard and dedicated work 
of the whitewashers, simply an acknowledgment of the fact that it is not 
a high priority for them and there is no formal mechanism to help it get 
done.

Because Project Gutenberg is the /only/ consumer of Distributed 
Proofreader's production, preservation of line- and page-breaks should 
be of little importance in the current DP->PG work flow.

/If/, on the other hand, what you're doing lies outside of the DP->PG 
work flow (as it appears your does), then the calculation changes.

For example, what happens to the page scans from whence your text is 
derived? If those scans are not, and will not ever be, publicly 
available then encoding markers in the text that refer back to page 
scans and the original text layout may not be necessary or important. 
Likewise, if PG is your only distribution point then you will probably 
be the only one who will ever make changes, corrections or improvements 
to the text. If you expect that once you have completed a task and 
transferred responsibility to Project Gutenberg you are finished, 
perhaps even deleting the original scans and your intermediate work 
files (please don't do this; I'm sure that the Internet Archive would be 
willing to take them off your hands) then preservation of markers 
referring back to the original text are probably not necessary.

By contrast, if you are preparing files for broader distribution than 
simply via Project Gutenberg, or if you anticipate that someday a work 
flow may develop either inside or outside of the DP->PG chain that will 
support continuous improvement of your original work, then I would think 
that creating and preserving text markers, including original 
page-breaks, line-breaks, and page numbers referencing the original scan 
set would be advisable. This is particularly true as it is always easier 
to preserve data, even that data of dubious value, than it is to try and 
recover data that has been lost or discarded.

This leads us to the second question: is it possible to create a single 
electronic document which can satisfy the needs of both readers and 
producers? I believe that it is, but it requires the use of a markup 
language having at least the capability of marking some text as 
invisible, and a user agent that is capable of recognizing that markup 
and /not/ rendering it as indicated.

I'm sure there are a number of markup languages that could satisfy this 
requirement, but I have chosen to use XHTML (with one small cheat).

When ABBYY FineReader saves its OCR output in HTML format it has the 
option of placing a break (<br>) at the end of each line, and a 
horizontal rule (<hr>) between each page (an alternative is to save each 
scanned page as a separate file, but I find that less convenient). I 
then wrote a short program (could probably be done just as easily with a 
perl script, or even sed) that replaced each <hr> with an anchor tag 
indicating the page number (<a name="page##" />), and replaced each <br> 
with <lb />. Now <lb> is not a valid HTML element (hence the cheat), but 
I know of no user agent that will fail to render an HTML file just 
because it has an invalid element in it.

FineReader is quite good at recognizing when line-ending hyphenation is 
due to splitting long words or when it is required as a part of a 
compound word. In the first case, when it saves using line breaks it 
saves the line-ending hyphenation either as a hard hyphen or a 
soft-hyphen. When soft hyphens are replaced by &shy; (and the following 
white space is removed) you have recorded a line-ending hyphen which 
will not be displayed (although different user agents sometimes do 
things differently).

Originally I was in the "collapse page- and line-break" camp, but 
because I never submit my texts to PG, and I have hope that someday some 
sort of continuous improvement process may evolve (and because 
maintaining the data is cheap and simple) I'm moving into the "preserve 
everything" camp.


From azkar0 at gmail.com  Mon Feb 22 15:02:41 2010
From: azkar0 at gmail.com (Scott Olson)
Date: Mon, 22 Feb 2010 16:02:41 -0700
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <4B830917.5010602@novomail.net>
References: <1bbf9.4e21db05.38b0c5a5@aol.com> <4B802F64.3040909@teksavvy.com>
	<4B830917.5010602@novomail.net>
Message-ID: <2362473e1002221502k7a23d04et90c268e3fa3865a0@mail.gmail.com>

On Mon, Feb 22, 2010 at 3:45 PM, Lee Passey <lee at novomail.net> wrote:

> When ABBYY FineReader saves its OCR output in HTML format it has the option
> of placing a break (<br>) at the end of each line, and a horizontal rule
> (<hr>) between each page (an alternative is to save each scanned page as a
> separate file, but I find that less convenient). I then wrote a short
> program (could probably be done just as easily with a perl script, or even
> sed) that replaced each <hr> with an anchor tag indicating the page number
> (<a name="page##" />), and replaced each <br> with <lb />. Now <lb> is not a
> valid HTML element (hence the cheat), but I know of no user agent that will
> fail to render an HTML file just because it has an invalid element in it.


Since the user agent will take care of rewrapping, you could just leave the
linebreaks where they are. If you really want to have them encoded, I'd opt
for some CSS.

br.lb
  {display: none}

in your <style> section

Then <br class="lb" /> wherever you're currently putting <lb />.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100222/4511c5ad/attachment.html>

From lee at novomail.net  Mon Feb 22 15:55:20 2010
From: lee at novomail.net (Lee Passey)
Date: Mon, 22 Feb 2010 16:55:20 -0700
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <2362473e1002221502k7a23d04et90c268e3fa3865a0@mail.gmail.com>
References: <1bbf9.4e21db05.38b0c5a5@aol.com>
	<4B802F64.3040909@teksavvy.com>	<4B830917.5010602@novomail.net>
	<2362473e1002221502k7a23d04et90c268e3fa3865a0@mail.gmail.com>
Message-ID: <4B831968.8060503@novomail.net>

On 2/22/2010 4:02 PM, Scott Olson wrote:

> On Mon, Feb 22, 2010 at 3:45 PM, Lee Passey <lee at novomail.net
> <mailto:lee at novomail.net>> wrote:
>
>     When ABBYY FineReader saves its OCR output in HTML format it has the
>     option of placing a break (<br>) at the end of each line, and a
>     horizontal rule (<hr>) between each page (an alternative is to save
>     each scanned page as a separate file, but I find that less
>     convenient). I then wrote a short program (could probably be done
>     just as easily with a perl script, or even sed) that replaced each
>     <hr> with an anchor tag indicating the page number (<a name="page##"
>     />), and replaced each <br> with <lb />. Now <lb> is not a valid
>     HTML element (hence the cheat), but I know of no user agent that
>     will fail to render an HTML file just because it has an invalid
>     element in it.
>
> Since the user agent will take care of rewrapping, you could just leave
> the linebreaks where they are.

I considered that, but I'm not in favor of invisible markup. What do I 
mean by invisible? We know that the HTML spec says that multiple white 
space can be collapsed unless it is specifically identified as 
"non-breaking," and we know that spaces, tabs and newlines are all white 
space and sometimes very hard to distinguish from each other. This means 
that my HTML tools might have a tendency to wrap these lines up if I'm 
not extremely diligent. And because it's still white space there's a 
good likelihood that I may not even notice if it gets screwed up.

I like markup that's in my face, and obviously not part of the text. 
Markup rules like "three blank lines indicate a minor header, but four 
blank lines indicate a major header" and "one space at the beginning of 
a line means don't wrap this line, but two spaces means wrap this line 
but do a block indent" just make me shudder. If it's markup it should be 
markup, and if it's not it shouldn't pretend that it is.

This is kind of a specific instance of the general rule that a markup 
element should do one thing, and one thing only.

> If you really want to have them encoded,
> I'd opt for some CSS.
> br.lb <http://br.lb>
>    {display: none}
> in your <style> section
> Then <br class="lb" /> wherever you're currently putting <lb />.

This is an option I have tried, and is not a bad idea. I prefer the 
invalid element idea simply because many user agents I'm familiar with, 
particularly phones and handheld devices simply have not yet figured out 
how to do CSS. At one point Mobipocket claimed CSS support, but on 
closer examination I discovered that their publisher tool simply went 
through and replaced CSS styles with elements that their UA actually 
recognized. Your "display:none" trick simply wouldn't have worked in the 
old Mobipocket reader.

Now I know that the Kindle software was based largely (if not entirely) 
on the old Mobipocket reader. What is the effect of trying to use CSS to 
turn off the display of line-breaks after the file has been converted to 
.mobi? Maybe someone with a Kindle could enlighten us?

From Bowerbird at aol.com  Mon Feb 22 16:23:44 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 22 Feb 2010 19:23:44 EST
Subject: [gutvol-d] Re: !@!!@!!@!Re: Re: so what is so important about
	pagination?
Message-ID: <1b8ef.3a619508.38b47a10@aol.com>

michael said:
>    I had to have read the whole thing 
>    to get to the part I quoted. . .duh!

you can quote something without reading it.

but i should have said i wish you would have _understood_
what i was saying, because then you would have understood
that your response didn't address the point that i was making.


>    When you ask people to pay attention, it helps to PAY ATTENTION.

i am paying attention, michael.   even though i've heard
what you're saying lots and lots and lots of times before.

and it makes sense when you are addressing the people
who are making _other_ points.   but they left long ago...


>    It addresses EXACTLY the point you made that I quoted. . . .

no, it doesn't.


>    Then SAY that!!!? Right up front in plain language!!!

i _have_ said it, every time i've talked about this issue...

including once when marcello brought up this same point
you brought up (think about that, michael), about editions.
(it was back in september of 2007, if you're curious.)

marcello pointed out that the various different editions of
"pride and prejudice" had different pagination, and he asked
which of the editions should be used to do the pagination...

here's the reply i made to marcello's point:
>    http://z-m-l.com/go/pap/pride%20and%20prejudice(4).txt

if you don't care to read all of it, here are the guts of that reply:
>    the answer to the question as to which set of linebreaks and
>    pagebreaks to use is this: the ones in the edition you digitize.
>
>    plain old common sense.   if you didn't already know the answer,
>    perhaps you might want to exercise your brain a little bit more...
>
>    if you're digitizing the 1844, use its linebreaks and pagebreaks.
>    if you're digitizing the 1853, use its linebreaks and pagebreaks.
>    if you're digitizing the 1870, use its linebreaks and pagebreaks.
>    if you're digitizing the 1892, use its linebreaks and pagebreaks.

and here is a web-page showing the first page of those 4 editions:
>    http://z-m-l.com/go/pap/pride_and_prejudice(4).html

and yes, in case you're wondering, if a p-book was important enough
to go through different editions, we should digitize _every_ edition...

i'm not going to tell you which of those 4 editions you _should_ use,
which one is the "right" one.   whichever one you want to use is "right".
and you should be able to determine if any specific e-book _does_ or
_does_ not match the edition you want to use, or some other edition.

***

by the way, there's another web-page of interest in this directory:
>    http://z-m-l.com/go/pap/jose(4).html

this fascinating page shows some work done by jose menendez.

jose adopted my suggestion that the e-book be able to mimic the
p-book, and he created a series of .pdf books that did just that...

shown on this web-page are some screenshots of his .pdf-books,
compared to the page-scans from those pages.   he did a great job.

of course, since a .pdf-book is unable to reflow its text, jose's work
doesn't fit the more-important criterion of reflowability, but it does
show that the ability to mimic the original can be extremely valuable.


>    However, that still relegates us to being a Xerox machine, no?

no.   because a xerox machine can't do reflow.   or fix typos.
or pull in spacey contractions.   or change the font, or size.

look, i understand the appeal of digital text _extremely_ well.
i've made all of the arguments myself, so there's no need for
you to repeat 'em back at me, you're just wasting your breath.

but there's a problem looming here, a problem that the future
will have to face, and solve, and i'm telling you what you need
to do, so that you can _help_ the future _solve_ that problem,
such that your e-texts will continue to be used, and not tossed.

i'm on _your_ side, michael...   i've got your back, good buddy...

so you need to get that through your skull and start listening...


>    I'm never going to get into any of these semantic arguments!!!!!!!
>    Mimic means to copy as closely as possible. . . .? Synonym:? copy.

it's not a semantic argument, michael.   it's protective coloration.

if your copy isn't capable of _assuming_the_look_and_feel_ of
the thing that it _purports_ to be copying, nobody will trust it.

you seem to be forgetting that you are claiming to _be_ a copy.
perhaps you are an "improved" copy, but you are _still_ a copy.

certainly if you came out and said "we rewrote parts of the middle,
because the original was too boring", you would expect that people
would throw you away.

but what if someone points out a few errors in your work, and says
"see, you can't trust this work, it hasn't been faithfully transcribed,"
then what is your defense?   you can say that "it was just a few errors",
but what if they then point out a few more, and a few more after that?
at what point can you no longer expect the end-user to believe you?


>    As I have said before, if you would listen, 
>    I am not AGAINST keeping a copy with such pagination 
>    for such purposes

well, good, and bully for you, and all that, but the fact of the matter is
that project gutenberg is not, at this point in time, actually doing that.


>    but I draw the lines, pun intended, at keeping every character 
>    in the same page position when there is no need for pages, 
>    in all available PG editions.

and, as i have said before, if you would listen,
i'm not suggesting, in any way, shape, or form,
that pagination and linebreaks need to be kept
"in all available p.g. editions".   that'd be stupid,
absolutely and totally and ridiculously _stupid_,
and i don't usually feel a need to rule out stuff
that is absolutely, totally, ridiculously stupid...


>    I want our eBooks to be optimally readable:
>    Minimal end of line hyphenation.
>    No page headers or footers.
>    Just plain reading.

i'm 100% in favor of that.   and i have demonstrated before, and
will be happy to demonstrate once again, any time that you like,
how p.g, could save its texts in a format that allows verification
of the type that i am talking about, _and_ allows the end-user to
have the text exactly like you specify.


>    Once again, I have no stance AGAINST people who want pagination,
>    I just don't want for force any such arbitrary formats on anyone
>    and neither should you or anyone else.
>    STOP TRYING TO FORCE YOUR OPINIONS ON OTHERS, 
>    MAKE THEM OPTIONS!

see, that's precisely why i said you're not listening to me, because
there's no way in the world i would try to "force" this on end-users.

i've never, ever, said _anything_ even _remotely_ like that, in all the
years i've been on this listserve, or the decades i have done e-books,
so i don't know who you're having this conversation with, michael,
but it's obviously not me.


>    I? CAN? tell you that most of the paper editions' page numbers 
>    will fade along with the hyphenation.

no, they won't.   because our cultural heritage is full of references
to page-numbers, and it'll be several orders of magnitude cheaper
and more efficient to keep track of those page-numbers than to
attempt to re-do all those references using hyperlinks or whatever.


>    Last time I looked there were still pretty ubiquitous programs to
>    lay out all such differences.
>
>    IFF you have such deep interests, you can simply put up two editions
>    side by side when you look at them. . .I do. . . .
>
>    If not, then you aren't really that interested. . .it's all smoke.

this is very amusing.   i do this kind of work, michael.   regularly.

and i can tell you that it's not nearly as simple as you make it sound.


>    "_RIGHT_" copy???
>    Now you've contradicted yourself back into the ivory tower. . . .
>    "_RIGHT_" copy, indeeeeed. . . .

you just can't _wait_ to jump to the wrong conclusion, can you?

by "right" copy, i mean the one that the person _wants_ to see.


>    This will? ONLY? do you any good if you manage to find 
>    that edition, out of all the other paper editions in the world.

again, "that edition" is whatever edition the person wants to use.

i have a paper copy of "catcher in the rye" on my bookshelf now.

let's say, 10 years on, i can find a dozen digital versions online.
let's also say that some analysis shows differences between them.
i haven't compared all, not in full, but i know there's some diffs.

i don't want a dozen different versions.   i want the one that matches
the paper copy that has been sitting on my bookshelf for 4 decades.

how do i determine which one -- _if_any_ -- is the same version
as the one that is sitting on my shelf?   that's the difficult question.


>    Sorry, but I anticipated ALL of these questions when I first started,
>    and have answered, and will continue to answer, at length.

no, you didn't answer the question.   so i just asked it to you again.


>    Why can't you just propose your ideas as OPTIONS, 
>    not CARVED IN STONE?

stop making the thread ridiculous.

no one can carve anything in stone any more.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100222/12078fcb/attachment.html>

From Bowerbird at aol.com  Mon Feb 22 16:45:34 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 22 Feb 2010 19:45:34 EST
Subject: [gutvol-d] Re: ok, let's take a look at gardner's book,
	just for the exercise
Message-ID: <1c322.394dbc31.38b47f2e@aol.com>

i did some more work on gardner's book, "the advocate". 

i continued to do preprocessing on the archive.org o.c.r.,
and posted the results on my website.? here's a sample url:
>?? http://z-m-l.com/go/gardn/gardnp123.html

the number of lines that differ between the cleaned o.c.r.
and gardner's file has dropped to under 300, which is great,
since the book contains about 3,900 lines in it.   so it's <8%.
so again we show the stupidity of a word-by-word proofing.

i'm gonna go through one final round of preprocessing,
and then we'll set out to resolve the differences.   already,
however, i can tell that gardner did a very good job on it.
not perfect!   but certainly something to be quite proud of.

the (presumably incorrect) lines from the o.c.r. are in red,
while the lines from gardner's proofed copy are in blue...

if you prefer to view all the edits on one web-page, see:
>?? http://z-m-l.com/go/gardn/gardn-hybrid5.html

i also included the preprocessing edits i made on page 126,
so you can get a feel for how little time i spent doing them...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100222/df871bfc/attachment.html>

From cmiske at ashzfall.com  Mon Feb 22 19:16:52 2010
From: cmiske at ashzfall.com (cmiske at ashzfall.com)
Date: Mon, 22 Feb 2010 20:16:52 -0700
Subject: [gutvol-d]
 =?utf-8?q?Markup_and_Pagination_=28was_RE=3A__Re=3A_!=40!!=40!!?=
 =?utf-8?q?=40!Re=3A_Re=3A_so_what_is_so_important_about_pagination=3F=29?=
Message-ID: <20100222201652.0dedd0f3f91314fbc67db20f64e304ca.314d69c56e.wbe@email05.secureserver.net>

The only justification that _I_ can see for the PG 'books' to contain
some form of pagination would be in a markup version with a hidden
reference to the source page (scan image - if one exists). This would
facilitate referring to the original book pages for simple corrections
in the released versions. If it is _your_ intention to preserve the
book, then the retention of pagination and other formatting would make
perfectly good sense as well.
 
I have followed this project (and this list) for around a decade. I have
always been of the understanding that Project Gutenberg was about making
available the contents of public domain books to as many people as
possible. It has never seemed to me that a goal of the project was to
create a re-printable facsimile of an existing edition, but rather to
create the Project Gutenberg edition. I?ve always thought of us as the
ancient scribes who spent their lives copying the contents of aging
editions of books onto new materials so as to preserve the content for
another few generations of time. The wonderful thing about the digital
editions at Project Gutenberg is that they will last ?forever.?

I have wondered for quite some time why the primary format for archiving
(not necessarily submission or release) at PG does not include markup.
It would be amazingly easy to generate the PG plain text version (for
release) to exacting standards from a rather simply marked up original,
but it is rather complex and potentially error prone to produce any
format other than plain text from the plain text document. I have done
the final plain-text formatting stage before and it is a slow and
potentially human-error riddled process to manually format to PG
standards (many of the standards I refer to are actually from DP rather
than PG). I believe volunteers could be found to markup texts submitted
(current and past) that lack such enhancements and to work towards
getting the backend of PG more standardized so as to facilitate greater
accessibility and distribution of the content. 

I realize that DP is already producing HTML versions, but that is not
precisely what I am referring to and does not address the multitude of
files without HTML versions and those that follow individual markup
standards. And, I would never suggest that submission of a book be
blocked or release be delayed by a contributor not having the ability or
desire to markup their document. I am only suggesting the PG use a more
re-usable formatting method for archiving and on-the-fly content
generation than the current standard of somewhat-standard plain-text
(and somewhat PG/DP standard HTML).

There has been a lot of talk on here over the years of markup and
preservation of information from the original book.  Markup would enable
the retention of as much (or as little) information beyond the content
as the volunteers wished to provide without interfering in the creation
of a Plain Vanilla text file. It also would enable the creation of one
document with multiple editions nested within (or at least the
difference between editions).  The later being what I believe Michael
just indicated was his ?dream.?

Also, markup, when done with that purpose in mind, is human readable.

<chapter title=?Chapter I? subtitle=?The quest for solitude?>
<paragraph>Helen was <emphasis display=?italics? edition=?1845,
Dover?>not</emphasis> in the mood for company and she resented the
<errata edition=?1845,Dover? type=?replace;typographical error?
details=? nkock?>knock< /errata> at the door?.</paragraph>
<paragraph><quote>Just go away!</quote> she yelled and then went back to
reading her copy of <title type=?magazine?
display=?italics;underline? edition=?1845, Dover;1865,NY
Press?>The Mystery Guild Weekly</title>.</paragraph>
?.
<page number=?1? edition=?1845, Dover? />
<paragraph>She really had no idea why there were so many people coming
<page number=?1? edition=?1865,NY Press? />to her house on such
a day as <errata edition=?1845, Dover? type=?replace?
details=?to-day?>today< /errata>. One would <errata
edition=?1865,NY Press? type=?addition? details=?almost? />
think it was her birthday?.</paragraph>
?.
</chapter>

This is a rather poorly thought out and quickly done markup schema, but
I believe it serves the point in that it is human readable. It would not
be an enjoyable read, but it would be readable. And, long after all of
us are gone, if this were the only copy of this document left on earth,
someone could make sense of it enough to convert it to plain text (or
another more readable format). They could also create a reproduction of
the 1845 book or the 1865 book. Or, create an edition of one year or the
other with footnotes or sidenotes/marginalia of the differences in the
other edition(s). In the meantime, while we are still around, PG could
publish the Plain Vanilla version (and any other versions they choose to
make available). I proposed something similar to this many years ago
over at DP, but there was little interest in it at the time. I?m not
sure there is much interest in it now?.

In any case, thank you to all the volunteers for all of the hard work
over the years and all the books you have provided for my pleasure and
for the pleasure of so many others! I?ve never felt as though you get
thanks often enough from all of us users of the end results of your
dedication. Thank you!

Carel
 

From hart at pglaf.org  Mon Feb 22 19:44:45 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Mon, 22 Feb 2010 19:44:45 -0800 (PST)
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <2362473e1002221502k7a23d04et90c268e3fa3865a0@mail.gmail.com>
References: <1bbf9.4e21db05.38b0c5a5@aol.com> <4B802F64.3040909@teksavvy.com>
	<4B830917.5010602@novomail.net>
	<2362473e1002221502k7a23d04et90c268e3fa3865a0@mail.gmail.com>
Message-ID: <alpine.DEB.2.00.1002221944330.16916@mail.pglaf.org>


Love it!!!


Michael


On Mon, 22 Feb 2010, Scott Olson wrote:

> On Mon, Feb 22, 2010 at 3:45 PM, Lee Passey <lee at novomail.net> wrote:
>       When ABBYY FineReader saves its OCR output in HTML format it has the option of
>       placing a break (<br>) at the end of each line, and a horizontal rule (<hr>)
>       between each page (an alternative is to save each scanned page as a separate
>       file, but I find that less convenient). I then wrote a short program (could
>       probably be done just as easily with a perl script, or even sed) that replaced
>       each <hr> with an anchor tag indicating the page number (<a name="page##" />),
>       and replaced each <br> with <lb />. Now <lb> is not a valid HTML element (hence
>       the cheat), but I know of no user agent that will fail to render an HTML file
>       just because it has an invalid element in it.
>
> ?
> Since the user agent will take care of rewrapping, you could just leave the linebreaks
> where they are. If you really want to have them encoded, I'd opt for some CSS.
> ?
> br.lb
> ? {display: none}
> ?
> in your <style> section
> ?
> Then <br class="lb" /> wherever you're currently putting <lb />.
>
>

From hart at pglaf.org  Mon Feb 22 20:15:22 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Mon, 22 Feb 2010 20:15:22 -0800 (PST)
Subject: [gutvol-d] Re: !@!!@!!@!Re: Re: so what is so important about
 pagination?
In-Reply-To: <1b8ef.3a619508.38b47a10@aol.com>
References: <1b8ef.3a619508.38b47a10@aol.com>
Message-ID: <alpine.DEB.2.00.1002221946260.16916@mail.pglaf.org>

On Mon, 22 Feb 2010, Bowerbird at aol.com wrote:

> michael said:
> >?? I had to have read the whole thing
> >?? to get to the part I quoted. . .duh!
>
> you can quote something without reading it.

YOU can. . .or at least without understanding it.


> but i should have said i wish you would have _understood_
> what i was saying, because then you would have understood
> that your response didn't address the point that i was making.

Since it is YOUR point, only YOU c"ould have understood. . . ."


> >?? When you ask people to pay attention, it helps to PAY ATTENTION.
>
> i am paying attention, michael.? even though i've heard
> what you're saying lots and lots and lots of times before.

Pleaes elucidate, without all the verbiage. . . .


> and it makes sense when you are addressing the people
> who are making _other_ points.? but they left long ago...
>
>
> >?? It addresses EXACTLY the point you made that I quoted. . . .
>
> no, it doesn't.

As above. . . .please.


> >?? Then SAY that!!!? Right up front in plain language!!!
>
> i _have_ said it, every time i've talked about this issue...

you're not saying it here and now. . . .


> including once when marcello brought up this same point
> you brought up (think about that, michael), about editions.
> (it was back in september of 2007, if you're curious.)
>
> marcello pointed out that the various different editions of
> "pride and prejudice" had different pagination, and he asked
> which of the editions should be used to do the pagination...
>
> here's the reply i made to marcello's point:
> >?? http://z-m-l.com/go/pap/pride%20and%20prejudice(4).txt
>
> if you don't care to read all of it, here are the guts of that reply:
> >?? the answer to the question as to which set of linebreaks and
> >?? pagebreaks to use is this: the ones in the edition you digitize.
> >
> >?? plain old common sense.? if you didn't already know the answer,
> >?? perhaps you might want to exercise your brain a little bit more...
> >
> >?? if you're digitizing the 1844, use its linebreaks and pagebreaks.
> >?? if you're digitizing the 1853, use its linebreaks and pagebreaks.
> >?? if you're digitizing the 1870, use its linebreaks and pagebreaks.
> >?? if you're digitizing the 1892, use its linebreaks and pagebreaks.
>
> and here is a web-page showing the first page of those 4 editions:
> >?? http://z-m-l.com/go/pap/pride_and_prejudice(4).html

If you had mentioned. . . .


> and yes, in case you're wondering, if a p-book was important enough
> to go through different editions, we should digitize _every_ edition...

I'm still happy to put out one edition that gets across 99% of meaning,
at least until we pass into the millions of books, with few exceptions.

Since I actually remember some of what I read in P&P, I am not at all
sure I would have gotten an extra 1% out of it, no matter how well done.

Not sure I could say the same about Shakespeare, though, though THAT was
botched hugely in a least one edition, eh?

Italians says you only need ONE edition of Dante. . . .


> i'm not going to tell you which of those 4 editions you _should_ use,
> which one is the "right" one.? whichever one you want to use is "right".
> and you should be able to determine if any specific e-book _does_ or
> _does_ not match the edition you want to use, or some other edition.

As mentioned elsewhere, this is someone's SUBJECTIVE DECISION.

Sorry, that's where I get off this train of thought.


>
> ***
>
> by the way, there's another web-page of interest in this directory:
> >?? http://z-m-l.com/go/pap/jose(4).html
>
> this fascinating page shows some work done by jose menendez.
>
> jose adopted my suggestion that the e-book be able to mimic the
> p-book, and he created a series of .pdf books that did just that...
>
> shown on this web-page are some screenshots of his .pdf-books,
> compared to the page-scans from those pages.? he did a great job.
>
> of course, since a .pdf-book is unable to reflow its text, jose's work
> doesn't fit the more-important criterion of reflowability, but it does
> show that the ability to mimic the original can be extremely valuable.

My comments about .pdf are well known.


> >?? However, that still relegates us to being a Xerox machine, no?
>
> no.? because a xerox machine can't do reflow.? or fix typos.
> or pull in spacey contractions.? or change the font, or size.

OK, a fancy Xerox, but still confined to one edition, still makes
it hard to be sure.  What about "freindship" or whatever?


> look, i understand the appeal of digital text _extremely_ well.
> i've made all of the arguments myself, so there's no need for
> you to repeat 'em back at me, you're just wasting your breath.

Then there's no sense talking, is there?


> but there's a problem looming here, a problem that the future
> will have to face, and solve, and i'm telling you what you need
> to do, so that you can _help_ the future _solve_ that problem,
> such that your e-texts will continue to be used, and not tossed.
>
> i'm on _your_ side, michael...? i've got your back, good buddy...
>
> so you need to get that through your skull and start listening...
>
>
> >?? I'm never going to get into any of these semantic arguments!!!!!!!
> >?? Mimic means to copy as closely as possible. . . .? Synonym:? copy.
>
> it's not a semantic argument, michael.? it's protective coloration.
>
> if your copy isn't capable of _assuming_the_look_and_feel_ of
> the thing that it _purports_ to be copying, nobody will trust it.

This is exactly what they said about The Gutenberg Press. . . .

And why Gutenberg wasted so much time putting in scriptorum marks.

"Noboby will trust it"???

When we get to the point where people are arguing about which eBook
of P&P is the best, and I have given away more than one edition, it
becomes moot that eBooks will have already won the day when this is
the case with all eBooks, or even most of them. . . .

At that point _I_ am willing to shout "GAME OVER. . .I WIN. . .!!!

And leave the field of play to the nitpicker ivory tower types.

My audience are those who are being exposed to Shakespeare, Austen,
Dante, Doyle, etc., not those who are already infected. . . .


> you seem to be forgetting that you are claiming to _be_ a copy.
> perhaps you are an "improved" copy, but you are _still_ a copy.
>
> certainly if you came out and said "we rewrote parts of the middle,
> because the original was too boring", you would expect that people
> would throw you away.

Non-sequitur.


> but what if someone points out a few errors in your work, and says
> "see, you can't trust this work, it hasn't been faithfully transcribed,"
> then what is your defense?? you can say that "it was just a few errors",
> but what if they then point out a few more, and a few more after that?
> at what point can you no longer expect the end-user to believe you?

Just a byte sequitur.

Same true of paper editions.


> >?? As I have said before, if you would listen,
> >?? I am not AGAINST keeping a copy with such pagination
> >?? for such purposes
>
> well, good, and bully for you, and all that, but the fact of the matter is
> that project gutenberg is not, at this point in time, actually doing that.

Only a bother to a minimal portion of the audience.


> >?? but I draw the lines, pun intended, at keeping every character
> >?? in the same page position when there is no need for pages,
> >?? in all available PG editions.
>
> and, as i have said before, if you would listen,
> i'm not suggesting, in any way, shape, or form,
> that pagination and linebreaks need to be kept
> "in all available p.g. editions".? that'd be stupid,
> absolutely and totally and ridiculously _stupid_,
> and i don't usually feel a need to rule out stuff
> that is absolutely, totally, ridiculously stupid...

Then say it is an OPTION. . . .


> >?? I want our eBooks to be optimally readable:
> >?? Minimal end of line hyphenation.
> >?? No page headers or footers.
> >?? Just plain reading.
>
> i'm 100% in favor of that.? and i have demonstrated before, and
> will be happy to demonstrate once again, any time that you like,
> how p.g, could save its texts in a format that allows verification
> of the type that i am talking about, _and_ allows the end-user to
> have the text exactly like you specify.

Only in that limited specification did I specify, not in general.


> >?? Once again, I have no stance AGAINST people who want pagination,
> >?? I just don't want for force any such arbitrary formats on anyone
> >?? and neither should you or anyone else.
> >?? STOP TRYING TO FORCE YOUR OPINIONS ON OTHERS,
> >?? MAKE THEM OPTIONS!
>
> see, that's precisely why i said you're not listening to me, because
> there's no way in the world i would try to "force" this on end-users.

/then say OPTION!!!


> i've never, ever, said _anything_ even _remotely_ like that, in all the
> years i've been on this listserve, or the decades i have done e-books,
> so i don't know who you're having this conversation with, michael,
> but it's obviously not me.

Learn to speak more clearly and lucidly to get the answers you want.


> >?? I? CAN? tell you that most of the paper editions' page numbers
> >?? will fade along with the hyphenation.
>
> no, they won't.? because our cultural heritage is full of references
> to page-numbers, and it'll be several orders of magnitude cheaper
> and more efficient to keep track of those page-numbers than to
> attempt to re-do all those references using hyperlinks or whatever.

Only for that minimal audience following such references.

However, if references were made to PHRASES and NOT PAGES, this would
work EVER SO MUCH BETTER!!!!!!!


> >?? Last time I looked there were still pretty ubiquitous programs to
> >?? lay out all such differences.
> >
> >?? IFF you have such deep interests, you can simply put up two editions
> >?? side by side when you look at them. . .I do. . . .
> >
> >?? If not, then you aren't really that interested. . .it's all smoke.
>
> this is very amusing.? i do this kind of work, michael.? regularly.
>
> and i can tell you that it's not nearly as simple as you make it sound.

But it's not DONE by people "nearly as simple as you make it sound."!!!


> >?? "_RIGHT_" copy???
> >?? Now you've contradicted yourself back into the ivory tower. . . .
> >?? "_RIGHT_" copy, indeeeeed. . . .
>
> you just can't _wait_ to jump to the wrong conclusion, can you?
>
> by "right" copy, i mean the one that the person _wants_ to see.

SUBJECTIVE DECISION. . .not my bailiwick.

I'm not getting into making THAT kind of choice.


> >?? This will? ONLY? do you any good if you manage to find
> >?? that edition, out of all the other paper editions in the world.
>
> again, "that edition" is whatever edition the person wants to use.

Not MY person. . .YOUR person.

YOU  address  YOUR  audience, _I_ will address MINE.

You will have the field all to yourself all too soon as I shuffle off.


> i have a paper copy of "catcher in the rye" on my bookshelf now.
>
> let's say, 10 years on, i can find a dozen digital versions online.
> let's also say that some analysis shows differences between them.
> i haven't compared all, not in full, but i know there's some diffs.

Not that it will take you that long to do so. . . .


> i don't want a dozen different versions.? i want the one that matches
> the paper copy that has been sitting on my bookshelf for 4 decades.

I think you meant "matches the most closely," eh?

Rots of ruck finding one that really matches, eh?


> how do i determine which one -- _if_any_ -- is the same version
> as the one that is sitting on my shelf?? that's the difficult question.

Not if you just compare the first and last page.

If you are NOT willing to do that, you have no right to an answer.

You have to LOOK AT THE QUESTION to get an answer, even as a child,
though the steps will be smaller.


> >?? Sorry, but I anticipated ALL of these questions when I first started,
> >?? and have answered, and will continue to answer, at length.
>
> no, you didn't answer the question.? so i just asked it to you again.

You did???  Did you make it obvious???  Does ANYONE here know the question???


> >?? Why can't you just propose your ideas as OPTIONS,
> >?? not CARVED IN STONE?
>
> stop making the thread ridiculous.
>
> no one can carve anything in stone any more.

Then don't make it sound like that's what you mean. . . .

I'm trying to show/teach you how to make your points. . . .

You will appreciate that more after I am gone. . . .


mh

From gbuchana at teksavvy.com  Mon Feb 22 21:12:34 2010
From: gbuchana at teksavvy.com (Gardner Buchanan)
Date: Tue, 23 Feb 2010 00:12:34 -0500
Subject: [gutvol-d] Re: ok, let's take a look at gardner's book,
 just for the exercise
In-Reply-To: <1c322.394dbc31.38b47f2e@aol.com>
References: <1c322.394dbc31.38b47f2e@aol.com>
Message-ID: <4B8363C2.70003@teksavvy.com>

On 22-Feb-2010 19:45, Bowerbird at aol.com wrote:
> i did some more work on gardner's book, "the advocate".

>
> if you prefer to view all the edits on one web-page, see:
>  >   http://z-m-l.com/go/gardn/gardn-hybrid5.html
>

Interesting.  The colours go a bit wonky at
"the awful and spectral presence of Mount Royal."
... page 29 (there's that pagination :-))

Markup aside, I picked up a total of 47 fixes.  A lot of extra
commas.  Several full-stop/comma and semi-/full-colon.  Some
spelling issues that I can't see how they got past the spell
check and a couple of word --> sensible-wrong-word scano's.

Thank you!

There seems something wonky about the text in red that is
"mine" in a few places.  Undoubtedly it is mine where I have
substituted the correct spelling "Sainte H?l?ne" where the
book had left off the accents inconsistently.  But there are
places where "my" text contains errors that I do not believe
were in my text.  I will double-check what exactly I sent
you as I might have fixed things during reformatting,
as that is normally my final "smooth reading" stage of proofing.

============================================================
Gardner Buchanan                     <gbuchana at teksavvy.com>
Ottawa, ON             FreeBSD: Where you want to go. Today.

From cmiske at ashzfall.com  Mon Feb 22 21:39:41 2010
From: cmiske at ashzfall.com (cmiske at ashzfall.com)
Date: Mon, 22 Feb 2010 22:39:41 -0700
Subject: [gutvol-d] Translation project
Message-ID: <20100222223941.0dedd0f3f91314fbc67db20f64e304ca.5624d01ea5.wbe@email05.secureserver.net>

I've just recently started studying a language that is foreign to me.
Although I am a long way off from being able to translate anything
beyond an early reader, I was wondering if there is a translation
project out there that is similar to DP? I looked about the PG site, but
could not find a link to anything of that nature.

Thanks;
Carel


From sly at victoria.tc.ca  Mon Feb 22 22:00:12 2010
From: sly at victoria.tc.ca (Andrew Sly)
Date: Mon, 22 Feb 2010 22:00:12 -0800 (PST)
Subject: [gutvol-d] Re: Translation project
In-Reply-To: <20100222223941.0dedd0f3f91314fbc67db20f64e304ca.5624d01ea5.wbe@email05.secureserver.net>
References: <20100222223941.0dedd0f3f91314fbc67db20f64e304ca.5624d01ea5.wbe@email05.secureserver.net>
Message-ID: <Pine.GSO.4.58.1002222146490.6845@vtn1.victoria.tc.ca>

No particular site springs to mind. Perhaps if we knew the
language you have in mind it might be easier to go searching
for possibilities.

One barrier to that kind of collabarative translation is that
there is usually not one "proper and correct" way to translate
a given phrase. There can be different word order, differences
in shades of meaning and tone, etc. If you have different people
translate different parts of a text, you can end up with great
lack of a consistent style.

It's not impossible though. Take this example:
http://www.gutenberg.org/etext/28971

It was worked on by a small group of people specifically
for inclusion in PG. A couple people translated different
chunks of the text from English, then after the whole thing
was put together, we did a few proof-reading runs, tried to
even out inconsistencies, ran a spell check, etc.

This was done through email and using a wiki.

--Andrew

On Mon, 22 Feb 2010 cmiske at ashzfall.com wrote:

> I've just recently started studying a language that is foreign to me.
> Although I am a long way off from being able to translate anything
> beyond an early reader, I was wondering if there is a translation
> project out there that is similar to DP? I looked about the PG site, but
> could not find a link to anything of that nature.
>
> Thanks;
> Carel
>

From Bowerbird at aol.com  Tue Feb 23 01:18:36 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 23 Feb 2010 04:18:36 EST
Subject: [gutvol-d] Re: !@!!@!!@!Re: Re: so what is so important about
	pagination?
Message-ID: <6d3.17caba02.38b4f76c@aol.com>

michael-

i said it before, and i'll say it again.   but i'm tired of repeating it.

i'm on _your_ side.   i've got your back.

lately, you're beginning to make as little sense as jon noring used to,
arguing "against" me when you don't even know what you're saying...

i'm really going to have to stop talking to you.   it's a waste of time.

***

>   Then say it is an OPTION. . . .
...
>   /then say OPTION!!!

i don't know how to say it any more clearly than i already have...

the whole point is that some people _want_ pagination, and i said
"why not give it to them?"   what is that, if it's not an _option?_

and as for the people who don't want to be bothered with pagination,
i've said "why not give them what they want as well?"   as _their_ option.

i've even devised a format, and written the code, that will allow _both_
parties to get what they want.   precisely because i believe in _options_.

does p.g. offer people who want pagination a way for them to get it?
no, it doesn't.   and yet you yabber on about "options", like you owned
the concept or something.   why don't you start giving people options?

you even go on to say "love it" when lee passey offers up a convoluted
and cockamamie scheme with complex and unnecessary .html coding.

does project gutenberg have a page on its website that will _unwrap_
an e-text?   no, it doesn't.   but i have a page on my site that'll do 
that...


>    However, if references were made to PHRASES and NOT PAGES, 
>    this would work EVER SO MUCH BETTER!!!!!!!

lay off the exclamation points, old man.   they're making you punch drunk.

making references to "phrases" might work if we did it from here on out,
but there'd still be millions and millions of references to pagenumbers
in the corpus that represents our cultural heritage.   i know you to be a
strong supporter of that cultural heritage, so i know you can't mean what
you appear to be implying here, which is that we lose all those references.

***

as for the rest of your reply, it just degenerates into meaninglessness...

i have made my point, for the people who care to understand it, so i will
opt out of stupid aspects of this conversation for the rest of the thread.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/7cb3fe03/attachment-0001.html>

From Bowerbird at aol.com  Tue Feb 23 01:31:46 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 23 Feb 2010 04:31:46 EST
Subject: [gutvol-d] Re: ok, let's take a look at gardner's book,
	just for the exercise
Message-ID: <874.46323226.38b4fa82@aol.com>

gardner said:
>   Interesting.? The colours go a bit wonky at

um, actually, i have the colors totally reversed, all the way through.
and the "bad" line was on the bottom, when i usually put it on top...

i confused myself by looking at some diffs where the o.c.r. was right
and your version was wrong.   they are few and far between, but i still
managed to convince myself that up was down and port was starboard.

sorry!


>    substituted the correct spelling "Sainte H?l?ne" where the

that's a different problem.   my text-editor assumes mac-roman,
so i canceled out the few places where there was high-bit text.


>    But there are places where "my" text contains errors 
>    that I do not believe were in my text.

again, that's because i mixed up the colors.   your text was in red.

i got it straightened out, and will be uploading a new version soon,
so wait until i do that before you spend any more time checking it.
(and actually, you don't have to do anything; i will do a full report.)

i'm sorry, i apologize for the confusion.   i ain't perfect...         :+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/44eef6e5/attachment.html>

From lostpaces.dp at googlemail.com  Tue Feb 23 03:28:48 2010
From: lostpaces.dp at googlemail.com (christine)
Date: Tue, 23 Feb 2010 12:28:48 +0100
Subject: [gutvol-d] Re: so what is so important about pagination?
Message-ID: <b35f6ee51002230328l14af53aeu3406929738cc197e@mail.gmail.com>

I have been following this conversation, and would like it to lead to a
slight change in the PG pagination rule.

As post-processor at DP-Int, I often have the feeling of uploading
unfinished works because I have to remove the pagination from the text
files. I don't mind in novels, but for books with index, it is another
matter.

Some time ago, I received a mail asking me why I format my page numbers as
invisible in the html files. One of my reason is that I can not see the
point of showing them, since most of everything is linked (TOC, LOI, Index,
etc...)

When I started post-processing, I included the page numbers in the text
files, and often had to remove them to have the project posted. When I asked
about the reason, I received the answers "it disturbs", "there is no need
for it, the reader can do a search", etc.

Now, it does not seem very logical to show them in a linked file (html) and
leave them out in the text file. Doing a search is not a very good solution
either, in most of the case, it is like searching for a Smith in a british
telephon book, more finds than you care for.

As example: I am currently working on "The Life of Napoleon", one of the
index entries says "N. returns to Paris". Well, searching for "Napoleon" or
"Paris" in that project would brings too many finds, as for "returns" this
would not bring you to the proper page either, because, there, you find "is
driven back".

So, would it be possible to just relax that rule and let the volunteer
decides *if* the pagination in the text file of a project is needed or not.

Christine (aka Lostpaces)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/81e76d4d/attachment.html>

From joshua at hutchinson.net  Tue Feb 23 06:46:11 2010
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Tue, 23 Feb 2010 14:46:11 +0000 (GMT)
Subject: [gutvol-d] Re: so what is so important about pagination?
References: <b35f6ee51002230328l14af53aeu3406929738cc197e@mail.gmail.com>
Message-ID: <541796544.271024.1266936371271.JavaMail.mail@webmail08>

An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/de9019db/attachment.html>

From jimad at msn.com  Tue Feb 23 07:32:07 2010
From: jimad at msn.com (Jim Adcock)
Date: Tue, 23 Feb 2010 07:32:07 -0800
Subject: [gutvol-d] Re: [SPAM] Re: Mirroring the firehost? Re: Re: Many solo
	projects out there in gutvol-d land?
In-Reply-To: <alpine.DEB.2.00.1002211442010.17501@durendal.durendal.org>
References: <20100221193308.GE10824@pglaf.org>
	<alpine.DEB.2.00.1002211442010.17501@durendal.durendal.org>
Message-ID: <SNT120-DS3E54E1EB836B9E684109CAE420@phx.gbl>


>The two queues that are most interesting because they are the largest are 
the P3 waiting and F2 waiting.

Actually, the PP queue is the longest, but roughly 1/3 each fall on the P3,
F2, and PP queues.

It would be cool if the PP queue could be presented with HTML headers, and
with the ----File.... pagination stuff stripped -- since this queue already
has much of the HTML markup.  The alternative is to strip the HTML markup
back out before presenting.


From greg at durendal.org  Tue Feb 23 07:41:35 2010
From: greg at durendal.org (Greg Weeks)
Date: Tue, 23 Feb 2010 10:41:35 -0500 (EST)
Subject: [gutvol-d] [SPAM] Re: Re: [SPAM] Re: Mirroring the firehost? Re:
 Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <SNT120-DS3E54E1EB836B9E684109CAE420@phx.gbl>
References: <20100221193308.GE10824@pglaf.org>
	<alpine.DEB.2.00.1002211442010.17501@durendal.durendal.org>
	<SNT120-DS3E54E1EB836B9E684109CAE420@phx.gbl>
Message-ID: <alpine.DEB.2.00.1002231037330.29893@durendal.durendal.org>

On Tue, 23 Feb 2010, Jim Adcock wrote:

>
>> The two queues that are most interesting because they are the largest are
> the P3 waiting and F2 waiting.
>
> Actually, the PP queue is the longest, but roughly 1/3 each fall on the P3,
> F2, and PP queues.
>
> It would be cool if the PP queue could be presented with HTML headers, and
> with the ----File.... pagination stuff stripped -- since this queue already
> has much of the HTML markup.  The alternative is to strip the HTML markup
> back out before presenting.

I think the text file for all rounds should be processed a bit to strip 
out the page separators as well as to strip in line markup (i.e. <i> and 
it's brothers) and proofer notes. [** something] I have no problems with 
doing the in-PP projects as well, but most of them will fairly quickly get 
posted. There are exceptions, but most will.

-- 
Greg Weeks
http://durendal.org:8080/greg/


From jimad at msn.com  Tue Feb 23 07:47:14 2010
From: jimad at msn.com (Jim Adcock)
Date: Tue, 23 Feb 2010 07:47:14 -0800
Subject: [gutvol-d] Re: let us not be confused
In-Reply-To: <bfef.7902e780.38b2f8f0@aol.com>
References: <bfef.7902e780.38b2f8f0@aol.com>
Message-ID: <SNT120-DS359BF0FBA2E88D39C9DF7AE420@phx.gbl>

>however, once nick upgraded his o.c.r. program, he found that about half of
his checks were no longer required.  they had been necessary essentially as
an artifact of an outdated o.c.r. program.

Begs the question why DP doesn't just institute a quality hosted OCR and let
people just submit the page images. Ask people to test run a couple pages by
the hosted OCR before settling on their digitization settings in order to
make sure they know what they are doing.

>but they could find 90% so easily that ...

Not that I totally disagree, but when you take out the easy stuff, the stuff
that's left is harder to find.  Especially at the P1 level too much cruft
makes for painful proofing -- but so does too little cruft.  Either way you
scare off your newbies, who you need to keep around and happy and convince
them to "progress" to the more difficult and less rewarding levels, such as
P3 and F2. Its not just the cruftiness, but that the current interface
doesn't make fixing common cruftiness easy -- neither on the fingers nor on
the eyes.

Working "solo" I find there are lots of clever schemes I can do to reduce
the amount of "P1" cruft I need to fix -- but "P1" really only takes about
20% of the time and effort I find necessary to make a book.


From dakretz at gmail.com  Tue Feb 23 08:23:30 2010
From: dakretz at gmail.com (don kretz)
Date: Tue, 23 Feb 2010 08:23:30 -0800
Subject: [gutvol-d] {Disarmed} Re: [SPAM] Re: Re: [SPAM] Re: Mirroring the
	firehost? Re: Re: Many solo projects out there in gutvol-d land?
In-Reply-To: <alpine.DEB.2.00.1002231037330.29893@durendal.durendal.org>
References: <20100221193308.GE10824@pglaf.org>
	<alpine.DEB.2.00.1002211442010.17501@durendal.durendal.org>
	<SNT120-DS3E54E1EB836B9E684109CAE420@phx.gbl>
	<alpine.DEB.2.00.1002231037330.29893@durendal.durendal.org>
Message-ID: <627d59b81002230823i699adf47j230306f2f4477206@mail.gmail.com>

If you strip out the page delimiters you'll probably want to do something
intelligent with
text flow across the break for footnotes etc.

On Tue, Feb 23, 2010 at 7:41 AM, Greg Weeks <greg at durendal.org> wrote:

> On Tue, 23 Feb 2010, Jim Adcock wrote:
>
>
>>  The two queues that are most interesting because they are the largest are
>>>
>> the P3 waiting and F2 waiting.
>>
>> Actually, the PP queue is the longest, but roughly 1/3 each fall on the
>> P3,
>> F2, and PP queues.
>>
>> It would be cool if the PP queue could be presented with HTML headers, and
>> with the ----File.... pagination stuff stripped -- since this queue
>> already
>> has much of the HTML markup.  The alternative is to strip the HTML markup
>> back out before presenting.
>>
>
> I think the text file for all rounds should be processed a bit to strip out
> the page separators as well as to strip in line markup (i.e. <i> and it's
> brothers) and proofer notes. [** something] I have no problems with doing
> the in-PP projects as well, but most of them will fairly quickly get posted.
> There are exceptions, but most will.
>
>
> --
> Greg Weeks
> http://durendal.org:8080/greg/
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/62fa9bba/attachment-0001.html>

From lostpaces.dp at googlemail.com  Tue Feb 23 08:39:40 2010
From: lostpaces.dp at googlemail.com (christine)
Date: Tue, 23 Feb 2010 17:39:40 +0100
Subject: [gutvol-d] so what is so important about pagination?
Message-ID: <b35f6ee51002230839s5651a646l62c07d7b031a498a@mail.gmail.com>

The PG template should then be taken into account, but it differs depending
on the WWer ...

It would also require the reader having a reader with line numbers, mine
does not have them.

As for the page numbers disturbing the flow of reading, that is only true
for the first couple of pages, you soon get use to it and do not see them
anymore.

Christine


> ---------- Forwarded message ----------
> From: Joshua Hutchinson <joshua at hutchinson.net>
> To: gutvol-d at lists.pglaf.org
> Date: Tue, 23 Feb 2010 14:46:11 +0000 (GMT)
> Subject: [gutvol-d] Re: so what is so important about pagination?
> I'd suggest a different direction.
>
> If you have a situation like that, don't bother with page numbers (since
> they do disturb the "flow" of reading).  Rather, in the index, use line
> numbers.  Most modern editors will happily tell you what line number you are
> currently on and even jump to a specific line number.  The line number in a
> text file is fixed, so it will be a consistent number to jump to and yet
> doesn't disturb the flow of reading in the text itself.
>
> Now, before you say it, I realize this would require someone to write up a
> quick and dirty script/program to convert page numbers to line numbers in an
> index ... but it's doable.
>
> Josh
>
> On Feb 23, 2010, *christine* <lostpaces.dp at googlemail.com> wrote:
>
> I have been following this conversation, and would like it to lead to a
> slight change in the PG pagination rule.
>
> As post-processor at DP-Int, I often have the feeling of uploading
> unfinished works because I have to remove the pagination from the text
> files. I don't mind in novels, but for books with index, it is another
> matter.
>
> Some time ago, I received a mail asking me why I format my page numbers as
> invisible in the html files. One of my reason is that I can not see the
> point of showing them, since most of everything is linked (TOC, LOI, Index,
> etc...)
>
> When I started post-processing, I included the page numbers in the text
> files, and often had to remove them to have the project posted. When I asked
> about the reason, I received the answers "it disturbs", "there is no need
> for it, the reader can do a search", etc.
>
> Now, it does not seem very logical to show them in a linked file (html) and
> leave them out in the text file. Doing a search is not a very good solution
> either, in most of the case, it is like searching for a Smith in a british
> telephon book, more finds than you care for.
>
> As example: I am currently working on "The Life of Napoleon", one of the
> index entries says "N. returns to Paris". Well, searching for "Napoleon" or
> "Paris" in that project would brings too many finds, as for "returns" this
> would not bring you to the proper page either, because, there, you find "is
> driven back".
>
> So, would it be possible to just relax that rule and let the volunteer
> decides *if* the pagination in the text file of a project is needed or not.
>
> Christine (aka Lostpaces)
>
>
>
> ------------------------------
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/815a607a/attachment.html>

From greg at durendal.org  Tue Feb 23 08:40:16 2010
From: greg at durendal.org (Greg Weeks)
Date: Tue, 23 Feb 2010 11:40:16 -0500 (EST)
Subject: [gutvol-d] [SPAM] Re: {Disarmed} Re: [SPAM] Re: Re: [SPAM] Re:
 Mirroring the firehost? Re: Re: Many solo projects out there in gutvol-d
 land?
In-Reply-To: <627d59b81002230823i699adf47j230306f2f4477206@mail.gmail.com>
References: <20100221193308.GE10824@pglaf.org>
	<alpine.DEB.2.00.1002211442010.17501@durendal.durendal.org>
	<SNT120-DS3E54E1EB836B9E684109CAE420@phx.gbl>
	<alpine.DEB.2.00.1002231037330.29893@durendal.durendal.org>
	<627d59b81002230823i699adf47j230306f2f4477206@mail.gmail.com>
Message-ID: <alpine.DEB.2.00.1002231139450.30115@durendal.durendal.org>

On Tue, 23 Feb 2010, don kretz wrote:

> If you strip out the page delimiters you'll probably want to do something
> intelligent with
> text flow across the break for footnotes etc.

If you can automate it great, but if not then don't bother.

-- 
Greg Weeks
http://durendal.org:8080/greg/


From hart at pglaf.org  Tue Feb 23 09:20:53 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Tue, 23 Feb 2010 09:20:53 -0800 (PST)
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <b35f6ee51002230839s5651a646l62c07d7b031a498a@mail.gmail.com>
References: <b35f6ee51002230839s5651a646l62c07d7b031a498a@mail.gmail.com>
Message-ID: <alpine.DEB.2.00.1002230916570.10422@mail.pglaf.org>


I'm reading in a 34 line window.

Let's say there are only 2 blank lines before and after page numbers,
that's 5 lines, or about 14% of my screen, when they go by.

Anything over 10% is going to be distracting, according to perception
psychologists. . .I find my eye is attracted there. . . .


On Tue, 23 Feb 2010, christine wrote:

> The PG template should then be taken into account, but it differs depending on the WWer ...
>
> It would also require the reader having a reader with line numbers, mine does not have
> them.
>
> As for the page numbers disturbing the flow of reading, that is only true for the first
> couple of pages, you soon get use to it and do not see them anymore.
>
> Christine
>
>
>
>       ---------- Forwarded message ----------
>       From:?Joshua Hutchinson <joshua at hutchinson.net>
>       To:?gutvol-d at lists.pglaf.org
>       Date:?Tue, 23 Feb 2010 14:46:11 +0000 (GMT)
>       Subject:?[gutvol-d] Re: so what is so important about pagination?
>       I'd suggest a different direction.
> If you have a situation like that, don't bother with page numbers (since they do
> disturb the "flow" of reading). ?Rather, in the index, use line numbers. ?Most modern
> editors will happily tell you what line number you are currently on and even jump to
> a specific line number. ?The line number in a text file is fixed, so it will be a
> consistent number to jump to and yet doesn't disturb the flow of reading in the text
> itself.
>
> Now, before you say it, I realize this would require someone to write up a quick and
> dirty script/program to convert page numbers to line numbers in an index ... but it's
> doable.
>
> Josh
>
> On Feb 23, 2010, christine <lostpaces.dp at googlemail.com> wrote:
>
>       I have been following this conversation, and would like it to lead to a
>       slight change in the PG pagination rule.
>
>       As post-processor at DP-Int, I often have the feeling of uploading
>       unfinished works because I have to remove the pagination from the text
>       files. I don't mind in novels, but for books with index, it is another
>       matter.
>
>       Some time ago, I received a mail asking me why I format my page numbers
>       as invisible in the html files. One of my reason is that I can not see
>       the point of showing them, since most of everything is linked (TOC, LOI,
>       Index, etc...)
>
>       When I started post-processing, I included the page numbers in the text
>       files, and often had to remove them to have the project posted. When I
>       asked about the reason, I received the answers "it disturbs", "there is
>       no need for it, the reader can do a search", etc.
>
>       Now, it does not seem very logical to show them in a linked file (html)
>       and leave them out in the text file. Doing a search is not a very good
>       solution either, in most of the case, it is like searching for a Smith in
>       a british telephon book, more finds than you care for.
>
>       As example: I am currently working on "The Life of Napoleon", one of the
>       index entries says "N. returns to Paris". Well, searching for "Napoleon"
>       or "Paris" in that project would brings too many finds, as for "returns"
>       this would not bring you to the proper page either, because, there, you
>       find "is driven back".
>
>       So, would it be possible to just relax that rule and let the volunteer
>       decides *if* the pagination in the text file of a project is needed or
>       not.
>
>       Christine (aka Lostpaces)
>
>
>
> ___________________________________________________________________________________________
>
>       _______________________________________________
>       gutvol-d mailing list
>       gutvol-d at lists.pglaf.org
>       http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
>
>
>
>

From hart at pglaf.org  Tue Feb 23 09:23:22 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Tue, 23 Feb 2010 09:23:22 -0800 (PST)
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <541796544.271024.1266936371271.JavaMail.mail@webmail08>
References: <b35f6ee51002230328l14af53aeu3406929738cc197e@mail.gmail.com>
	<541796544.271024.1266936371271.JavaMail.mail@webmail08>
Message-ID: <alpine.DEB.2.00.1002230922350.10422@mail.pglaf.org>


I agree, line numbers would be OK, since usually invisible.

BUT!

Those who reflow would get different line numbers. . . .


On Tue, 23 Feb 2010, Joshua Hutchinson wrote:

> I'd suggest a different direction.
> If you have a situation like that, don't bother with page numbers (since they do disturb
> the "flow" of reading). ?Rather, in the index, use line numbers. ?Most modern editors will
> happily tell you what line number you are currently on and even jump to a specific line
> number. ?The line number in a text file is fixed, so it will be a consistent number to jump
> to and yet doesn't disturb the flow of reading in the text itself.
>
> Now, before you say it, I realize this would require someone to write up a quick and dirty
> script/program to convert page numbers to line numbers in an index ... but it's doable.
>
> Josh
>
> On Feb 23, 2010, christine <lostpaces.dp at googlemail.com> wrote:
>
>       I have been following this conversation, and would like it to lead to a slight
>       change in the PG pagination rule.
>
>       As post-processor at DP-Int, I often have the feeling of uploading unfinished
>       works because I have to remove the pagination from the text files. I don't mind
>       in novels, but for books with index, it is another matter.
>
>       Some time ago, I received a mail asking me why I format my page numbers as
>       invisible in the html files. One of my reason is that I can not see the point
>       of showing them, since most of everything is linked (TOC, LOI, Index, etc...)
>
>       When I started post-processing, I included the page numbers in the text files,
>       and often had to remove them to have the project posted. When I asked about the
>       reason, I received the answers "it disturbs", "there is no need for it, the
>       reader can do a search", etc.
>
>       Now, it does not seem very logical to show them in a linked file (html) and
>       leave them out in the text file. Doing a search is not a very good solution
>       either, in most of the case, it is like searching for a Smith in a british
>       telephon book, more finds than you care for.
>
>       As example: I am currently working on "The Life of Napoleon", one of the index
>       entries says "N. returns to Paris". Well, searching for "Napoleon" or "Paris"
>       in that project would brings too many finds, as for "returns" this would not
>       bring you to the proper page either, because, there, you find "is driven back".
>
>       So, would it be possible to just relax that rule and let the volunteer decides
>       *if* the pagination in the text file of a project is needed or not.
>
>       Christine (aka Lostpaces)
>
>
>
> ___________________________________________________________________________________________
>
>       _______________________________________________
>       gutvol-d mailing list
>       gutvol-d at lists.pglaf.org
>       http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
>
>

From malcolm.farmer at gmail.com  Tue Feb 23 09:26:19 2010
From: malcolm.farmer at gmail.com (Malcolm Farmer)
Date: Tue, 23 Feb 2010 17:26:19 +0000
Subject: [gutvol-d] Re: let us not be confused
In-Reply-To: <SNT120-DS359BF0FBA2E88D39C9DF7AE420@phx.gbl>
References: <bfef.7902e780.38b2f8f0@aol.com>
	<SNT120-DS359BF0FBA2E88D39C9DF7AE420@phx.gbl>
Message-ID: <8baaac1d1002230926g2b26171dp84eed4061f21676a@mail.gmail.com>

On 23 February 2010 15:47, Jim Adcock <jimad at msn.com> wrote:

>
> Begs the question why DP doesn't just institute a quality hosted OCR and
> let
> people just submit the page images. Ask people to test run a couple pages
> by
> the hosted OCR before settling on their digitization settings in order to
> make sure they know what they are doing.
>

This was discussed on DP back in 2003. If you have a DP login see here:
http://www.pgdp.net/phpBB2/viewtopic.php?t=5840

And the flaw?
I quote from from a post about finereader in that thread:

------------------------------
I asked the price of Linux development kit. It is 9000 Euro, plus some more
money to get a licence for a fixed number of page/month (500 euro for 25k
pages/month)


(Tesseract might be the way to go, but there's still the chronic shortage of
programmers to implement new DP features.)

Malcolm


Malcolm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/8abfe39f/attachment.html>

From joshua at hutchinson.net  Tue Feb 23 09:41:23 2010
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Tue, 23 Feb 2010 17:41:23 +0000 (GMT)
Subject: [gutvol-d] Re: so what is so important about pagination?
References: <b35f6ee51002230328l14af53aeu3406929738cc197e@mail.gmail.com>
	<541796544.271024.1266936371271.JavaMail.mail@webmail08>
	<alpine.DEB.2.00.1002230922350.10422@mail.pglaf.org>
Message-ID: <1218827037.274545.1266946884045.JavaMail.mail@webmail08>

An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/1fac52ae/attachment-0001.html>

From marcello at perathoner.de  Tue Feb 23 09:41:44 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Tue, 23 Feb 2010 18:41:44 +0100
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <b35f6ee51002230839s5651a646l62c07d7b031a498a@mail.gmail.com>
References: <b35f6ee51002230839s5651a646l62c07d7b031a498a@mail.gmail.com>
Message-ID: <4B841358.8000500@perathoner.de>

christine wrote:

> As for the page numbers disturbing the flow of reading, that is only true for 
> the first couple of pages, you soon get use to it and do not see them anymore.

The first error report I got about our mobi files was that there were 
digits in the text flow that didn't belong there.

If you take into account how annoyed a person must be before actually 
writing a complaint, you can judge how big an irritation those page 
numbers are.


-- 
Marcello Perathoner
webmaster at gutenberg.org

From hart at pglaf.org  Tue Feb 23 09:45:28 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Tue, 23 Feb 2010 09:45:28 -0800 (PST)
Subject: [gutvol-d] Re: !@!!@!!@!Re: Re: so what is so important about
 pagination?
In-Reply-To: <6d3.17caba02.38b4f76c@aol.com>
References: <6d3.17caba02.38b4f76c@aol.com>
Message-ID: <alpine.DEB.2.00.1002230944100.10422@mail.pglaf.org>


I'm ready to stop.

You are not talking to me.

You say I am not talking to you.

Who has more practice?

Who started the non-conversation?

If you have a point to make, you missed.

Take better aim and try again, or give it up.


On Tue, 23 Feb 2010, Bowerbird at aol.com wrote:

> michael-
>
> i said it before, and i'll say it again.? but i'm tired of repeating it.
>
> i'm on _your_ side.? i've got your back.
>
> lately, you're beginning to make as little sense as jon noring used to,
> arguing "against" me when you don't even know what you're saying...
>
> i'm really going to have to stop talking to you.? it's a waste of time.
>
> ***
>
> >?? Then say it is an OPTION. . . .
> ...
> >?? /then say OPTION!!!
>
> i don't know how to say it any more clearly than i already have...
>
> the whole point is that some people _want_ pagination, and i said
> "why not give it to them?"? what is that, if it's not an _option?_
>
> and as for the people who don't want to be bothered with pagination,
> i've said "why not give them what they want as well?"? as _their_ option.
>
> i've even devised a format, and written the code, that will allow _both_
> parties to get what they want.? precisely because i believe in _options_.
>
> does p.g. offer people who want pagination a way for them to get it?
> no, it doesn't.? and yet you yabber on about "options", like you owned
> the concept or something.? why don't you start giving people options?
>
> you even go on to say "love it" when lee passey offers up a convoluted
> and cockamamie scheme with complex and unnecessary .html coding.
>
> does project gutenberg have a page on its website that will _unwrap_
> an e-text?? no, it doesn't.? but i have a page on my site that'll do that...
>
>
> >?? However, if references were made to PHRASES and NOT PAGES,
> >?? this would work EVER SO MUCH BETTER!!!!!!!
>
> lay off the exclamation points, old man.? they're making you punch drunk.
>
> making references to "phrases" might work if we did it from here on out,
> but there'd still be millions and millions of references to pagenumbers
> in the corpus that represents our cultural heritage.? i know you to be a
> strong supporter of that cultural heritage, so i know you can't mean what
> you appear to be implying here, which is that we lose all those references.
>
> ***
>
> as for the rest of your reply, it just degenerates into meaninglessness...
>
> i have made my point, for the people who care to understand it, so i will
> opt out of stupid aspects of this conversation for the rest of the thread.
>
> -bowerbird
>
>

From lostpaces.dp at googlemail.com  Tue Feb 23 10:03:09 2010
From: lostpaces.dp at googlemail.com (christine)
Date: Tue, 23 Feb 2010 19:03:09 +0100
Subject: [gutvol-d] so what is so important about pagination?
Message-ID: <b35f6ee51002231003y475d61e1w61a7b13ffcae5fa6@mail.gmail.com>

I am not talking about adding big chunk of blank lines, but of either
including the page number like that [p.101] in the text, I do not like this
solution since it disturb when doing a search;

or adding the page number at the start of the 1. paragraph starting on a
page, e.g.:

change of page,
blah blah end of paragraph.

[p.101]Text of the new paragraph.

With a TN explaining that the page numbers are not accuratly placed at the
start of the page, but at the start of the 1. paragraph starting on that
page. The aim is to make the use of the files as easy and comfortable to the
reader as possible.

Christine

>From: "Michael S. Hart" <hart at pglaf.org>
>
>
>I'm reading in a 34 line window.
>
>Let's say there are only 2 blank lines before and after page numbers,
>that's 5 lines, or about 14% of my screen, when they go by.
>
>Anything over 10% is going to be distracting, according to perception
>psychologists. . .I find my eye is attracted there. . . .
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/d289baf7/attachment.html>

From grythumn at gmail.com  Tue Feb 23 10:14:49 2010
From: grythumn at gmail.com (Robert Cicconetti)
Date: Tue, 23 Feb 2010 13:14:49 -0500
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <1218827037.274545.1266946884045.JavaMail.mail@webmail08>
References: <b35f6ee51002230328l14af53aeu3406929738cc197e@mail.gmail.com>
	<541796544.271024.1266936371271.JavaMail.mail@webmail08>
	<alpine.DEB.2.00.1002230922350.10422@mail.pglaf.org>
	<1218827037.274545.1266946884045.JavaMail.mail@webmail08>
Message-ID: <15cfa2a51002231014q42e15292sf863f60d0578bcff@mail.gmail.com>

I've seen paragraph numbers used as anchors in a few contexts. Line
numbers also work if you use hidden anchors in the HTML. Webscriptions
uses something along these lines, IIRC, with a small javascript box in
the navigation frame displaying the current paragraph number.

Won't work in epub or other derivative formats, though.

-Bob

On Tue, Feb 23, 2010 at 12:41 PM, Joshua Hutchinson
<joshua at hutchinson.net> wrote:
> True, but the text files usually aren't reflowed (we've all talked ad-naseum
> about how easy/hard that is to do, so I won't go back into that).
> Eventually, you run into limitations. ?If you chose path X, the path Y
> people will not get what they want.
> Josh
>
> On Feb 23, 2010, Michael S. Hart <hart at pglaf.org> wrote:
>
> I agree, line numbers would be OK, since usually invisible.
>
> BUT!
>
> Those who reflow would get different line numbers. . . .

From marcello at perathoner.de  Tue Feb 23 10:22:27 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Tue, 23 Feb 2010 19:22:27 +0100
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <b35f6ee51002231003y475d61e1w61a7b13ffcae5fa6@mail.gmail.com>
References: <b35f6ee51002231003y475d61e1w61a7b13ffcae5fa6@mail.gmail.com>
Message-ID: <4B841CE3.3060306@perathoner.de>

christine wrote:
> I am not talking about adding big chunk of blank lines, but of either including 
> the page number like that [p.101] in the text, I do not like this solution since 
> it disturb when doing a search;
> 
> or adding the page number at the start of the 1. paragraph starting on a page, e.g.:
> 
> change of page,
> blah blah end of paragraph.
> 
> [p.101]Text of the new paragraph.
> 
> With a TN explaining that the page numbers are not accuratly placed at the start 
> of the page, but at the start of the 1. paragraph starting on that page. The aim 
> is to make the use of the files as easy and comfortable to the reader as possible.

Probably would not work with Ulysses or Kant or many other texts that 
have multi-page paragraphs.


So we are left with:

1. put the page number where it belongs, even if its in the middle of a 
word, or

2. put the page number where it irritates less, even if its up to one 
whole paragraph off.

Of course 2. would make it useless for indices, the use case you 
advocated in the first place.

Both cases are unsatisfying, IMO.


The only place where page numbers belong is in invisible markup that can 
be made visible on demand (CSS edit or so).


-- 
Marcello Perathoner
webmaster at gutenberg.org

From Bowerbird at aol.com  Tue Feb 23 10:27:08 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 23 Feb 2010 13:27:08 EST
Subject: [gutvol-d] Re: !@!!@!!@!Re: Re: so what is so important about
	pagination?
Message-ID: <ef6c.db2697d.38b577fc@aol.com>

michael said:
>   If you have a point to make, you missed.

i think lots of people got the point.   but you evidently missed it.


>    Take better aim and try again, or give it up.

my aim is true.   i've done enough.   i have no need to "give up",
_or_ to continue.   the future is the entity that will deal with this.
doesn't matter how much friction we generate.   the future will deal.

and this is how the future will do that dealing:

1.   it will need to tell which texts are accurate and which are not,
because it will find itself awash in e-books without provenance...

2.   it will do this determination by comparing digital text to scans.
(it will test the validity of the scans by comparing them to p-books.)

3.   digital text which retained pagination/linebreaks will lend itself
better to this comparison process, so it will generally tend to "win",
and the "best practice" will come to require this aspect, and strictly.

4.   digital text which is difficult to get to (e.g., zipped up in .epub,
obscured by some opaque format, unduly burdened by d.r.m., etc.)
and text which has discarded its pagination/linebreaks will lose out.

i have no interest in engaging in "debate" on these 4 points, 
because it's perfectly clear to me that they're all _inevitable_.

the question here is whether p.g. wants its e-texts in group #3 or #4.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/a950dca7/attachment.html>

From Bowerbird at aol.com  Tue Feb 23 10:27:56 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 23 Feb 2010 13:27:56 EST
Subject: [gutvol-d] Re: let us not be confused
Message-ID: <efe7.a6d588a.38b5782c@aol.com>

jim said:
>   Not that I totally disagree, but 
>    when you take out the easy stuff, 
>    the stuff that's left is harder to find.

where's your data that validates that?

i think when you clear the page of
"the easy stuff", it becomes _easier_
to find "the stuff that's left", because
you're not distracted by the easy stuff.

it might _feel_ "less effective", since
you're making fewer corrections, but
the ones you do make are more vital.

what usually happens is that it takes
one round of proofing to remove the
easy stuff, and another for the rest...

what i say is to make the first round
a tool-aided round, to preserve your
human resources, so the humans
only have to do one word-by-word
round, which is the difficult process.

besides, it's not as if zero satisfaction
comes in the tool-aided round.   myself,
i feel _greater_ satisfaction there, since
my efficiency is boosted _considerably._

and in a roundless system, anything that
moves a page closer to "finished status"
is a good thing, because that's the goal.

just _offer_ people a good tool to use;
you will find they enjoy it immensely...

by the way, dkretz has a new version of
his tool available now, at the usual place:
>    http://code.google.com/p/dp50/downloads/list

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/77ff4baa/attachment-0001.html>

From lostpaces.dp at googlemail.com  Tue Feb 23 10:38:23 2010
From: lostpaces.dp at googlemail.com (christine)
Date: Tue, 23 Feb 2010 19:38:23 +0100
Subject: [gutvol-d] so what is so important about pagination?
Message-ID: <b35f6ee51002231038j4e35a040g1c7efc981d522371@mail.gmail.com>

The format text file is unsatisfying anyway, but as long as we insist on
having it, we should make the best of it. Even if the page numbers are only
given every second pages because of long paragraphs, it is IMO still better
than not giving any at all.

And putting it at the start of a paragraph would not make it useless in case
of indices. It would reduce the search of the user to a couple of pages
instead of a lot of them.

Christine


>
>
> ---------- Forwarded message ----------
> From: Marcello Perathoner <marcello at perathoner.de>
> To: Project Gutenberg Volunteer Discussion <gutvol-d at lists.pglaf.org>
> Date: Tue, 23 Feb 2010 19:22:27 +0100
> Subject: [gutvol-d] Re: so what is so important about pagination?
> christine wrote:
>
>> I am not talking about adding big chunk of blank lines, but of either
>> including the page number like that [p.101] in the text, I do not like this
>> solution since it disturb when doing a search;
>>
>> or adding the page number at the start of the 1. paragraph starting on a
>> page, e.g.:
>>
>> change of page,
>> blah blah end of paragraph.
>>
>> [p.101]Text of the new paragraph.
>>
>> With a TN explaining that the page numbers are not accuratly placed at the
>> start of the page, but at the start of the 1. paragraph starting on that
>> page. The aim is to make the use of the files as easy and comfortable to the
>> reader as possible.
>>
>
> Probably would not work with Ulysses or Kant or many other texts that have
> multi-page paragraphs.
>
>
>
> So we are left with:
>
> 1. put the page number where it belongs, even if its in the middle of a
> word, or
>
> 2. put the page number where it irritates less, even if its up to one whole
> paragraph off.
>
> Of course 2. would make it useless for indices, the use case you advocated
> in the first place.
>
> Both cases are unsatisfying, IMO.
>
>
> The only place where page numbers belong is in invisible markup that can be
> made visible on demand (CSS edit or so).
>
>
> --
> Marcello Perathoner
> webmaster at gutenberg.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/21743851/attachment.html>

From vze3rknp at verizon.net  Tue Feb 23 11:29:19 2010
From: vze3rknp at verizon.net (Juliet Sutherland)
Date: Tue, 23 Feb 2010 14:29:19 -0500
Subject: [gutvol-d] Re: let us not be confused
In-Reply-To: <SNT120-DS359BF0FBA2E88D39C9DF7AE420@phx.gbl>
References: <bfef.7902e780.38b2f8f0@aol.com>
	<SNT120-DS359BF0FBA2E88D39C9DF7AE420@phx.gbl>
Message-ID: <4B842C8F.5020106@verizon.net>


On 2/23/2010 10:47 AM, Jim Adcock wrote:
>
> Begs the question why DP doesn't just institute a quality hosted OCR and let
> people just submit the page images. Ask people to test run a couple pages by
> the hosted OCR before settling on their digitization settings in order to
> make sure they know what they are doing.
>    
Having done OCR on several thousand books, I can safely say that even 
with the most advanced OCR programs currently available, this is NOT a 
good idea for books of any complexity at all. It might be OK for 
straight fiction. The big stumbling block is that ABBYY often segments 
the page incorrectly or orders the segments incorrectly. A classic 
example often comes up in the Table of Contents where it may group all 
of the chapter titles into one block and then the page numbers into 
another block. When this is saved as plain text for proofing, that will 
make lines of chapter titles appear, followed by lines of the page 
numbers where what we really want is for the page number to appear on 
the same line as the chapter title. Not much fun for the proofers to 
clean up. ABBYY 10 does much better than the previous version I was 
using, but still sometimes gets things wrong. Getting blocks of text in 
the wrong order, as can sometimes happen when there are multiple 
illustrations on a page dividing the text up into separate blocks, is 
equally bad. Another common OCR error is missing the last word of a 
paragraph when it appears by itself on a line.

When I scan a book, I keep an eye out for any pages having anything 
other than a single solid block of text. If the book has any, I'll then 
go through page by page to make sure that the OCR got the text block 
segmentation and order correct. I often end up redrawing the text 
blocks, sometimes re-ordering them, and then running the OCR a second 
time on that page. I would  not trust a "batch" or "remote" OCR program 
to do this correctly. Despite assertions to the contrary, the content 
providers at DP do go to some considerable lengths to make things easier 
for the other volunteers.

There are other problems with providing a central OCR service, which 
include expense, processing load, etc. But to my mind, the definitive 
problem is what I outlined above. Without an interactive capability OCR 
results often are not good enough for books of any complexity. And 
before someone says "so make the OCR engine on the server be 
interactive", let me say that communication and processing costs would 
be prohibitively expensive, and further, the OCR engines that are sold 
for that kind of multiple user, production environment use, don't (as 
far as I know) make that kind of interaction easy to accomplish.

JulietS

From Bowerbird at aol.com  Tue Feb 23 11:52:15 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 23 Feb 2010 14:52:15 EST
Subject: [gutvol-d] Re: ok, let's take a look at gardner's book,
	just for the exercise
Message-ID: <12071.4fc084b3.38b58bef@aol.com>

i did some more work on gardner's book, "the advocate". 

the (presumably incorrect) lines from the o.c.r. are in red,
while the lines from gardner's proofed copy are in blue...
(at least i hope i didn't screw up the colors like last time.)

here's a sample url, using page-by-page mode:
>?? http://z-m-l.com/go/gardn/gardnp123.html

here's a single web-page containing the whole book:
>    http://z-m-l.com/go/gardn/gardn-hybrid6.html

and here's a web-page with _just_ the diffs on it:
>    http://z-m-l.com/go/gardn/gardn-159diffs.html

we've got 159 diffs.   that's about all the preprocessing
that can be done on this file.   159 diffs on ~4000 lines is
_great._   saves a ton of time on word-by-word proofing.

in this last round of preprocessing, i did a spellcheck...

i should explain that i do not use a regular spellchecker.

i copy all the text into a program (which i coded) that
spits out a list of all the words _not_ in its dictionary.
if you're interested, here's the dictionary that it uses:
>    http://z-m-l.com/go/regulardictionary.txt

then i go through the list and cull out the words that
-- for whatever reason -- look to me to be correct...
(most typically, they're names and compound words,
neither of which class happens to be in my dictionary;
doing this culling saves me a lot of time that would be
spent checking and then by-passing these valid words;
it's basically a quick way to create a custom dictionary.)

this leaves me with a list of words that i want to check,
which i then feed into my banana-cream app, so it will
show me all the occurrences of those words in the text,
with the relevant page-scan displayed at the same time,
so that i can check the text against its scan.   it's handy.

by the way, i can also feed banana-cream a list of lines,
which it will then search for and display automatically...
this is what i'll do to resolve those diffs displayed here:
>    http://z-m-l.com/go/gardn/gardn-159diffs.html

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/0fad88bc/attachment.html>

From traverso at posso.dm.unipi.it  Tue Feb 23 13:11:02 2010
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Tue, 23 Feb 2010 22:11:02 +0100 (CET)
Subject: [gutvol-d] Re: let us not be confused
In-Reply-To: <4B842C8F.5020106@verizon.net> (message from Juliet Sutherland on
	Tue, 23 Feb 2010 14:29:19 -0500)
References: <bfef.7902e780.38b2f8f0@aol.com>
	<SNT120-DS359BF0FBA2E88D39C9DF7AE420@phx.gbl>
	<4B842C8F.5020106@verizon.net>
Message-ID: <20100223211102.22765FFB3@cardano.dm.unipi.it>


Making an OCR batch service at DP would be pointless, since the
Internet Archive already does it: their OCR is as good as what you can
do yourself if you don't act manually during the recognition
(training, drawing blocks, etc.).

And for contents not taken from TIA you can just donate the scans at
TIA and let them do the OCR for you.

Carlo

From Bowerbird at aol.com  Tue Feb 23 13:29:05 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 23 Feb 2010 16:29:05 EST
Subject: [gutvol-d] Re: ok, let's take a look at gardner's book,
	just for the exercise
Message-ID: <14cce.3233a3c9.38b5a2a1@aol.com>

i've analyzed the 159 diffs on gardner's books,
and placed them into the following categories:

    60 comma/period diffs
    33 quotemark diffs
    29 stealth-scanno diffs
    12 various-punctuation diffs
    12 formatting diffs
    09 real-spelling-question diffs
    04 should-have-been-preprocessed diffs
----------
159 total

note that i haven't resolved the diffs yet, i've just
classified them, so we can understand their nature.

***

more than anything else, these finds reflect the fact
that when archive.org does o.c.r. on a google scanset,
the results are not as impressive as they might be...

the o.c.r. attained by archive.org on its own scansets
is _much_ better.

although the comma/period problem is a common one,
the high number of confusions in this text is strange...
ditto with the quotemark diffs.   (but, for the record, i 
could've detected many of those with my quote-check,
but i didn't bother to run it, since i had gardner's text.)

i don't know if the google scans need to be despeckled,
or if the contrast is bad on them, or what the problem is,
and since google probably wouldn't improve them for us
anyway, even if we knew the reason and asked 'em to fix it,
it's probably not worth figuring out.   or maybe it would be,
if there was something that we could do to fix it ourselves.

lucky for me, i don't have much investment in any book,
and if i did have one, i would be willing to scan it myself,
so i'm not desperate enough to deal with a google scanset;
i can afford to wait until archive.org scans the book instead.

one final note is that the number of stealth scannos is high,
at least relative to the other archive.org books i've checked.

again, this might be due to the low quality of google's scans,
but it might also be that archive.org has activated a kind of
"intelligence" that tries to _guess_ a word it cannot recognize,
and the guessing routine is bad, at least from our standpoint.
we should certainly hope and pray that's _not_ the case, since
stealth scannos are hidden icebergs to a comparison method.

you can find my breakdown of the 159 diffs into categories at:
>    http://z-m-l.com/go/gardn/gardn-diffs-classes.html

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/76988208/attachment.html>

From hart at pglaf.org  Tue Feb 23 15:17:02 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Tue, 23 Feb 2010 15:17:02 -0800 (PST)
Subject: [gutvol-d] Re: !@!!@!!@!Re: Re: so what is so important about
 pagination?
In-Reply-To: <ef6c.db2697d.38b577fc@aol.com>
References: <ef6c.db2697d.38b577fc@aol.com>
Message-ID: <alpine.DEB.2.00.1002231506310.25761@mail.pglaf.org>


On Tue, 23 Feb 2010, Bowerbird at aol.com wrote:

> michael said:
> >?? If you have a point to make, you missed.
>
> i think lots of people got the point.? but you evidently missed it.

You missed your real point, that it was already time for you to quit.


> >?? Take better aim and try again, or give it up.
>
> my aim is true.? i've done enough.? i have no need to "give up",
> _or_ to continue.? the future is the entity that will deal with this.
> doesn't matter how much friction we generate.? the future will deal.

Gee, I wonder from whom you learned to say that?


> and this is how the future will do that dealing:
>
> 1.? it will need to tell which texts are accurate and which are not,
> because it will find itself awash in e-books without provenance...

As Paul Klipsch is famous for saying:

"There is no such thing as hi-fi.
It is fi or not fi, that is all."

As I am famous for saying, there aren't any accurate books on paper,
in computers, on parchment, scrolls, etc., at least not many, and it
appears to me that I should perhaps even include written in stone.

However, when all is said and done with eBooks, or even half of it--
eBooks will be more correct than the originals.

The paper books will never become as correct/accurate.


> 2.? it will do this determination by comparing digital text to scans.
> (it will test the validity of the scans by comparing them to p-books.)

Valid. . .what a concept!

Reliable. . .a little better.

Correct. . .only as far as mimic, copy, etc.

Accurate. . .you have to ask the authors, they always say NO!


> 3.? digital text which retained pagination/linebreaks will lend itself
> better to this comparison process, so it will generally tend to "win",
> and the "best practice" will come to require this aspect, and strictly.

You just cannot get up there and SAY. . ."OPTION". . .can you???

It will be a tool, but most of those page numbers will not exist for most
readers in the future, just as no one knows the page numbers of Homer.

;-)

OK. . .of Shakespeare. . . .


> 4.? digital text which is difficult to get to (e.g., zipped up in .epub,
> obscured by some opaque format, unduly burdened by d.r.m., etc.)
> and text which has discarded its pagination/linebreaks will lose out.

A digital text wich is difficult to get to, i.e. headers, footers, page #s,
will lose out to the majority.


>
> i have no interest in engaging in "debate" on these 4 points,
> because it's perfectly clear to me that they're all _inevitable_.

> the question here is whether p.g. wants its e-texts in group #3 or #4.


The question is will you put your money where your mouth is???


Wagers?

From hart at pglaf.org  Tue Feb 23 15:19:42 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Tue, 23 Feb 2010 15:19:42 -0800 (PST)
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <b35f6ee51002231003y475d61e1w61a7b13ffcae5fa6@mail.gmail.com>
References: <b35f6ee51002231003y475d61e1w61a7b13ffcae5fa6@mail.gmail.com>
Message-ID: <alpine.DEB.2.00.1002231518230.25761@mail.pglaf.org>


Won't ANY kind of header, footer, page indicator "disturb when doing a search"

???


On Tue, 23 Feb 2010, christine wrote:

> I am not talking about adding big chunk of blank lines, but of either including the page
> number like that [p.101] in the text, I do not like this solution since it disturb when
> doing a search;
>
> or adding the page number at the start of the 1. paragraph starting on a page, e.g.:
>
> change of page,
> blah blah end of paragraph.
>
> [p.101]Text of the new paragraph.
>
> With a TN explaining that the page numbers are not accuratly placed at the start of the
> page, but at the start of the 1. paragraph starting on that page. The aim is to make the
> use of the files as easy and comfortable to the reader as possible.
>
> Christine
>
> >From:?"Michael S. Hart" <hart at pglaf.org>
> >
> >
> >I'm reading in a 34 line window.
> >
> >Let's say there are only 2 blank lines before and after page numbers,
> >that's 5 lines, or about 14% of my screen, when they go by.
> >
> >Anything over 10% is going to be distracting, according to perception
> >psychologists. . .I find my eye is attracted there. . . .
>
>
>
>
>
>
>

From prosfilaes at gmail.com  Tue Feb 23 15:51:32 2010
From: prosfilaes at gmail.com (David Starner)
Date: Tue, 23 Feb 2010 18:51:32 -0500
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <541796544.271024.1266936371271.JavaMail.mail@webmail08>
References: <b35f6ee51002230328l14af53aeu3406929738cc197e@mail.gmail.com>
	<541796544.271024.1266936371271.JavaMail.mail@webmail08>
Message-ID: <6d99d1fd1002231551k5454de6es87bf3d0c63a94146@mail.gmail.com>

On Tue, Feb 23, 2010 at 9:46 AM, Joshua Hutchinson
<joshua at hutchinson.net> wrote:
> ?Rather, in the index, use line numbers.

Two problems: you don't know what the line numbers are until you have
the header put on. Secondly, anything that removes that header
destroys the line numbers.

-- 
Kie ekzistas vivo, ekzistas espero.

From Bowerbird at aol.com  Tue Feb 23 15:57:20 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 23 Feb 2010 18:57:20 EST
Subject: [gutvol-d] Re: !@!!@!!@!Re: Re: so what is so important about
	pagination?
Message-ID: <19c21.2f59db3e.38b5c560@aol.com>

michael said:
>   You missed your real point, that it was already time for you to quit.

i seem to have found it now.

you're wrong, michael, and the future will deliver the _real_ point.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/76873db0/attachment.html>

From Bowerbird at aol.com  Tue Feb 23 16:00:12 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 23 Feb 2010 19:00:12 EST
Subject: [gutvol-d] Re: so what is so important about pagination?
Message-ID: <19d99.58cae61c.38b5c60c@aol.com>

michael said:
>   Won't ANY kind of header, footer, page indicator 
>    "disturb when doing a search"

only if the viewer-app is too stupid to know 
it should skip such things when searching...
(unless you tell it specifically to include 'em).

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/ea354896/attachment.html>

From dakretz at gmail.com  Tue Feb 23 16:04:27 2010
From: dakretz at gmail.com (don kretz)
Date: Tue, 23 Feb 2010 16:04:27 -0800
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <19d99.58cae61c.38b5c60c@aol.com>
References: <19d99.58cae61c.38b5c60c@aol.com>
Message-ID: <627d59b81002231604u6079e880x9ee1625963baba8d@mail.gmail.com>

And anything with line numbers won't have consumer appeal -
it was obviously produced by geeks for geeks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/9cdbf483/attachment.html>

From hart at pglaf.org  Tue Feb 23 16:10:12 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Tue, 23 Feb 2010 16:10:12 -0800 (PST)
Subject: [gutvol-d] Re: !@!!@!!@!Re: Re: so what is so important about
 pagination?
In-Reply-To: <19c21.2f59db3e.38b5c560@aol.com>
References: <19c21.2f59db3e.38b5c560@aol.com>
Message-ID: <alpine.DEB.2.00.1002231607250.25761@mail.pglaf.org>


On Tue, 23 Feb 2010, Bowerbird at aol.com wrote:

> michael said:
> >?? You missed your real point, that it was already time for you to quit.
>
> i seem to have found it now.
>
> you're wrong, michael, and the future will deliver the _real_ point.

The future WILL deliver the real point, but it's rarely as cool as we think.

My pocket terabyte is one of the only things that is as cool as I expected,
the again, it's just plain hardware.

Just think how much we could do with the new computers if we all ran commands
instead of pointing and clicking and watching that damn hourglass. . . .

From Bowerbird at aol.com  Tue Feb 23 16:22:40 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 23 Feb 2010 19:22:40 EST
Subject: [gutvol-d] the d.p. opinion on "prerelease" of e-texts
Message-ID: <1a810.2f4a8a5b.38b5cb50@aol.com>

just in case you were wondering, dkretz _did_
copy michael hart's saturday post over to d.p.,
giving michael's position in favor of prerelease
of texts while they're still "in progress" at d.p.

that was on saturday evening, and since then
there have been over 100 posts in response:
>    http://www.pgdp.net/phpBB2/viewtopic.php?t=42940

there was an earlier discussion on the issue too:
>    http://www.pgdp.net/phpBB2/viewtopic.php?t=42695

the earlier discussion had over 100 posts as well,
and had a poll, which got a high number of votes.

the poll question?

>    Making preview texts available on a dedicated site, 
>    with a clear label that they're work in progress. 
>    Good idea/bad idea?

the poll results?

>    45% -- I'd rather you didn't
>    17% -- I couldn't care less
>    36% -- Great idea!

some people are quite agitated at the thought of the idea.
they drag out the old rot about how "d.p. means _quality_".

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100223/972f9251/attachment.html>

From schultzk at uni-trier.de  Wed Feb 24 01:57:11 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Wed, 24 Feb 2010 10:57:11 +0100
Subject: [gutvol-d] Re: Translation project
In-Reply-To: <Pine.GSO.4.58.1002222146490.6845@vtn1.victoria.tc.ca>
References: <20100222223941.0dedd0f3f91314fbc67db20f64e304ca.5624d01ea5.wbe@email05.secureserver.net>
	<Pine.GSO.4.58.1002222146490.6845@vtn1.victoria.tc.ca>
Message-ID: <099319D1-95DE-4E05-8C09-E5A0BF62FD2B@uni-trier.de>

Hi Carel,

	Translation is a fine art. You have to be proficient in
	both languages and should be a nativer speaker of the
	language you translate into.
	Furthermore there is more to translation than taking 
	the text and translating the words, systanx, and semantics.
	You have to do research on the author, historical research
	to understand the mood in which a text was written.
	As proof of fact just take a look at Shakespeare.
	
	I have to agree with Andrew that a collabarative approach is not
	good. You must have strict rules in place when a group is
	working together. 

	I do not mean to discourage you, but want to make
	sure you understand the task of translating.

	regards
		Keith.

	
Am 23.02.2010 um 07:00 schrieb Andrew Sly:

> No particular site springs to mind. Perhaps if we knew the
> language you have in mind it might be easier to go searching
> for possibilities.
> 
> One barrier to that kind of collabarative translation is that
> there is usually not one "proper and correct" way to translate
> a given phrase. There can be different word order, differences
> in shades of meaning and tone, etc. If you have different people
> translate different parts of a text, you can end up with great
> lack of a consistent style.
> 
> It's not impossible though. Take this example:
> http://www.gutenberg.org/etext/28971
> 
> It was worked on by a small group of people specifically
> for inclusion in PG. A couple people translated different
> chunks of the text from English, then after the whole thing
> was put together, we did a few proof-reading runs, tried to
> even out inconsistencies, ran a spell check, etc.
> 
> This was done through email and using a wiki.
> 
> --Andrew
> 
> On Mon, 22 Feb 2010 cmiske at ashzfall.com wrote:
> 
>> I've just recently started studying a language that is foreign to me.
>> Although I am a long way off from being able to translate anything
>> beyond an early reader, I was wondering if there is a translation
>> project out there that is similar to DP? I looked about the PG site, but
>> could not find a link to anything of that nature.
>> 
>> Thanks;
>> Carel
>> 
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d


From schultzk at uni-trier.de  Wed Feb 24 02:23:18 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Wed, 24 Feb 2010 11:23:18 +0100
Subject: [gutvol-d] Re: !@!!@!!@!Re: Re: so what is so important about
	pagination?
In-Reply-To: <ef6c.db2697d.38b577fc@aol.com>
References: <ef6c.db2697d.38b577fc@aol.com>
Message-ID: <89C594A9-5D5B-420F-8598-2760F4E95AFD@uni-trier.de>

Hi All,

	Been following this thread loosely.

	But the fact remains that the problem is the encoding
	of the digital text!!! NOT, the output. 
	
	I have said it time and time again, the the encoding has
	to encode the information of the original scan or p-text.
	Thats is a rendering of such an encoding will produce
	a fairly close copy of the original. (I use rendering
	here as a way of producing the text; either in text or
	display)

	For the output you simply use those parts of the
	encoding for your output that you need.
	
	The page number problem can be handled because
	out of convention they are placed in the header or 
	footer of a page. So if you do not want the original
	pagenumbers then just skip them. 
	If you want the original page breaks use them and 
	output it where you want it.

	I agree with BB, but we have different ways about doing it.

	
Am 23.02.2010 um 19:27 schrieb Bowerbird at aol.com:

> michael said:
> >   If you have a point to make, you missed.
> 
> i think lots of people got the point.  but you evidently missed it.
> 
> 
> >   Take better aim and try again, or give it up.
> 
> my aim is true.  i've done enough.  i have no need to "give up",
> _or_ to continue.  the future is the entity that will deal with this.
> doesn't matter how much friction we generate.  the future will deal.
> 
> and this is how the future will do that dealing:
> 
> 1.  it will need to tell which texts are accurate and which are not,
> because it will find itself awash in e-books without provenance...
> 
> 2.  it will do this determination by comparing digital text to scans.
> (it will test the validity of the scans by comparing them to p-books.)
> 
> 3.  digital text which retained pagination/linebreaks will lend itself
> better to this comparison process, so it will generally tend to "win",
> and the "best practice" will come to require this aspect, and strictly.
> 
> 4.  digital text which is difficult to get to (e.g., zipped up in .epub,
> obscured by some opaque format, unduly burdened by d.r.m., etc.)
> and text which has discarded its pagination/linebreaks will lose out.
> 
> i have no interest in engaging in "debate" on these 4 points, 
> because it's perfectly clear to me that they're all _inevitable_.
> 
> the question here is whether p.g. wants its e-texts in group #3 or #4.
> 
> -bowerbird
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100224/10b0939e/attachment-0001.html>

From schultzk at uni-trier.de  Wed Feb 24 02:31:27 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Wed, 24 Feb 2010 11:31:27 +0100
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <627d59b81002231604u6079e880x9ee1625963baba8d@mail.gmail.com>
References: <19d99.58cae61c.38b5c60c@aol.com>
	<627d59b81002231604u6079e880x9ee1625963baba8d@mail.gmail.com>
Message-ID: <9B13A81D-3C69-48E0-9C18-E17C9CC20883@uni-trier.de>

Hi Don,

	if I remember correctly the discussion of using editors and
	you can turn linenumber on and off.

	Also, if you have ever been in a literature class you would
	know that in poems and drama one uses linenumbers as
	references. So it is not that geeky.

	On another point. Could you please add in so quotes
	from the original post you are referring to. It is sometimes
	hard to follow to whom and what actually you are
	commenting on.    Thanx
	
	regards
		Keith.


Am 24.02.2010 um 01:04 schrieb don kretz:

> And anything with line numbers won't have consumer appeal -
> it was obviously produced by geeks for geeks.


From hart at pglaf.org  Wed Feb 24 04:18:31 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Wed, 24 Feb 2010 04:18:31 -0800 (PST)
Subject: [gutvol-d] Our 400th Portuguese eBook
Message-ID: <alpine.DEB.2.00.1002240416410.14865@mail.pglaf.org>


We are up to #398 and I have not received any suggestions for #400.

Should we reserve one?

Just go with whatever comes along?

Other suggestions?


Thanks!!!


Michael

From joshua at hutchinson.net  Wed Feb 24 05:34:00 2010
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Wed, 24 Feb 2010 13:34:00 +0000 (GMT)
Subject: [gutvol-d] Re: so what is so important about pagination?
References: <b35f6ee51002230328l14af53aeu3406929738cc197e@mail.gmail.com>
	<541796544.271024.1266936371271.JavaMail.mail@webmail08>
	<6d99d1fd1002231551k5454de6es87bf3d0c63a94146@mail.gmail.com>
Message-ID: <1360839564.287591.1267018440992.JavaMail.mail@webmail08>

An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100224/fada0bb5/attachment.html>

From dakretz at gmail.com  Wed Feb 24 09:18:07 2010
From: dakretz at gmail.com (don kretz)
Date: Wed, 24 Feb 2010 09:18:07 -0800
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <9B13A81D-3C69-48E0-9C18-E17C9CC20883@uni-trier.de>
References: <19d99.58cae61c.38b5c60c@aol.com>
	<627d59b81002231604u6079e880x9ee1625963baba8d@mail.gmail.com>
	<9B13A81D-3C69-48E0-9C18-E17C9CC20883@uni-trier.de>
Message-ID: <627d59b81002240918t6fc38b26lb63460419c3f8a3e@mail.gmail.com>

OK. It's only geeky if you put line numbers where they weren't in the
original.

How's that?

On Wed, Feb 24, 2010 at 2:31 AM, Keith J. Schultz <schultzk at uni-trier.de>wrote:

> Hi Don,
>
>        if I remember correctly the discussion of using editors and
>        you can turn linenumber on and off.
>
>        Also, if you have ever been in a literature class you would
>        know that in poems and drama one uses linenumbers as
>        references. So it is not that geeky.
>
>        On another point. Could you please add in so quotes
>        from the original post you are referring to. It is sometimes
>        hard to follow to whom and what actually you are
>        commenting on.    Thanx
>
>        regards
>                Keith.
>
>
> Am 24.02.2010 um 01:04 schrieb don kretz:
>
> > And anything with line numbers won't have consumer appeal -
> > it was obviously produced by geeks for geeks.
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100224/a1f1c06f/attachment.html>

From cmiske at ashzfall.com  Wed Feb 24 09:51:34 2010
From: cmiske at ashzfall.com (cmiske at ashzfall.com)
Date: Wed, 24 Feb 2010 10:51:34 -0700
Subject: [gutvol-d] Re: Translation project
Message-ID: <20100224105134.0dedd0f3f91314fbc67db20f64e304ca.7a021e77f6.wbe@email05.secureserver.net>


Keith and Andrew: Thank you so much for your replies.
 
I am in the process of learning Spanish (the 'textbook' version they
tend to teach in the US school system). In my readings I have already
encountered many moments of "I know what it is saying, but what does it
mean?" I lack the cultural background and ability to 'play' with words
that is necessary for high-level translation.
 
What I was actually thinking of was more along the lines of the
translation of early readers (such as the Dick and Jane series) into a
multitude of languages. As books of this nature are intended to have a
rather limited vocabulary scope and to facilitate the learning of an
individual's 'native' language, I thought these books might also be
helpful if translated into multiple languages. When reading these books
it would also be possible for the language learners to refer to the
edition in their native language for guidance. I'm not sure that this
sort of translation process would actually create something that PG
would even want. 

Is there much interest beyond my own to see "See Spot. See Spot run...."
translated into a dozen or more languages? I assumed that if there were
much interest, there would already be such a project out there on the
internet.
 
There are, of course, books of this nature on the market, but they are
rather limited in availability and somewhat pricey. A free archive would
be nice.
 
I agree that the majority of works of literature should be left to the
experts and native speakers of both languages. Andrew's project was a
fine example of a collaborative effort. I hope more people get together
in small groups (or individually) to translate more works. Their efforts
are appreciated.
 
Carel
 
-------- Original Message --------
Subject: [gutvol-d] Re: Translation project
From: "Keith J. Schultz" <schultzk at uni-trier.de>
Date: Wed, February 24, 2010 2:57 am
To: Project Gutenberg Volunteer Discussion <gutvol-d at lists.pglaf.org>

Hi Carel,

Translation is a fine art. You have to be proficient in
both languages and should be a nativer speaker of the
language you translate into.
Furthermore there is more to translation than taking 
the text and translating the words, systanx, and semantics.
You have to do research on the author, historical research
to understand the mood in which a text was written.
As proof of fact just take a look at Shakespeare.

I have to agree with Andrew that a collabarative approach is not
good. You must have strict rules in place when a group is
working together. 

I do not mean to discourage you, but want to make
sure you understand the task of translating.

regards
Keith.


Am 23.02.2010 um 07:00 schrieb Andrew Sly:

> No particular site springs to mind. Perhaps if we knew the
> language you have in mind it might be easier to go searching
> for possibilities.
> 
> One barrier to that kind of collabarative translation is that
> there is usually not one "proper and correct" way to translate
> a given phrase. There can be different word order, differences
> in shades of meaning and tone, etc. If you have different people
> translate different parts of a text, you can end up with great
> lack of a consistent style.
> 
> It's not impossible though. Take this example:
> http://www.gutenberg.org/etext/28971
> 
> It was worked on by a small group of people specifically
> for inclusion in PG. A couple people translated different
> chunks of the text from English, then after the whole thing
> was put together, we did a few proof-reading runs, tried to
> even out inconsistencies, ran a spell check, etc.
> 
> This was done through email and using a wiki.
> 
> --Andrew
> 
> On Mon, 22 Feb 2010 cmiske at ashzfall.com wrote:
> 
>> I've just recently started studying a language that is foreign to me.
>> Although I am a long way off from being able to translate anything
>> beyond an early reader, I was wondering if there is a translation
>> project out there that is similar to DP? I looked about the PG site, but
>> could not find a link to anything of that nature.
>> 
>> Thanks;
>> Carel
>> 
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d

_______________________________________________
gutvol-d mailing list
gutvol-d at lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d


From cmiske at ashzfall.com  Wed Feb 24 09:58:17 2010
From: cmiske at ashzfall.com (cmiske at ashzfall.com)
Date: Wed, 24 Feb 2010 10:58:17 -0700
Subject: [gutvol-d] Re:
 =?utf-8?q?!=40!!=40!!=40!Re=3A_Re=3A_so_what_is_so_important_?=
 =?utf-8?q?about_pagination=3F?=
Message-ID: <20100224105817.0dedd0f3f91314fbc67db20f64e304ca.5e400bb5c6.wbe@email05.secureserver.net>

I agree as well that the original digitalization should be as lossless
as possible. The various outputs of this original can then be as lossy
as anyone cares to make them. If the original is lossy, then the
information is lost 'forever.' Lossy originals reduce the choices of
readers and forces a convention upon them of what is and is not
important. The readers of the books should decide what they want/need.
 
Carel
 
-------- Original Message --------
Subject: [gutvol-d] Re: !@!!@!!@!Re: Re: so what is so important about
pagination?
From: "Keith J. Schultz" <schultzk at uni-trier.de>
Date: Wed, February 24, 2010 3:23 am
To: Project Gutenberg Volunteer Discussion <gutvol-d at lists.pglaf.org>

Hi All,

Been following this thread loosely.


But the fact remains that the problem is the encoding
of the digital text!!! NOT, the output. 

I have said it time and time again, the the encoding has
to encode the information of the original scan or p-text.
Thats is a rendering of such an encoding will produce
a fairly close copy of the original. (I use rendering
here as a way of producing the text; either in text or
display)


For the output you simply use those parts of the
encoding for your output that you need.

The page number problem can be handled because
out of convention they are placed in the header or 
footer of a page. So if you do not want the original
pagenumbers then just skip them. 
If you want the original page breaks use them and 
output it where you want it.


I agree with BB, but we have different ways about doing it.


From cmiske at ashzfall.com  Wed Feb 24 11:37:15 2010
From: cmiske at ashzfall.com (cmiske at ashzfall.com)
Date: Wed, 24 Feb 2010 12:37:15 -0700
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
Message-ID: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>

> some people are quite agitated at the thought of the idea.
> they drag out the old rot about how "d.p. means _quality_".
> -bowerbird
 
DP does have a standard of quality similar to many current publishers of
books. I frankly cannot recall the last time I read a book in any format
or from any publisher/producer of books that did not contain errors of
some sort or another. Humans are not perfect. Writing, editing, and
proofing are human endeavors and, as such, are limited by our
imperfections. The fact that they even strive for quality at DP is a
mark in their favor. There are many eBook producers of public domain
books that hold quantity so far above quality as to make a majority of
their books almost unreadable. I would worry that a pre-release version
of a text might result in PG getting a reputation for being yet another
supplier of illegible eBooks (even if these books are only available in
a 'special area'). As Michael asked only for 'serious objections'
against a pre-release section being created, I cannot think of any
objections that go outside of simply wondering if you are willing to
have these books circulated as they stand (people may grab them and
distribute them 'as-is'). Would they have a header that specifies that
they are 'not ready for prime time' as well as being in a special area?
(Many people strip the headers, so the source of origin issue often
becomes a moot point.)

The more I think about it, the more I feel that it would actually be
rather nice to see something along the lines of the 'roundless
experiment' produce decent quality 'first release' books and then have
people nitpick them to 'perfection' later. The point would be the format
of storage so that the rough version could be compared to the original
scans and corrections/updates be done with ease. Also, if a book is
backlogged at DP, it would make sense to have a method for someone
outside of DP to adopt the book and finish it.

Carel

-------- Original Message --------
Subject: [gutvol-d] the d.p. opinion on "prerelease" of e-texts
From: Bowerbird at aol.com
Date: Tue, February 23, 2010 5:22 pm
To: gutvol-d at lists.pglaf.org, bowerbird at aol.com

just in case you were wondering, dkretz _did_
copy michael hart's saturday post over to d.p.,
giving michael's position in favor of prerelease
of texts while they're still "in progress" at d.p.

that was on saturday evening, and since then
there have been over 100 posts in response:
>   http://www.pgdp.net/phpBB2/viewtopic.php?t=42940

there was an earlier discussion on the issue too:
>   http://www.pgdp.net/phpBB2/viewtopic.php?t=42695

the earlier discussion had over 100 posts as well,
and had a poll, which got a high number of votes.

the poll question?

>   Making preview texts available on a dedicated site, 
>   with a clear label that they're work in progress. 
>   Good idea/bad idea?

the poll results?

>   45% -- I'd rather you didn't
>   17% -- I couldn't care less
>   36% -- Great idea!

some people are quite agitated at the thought of the idea.
they drag out the old rot about how "d.p. means _quality_".

-bowerbird
_______________________________________________
gutvol-d mailing list
gutvol-d at lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d


From lee at novomail.net  Wed Feb 24 11:49:57 2010
From: lee at novomail.net (Lee Passey)
Date: Wed, 24 Feb 2010 12:49:57 -0700
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
Message-ID: <4B8582E5.5050103@novomail.net>

On 2/24/2010 12:37 PM, cmiske at ashzfall.com wrote:

[snip]

> The more I think about it, the more I feel that it would actually be
> rather nice to see something along the lines of the 'roundless
> experiment' produce decent quality 'first release' books and then have
> people nitpick them to 'perfection' later. The point would be the format
> of storage so that the rough version could be compared to the original
> scans and corrections/updates be done with ease. Also, if a book is
> backlogged at DP, it would make sense to have a method for someone
> outside of DP to adopt the book and finish it.

I agree with you completely, and what you have described is what 
bowerbird has been agitating for for many years now. A book would go 
into 'first release' when the number of changes in the roundless system 
dropped below a certain rate. It would then go into general distribution 
in a system that allowed "continuous proofreading" in a wiki-like system

Good luck in convincing either DP or PG to adopt these kind of reforms.

From hart at pglaf.org  Wed Feb 24 13:21:27 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Wed, 24 Feb 2010 13:21:27 -0800 (PST)
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <4B8582E5.5050103@novomail.net>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<4B8582E5.5050103@novomail.net>
Message-ID: <alpine.DEB.2.00.1002241317320.7483@mail.pglaf.org>


I can tell you that both Greg Newby, our CEO, and myself,
are most definitely FOR doing such prereleases!!!

We are prepared to start making directories for them and
getting them loaded in as soon as we have permission.

We would put up a number of disclaimers abou them to put
a protective system up for DP and various volunteers.

We would appreciate any opportuty to do this with any of
the three sections of books mentioned earlier.

Many thanks!!!


Michael


On Wed, 24 Feb 2010, Lee Passey wrote:

> On 2/24/2010 12:37 PM, cmiske at ashzfall.com wrote:
>
> [snip]
>
> > The more I think about it, the more I feel that it would actually be
> > rather nice to see something along the lines of the 'roundless
> > experiment' produce decent quality 'first release' books and then have
> > people nitpick them to 'perfection' later. The point would be the format
> > of storage so that the rough version could be compared to the original
> > scans and corrections/updates be done with ease. Also, if a book is
> > backlogged at DP, it would make sense to have a method for someone
> > outside of DP to adopt the book and finish it.
>
> I agree with you completely, and what you have described is what bowerbird has
> been agitating for for many years now. A book would go into 'first release'
> when the number of changes in the roundless system dropped below a certain
> rate. It would then go into general distribution in a system that allowed
> "continuous proofreading" in a wiki-like system
>
> Good luck in convincing either DP or PG to adopt these kind of reforms.
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

From ajhaines at shaw.ca  Wed Feb 24 14:00:31 2010
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Wed, 24 Feb 2010 14:00:31 -0800
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<4B8582E5.5050103@novomail.net>
	<alpine.DEB.2.00.1002241317320.7483@mail.pglaf.org>
Message-ID: <B592E6447AAB45C891E108507D6811FA@alp2400>

Would it not be simpler for DP itself to have a Pre-releases page, similar 
to its Smooth-Read page?

I would think that if pre-releases are copied from DP into some PG 
environment, similar to Preprints, there would need to be some coordination 
to remove them from that environment when they're posted into PG as finished 
products.

Al


----- Original Message ----- 
From: "Michael S. Hart" <hart at pglaf.org>
To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
Sent: Wednesday, February 24, 2010 1:21 PM
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts


>
>
>
> I can tell you that both Greg Newby, our CEO, and myself,
> are most definitely FOR doing such prereleases!!!
>
> We are prepared to start making directories for them and
> getting them loaded in as soon as we have permission.
>
> We would put up a number of disclaimers abou them to put
> a protective system up for DP and various volunteers.
>
> We would appreciate any opportuty to do this with any of
> the three sections of books mentioned earlier.
>
> Many thanks!!!
>
>
> Michael
>
>
>
> On Wed, 24 Feb 2010, Lee Passey wrote:
>
>> On 2/24/2010 12:37 PM, cmiske at ashzfall.com wrote:
>>
>> [snip]
>>
>> > The more I think about it, the more I feel that it would actually be
>> > rather nice to see something along the lines of the 'roundless
>> > experiment' produce decent quality 'first release' books and then have
>> > people nitpick them to 'perfection' later. The point would be the 
>> > format
>> > of storage so that the rough version could be compared to the original
>> > scans and corrections/updates be done with ease. Also, if a book is
>> > backlogged at DP, it would make sense to have a method for someone
>> > outside of DP to adopt the book and finish it.
>>
>> I agree with you completely, and what you have described is what 
>> bowerbird has
>> been agitating for for many years now. A book would go into 'first 
>> release'
>> when the number of changes in the roundless system dropped below a 
>> certain
>> rate. It would then go into general distribution in a system that allowed
>> "continuous proofreading" in a wiki-like system
>>
>> Good luck in convincing either DP or PG to adopt these kind of reforms.
>> _______________________________________________
>> gutvol-d mailing list
>> gutvol-d at lists.pglaf.org
>> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
> 


From dakretz at gmail.com  Wed Feb 24 14:19:55 2010
From: dakretz at gmail.com (don kretz)
Date: Wed, 24 Feb 2010 14:19:55 -0800
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <B592E6447AAB45C891E108507D6811FA@alp2400>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<4B8582E5.5050103@novomail.net>
	<alpine.DEB.2.00.1002241317320.7483@mail.pglaf.org>
	<B592E6447AAB45C891E108507D6811FA@alp2400>
Message-ID: <627d59b81002241419j519fe429q77e5a0ab32605400@mail.gmail.com>

I think you're talking about a major technical challenge for the DP
developers
to undertake something like that, even if there weren't as much highly
vocal opposition as there is. There hasn't even been a single response
from the developers about assisting with a transfer to the PG site. Or
from anyone else with significant authority there.

On Wed, Feb 24, 2010 at 2:00 PM, Al Haines (shaw) <ajhaines at shaw.ca> wrote:

> Would it not be simpler for DP itself to have a Pre-releases page, similar
> to its Smooth-Read page?
>
> I would think that if pre-releases are copied from DP into some PG
> environment, similar to Preprints, there would need to be some coordination
> to remove them from that environment when they're posted into PG as finished
> products.
>
> Al
>
>
> ----- Original Message ----- From: "Michael S. Hart" <hart at pglaf.org>
> To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
> Sent: Wednesday, February 24, 2010 1:21 PM
> Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
>
>
>
>
>>
>>
>> I can tell you that both Greg Newby, our CEO, and myself,
>> are most definitely FOR doing such prereleases!!!
>>
>> We are prepared to start making directories for them and
>> getting them loaded in as soon as we have permission.
>>
>> We would put up a number of disclaimers abou them to put
>> a protective system up for DP and various volunteers.
>>
>> We would appreciate any opportuty to do this with any of
>> the three sections of books mentioned earlier.
>>
>> Many thanks!!!
>>
>>
>> Michael
>>
>>
>>
>> On Wed, 24 Feb 2010, Lee Passey wrote:
>>
>>  On 2/24/2010 12:37 PM, cmiske at ashzfall.com wrote:
>>>
>>> [snip]
>>>
>>> > The more I think about it, the more I feel that it would actually be
>>> > rather nice to see something along the lines of the 'roundless
>>> > experiment' produce decent quality 'first release' books and then have
>>> > people nitpick them to 'perfection' later. The point would be the >
>>> format
>>> > of storage so that the rough version could be compared to the original
>>> > scans and corrections/updates be done with ease. Also, if a book is
>>> > backlogged at DP, it would make sense to have a method for someone
>>> > outside of DP to adopt the book and finish it.
>>>
>>> I agree with you completely, and what you have described is what
>>> bowerbird has
>>> been agitating for for many years now. A book would go into 'first
>>> release'
>>> when the number of changes in the roundless system dropped below a
>>> certain
>>> rate. It would then go into general distribution in a system that allowed
>>> "continuous proofreading" in a wiki-like system
>>>
>>> Good luck in convincing either DP or PG to adopt these kind of reforms.
>>> _______________________________________________
>>> gutvol-d mailing list
>>> gutvol-d at lists.pglaf.org
>>> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>>>
>>>  _______________________________________________
>> gutvol-d mailing list
>> gutvol-d at lists.pglaf.org
>> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>>
>>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100224/33d34e89/attachment.html>

From lee at novomail.net  Wed Feb 24 14:21:20 2010
From: lee at novomail.net (Lee Passey)
Date: Wed, 24 Feb 2010 15:21:20 -0700
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <B592E6447AAB45C891E108507D6811FA@alp2400>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<4B8582E5.5050103@novomail.net>	<alpine.DEB.2.00.1002241317320.7483@mail.pglaf.org>
	<B592E6447AAB45C891E108507D6811FA@alp2400>
Message-ID: <4B85A660.306@novomail.net>

On 2/24/2010 3:00 PM, Al Haines (shaw) wrote:

> Would it not be simpler for DP itself to have a Pre-releases page,
> similar to its Smooth-Read page?
>
> I would think that if pre-releases are copied from DP into some PG
> environment, similar to Preprints, there would need to be some
> coordination to remove them from that environment when they're posted
> into PG as finished products.

True, if that were the proposal. But it wasn't. The proposal was to let 
the files churn at DP until they were in a "mostly finished" state.

Then, let them sit at PG in a "mostly finished" state forever, because 
no one can say definitively when they /are/ finished; and to put them in 
some kind of a Wiki-like environment so "tweaks" can be made 
incrementally by the "unwashed masses".

Or, PG could encourage DP to release its work product (including page 
scans) to organizations other than PG (e.g. IA, Wikisource) where these 
kind of incremental changes /could/ be made and PG could harvest them 
from these other sources at regular intervals.

From Bowerbird at aol.com  Wed Feb 24 15:06:14 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 24 Feb 2010 18:06:14 EST
Subject: [gutvol-d] Re: so what is so important about pagination?
Message-ID: <417d1.1faab1b0.38b70ae6@aol.com>

keith said:
>    I agree with BB, but we have different ways about doing it.

well, to put it in a way that is just a bit more accurate...

i have actually developed a system for doing it, and coded tools
that prove both the viability and the efficiency of that system...

and you've written some posts where you waved your hands around...

so, yes, in a sense, we do have "different ways" of "doing it"...        
;+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100224/0768c5d9/attachment.html>

From schultzk at uni-trier.de  Wed Feb 24 15:28:10 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Thu, 25 Feb 2010 00:28:10 +0100
Subject: [gutvol-d] Re: Translation project
In-Reply-To: <20100224105134.0dedd0f3f91314fbc67db20f64e304ca.7a021e77f6.wbe@email05.secureserver.net>
References: <20100224105134.0dedd0f3f91314fbc67db20f64e304ca.7a021e77f6.wbe@email05.secureserver.net>
Message-ID: <AE8E4606-4CDD-41BC-BF22-03CBC94184AA@uni-trier.de>

Hi Carel,

	What you are forgetting that every language has its
	cultural idiosyncrasies and idioms. You can not take
	a pre-school text book and simply translate it.
	Furthermore, there are also cultural differences in 
	teaching. 

	Take your example: 
		In German: Schau Spot. Schau Spot l?uft.
	No German would say that to a child. Also, schau is actually
	look and not see. Another translation with see would be
	Sehe Spot. Sehe Spot laufen.
	This is not good German and under normal circumstances
	a German would never speak that way.
	More appropriate would be:
	Das ist Spot. Spot l?uft.
	In english we have "That is Spot. Spot is running"
	Of course I could try "Seht Spot. Seht wie Spot l?uft"
	In English:See Spot. See how Spot runs"
	As you can see completely different grammatical
	forms. It is not that simple.

	Furthermore, a German normallydoes call their dogs
	Spot. Or translate Dick and Jane. Well I could start
	with "Every dick and Jane" being Hinz und Kunze, but 
	Hinz und Kunze are last names. 

	It does not work that way. Sorry.

	regards
		Keith.
	
Am 24.02.2010 um 18:51 schrieb cmiske at ashzfall.com:

> 
> Keith and Andrew: Thank you so much for your replies.
> 
> I am in the process of learning Spanish (the 'textbook' version they
> tend to teach in the US school system). In my readings I have already
> encountered many moments of "I know what it is saying, but what does it
> mean?" I lack the cultural background and ability to 'play' with words
> that is necessary for high-level translation.
> 
> What I was actually thinking of was more along the lines of the
> translation of early readers (such as the Dick and Jane series) into a
> multitude of languages. As books of this nature are intended to have a
> rather limited vocabulary scope and to facilitate the learning of an
> individual's 'native' language, I thought these books might also be
> helpful if translated into multiple languages. When reading these books
> it would also be possible for the language learners to refer to the
> edition in their native language for guidance. I'm not sure that this
> sort of translation process would actually create something that PG
> would even want. 
> 
> Is there much interest beyond my own to see "See Spot. See Spot run...."
> translated into a dozen or more languages? I assumed that if there were
> much interest, there would already be such a project out there on the
> internet.
> 
> There are, of course, books of this nature on the market, but they are
> rather limited in availability and somewhat pricey. A free archive would
> be nice.
> 
> I agree that the majority of works of literature should be left to the
> experts and native speakers of both languages. Andrew's project was a
> fine example of a collaborative effort. I hope more people get together
> in small groups (or individually) to translate more works. Their efforts
> are appreciated.


From Bowerbird at aol.com  Wed Feb 24 16:12:32 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 24 Feb 2010 19:12:32 EST
Subject: [gutvol-d] Re: d.p.'s undeserved superior attitude about "quality"
Message-ID: <455e3.24b0a0aa.38b71a70@aol.com>

carel said:
>    The fact that they even strive for quality at DP is a mark in their 
favor.

not really.   everyone is gonna tell you they strive for quality...
(trust me, nobody's gonna say, "quality doesn't matter to us.")

besides...

it depends on how you go about trying to attain that goal.
and d.p. has a lot of problems -- a lot -- in that respect...

but it's not the notion that d.p. "strives" for quality that's irritating;
it is their blind stubbornness that they have already _attained_ it,
coupled with an unwillingness to measure the price they pay for it.

let's take the first item first -- the idea that d.p. quality is _great_.
this has been a staple of the propaganda served on the d.p. forums
for _years_ now.   equally troubling is their accompanying attitude
that solo producers create shitty e-texts.   if you want, i can point to
some listserve posts from years back where this led to some highly
embarrassing interactions for d.p., where one solo producer showed
just exactly how _crappy_ some d.p. e-texts were, and dared d.p. to
uncover any errors in the work that he had done.   they couldn't do it.

that led to a big shakeup at d.p., as d.p. got spooked about "quality",
and started piling on the rounds to make sure they got things right.

it also resulted in a slight reworking of the standard propaganda,
where "the early d.p. e-texts had flaws, but the later ones are great."
this is even better, because it's even more insidious towards rebuff,
and you see this superior attitude expressed regularly on the forums.

now, i do believe d.p. quality is better these days.   but now they are
paying a _very_ high price for it, in terms of the time and energy used.

when d.p. piled on the rounds, and made people "earn" a p3 badge,
they significantly increased the time and energy used, and they also
bought themselves backlogs that are threatening to strangle them...

which means that in spite of a constant influx of people (thanks to
that banner on the p.g. website that directs volunteers over to d.p.),
production hasn't increased as a result (it's been flat for years now),
and the burnout factor has become palpable.   not a good situation...

i don't mind if _you_, as a _volunteer_, decide that you want to spend
hours and days and weeks and months getting your e-text _perfect_.

but when you start wasting the time and energy of _other_people_
-- who are _volunteers_ who've donated time and energy to you --
by putting them into a workflow that is decidedly inefficient, i mind.

meanwhile, the attitude about "quality" at d.p. has gotten ridiculous.

i ran a poll there years back, on "how much quality would you like?"
>    http://www.pgdp.net/phpBB2/viewtopic.php?t=32479

the question:
>    for a 200-page average-difficulty book, 
>    how many errors should we 
>    tolerate as "acceptable" for
>    posting that book to project gutenberg?

the poll gave people options, which received the following votes:
>    none -- it must be perfect --?4% --?[2]
>    1 error in 200 pages is good enough -- 16% [7]
>    2-4 errors in 200 pages is good enough --?30%? [13]
>    5-10 errors in 200 pages is good enough --?23%? [10]
>    10-20 errors in 200 pages is good enough --?9%? [4]
>    20-40 errors in 200 pages is good enough --?2%? [1]
>    40-50 errors in 200 pages is good enough --?0%? [0]
>    50-100 errors in 200 pages is good enough --?0%? [0]
>    100-200 errors in 200 pages is good enough --?0%? [0]
>    you can't have your pudding until you eat your meat! --?13%? [6]
>    Total Votes : 43

as you can see, even though i gave a lot of options trying to
stretch them out, of the d.p. people who expressed an opinion,
the majority wanted no more than 4 errors in a 200-page book,
and roughly 9 out of 10 of them wanted no more than 10 errors,
and i have no doubt they think they are _meeting_that_standard_.

which is just plain ridiculous.   4 errors in a 200-page book is
a _very_ polished book -- most digitizations aren't that good.
(many aren't anywhere close to being that good.)

but d.p. people _think_ their work is that good.   they _believe_ it.
they _proclaim_ it, as if it were a proven fact.   but it's not a fact.

many years back, "zora" from d.p. was one of the loudest people
proclaiming the superiority of d.p. quality, so i analyzed a book
that she had postprocessed.   i found _hundreds_ of errors in it,
even _obvious_ stuff, like out-of-sequence footnote-numbers...

i haven't analyzed any d.p. books lately, but given the fact that
people there have an undeserved sense of their _superiority_
when it comes to the issue of quality, perhaps i should do that.

it might be fun to knock them off that high horse.

of course, they won't listen over there.   but we will know here...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100224/6c449e6b/attachment.html>

From Bowerbird at aol.com  Wed Feb 24 16:36:37 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 24 Feb 2010 19:36:37 EST
Subject: [gutvol-d] Re: ok, let's take a look at gardner's book,
	just for the exercise
Message-ID: <46ac8.49d09f13.38b72015@aol.com>

let me now _contrast_ the undeserved superior attitude
that some d.p. members manifest about their product
with the _humble_ nature of the request gardner made,
where he _invited_ me to analyze one of his books and
report back to him how he could improve his workflow.

my general attitude here is to try to be of _assistance_.

but so much of the time, i am vilified, and the nature
of the interaction turns confrontational and viscous.

so it's nice when someone like gardner comes along,
and reminds us all that it doesn't have to be that way,
that we can talk pleasantly about how to do better...

***

you might recall that i found 159 differences between
the archive.org o.c.r. and gardner's proofed version...

>    http://z-m-l.com/go/gardn/gardn-159diffs.html

i've resolved those differences, and found that they
split almost exactly down the middle between cases
where the o.c.r. was wrong, and gardner was wrong.

this was an _intermediate_ version of gardner's file
-- the last one he had before he rewrapped lines,
which he generously gave to me to save me time --
and he likely made some corrections after that act,
plus the whitewashers might've fixed some errors.

so i'm going to now test my diff-resolved version
against the file that was actually posted up at p.g.

what i _can_ say now, after resolving the diffs, is
that gardner's o.c.r. program didn't do very well.
("very well" sounds so stilted; i prefer "very good".)
whether this is because the scans weren't good,
i cannot say...   gardner, what app/version is it?

in particular, there was a problem with he/be diffs,
and other h/b glitches, as well as with punctuation
(in general).   so i'd want to track down that aspect...

and gardner, next time you might want to hold off
on the rewrapping until later in your workflow, as
that'll make it easier to compare against the scans.

i should have a final report by tomorrow or friday.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100224/f0347834/attachment.html>

From cmiske at ashzfall.com  Wed Feb 24 16:51:53 2010
From: cmiske at ashzfall.com (cmiske at ashzfall.com)
Date: Wed, 24 Feb 2010 17:51:53 -0700
Subject: [gutvol-d] Re: Translation project
Message-ID: <20100224175153.0dedd0f3f91314fbc67db20f64e304ca.35a09c84d1.wbe@email05.secureserver.net>

 
>-------- Original Message --------
>Subject: [gutvol-d] Re: Translation project
>From: "Keith J. Schultz" <schultzk at uni-trier.de>
>Date: Wed, February 24, 2010 4:28 pm
>To: Project Gutenberg Volunteer Discussion <gutvol-d at lists.pglaf.org>

>Hi Carel,

>What you are forgetting that every language has its
>cultural idiosyncrasies and idioms. You can not take
>a pre-school text book and simply translate it.
>Furthermore, there are also cultural differences in 
>teaching. 

I deal with people of a variety of cultures and languages on a daily
basis. Problems and frequent episodes of humor occur as a result of
these differences. I also realize the existance of idiosyncrasies and
idioms or I would have thought that google translate was the answer
rather than human translations.

I have instructors from many different cultures and they openly discuss
and often use the teaching methods of their culture of origin. So, that
is something I would have a hard time forgetting as I have several
decades of personal experience with it (I'm wht is known as a
'life-student' or 'professional student' --depending on how rude you
feel like being).

[snip]

>Furthermore, a German normallydoes call their dogs
>Spot. Or translate Dick and Jane. Well I could start
>with "Every dick and Jane" being Hinz und Kunze, but 
>Hinz und Kunze are last names. 

Exactly. Again, the reason google translate is not adequate. As books
intended to help a native speaker of one language gain some familiarity
with the sentence structure of another (by reading books they are
familiar with from their own language), I fail to see how most of this
applies in any real depth. The individuals translating the books would
have to decide how best to deal with the issues of how "See Spot" should
best be phrased in another language: I believe that is the role and
purpose of a translator is it not? And, I've rarely seen names changed
in translation as that would remove a large portion of the cultural of
origin.

The purpose and nature of what I am looking for is why I doubted that PG
would be interested in the end result: It would not be an
'authoritative' translation, but rather a learning aid.

>It does not work that way. Sorry.

I'll be sure to let the publishers of books that do exactly what I am
describing know that they are doing the impossible. Since my question
was actually whether or not such a translation project exists, I will
take your response as the simple 'no' that all of this can be translated
to mean.

I'll search around a bit more for an existing project and then start one
of my own. If it serves no other purpose than to help me (and some of my
fellow students) to get a firmer grasp on a few languages and to meet a
few interesting people who are willing to help out, then the project
will be a success.

>regards
>Keith.

Yes, thank you....
Carel


From gbuchana at teksavvy.com  Wed Feb 24 17:13:39 2010
From: gbuchana at teksavvy.com (Gardner Buchanan)
Date: Wed, 24 Feb 2010 20:13:39 -0500
Subject: [gutvol-d] Re: ok, let's take a look at gardner's book,
 just for the exercise
In-Reply-To: <46ac8.49d09f13.38b72015@aol.com>
References: <46ac8.49d09f13.38b72015@aol.com>
Message-ID: <4B85CEC3.3070206@teksavvy.com>

On 24-Feb-2010 19:36, Bowerbird at aol.com wrote:

> i cannot say... gardner, what app/version is it?

Abbyy 5.0.  Also keep in mind that the Internet Archive
used a different scan set from the ECO one I used.  The
ECO one is via microform and IIs is -- I think -- straight
from paper.

I'd already tackled the first proof when it discovered
the Internet Archive version.  I used IIs for the images as
the ECO ones from the microform are really awful.

============================================================
Gardner Buchanan                     <gbuchana at teksavvy.com>
Ottawa, ON             FreeBSD: Where you want to go. Today.

From cmiske at ashzfall.com  Wed Feb 24 17:47:25 2010
From: cmiske at ashzfall.com (cmiske at ashzfall.com)
Date: Wed, 24 Feb 2010 18:47:25 -0700
Subject: [gutvol-d] Re: d.p.'s undeserved superior attitude about "quality"
Message-ID: <20100224184725.0dedd0f3f91314fbc67db20f64e304ca.5965319e6a.wbe@email05.secureserver.net>

>-------- Original Message --------
>Subject: [gutvol-d] Re: d.p.'s undeserved superior attitude about
>"quality"
>From: Bowerbird at aol.com
>Date: Wed, February 24, 2010 5:12 pm
>To: gutvol-d at lists.pglaf.org, bowerbird at aol.com

>carel said:
>   The fact that they even strive for quality at DP is a mark in their favor.

>not really.  everyone is gonna tell you they strive for quality...
>(trust me, nobody's gonna say, "quality doesn't matter to us.")

Actually, I think google more or less did say this when they initially
scanned the books (with fingers visible and all), but they now correct
bad pages, so they do want quality. :)

[snip]
>but it's not the notion that d.p. "strives" for quality that's irritating;
>it is their blind stubbornness that they have already _attained_ it,
>coupled with an unwillingness to measure the price they pay for it.

Yes, that is an entirely different matter. To believe one has attained
perfection is quite different than the pursuit of it....

[snip]

>but when you start wasting the time and energy of _other_people_
>-- who are _volunteers_ who've donated time and energy to you --
>by putting them into a workflow that is decidedly inefficient, i mind.

You don't need to talk to me about having your time wasted. I am the
somewhat infamous Carel that left DP nearly a decade ago (after spending
around 500 hours creating a front-end and revamping the horrific code
that resided on the back-end). I also tried to get them to add markup
and additional computer generated pre- and post-processing and formating
at the time, but they resisted. It was nice to see them add HTML a few
years later.

When I left DP, I was going to create my own project that had a more
logical flow and used more computer processing (so as to reduce the
workload for the volunteers), but the point seemed moot due to the sheer
power that DP already had in the PG community. I assumed it would be
quite difficult to recruit volunteers. Every once in a while, I toss the
idea around again, but it always comes back to the point that the
majority of volunteers willing to proof in such a fashion are already
cemented in the DP process. (Now that I think about it, I believe I
still have a copy of the test site for my own proofing community on some
CD some place. It might be worth a chuckle to see how outdated my
efforts now look.)

It is nice that DP strives for quality in the work that they do. I am
not however surprised that they ultimately became backlogged. DP has
managed to produce a huge number of etexts and for that their efforts
should be appreciated. Their etexts also have a decent level of quality.
If the volunteers at DP are blissfully unaware that their time is being
wasted then they do not feel like victims of the process, but rather
champions of the cause. So, efficient or not, no harm is being done and
books are being produced. If the inefficiency bothers you and me (and
others), it's a moot point because it is not our time that is being
wasted. And, we have no power to change what is at DP. At least that is
the way I look at it.

The only thing that does bother me is having all those books 'stuck' in
the system rather than being available to the public. But the
pre-release idea takes care of that.

And, thank you for explaining the process that led DP to have so many
rounds and member levels. That is quite the workflow, but I now
understand how it came to exist.

Carel


From hart at pglaf.org  Wed Feb 24 17:50:46 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Wed, 24 Feb 2010 17:50:46 -0800 (PST)
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <B592E6447AAB45C891E108507D6811FA@alp2400>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<4B8582E5.5050103@novomail.net>
	<alpine.DEB.2.00.1002241317320.7483@mail.pglaf.org>
	<B592E6447AAB45C891E108507D6811FA@alp2400>
Message-ID: <alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>


Greg Newby takes care of deleting from PrePrints, should be easy enough
just to send him notes of what has been completed.


mh


On Wed, 24 Feb 2010, Al Haines (shaw) wrote:

> Would it not be simpler for DP itself to have a Pre-releases page, similar to
> its Smooth-Read page?
>
> I would think that if pre-releases are copied from DP into some PG
> environment, similar to Preprints, there would need to be some coordination to
> remove them from that environment when they're posted into PG as finished
> products.
>
> Al
>
>
> ----- Original Message ----- From: "Michael S. Hart" <hart at pglaf.org>
> To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
> Sent: Wednesday, February 24, 2010 1:21 PM
> Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
>
>
> >
> >
> >
> > I can tell you that both Greg Newby, our CEO, and myself,
> > are most definitely FOR doing such prereleases!!!
> >
> > We are prepared to start making directories for them and
> > getting them loaded in as soon as we have permission.
> >
> > We would put up a number of disclaimers abou them to put
> > a protective system up for DP and various volunteers.
> >
> > We would appreciate any opportuty to do this with any of
> > the three sections of books mentioned earlier.
> >
> > Many thanks!!!
> >
> >
> > Michael
> >
> >
> >
> > On Wed, 24 Feb 2010, Lee Passey wrote:
> >
> > > On 2/24/2010 12:37 PM, cmiske at ashzfall.com wrote:
> > >
> > > [snip]
> > >
> > > > The more I think about it, the more I feel that it would actually be
> > > > rather nice to see something along the lines of the 'roundless
> > > > experiment' produce decent quality 'first release' books and then have
> > > > people nitpick them to 'perfection' later. The point would be the >
> > > format
> > > > of storage so that the rough version could be compared to the original
> > > > scans and corrections/updates be done with ease. Also, if a book is
> > > > backlogged at DP, it would make sense to have a method for someone
> > > > outside of DP to adopt the book and finish it.
> > >
> > > I agree with you completely, and what you have described is what bowerbird
> > > has
> > > been agitating for for many years now. A book would go into 'first
> > > release'
> > > when the number of changes in the roundless system dropped below a certain
> > > rate. It would then go into general distribution in a system that allowed
> > > "continuous proofreading" in a wiki-like system
> > >
> > > Good luck in convincing either DP or PG to adopt these kind of reforms.
> > > _______________________________________________
> > > gutvol-d mailing list
> > > gutvol-d at lists.pglaf.org
> > > http://lists.pglaf.org/mailman/listinfo/gutvol-d
> > >
> > _______________________________________________
> > gutvol-d mailing list
> > gutvol-d at lists.pglaf.org
> > http://lists.pglaf.org/mailman/listinfo/gutvol-d
> >
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

From prosfilaes at gmail.com  Wed Feb 24 17:53:04 2010
From: prosfilaes at gmail.com (David Starner)
Date: Wed, 24 Feb 2010 20:53:04 -0500
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <627d59b81002240918t6fc38b26lb63460419c3f8a3e@mail.gmail.com>
References: <19d99.58cae61c.38b5c60c@aol.com>
	<627d59b81002231604u6079e880x9ee1625963baba8d@mail.gmail.com>
	<9B13A81D-3C69-48E0-9C18-E17C9CC20883@uni-trier.de>
	<627d59b81002240918t6fc38b26lb63460419c3f8a3e@mail.gmail.com>
Message-ID: <6d99d1fd1002241753h23f4cad5i69fedbd5672fef2@mail.gmail.com>

 On Wed, Feb 24, 2010 at 2:31 AM, Keith J. Schultz <schultzk at uni-trier.de>
wrote:
> ? ? ? ?Also, if you have ever been in a literature class you would
> ? ? ? ?know that in poems and drama one uses linenumbers as
> ? ? ? ?references. So it is not that geeky.

Yeah, because only computer geeks are geeks. Old English geeks, well,
they aren't the REAL geeks.

-- 
Kie ekzistas vivo, ekzistas espero.

From Bowerbird at aol.com  Wed Feb 24 18:23:02 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 24 Feb 2010 21:23:02 EST
Subject: [gutvol-d] Re: ok, let's take a look at gardner's book,
	just for the exercise
Message-ID: <4c3a5.61ca50ca.38b73906@aol.com>

gardner said:
>   Abbyy 5.0.

yeah, i could tell it was an old one. (he/be confusion was the sign.)

i think most people say abby v7 is where recognition hit a plateau.
(versions later than that have added features, yet no better o.c.r.,
while all of the versions earlier than v7 are decidedly inferior.)

if you can find a used copy of v7, it would help you immensely...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100224/d74c8f40/attachment.html>

From joshua at hutchinson.net  Wed Feb 24 18:24:02 2010
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Thu, 25 Feb 2010 02:24:02 +0000 (GMT)
Subject: [gutvol-d] Re: so what is so important about pagination?
References: <19d99.58cae61c.38b5c60c@aol.com>
	<627d59b81002231604u6079e880x9ee1625963baba8d@mail.gmail.com>
	<9B13A81D-3C69-48E0-9C18-E17C9CC20883@uni-trier.de>
	<627d59b81002240918t6fc38b26lb63460419c3f8a3e@mail.gmail.com>
	<6d99d1fd1002241753h23f4cad5i69fedbd5672fef2@mail.gmail.com>
Message-ID: <919654776.89498.1267064642717.JavaMail.mail@webmail12>

An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100225/9575b90e/attachment.html>

From Bowerbird at aol.com  Wed Feb 24 18:51:11 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 24 Feb 2010 21:51:11 EST
Subject: [gutvol-d] Re: d.p.'s undeserved superior attitude about "quality"
Message-ID: <4da14.62f08611.38b73f9f@aol.com>

carel said:
>   You don't need to talk to me about having your time wasted. 
>    I am the somewhat infamous Carel that left DP nearly a decade ago 
>    (after spending around 500 hours creating a front-end and 
>    revamping the horrific code that resided on the back-end).

oh, hey, glad to meet you.           :+)

i do believe i am the one who has grabbed the "infamous" crown
at this point in time, however.   i see they still let you post there.
myself, i've been banished, so as not to corrupt the youth there.


>   It is nice that DP strives for quality in the work that they do. 

yes, it is.         :+)

and it's nice that vancouver is finally getting some snow...       ;+)


>    I am not however surprised that 
>    they ultimately became backlogged. 

well, yeah, quite a few people saw that coming...

when you have 3 rounds of proofing, and you want all of your
material to go through all 3 rounds, and you have thousands
of proofers in p1, and hundreds in p2, and only dozens in p3,
a backlog is pretty inevitable...

especially if you expect your p3 people to take _more_ time
on each page than your p1 proofers.   it just will not add up...


>    DP has managed to produce a huge number of etexts

and they could have produced 3 times as many if their workflow
were 3 times more efficient, something it could _easily_ attain...

according to brewster, internet archive scans 1,000 books a day.
(google probably scans twice that many before their lunch break.)

d.p. digitizes 2400 books a _year_.   i predict they cannot keep up.


>    and for that their efforts should be appreciated. 

i appreciate the proofers who volunteer their time _immensely_...
and the people who format, and postprocess.   they're champions.

but the "powers that be", who _refuse_ to change a bad workflow,
and clamp down on dissent that reveals why that workflow is bad?
i have no respect for them at all.   not even one shred of respect...


>    Their etexts also have a decent level of quality.

i'm with michael hart on this.   (even if he refuses to see that.)

i think the first mission is to get text up with acceptable quality.
(which, by the way, most of the in-process stuff at d.p. is _not_)

and _then_ we can use the _public_ to drive it toward perfection.

(and no, i don't get the impression that that's what the current
"pre-release" plans will entail; seems to me it's "post and forget",
which is the attitude that p.g. takes towards _all_ of its e-texts.)


>    If the volunteers at DP are blissfully unaware that their time 
>    is being wasted then they do not feel like victims of the process

but they _are_ victims, whether they _feel_like_ victims or not...

and because they are being used so inefficiently, they are _not_
receiving anything near the satisfaction they _could_ be getting.

and, like i said, despite a constant influx of volunteers sent there
by p.g., the steady d.p. base seems to be a fairly constant number,
which means they are experiencing severe "churn" and "burnout",
neither of which bodes well for the future.   the well _will_ go dry.
(especially since more and more people will come to see google as
"the source" of the books that they try to find online, since google
offers people several million more titles than project gutenberg.)


>    If the volunteers at DP are blissfully unaware that their time 
>    is being wasted then they do not feel like victims of the process,
>    but rather champions of the cause.

well, they _are_ champions.   too bad they're not treated as such.


>    So, efficient or not, no harm is being done

well, as i just argued, harm _is_ being done.   it most certainly is.


>    and books are being produced. 

some.   not nearly enough.   and not nearly efficiently enough.


>    If the inefficiency bothers you and me (and others), 
>    it's a moot point because it is not our time that is being wasted. 

but the potential of our society _is_ being spent inefficiently, and 
the time of _good_people_ is being wasted, for no good reason.

so i think the point is _not_ moot.   i believe i should _speak_up_.


>    And, we have no power to change what is at DP. 
>    At least that is the way I look at it.

i'm not looking for "power".   i'm looking to enlighten people.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100224/0f3c9456/attachment.html>

From ajhaines at shaw.ca  Wed Feb 24 19:09:03 2010
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Wed, 24 Feb 2010 19:09:03 -0800
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<4B8582E5.5050103@novomail.net>
	<alpine.DEB.2.00.1002241317320.7483@mail.pglaf.org>
	<B592E6447AAB45C891E108507D6811FA@alp2400>
	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>
Message-ID: <A871E93708224C1FAB0E62A643357D64@alp2400>

Hmm... I figured Greg was busy enough! <g>

As for the "major technical challenge" suggested elsewhere in this topic, 
why can't DP put up a wiki page for pre-releases, similar to its Harvesting 
wiki page?  Project Managers (or whoever) could put links on the page to 
their pre-release candidates, and when a pre-release was ready for 
submission to PG, or had been posted, the PM could remove the link.

Al


----- Original Message ----- 
From: "Michael S. Hart" <hart at pglaf.org>
To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
Sent: Wednesday, February 24, 2010 5:50 PM
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts


>
> Greg Newby takes care of deleting from PrePrints, should be easy enough
> just to send him notes of what has been completed.
>
>
> mh
>
>
>
> On Wed, 24 Feb 2010, Al Haines (shaw) wrote:
>
>> Would it not be simpler for DP itself to have a Pre-releases page, 
>> similar to
>> its Smooth-Read page?
>>
>> I would think that if pre-releases are copied from DP into some PG
>> environment, similar to Preprints, there would need to be some 
>> coordination to
>> remove them from that environment when they're posted into PG as finished
>> products.
>>
>> Al
>>
>>
>> ----- Original Message ----- From: "Michael S. Hart" <hart at pglaf.org>
>> To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
>> Sent: Wednesday, February 24, 2010 1:21 PM
>> Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
>>
>>
>> >
>> >
>> >
>> > I can tell you that both Greg Newby, our CEO, and myself,
>> > are most definitely FOR doing such prereleases!!!
>> >
>> > We are prepared to start making directories for them and
>> > getting them loaded in as soon as we have permission.
>> >
>> > We would put up a number of disclaimers abou them to put
>> > a protective system up for DP and various volunteers.
>> >
>> > We would appreciate any opportuty to do this with any of
>> > the three sections of books mentioned earlier.
>> >
>> > Many thanks!!!
>> >
>> >
>> > Michael
>> >
>> >
>> >
>> > On Wed, 24 Feb 2010, Lee Passey wrote:
>> >
>> > > On 2/24/2010 12:37 PM, cmiske at ashzfall.com wrote:
>> > >
>> > > [snip]
>> > >
>> > > > The more I think about it, the more I feel that it would actually 
>> > > > be
>> > > > rather nice to see something along the lines of the 'roundless
>> > > > experiment' produce decent quality 'first release' books and then 
>> > > > have
>> > > > people nitpick them to 'perfection' later. The point would be the >
>> > > format
>> > > > of storage so that the rough version could be compared to the 
>> > > > original
>> > > > scans and corrections/updates be done with ease. Also, if a book is
>> > > > backlogged at DP, it would make sense to have a method for someone
>> > > > outside of DP to adopt the book and finish it.
>> > >
>> > > I agree with you completely, and what you have described is what 
>> > > bowerbird
>> > > has
>> > > been agitating for for many years now. A book would go into 'first
>> > > release'
>> > > when the number of changes in the roundless system dropped below a 
>> > > certain
>> > > rate. It would then go into general distribution in a system that 
>> > > allowed
>> > > "continuous proofreading" in a wiki-like system
>> > >
>> > > Good luck in convincing either DP or PG to adopt these kind of 
>> > > reforms.
>> > > _______________________________________________
>> > > gutvol-d mailing list
>> > > gutvol-d at lists.pglaf.org
>> > > http://lists.pglaf.org/mailman/listinfo/gutvol-d
>> > >
>> > _______________________________________________
>> > gutvol-d mailing list
>> > gutvol-d at lists.pglaf.org
>> > http://lists.pglaf.org/mailman/listinfo/gutvol-d
>> >
>>
>>
>> _______________________________________________
>> gutvol-d mailing list
>> gutvol-d at lists.pglaf.org
>> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
> 


From hart at pglaf.org  Wed Feb 24 20:10:47 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Wed, 24 Feb 2010 20:10:47 -0800 (PST)
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <A871E93708224C1FAB0E62A643357D64@alp2400>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<4B8582E5.5050103@novomail.net>
	<alpine.DEB.2.00.1002241317320.7483@mail.pglaf.org>
	<B592E6447AAB45C891E108507D6811FA@alp2400>
	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>
	<A871E93708224C1FAB0E62A643357D64@alp2400>
Message-ID: <alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>


On Wed, 24 Feb 2010, Al Haines (shaw) wrote:

> Hmm... I figured Greg was busy enough! <g>

I talked to Greg about it, he's fine with it.
Might be best to let him do it once a week or something, tho.


> As for the "major technical challenge" suggested elsewhere in this topic, why
> can't DP put up a wiki page for pre-releases, similar to its Harvesting wiki
> page?  Project Managers (or whoever) could put links on the page to their
> pre-release candidates, and when a pre-release was ready for submission to PG,
> or had been posted, the PM could remove the link.

I'm up for any way we can get more eBooks to more people, sooner, than later.

I'm willing to try them both, see how our readers respond.

After all, it's not trivial to find the PrePrints page, though Google does
seem to find it ok, which should be ok for most of our readers.

Let's try. . . .


Michael

>
> Al
>
>
> ----- Original Message ----- From: "Michael S. Hart" <hart at pglaf.org>
> To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
> Sent: Wednesday, February 24, 2010 5:50 PM
> Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
>
>
> >
> > Greg Newby takes care of deleting from PrePrints, should be easy enough
> > just to send him notes of what has been completed.
> >
> >
> > mh
> >
> >
> >
> > On Wed, 24 Feb 2010, Al Haines (shaw) wrote:
> >
> > > Would it not be simpler for DP itself to have a Pre-releases page, similar
> > > to
> > > its Smooth-Read page?
> > >
> > > I would think that if pre-releases are copied from DP into some PG
> > > environment, similar to Preprints, there would need to be some
> > > coordination to
> > > remove them from that environment when they're posted into PG as finished
> > > products.
> > >
> > > Al
> > >
> > >
> > > ----- Original Message ----- From: "Michael S. Hart" <hart at pglaf.org>
> > > To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
> > > Sent: Wednesday, February 24, 2010 1:21 PM
> > > Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
> > >
> > >
> > > >
> > > >
> > > >
> > > > I can tell you that both Greg Newby, our CEO, and myself,
> > > > are most definitely FOR doing such prereleases!!!
> > > >
> > > > We are prepared to start making directories for them and
> > > > getting them loaded in as soon as we have permission.
> > > >
> > > > We would put up a number of disclaimers abou them to put
> > > > a protective system up for DP and various volunteers.
> > > >
> > > > We would appreciate any opportuty to do this with any of
> > > > the three sections of books mentioned earlier.
> > > >
> > > > Many thanks!!!
> > > >
> > > >
> > > > Michael
> > > >
> > > >
> > > >
> > > > On Wed, 24 Feb 2010, Lee Passey wrote:
> > > >
> > > > > On 2/24/2010 12:37 PM, cmiske at ashzfall.com wrote:
> > > > >
> > > > > [snip]
> > > > >
> > > > > > The more I think about it, the more I feel that it would actually >
> > > > > be
> > > > > > rather nice to see something along the lines of the 'roundless
> > > > > > experiment' produce decent quality 'first release' books and then >
> > > > > have
> > > > > > people nitpick them to 'perfection' later. The point would be the >
> > > > > format
> > > > > > of storage so that the rough version could be compared to the > > >
> > > original
> > > > > > scans and corrections/updates be done with ease. Also, if a book is
> > > > > > backlogged at DP, it would make sense to have a method for someone
> > > > > > outside of DP to adopt the book and finish it.
> > > > >
> > > > > I agree with you completely, and what you have described is what > >
> > > bowerbird
> > > > > has
> > > > > been agitating for for many years now. A book would go into 'first
> > > > > release'
> > > > > when the number of changes in the roundless system dropped below a > >
> > > certain
> > > > > rate. It would then go into general distribution in a system that > >
> > > allowed
> > > > > "continuous proofreading" in a wiki-like system
> > > > >
> > > > > Good luck in convincing either DP or PG to adopt these kind of > >
> > > reforms.
> > > > > _______________________________________________
> > > > > gutvol-d mailing list
> > > > > gutvol-d at lists.pglaf.org
> > > > > http://lists.pglaf.org/mailman/listinfo/gutvol-d
> > > > >
> > > > _______________________________________________
> > > > gutvol-d mailing list
> > > > gutvol-d at lists.pglaf.org
> > > > http://lists.pglaf.org/mailman/listinfo/gutvol-d
> > > >
> > >
> > >
> > > _______________________________________________
> > > gutvol-d mailing list
> > > gutvol-d at lists.pglaf.org
> > > http://lists.pglaf.org/mailman/listinfo/gutvol-d
> > >
> > _______________________________________________
> > gutvol-d mailing list
> > gutvol-d at lists.pglaf.org
> > http://lists.pglaf.org/mailman/listinfo/gutvol-d
> >
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

From cmiske at ashzfall.com  Wed Feb 24 23:07:37 2010
From: cmiske at ashzfall.com (cmiske at ashzfall.com)
Date: Thu, 25 Feb 2010 00:07:37 -0700
Subject: [gutvol-d] Re: d.p.'s undeserved superior attitude about "quality"
Message-ID: <20100225000737.0dedd0f3f91314fbc67db20f64e304ca.f88872c91f.wbe@email05.secureserver.net>


Bowerbird:

>carel said:
>> You don't need to talk to me about having your time wasted. 
>> I am the somewhat infamous Carel that left DP nearly a decade ago 
>> (after spending around 500 hours creating a front-end and 
>> revamping the horrific code that resided on the back-end).
>
>oh, hey, glad to meet you. :+)
>
>i do believe i am the one who has grabbed the "infamous" crown
>at this point in time, however. i see they still let you post there.
>myself, i've been banished, so as not to corrupt the youth there.

LOL. It's nice to meet you as well. I've read your posts for many years,
but I don't believe we have ever directly 'spoken.'

>> DP has managed to produce a huge number of etexts

>and they could have produced 3 times as many if their workflow
>were 3 times more efficient, something it could _easily_ attain...
>according to brewster, internet archive scans 1,000 books a day.
>(google probably scans twice that many before their lunch break.)
>d.p. digitizes 2400 books a _year_. i predict they cannot keep up.

I'm not sure the most efficient system on earth could keep up with the
scan rate. No matter how much you automate pre- and post-processing and
formatting, it comes down to the speed of humans when it comes to
proofing. And, if you speed things up by skipping the manual proofing
and use automation completely for the first release then you will have
text little to no better than what google already has.

I realize that is not what you are proposing, but what google has done
is the most efficient with little regard to the potential for the
quality added by human interaction. Wikimedia then takes this further by
adding the ability for humans to modify the text in a never-ending
cycle.

>> Their etexts also have a decent level of quality.
>
>i'm with michael hart on this. (even if he refuses to see that.)
>
>i think the first mission is to get text up with acceptable quality.
>(which, by the way, most of the in-process stuff at d.p. is _not_)

>and _then_ we can use the _public_ to drive it toward perfection.

I'm am very much with you on this one. 

However, it rather duplicates the efforts of the wikimedia foundation
(at least the last time I looked into their efforts), so I am not sure
of the 'need' for it. In what manner do you suggest that this would be
differentiated from the current wiki book editing process at wikimedia
that justifies the creation of the project for the purpose of PG? 

I ask because I am considering giving it a try if I can find a single
reason to do so that would not make me feel as though I am reinventing
the wheel. I have not been able to find an acceptable reason on my own.
I love PG and promote it to all that will listen, but love alone is not
enough: There must be logic as well. :)

For that matter, it wouldn't be that hard to cull the data from google
books, cut their 'plain text' into pages (and/or re-ocr and dif) and
just slap the result into a wiki (which is about what it looks like
wikimedia has done). I believe the google use license allows for this
for non-commercial entities such as PG? 

Then, someone can verify the google 'plain text' images as inline images
or as content that could not be processed by the OCR (type in missing
text and annotate images). Tidy it up (scannos, etc.), add some assumed
markup for lines, paragraphs, pages, etc. using scripts with no human
interaction. Have a human assisted 'spell check' fix the majority of
remaining errors and end of line hyphenation issues, etc. and add some
basic markup for italics and whatnot (although the google OCR text
appears to already contain this). Run the text through an 'X number of
people say this page matches the scan' proofing process (roudless
process). Have a human assisted markup process to wrap the content in
properly formed (but easy) XML. Convert the document to plain text (and
any other formats that are desired). On a 300 page work of uncomplicated
fiction, depending on the scan quality and speed of the humans doing the
interactive processes, I would estimate this would take about 2-6 hours
_plus_ the proofing time to create a document ready for first release
and wiki storage. Then, you let it sit in a wiki for eternity being
edited into perfection (or until a very large X number of people again
say a page is perfect, create a final edition when all pages in a book
pass this test, post the output, archive the XML version and scans and
consider the book done for all time - or at least until a PG reader
reports an error that must be checked).

For user supplied scans, same process with OCR done on the server (if
desired, can diff against the scan submitter's OCR, if supplied). I
haven't thought about the process in a while and the new press for speed
cuts out some of the way I recall my original ideas. I'm probably
completely forgetting a topic or two, but this would pretty much work
(after a few logic tweeks) to accelerate the release process and should
retain some quality factors in the first release.

As you know, ultimately, in regards to quality, you will always have to
rely on the quality judgements of individuals working on the page or
project to determine the quality of the final output. Quality is a human
concept and so there is no way to automate it. I wish there were. :) 

I am very interested in your text processing scripts though and always
have been. I just haven't been processing any OCR text for a while.

>> If the volunteers at DP are blissfully unaware that their time 
>> is being wasted then they do not feel like victims of the process
>
>but they _are_ victims, whether they _feel_like_ victims or not...

They are not victims in the sense that an abused child is an abused
child whether the child realizes it or not. We are talking about
proofreading and wasting manual hours doing tasks that could have been
done by (or assisted by) automation. I may lose sleep over the former,
but I'll never lose sleep over the latter. Not that I do not regret the
wasting of volunteer time, I just don't have it at the top of my
'causes' list.

>and, like i said, despite a constant influx of volunteers sent there
>by p.g., the steady d.p. base seems to be a fairly constant number,
>which means they are experiencing severe "churn" and "burnout",
>neither of which bodes well for the future. the well _will_ go dry.
>(especially since more and more people will come to see google as
>"the source" of the books that they try to find online, since google
>offers people several million more titles than project gutenberg.)

Loss of volunteers is not good, obviously. Just as with any venture, it
is twice as hard to win someone back as it was to gain them in the first
place. The last I checked, the wiki book project also appears to have
very few active members, so it is also possible that book proofing has a
cap on the number of people willing to do it with any regularity that
has nothing to do with the format of presentation or the process.

As I said before, I don't believe it is possible to catch up to google.
If DP produces 2400 books per year, then even if the process I outlined
above (or your own or anyone else's) increased the production by 5x or
10x or even 25x (for first releases), we would still be far behind the
scanners. (Although I do believe your scan figures for daily output by
IA and google include a lot of material that is not public domain.) I
would very much love for someone to prove me wrong on this point and
have a million texts in PG in no time flat. 

>> If the inefficiency bothers you and me (and others), 
>> it's a moot point because it is not our time that is being wasted. 
>
>but the potential of our society _is_ being spent inefficiently, and 
>the time of _good_people_ is being wasted, for no good reason.
>
>so i think the point is _not_ moot. i believe i should _speak_up_.

Yes, you should do as you please. We all should. You do fight the good
fight as they say. As much as I may agree with you most of the time, it
just isn't in me to voice out about it. After all, you took my crown. ;)

>> And, we have no power to change what is at DP. 
>> At least that is the way I look at it.
>
>i'm not looking for "power". i'm looking to enlighten people.

'Power' is often the easier goal to attain. ;) I merely meant that we
are not in a position to have any say in what DP does with itself. Nor
in what PG does with itself.

Carel


From schultzk at uni-trier.de  Thu Feb 25 03:14:17 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Thu, 25 Feb 2010 12:14:17 +0100
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <6d99d1fd1002241753h23f4cad5i69fedbd5672fef2@mail.gmail.com>
References: <19d99.58cae61c.38b5c60c@aol.com>
	<627d59b81002231604u6079e880x9ee1625963baba8d@mail.gmail.com>
	<9B13A81D-3C69-48E0-9C18-E17C9CC20883@uni-trier.de>
	<627d59b81002240918t6fc38b26lb63460419c3f8a3e@mail.gmail.com>
	<6d99d1fd1002241753h23f4cad5i69fedbd5672fef2@mail.gmail.com>
Message-ID: <6FFBAA40-DB26-4452-A445-26B709649717@uni-trier.de>


Am 25.02.2010 um 02:53 schrieb David Starner:

> On Wed, Feb 24, 2010 at 2:31 AM, Keith J. Schultz <schultzk at uni-trier.de>
> wrote:
>>        Also, if you have ever been in a literature class you would
>>        know that in poems and drama one uses linenumbers as
>>        references. So it is not that geeky.
> 
> Yeah, because only computer geeks are geeks. Old English geeks, well,
> they aren't the REAL geeks.
	Old English!!! I have not seen a high school text books for english lately,
	but I am sure they have line numbers in them for longer poems. But,
	maybe all students are probably geeks. 
	Just to be geeky. Shakespeare is not old english, not even middle
	english, actually "Modern English". Of course the the question is 
	what language are we writing in?!!!
	Before I forget in the original folios there are NO line numbers.
	I wonder why we use them. Must be some geeky thing I do not understand.
	Maybe, you can enlighten me! ;-))

	regards
		Keith.


From schultzk at uni-trier.de  Thu Feb 25 03:17:02 2010
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Thu, 25 Feb 2010 12:17:02 +0100
Subject: [gutvol-d] Re: so what is so important about pagination?
In-Reply-To: <919654776.89498.1267064642717.JavaMail.mail@webmail12>
References: <19d99.58cae61c.38b5c60c@aol.com>
	<627d59b81002231604u6079e880x9ee1625963baba8d@mail.gmail.com>
	<9B13A81D-3C69-48E0-9C18-E17C9CC20883@uni-trier.de>
	<627d59b81002240918t6fc38b26lb63460419c3f8a3e@mail.gmail.com>
	<6d99d1fd1002241753h23f4cad5i69fedbd5672fef2@mail.gmail.com>
	<919654776.89498.1267064642717.JavaMail.mail@webmail12>
Message-ID: <C5A8D8D7-FB47-4BB9-AE0B-1DB33AD5BAEB@uni-trier.de>

Actually, I go to the cemetery to meet them. I think
Old English People are dead!!

;-))

Keith.

Am 25.02.2010 um 03:24 schrieb Joshua Hutchinson:

> Of course they're not.  Old English people are NERDS!
> 
> ;)
> 
> Josh


From greg at durendal.org  Thu Feb 25 05:07:52 2010
From: greg at durendal.org (Greg Weeks)
Date: Thu, 25 Feb 2010 08:07:52 -0500 (EST)
Subject: [gutvol-d] [SPAM] Re: Re: the d.p. opinion on "prerelease" of
	e-texts
In-Reply-To: <A871E93708224C1FAB0E62A643357D64@alp2400>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<4B8582E5.5050103@novomail.net>
	<alpine.DEB.2.00.1002241317320.7483@mail.pglaf.org>
	<B592E6447AAB45C891E108507D6811FA@alp2400>
	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>
	<A871E93708224C1FAB0E62A643357D64@alp2400>
Message-ID: <alpine.DEB.2.00.1002250806010.10627@durendal.durendal.org>

On Wed, 24 Feb 2010, Al Haines (shaw) wrote:

> Hmm... I figured Greg was busy enough! <g>
>
> As for the "major technical challenge" suggested elsewhere in this topic, why 
> can't DP put up a wiki page for pre-releases, similar to its Harvesting wiki 
> page?  Project Managers (or whoever) could put links on the page to their 
> pre-release candidates, and when a pre-release was ready for submission to 
> PG, or had been posted, the PM could remove the link.

I'm not going to hand edit a wiki for every project. I suspect no one else 
will either and the wiki would end up totally out of date. The PG 
bookshelf wiki pages are basically in that state now.


-- 
Greg Weeks
http://durendal.org:8080/greg/


From ajhaines at shaw.ca  Thu Feb 25 10:17:19 2010
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Thu, 25 Feb 2010 10:17:19 -0800
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<4B8582E5.5050103@novomail.net>
	<alpine.DEB.2.00.1002241317320.7483@mail.pglaf.org>
	<B592E6447AAB45C891E108507D6811FA@alp2400>
	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>
	<A871E93708224C1FAB0E62A643357D64@alp2400>
	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>
Message-ID: <AA902F07008A4680BD1E6D65699E82CD@alp2400>

Some of these questions may overlap a bit--bear with me...


Who's going to monitor this pre-prelease page for when projects on it are 
posted?  The WWers?  DP? Greg?

Will it contain only pre-release text files, or all working files associated 
with a given project (page scans, illustrations, text, HTML, etc, etc)?

If the latter, what's to stop someone from taking those files, getting their 
own clearance, and submitting them to PG as their own work?  Or is DP going 
to consider that they're, in effect, abandoned projects, and up for grabs? 
(I can only imagine the reaction that would cause.)

Related to a couple of the above questions, would the WWers be expected to 
check to see if a given submission is one that's also in progress at DP, or 
would it be a case of first-come, first-posted, and let DP take its lumps?


Rather than dumping who knows how many pre-releases into Preprints, I'd 
suggest a separate Prerelease page.  (Speaking personally, I regularly check 
Preprints for interesting/doable projects, and have drawn a number of 
projects from it.  I doubt I'd be interested in looking through a raft of 
ex-DP items, searching for non-DP Preprint items.)

Al


----- Original Message ----- 
From: "Michael S. Hart" <hart at pglaf.org>
To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
Sent: Wednesday, February 24, 2010 8:10 PM
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts


>
> On Wed, 24 Feb 2010, Al Haines (shaw) wrote:
>
>> Hmm... I figured Greg was busy enough! <g>
>
> I talked to Greg about it, he's fine with it.
> Might be best to let him do it once a week or something, tho.
>
>
>> As for the "major technical challenge" suggested elsewhere in this topic, 
>> why
>> can't DP put up a wiki page for pre-releases, similar to its Harvesting 
>> wiki
>> page?  Project Managers (or whoever) could put links on the page to their
>> pre-release candidates, and when a pre-release was ready for submission 
>> to PG,
>> or had been posted, the PM could remove the link.
>
> I'm up for any way we can get more eBooks to more people, sooner, than 
> later.
>
> I'm willing to try them both, see how our readers respond.
>
> After all, it's not trivial to find the PrePrints page, though Google does
> seem to find it ok, which should be ok for most of our readers.
>
> Let's try. . . .
>
>
> Michael
>
>>
>> Al
>>
>>
>> ----- Original Message ----- From: "Michael S. Hart" <hart at pglaf.org>
>> To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
>> Sent: Wednesday, February 24, 2010 5:50 PM
>> Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
>>
>>
>> >
>> > Greg Newby takes care of deleting from PrePrints, should be easy enough
>> > just to send him notes of what has been completed.
>> >
>> >
>> > mh
>> >
>> >
>> >
>> > On Wed, 24 Feb 2010, Al Haines (shaw) wrote:
>> >
>> > > Would it not be simpler for DP itself to have a Pre-releases page, 
>> > > similar
>> > > to
>> > > its Smooth-Read page?
>> > >
>> > > I would think that if pre-releases are copied from DP into some PG
>> > > environment, similar to Preprints, there would need to be some
>> > > coordination to
>> > > remove them from that environment when they're posted into PG as 
>> > > finished
>> > > products.
>> > >
>> > > Al
>> > >
>> > >
>> > > ----- Original Message ----- From: "Michael S. Hart" <hart at pglaf.org>
>> > > To: "Project Gutenberg Volunteer Discussion" 
>> > > <gutvol-d at lists.pglaf.org>
>> > > Sent: Wednesday, February 24, 2010 1:21 PM
>> > > Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
>> > >
>> > >
>> > > >
>> > > >
>> > > >
>> > > > I can tell you that both Greg Newby, our CEO, and myself,
>> > > > are most definitely FOR doing such prereleases!!!
>> > > >
>> > > > We are prepared to start making directories for them and
>> > > > getting them loaded in as soon as we have permission.
>> > > >
>> > > > We would put up a number of disclaimers abou them to put
>> > > > a protective system up for DP and various volunteers.
>> > > >
>> > > > We would appreciate any opportuty to do this with any of
>> > > > the three sections of books mentioned earlier.
>> > > >
>> > > > Many thanks!!!
>> > > >
>> > > >
>> > > > Michael
>> > > >
>> > > >
>> > > >
>> > > > On Wed, 24 Feb 2010, Lee Passey wrote:
>> > > >
>> > > > > On 2/24/2010 12:37 PM, cmiske at ashzfall.com wrote:
>> > > > >
>> > > > > [snip]
>> > > > >
>> > > > > > The more I think about it, the more I feel that it would 
>> > > > > > actually >
>> > > > > be
>> > > > > > rather nice to see something along the lines of the 'roundless
>> > > > > > experiment' produce decent quality 'first release' books and 
>> > > > > > then >
>> > > > > have
>> > > > > > people nitpick them to 'perfection' later. The point would be 
>> > > > > > the >
>> > > > > format
>> > > > > > of storage so that the rough version could be compared to the > 
>> > > > > >  > >
>> > > original
>> > > > > > scans and corrections/updates be done with ease. Also, if a 
>> > > > > > book is
>> > > > > > backlogged at DP, it would make sense to have a method for 
>> > > > > > someone
>> > > > > > outside of DP to adopt the book and finish it.
>> > > > >
>> > > > > I agree with you completely, and what you have described is what 
>> > > > >  > >
>> > > bowerbird
>> > > > > has
>> > > > > been agitating for for many years now. A book would go into 
>> > > > > 'first
>> > > > > release'
>> > > > > when the number of changes in the roundless system dropped below 
>> > > > > a > >
>> > > certain
>> > > > > rate. It would then go into general distribution in a system that 
>> > > > >  > >
>> > > allowed
>> > > > > "continuous proofreading" in a wiki-like system
>> > > > >
>> > > > > Good luck in convincing either DP or PG to adopt these kind of > 
>> > > > >  >
>> > > reforms.
>> > > > > _______________________________________________
>> > > > > gutvol-d mailing list
>> > > > > gutvol-d at lists.pglaf.org
>> > > > > http://lists.pglaf.org/mailman/listinfo/gutvol-d
>> > > > >
>> > > > _______________________________________________
>> > > > gutvol-d mailing list
>> > > > gutvol-d at lists.pglaf.org
>> > > > http://lists.pglaf.org/mailman/listinfo/gutvol-d
>> > > >
>> > >
>> > >
>> > > _______________________________________________
>> > > gutvol-d mailing list
>> > > gutvol-d at lists.pglaf.org
>> > > http://lists.pglaf.org/mailman/listinfo/gutvol-d
>> > >
>> > _______________________________________________
>> > gutvol-d mailing list
>> > gutvol-d at lists.pglaf.org
>> > http://lists.pglaf.org/mailman/listinfo/gutvol-d
>> >
>>
>>
>> _______________________________________________
>> gutvol-d mailing list
>> gutvol-d at lists.pglaf.org
>> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
> 


From jimad at msn.com  Thu Feb 25 11:22:32 2010
From: jimad at msn.com (Jim Adcock)
Date: Thu, 25 Feb 2010 11:22:32 -0800
Subject: [gutvol-d] Re: let us not be confused
In-Reply-To: <8baaac1d1002230926g2b26171dp84eed4061f21676a@mail.gmail.com>
References: <bfef.7902e780.38b2f8f0@aol.com>	<SNT120-DS359BF0FBA2E88D39C9DF7AE420@phx.gbl>
	<8baaac1d1002230926g2b26171dp84eed4061f21676a@mail.gmail.com>
Message-ID: <SNT120-DS2ED617A754A87DF3F2B5DAE400@phx.gbl>

>I asked the price of Linux development kit. It is 9000 Euro, plus some more
money to get a licence for a fixed number of page/month (500 euro for 25k
pages/month)

As opposed to $400 a pop per volunteer buying the recommended ABBYY
Finereader in order to do a good job of OCR, or $0 to do a bad job of OCR
using whatever came free with one's scanner -- leaving the P1's to be the
ones who are actually doing the OCR!

Maybe we should do a straw poll of the volunteers about who is willing to
donate to a hosted OCR. If each DP volunteer were willing to donate $3 we
would be there.


From grythumn at gmail.com  Thu Feb 25 11:28:14 2010
From: grythumn at gmail.com (Robert Cicconetti)
Date: Thu, 25 Feb 2010 14:28:14 -0500
Subject: [gutvol-d] Re: let us not be confused
In-Reply-To: <SNT120-DS2ED617A754A87DF3F2B5DAE400@phx.gbl>
References: <bfef.7902e780.38b2f8f0@aol.com>
	<SNT120-DS359BF0FBA2E88D39C9DF7AE420@phx.gbl>
	<8baaac1d1002230926g2b26171dp84eed4061f21676a@mail.gmail.com>
	<SNT120-DS2ED617A754A87DF3F2B5DAE400@phx.gbl>
Message-ID: <15cfa2a51002251128o78e538ek2c343bbcd2e2a9e@mail.gmail.com>

Who pays $400 for FR? Older versions that work quite well can be had
on ebay for under $50, last time I looked. And there's always the OCR
pool.

-Bob

On Thu, Feb 25, 2010 at 2:22 PM, Jim Adcock <jimad at msn.com> wrote:
>>I asked the price of Linux development kit. It is 9000 Euro, plus some more
> money to get a licence for a fixed number of page/month (500 euro for 25k
> pages/month)
>
> As opposed to $400 a pop per volunteer buying the recommended ABBYY
> Finereader in order to do a good job of OCR, or $0 to do a bad job of OCR
> using whatever came free with one's scanner -- leaving the P1's to be the
> ones who are actually doing the OCR!
>
> Maybe we should do a straw poll of the volunteers about who is willing to
> donate to a hosted OCR. If each DP volunteer were willing to donate $3 we
> would be there.
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

From jimad at msn.com  Thu Feb 25 11:48:03 2010
From: jimad at msn.com (Jim Adcock)
Date: Thu, 25 Feb 2010 11:48:03 -0800
Subject: [gutvol-d] Re: let us not be confused
In-Reply-To: <efe7.a6d588a.38b5782c@aol.com>
References: <efe7.a6d588a.38b5782c@aol.com>
Message-ID: <SNT120-DS10508F8A9ED5BA5593B655AE400@phx.gbl>

>jim said:
>>   Not that I totally disagree, but 
>>   when you take out the easy stuff, 
>>   the stuff that's left is harder to find.
>
>where's your data that validates that?

Simply my personal experience finishing a book taking this exact approach.
When you take out the easy stuff, you are left with the stuff that typically
the P2s and P3s find -- hopefully! When you do the machine marking of errors
you double the amount of errors that need to be fixed -- since each input
file contributes its errors.  But now many of the "P1" type errors are
marked, which could be a win if the P1 people were presented with a
highlighted version of those errors, similar to what is currently presented
in WordCheck. To me a surprising amount of work goes into fixing
hyphenation/dehyphenation linebreak errors, where it should be possible to
make a more intelligent tool to fix most of these problems -- or does
someone claim to already have such an intelligent tool?  But again, when I'm
doing a "solo" project, I don't really spend that much of my total effort on
a "P1" type pass to begin with, and a surprisingly large amount of time and
effort goes into doing "PP" stuff -- including supporting that GD PG "txt70"
format!


From Bowerbird at aol.com  Thu Feb 25 11:52:03 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 25 Feb 2010 14:52:03 EST
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
Message-ID: <20644.f0a15b3.38b82ee3@aol.com>

al said:
>   there would need to be some coordination to remove them from 
>    that environment when they're posted into PG as finished products.

michael said:
>    Greg Newby takes care of deleting from PrePrints, should be 
>    easy enough just to send him notes of what has been completed

my gawd, what a remarkable exchange.

we're talking about _files_, on the _internet_, with _known_ u.r.l.,
going through a standard progression of states.

and y'all think you need to handle this _manually_, by "sending notes"?

you haven't learned one thing about why all of your processes are
_so_ inefficient, have you?   not a single thing!

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100225/f7bccc09/attachment.html>

From hart at pglaf.org  Thu Feb 25 12:17:56 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Thu, 25 Feb 2010 12:17:56 -0800 (PST)
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <20644.f0a15b3.38b82ee3@aol.com>
References: <20644.f0a15b3.38b82ee3@aol.com>
Message-ID: <alpine.DEB.2.00.1002251214090.23263@mail.pglaf.org>


We complile notes into computer instructions all the time,
it's not so terribly inefficient.  How do you think the
"posted" list gets created in the first place?

I'm sure we will improve any such process a few times when
we are first getting started, but in the end the important
line can be parsed out of the message just like one of the
volunteers always says the eBooks should be parsable, only
on a much easier scale. . .use the same POV on both sides,
when you are making your points, it's more believable.


On Thu, 25 Feb 2010, Bowerbird at aol.com wrote:

> al said:
> >?? there would need to be some coordination to remove them from
> >?? that environment when they're posted into PG as finished products.
>
> michael said:
> >?? Greg Newby takes care of deleting from PrePrints, should be
> >?? easy enough just to send him notes of what has been completed
>
> my gawd, what a remarkable exchange.
>
> we're talking about _files_, on the _internet_, with _known_ u.r.l.,
> going through a standard progression of states.
>
> and y'all think you need to handle this _manually_, by "sending notes"?
>
> you haven't learned one thing about why all of your processes are
> _so_ inefficient, have you?? not a single thing!
>
> -bowerbird
>
>

From cmiske at ashzfall.com  Thu Feb 25 12:26:29 2010
From: cmiske at ashzfall.com (cmiske at ashzfall.com)
Date: Thu, 25 Feb 2010 13:26:29 -0700
Subject: [gutvol-d] Re: let us not be confused
Message-ID: <20100225132629.0dedd0f3f91314fbc67db20f64e304ca.d197b1c7de.wbe@email05.secureserver.net>

>Jim said:
>Simply my personal experience finishing a book taking this exact approach.
>When you take out the easy stuff, you are left with the stuff that typically
>the P2s and P3s find -- hopefully! When you do the machine marking of errors
>you double the amount of errors that need to be fixed -- since each input
>file contributes its errors. But now many of the "P1" type errors are
>marked, which could be a win if the P1 people were presented with a
>highlighted version of those errors, similar to what is currently presented
>in WordCheck. To me a surprising amount of work goes into fixing
>hyphenation/dehyphenation linebreak errors, where it should be possible to
>make a more intelligent tool to fix most of these problems -- or does
>someone claim to already have such an intelligent tool? 

I made several such tools many years ago. I'm not sure I can locate them
in my backups. In any case, I am in the process of creating some tools
to assist in the process of proofing and formating. These would be for
the use of 'solo' producers as well as DP users.

I do not believe in full automation for anything complex (as it can
induce as many errors as it fixes), so many of these tools would
primarily prompt the user to make choices about things that 'confuse'
the software.

Carel


_______________________________________________
gutvol-d mailing list
gutvol-d at lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d


From cmiske at ashzfall.com  Thu Feb 25 12:54:16 2010
From: cmiske at ashzfall.com (cmiske at ashzfall.com)
Date: Thu, 25 Feb 2010 13:54:16 -0700
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
Message-ID: <20100225135416.0dedd0f3f91314fbc67db20f64e304ca.bf7a801c99.wbe@email05.secureserver.net>

>Al said: 
>Will it contain only pre-release text files, or all working files associated 
>with a given project (page scans, illustrations, text, HTML, etc, etc)?
>
>If the latter, what's to stop someone from taking those files, getting their 
>>own clearance, and submitting them to PG as their own work? Or is DP going 
>to consider that they're, in effect, abandoned projects, and up for grabs? 

I actually hope the files are made available for others to finish. If
the backlog is deep enough at DP as to warrant the creation of the
pre-release area, then it would actually relieve their burden if
individuals adopted some of the projects. In any case, I sincerely doubt
there would be a mass wave of people grabbing up projects.

If someone does finish one of these projects, I assume an attribution to
DP would be included in the header to credit the work that they have
done. To me, the issues of 'ownership' and 'credit' are secondary to
what I believe is the main goal of getting finished works into PG so
that people can read them. 

Carel


From Bowerbird at aol.com  Thu Feb 25 12:59:03 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 25 Feb 2010 15:59:03 EST
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
Message-ID: <252db.3203e702.38b83e97@aol.com>

michael said:
>    We complile notes into computer instructions all the time,
>    it's not so terribly inefficient.

um, ok, if you say so.


>    How do you think the "posted" list 
>    gets created in the first place?

i thought it was in some sensible way.
but now i'm just plain afraid to ask...

***

anyway, here's how the questions should be answered:

1.   programs scrape the d.p. website for info about
which books have been posted into which rounds...

2.   when a book is posted into a new round, its file
is copied to the prerelease site, and the prerelease
file posted for the previous round (if there was one)
is deleted.

3.   when a book is posted to p.g., the latest file on
the prerelease site is replaced with a pointer to p.g.

4.   all of this happens _automatically_, with no need
for any human intervention at all.

do you see the difference?   because it's a huge difference.

of course, this doesn't answer the _political_ questions
al asked.   but those answers are above my pay grade...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100225/c4f266eb/attachment.html>

From Bowerbird at aol.com  Thu Feb 25 13:55:31 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 25 Feb 2010 16:55:31 EST
Subject: [gutvol-d] Re: ok, let's take a look at gardner's book,
	just for the exercise
Message-ID: <14b0a.4237863c.38b84bd3@aol.com>

ok, you might remember that gardner invited me to
take a look at a book he posted -- "the advocate"...

>    http://www.gutenberg.org/dirs/3/1/2/1/31212/31212-8.txt
>    http://www.archive.org/details/advocateanovel00heavgoog
>    http://www.canadiana.org/ECO/ItemRecord/48293?id=16c79d4f15394e51
>    http://books.google.com/books?id=ot4OAAAAYAAJ&oe=UTF-8

***

gardner's since told me that he just expected that i'd
run his text through my clean-up script and report
things that it found.   but, as i told him a while back,
i don't really have any such general clean-up script.

besides...

when i _really_ want to find out if a text is accurate,
i compare it against another digitization, and then
resolve any differences by looking at the page-scan.

i've found it's an excellent way to approach perfection.

i've also documented the process quite thoroughly,
but i haven't gone through that exercise lately, so i
thought i'd use gardner's invitation to remedy that.

it's also true that i knew it'd be a good opportunity to
demonstrate the importance of retaining the original
linebreaks and pagebreaks when you digitize a book,
so i did it for that reason as well.

***

i cleaned up the o.c.r. i got from internet archive,
and then compared it against gardner's proofing.
i resolved the diffs by looking at the page-scan,
and ended up with my final version, which is here:
>    http://z-m-l.com/go/gardn/gardn.zml

comparing this against the p.g. e-text, i found
about 79 places where the p.g. text is incorrect.
i appended the list to this post, plus it is here:
>   http://z-m-l.com/go/gardn/gardn-79diffs.html

(there _are_ 79 differences, but somebody _might_
challenge a few of them as _my_text_ being wrong,
and not the p.g. e-text.   i welcome dialog on them.
there are about 70 cases where there is no doubt.)

the vast majority of the diffs were on punctuation.
abbyy v5 read many marks on the page as commas,
and often garbled the semicolon/colon distinction.
it recognized most letters correctly, except for h/b.
despeckling might (or might not) help performance,
but upgrading to abby v7 is probably the best bet...

i salute gardner's courage in seeking to _improve_
his workflow.   because of his openness to criticism,
he's learned that his o.c.r. app does him no favors,
which is a lesson that shouldn't really hurt much...
especially since he ends up with a nice little list that
points to some 79 possible errors in his digitization.

***

it's worth noting that my cleaned-up version of the
o.c.r. from archive.org was as bad as gardner's file.

in the same way that the diffs helped me to find the
errors in _his_ file, the diffs helped locate the errors
in _mine_ -- the beauty of the comparison method
is that two flawed files can produce one that is better,
to the point the merged product can be nearly-perfect.

given the plethora of flawed digitizations in the world,
the comparison method gives us real hope for change.
the fact that it is boatloads more efficient than using
the word-by-word proof method is icing on the cake.

if you don't understand this is the way of the future,
you're not paying sufficient attention, my friend, and
i suggest you get out of the way or you'll get run over.

***

now, on to my last point, about pagination/linebreaks.

gardner has said he will fix the errors that were found,
and resubmit the book to project gutenberg; that's fine.

but now there are two versions out there in the world:
the one i created with original pagination/linebreaks,
and a p.g. version where the text has been rewrapped.

now, a person can-- with some degree of difficulty --
verify that the two versions have the exact same text...

but what i have done is to juxtapose _my_ version with
the page-scans of the book, so as to highlight the fact
that the text i've presented does indeed match the scan:
>    http://z-m-l.com/go/gardn/gardnp123.html

with the p.g. version, the user must take your word that
the version actually does correspond well to the p-book.

given the starkness of their choice, the answer is clear...

in the past, when the p.g. version was the _only_ version
that was available as digital text, people were _happy_
to accept that it was a copy that was accurate enough...

with a choice, however, and users now _have_ a choice,
i'm not sure they'll be so willing to take it on pure faith.

i _am_ sure michael will want to argue about this.   fine.
but i don't see anything to argue about; it's a slam dunk.

***

i'll do a follow-up report (tomorrow?) discussing how
you can generate other versions from the .zml master,
as well as how to remove the pagination and rewrap...

-bowerbird

p.s.   ok, here are the 79 diffs between me and p.g.
use a monospaced font, so the pointer-line works:

looked up and sighed, and he continued:
looked up and sighed, and he continued: <-- missing paragraph break here
=======================================xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

soon after their union, his wife died in giving
soon, after their union, his wife died in giving
====x===========================================

to hide her shame.
to hide her shame,
=================x

jest, all sat to listen to and applaud their host's inimitable stories, his
jest, all sat to listen to and applaud, their host's inimitable stories, 
his
======================================x=====================================


or,--leaving both revelation and the pandects,--become the
or,--leaving both revelation, and the pandects,--become the
============================x==============================

[Illustration: "As if at the jests of another
[Illustration "As if at the jests of another <-- missing colon after 
illustration tag
=============x===============================

of birds, the lowing of cattle, the barking of dogs, the churr of the 
bullfrog,
of birds, the lowing of cattle, the barking, of dogs, the churr of the 
bullfrog,
===========================================x================================
====

creeper half concealed its rugged exterior, and clothed in
creeper half concealed; its rugged exterior, and clothed in
======================x====================================

done;--what is it that alarms you?
done;--what is it that alarms you?"
==================================x

swearing, into the vestibule.
swearing, into the vestibule
============================x

and attempted to bear her off, as of old the treacherous
and attempted to tear her off, as of old the treacherous
=================x======================================

and the affair that had been begun in vulgar, aimless, frolic, might 
and the affair that had been, begun in vulgar, aimless, frolic, might 
============================x=========================================

in which the gallant stranger had disappeared.
in which the gallant stranger, had disappeared.
=============================x=================

fool, that rent the kingdom,--Rehoboam
fool, that rent the kingdom,--Rehoboam.
======================================x

him for a block-head, a little black-browed beetle, a
him for a block head, a little black-browed beetle, a
===============x=====================================

my blister, my settled, ceaseless source of irritation: the cause, the 
cause--of
my blister, my settled, ceaseless source, of irritation: the cause, the 
cause--of
========================================x===================================
=====

against its cause, the gallant stranger.
against its cause, the gallant stranger,
=======================================x

and exagerrated form. Anger and shame contended in the old law
and exagerrated form. Auger and shame contended in the old law
=======================x======================================

sir, come,--ah, will you resist your"--father he was about to say, but he 
recoiled 
sir, come,--ah, will you resist your--" father he was about to say, but he 
recoiled 
====================================xxxx====================================
========

thou shalt suffer, thou shalt bend, or
thou shall suffer, thou shalt bend, or
=========x============================

with the cup his own hands have fashioned?
with the cup his own hands have, fashioned?
===============================x===========

at last growing calmer he exclaimed: "Down, down, ye cruel thoughts, ye 
at last growing calmer he exclaimed; "Down, down, ye cruel thoughts, ye 
===================================x====================================

not have you murdered in your old age.
not have you murdered in your old, age.
=================================x=====

her eyes dwelt abstractedly on the sight, then,
her eves dwelt abstractedly on the sight, then,
=====x=========================================

"Can I have heard aright, or do
"Can have heard aright, or do
=====xx========================

am the seignieur
am the seigneur
============x===

of the beloved in rock, not sand."
of the beloved in rock, not sand.'
=================================x

nothing heard but you," replied
nothing heard but you." replied
=====================x=========

"Nothing, I hope," she answered, falteringly.
"Nothing. I hope," she answered, falteringly.
========x====================================

beware, beware, Amanda.
beware, beware, Amanda,
======================x

a gentleman, who announced himself as
a "gentleman, who announced himself as
==x===================================

are giddy," remarked the seigneur gravely.
are giddy." remarked the seigneur gravely.
=========x================================

sir, no; we are fumigated, ventilated, scented,
sir, no: we are fumigated, ventilated, scented,
=======x=======================================

assert, 'All flesh is glass.'"
assert. 'All flesh is glass.'"
======x=======================

have as little for your son," said the lawyer sarcastically.
have as little for your son," said, the lawyer sarcastically.
==================================x==========================

the manner of one who is going to make a confidential proposal: "Either 
remove 
the manner of one who is going to make, a confidential proposal: "Either 
remove 
======================================x=====================================
====

parent made no answer, but secretly groaned in his dilemma, and at length 
excl
parent made no answer, but secretly, groaned in his dilemma, and at length 
excl
===================================x========================================
===

buy brass from you at the price of gold; I will not subsidize you to avoid 
your ward."
buy brass from you at the price of gold; will not subsidize you to avoid 
your ward."
=========================================x==================================
==========

of life you keep a care,
of life you keep a care.
=======================x

midst of three sons and a daughter; the former being dissipated and
midst of three sons and a daughter: the former being dissipated and
==================================x================================

and not ill-disposed youth; whom his
and Hot ill-disposed youth; whom his
====x===============================

their peccadilloes, and entertained other ideas of foreign travel than that
their peccadilloes, and entertained, other ideas of foreign travel than 
that
===================================x========================================


had told her of the predilection and hopes of
had told her of the predilection, and hopes of
================================x=============

as of one who intends no longer to be checked, nor submit to unmerited
as of one who intends no longer to lie checked, nor submit to unmerited
===================================xx==================================

his breast charged with a spiteful purpose; and going straight to the 
lodgings of
his breast charged with a spiteful purpose: and going straight to the 
lodgings of
==========================================x=================================
=====

that night, and in the morning, having obtained leave of absence, rode
that night, and in the morning, haying obtained leave of absence, rode
==================================x===================================

solitary, and filled with chequered thoughts, continued his way,
solitary, and filled with checquered thoughts, continued his way,
=============================x===================================

you are there unto me will still be heaven.
you are there unto me will still he heaven.
=================================x=========

before you could have asked forgiveness," replied
before you could have asked forgiveness." replied
=======================================x=========

boundless goodness and free grace, remits the debts and manifold 
boundless goodness and, free grace, remits the debts and manifold 
=====================x===========================================

not libel love, nor our sweet fortunes," cried
not libel love, nor our sweet fortunes." cried
======================================x=======

It _is_ because _it must_ be; it is unselfish; nay, unto itself
It is because _it must_ be; it is unselfish; nay, unto itself
===xxxx========================================================

its riders; and prompted as it seemed by fear of a rescue, the 
its riders; and prompted is it seemed by fear of a rescue, the 
=========================x=====================================

to swallow the ground, until again over all burst,
to swallow the ground, until again overall burst,
=======================================x==========

gnome, from beneath her envelopement,
gnome, from beneath her envelopment,
===============================x=====

will not require to find your way back this year."
will not require to find your way hack this year."
==================================x===============

gazed wistfully around to discover some glimpse of dawn.
gazed wistfully around to discover, some glimpse of dawn.
==================================x======================

implore you, from this man," and with the words she sprang towards
implore you, from this man," find with the words she sprang towards
=============================xx====================================

"Keep quiet, gentle lady; have patience, bashful beauty; sit down, sit 
down; come p
"Keep quiet, gentle lady; have patience, bashful, beauty; sit down, sit 
down; come p
================================================x===========================
========

akin to contempt, he demanded:
akin to contempt, he demanded: <-- missing paragraph break here
==============================xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

No. Besides at present you have taken a long-
No, Besides at present you have taken a long-
==x==========================================

and taking to their heels concealed themselves amongst the trees that 
covere
and taking to their heels concealed, themselves amongst the trees that 
covere
===================================x========================================
=

answered; "worse than your worst suspicions,
answered "worse than your worst suspicions,
========x===================================

serve? who, except the father of yon boy, the
serve? who, except the father of you boy, the
===================================x=========

this, thinking it to be but the jest or boast, or, at furthest, merely the 
loose anno
this, thinking it to be but the jest or toast, or, at furthest, merely the 
loose anno
========================================x===================================
=========

dubious dwelling where, some hours before,
dubious dwelling where, some hours before.
=========================================x

have her face flayed; her hair shall be plucked up by the roots;" and she
have her face flayed; her hair, shall be plucked up by the roots;" and she
==============================x===========================================

are, as you seem to be, a gentleman, do not leave me;" she exclaimed be
are, as you seem to be, a gentleman. do not leave me;" she exclaimed be
===================================x===================================

calm thyself, girl," echoed the ponderous
calm thyself, girl." echoed the ponderous
==================x======================

advanced, and bending over her whilst his voice fell,
advanced and bending over her whilst his voice fell,
========x============================================

who shall blame the sun and moon
who shall blame the sun and moon <-- missing paragraph break here
================================xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

in surprise and terror. "No,
in surprise and terror. "No.
===========================x

called for anything of earth, but
called for any thing of earth, but
==============x===================

shall ooze out drop by drop, and each drop be a portion of your life."
shall ooze out drop by drop, and each, drop be a portion of your life."
=====================================x=================================

missing, had been caught by the advocate's keen eye, and convinced
missing, had been caught by, the advocate's keen eye, and convinced
===========================x=======================================

until, detected, he stood, too nigh to retreat, too terrified to advance, 
an
until, detected, he stood, too nigh, to retreat, too terrified to advance, 
an
===================================x========================================
=

he cried: "Demon, degenerate dog, where hast thou been walking to and fro 
in the 
he cried: "Demon, degenerate dog, where bast thou been walking to and fro 
in the 
========================================x===================================
=====

my son, my son," he cried in agony; "Oh,
my son, my son." he cried in agony; "Oh,
==============x=========================

did it, all you curious crowd.
did it, all you curious crowd
=============================x
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100225/4619a01e/attachment-0001.html>

From Bowerbird at aol.com  Thu Feb 25 14:34:46 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 25 Feb 2010 17:34:46 EST
Subject: [gutvol-d] Re: d.p.'s undeserved superior attitude about "quality"
Message-ID: <15beb.3c741dcd.38b85506@aol.com>

carel said:
>   I've read your posts for many years

so _you're_ the one who's been reading them!            ;+)


>   I'm not sure the most efficient system on earth 
>    could keep up with the scan rate.

sure we can.   just have 10 million people do 1 book each.

piece'a'cake.


>   And, if you speed things up by skipping the manual proofing
>    and use automation completely for the first release then you 
>    will have text little to no better than what google already has.

oh, don't kid yourself.   google's text is _much_ better than
what they're showing to us members of the general public.


>    Wikimedia then takes this further by adding the ability 
>    for humans to modify the text in a never-ending cycle.

i haven't been impressed by the wiki approach, honestly...


>    In what manner do you suggest that this would be
>    differentiated from the current wiki book editing 
>    process at wikimedia that justifies the creation of 
>    the project for the purpose of PG?

i'm not suggesting this for p.g.   it could be for anybody.
including google.   as for how it differs, please see below.


>    it wouldn't be that hard to cull the data from google

well, it's not that "hard", but it's not that _easy_ either...
first, they've started rewrapping the lines, the dirty rats.
second, and more importantly, their terms prohibit it...
people on this listserve have been cut off for scraping...
you can still do it, of course, but you have to be careful.


>    Then, someone can

yeah, what you've laid out is basically the standard plan.

i would bother to nitpick certain parts of it, like the x.m.l.,
but really, my objection is at the more fundamental level.

yes, i believe every page of every book should be online,
on its own webpage with unchanging u.r.l., text and scan,
with an error-report form that the general public can use
to detail any problems with the page, or ask questions, or
even make annotations and have dialog about that page...

but i also believe that the text for the vast majority of these
pages should be _perfect_ at the time that text is mounted.

and i've demonstrated, repeatedly, that this is fully possible.

one person can take an average book to near-perfection in
the space of a couple hours, if they are given a good tool...
("near-perfection" means 1-error-or-less-every-10-pages.)

i just showed how you can do this, using gardner's book...

so, if 9 out of 10 pages are perfect to begin with, then you
simply don't need a full-on wiki format.   a wiki is good when
you are constantly editing something.   but book digitization
isn't like that.   corrections will be very few and far between.
meanwhile, there's no reason to put every page up for grabs.

even on a first pass, straight outta o.c.r., assuming good scans
and a good o.c.r. app, _half_ the pages have no errors on 'em.

in 2001, the idea for distributed proofreaders was brilliant.

(i was once a graduate student at u.c.l.a. in social psychology,
studying cooperation/competition in the arena of the internet,
so i'm probably the best person on the planet to understand
just precisely how brilliant the idea was, and it was brilliant.)

in 2001, scansets were rare, and o.c.r. stunk (and cost a lot),
and it was hard -- very hard -- to digitize a book by yourself.
moreover, the primary method of connection was slow-speed.

so doling out pages one-at-a-time to people online was
a relatively efficient way of doing book digitization in 2001,
especially as it allowed us to _cumulate_ the _partial_effort_
of many people.   simply put, it made sense to go distributed.
a good number of people were willing to "proof a page a day".

in 2010, the situation has been turned completely on its head.
scansets are no longer rare; indeed, we are swimming in them.
o.c.r. no longer stinks, nor does it cost a lot.   indeed, many of
those scansets now _come_ with the o.c.r. already being done,
and -- after some preprocessing, which isn't difficult to do --
the high level of the accuracy of that o.c.r. is rather startling.
and perhaps most importantly, broadband is now widespread;
we think nothing of throwing tens of megabytes around now.

so, in 2010, doling out pages one-at-a-time to people online
is just a silly idea.   i can download in the background now, but
even if i _watch_, i can download a whole scanset in minutes.
and a 300k text-file that represents a book?   in just seconds.

these days, with the right tools at your disposal, it's possible
for a person to digitize a book all by yourself in _one_hour_.
i proved this -- with a stopwatch -- and documented it fully.

so here's how i see things going...

when you find a scanset online of a book you want to digitize,
you'll download it, along with the o.c.r.   (if it comes from a site
that won't give you the o.c.r., you can upload it to archive.org,
and they'll o.c.r. it for you.   providing you have no o.c.r. app.)

you'll use a handy-dandy tool to digitize it in an hour or two,
and then you'll make the results available to everyone online.

or... your alternative is waiting 3 years for d.p. to get it done.

and carel, since you're interested in programming such tools,
we'll have plenty to talk about...


>    I am very interested in your text processing scripts 
>    though and always have been. I just haven't been 
>    processing any OCR text for a while.

well, like i told gardner just the other day, i really don't
_have_ any collection of scripts.   i use a text-editor and
start looking at the book.   i see an error, and devise a
global search-and-replace to deal with that type of error.
when they're all fixed, i look again to find the next error.

i think of this process as "listening to the book", in that
the book itself will tell you what kinds of errors it has...

and yeah, there are some things that you do to every book,
like fix spacey quotemarks, and pull in spacey contractions.
and it would be good to write macros or scripts or routines
to do these.   but it's fairly trivial to type them in each time.
it's also the case that a lot of the time you _must_ examine
the text in order to approve or disapprove the change, and
-- while it is not impossible to code an interface for that --
your typical text-editor _has_ one, which you know and love.


>   I merely meant that we are not in a position to 
>    have any say in what DP does with itself. 
>    Nor in what PG does with itself.

but i love telling them what they should be doing, and why.

because once they decide to do it, and they _will_, because
i'm never wrong about these things, i can say "i told you so".

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100225/3d910d92/attachment.html>

From jimad at msn.com  Thu Feb 25 14:51:15 2010
From: jimad at msn.com (James Adcock)
Date: Thu, 25 Feb 2010 14:51:15 -0800
Subject: [gutvol-d] Inkmesh
In-Reply-To: <20100224184725.0dedd0f3f91314fbc67db20f64e304ca.5965319e6a.wbe@email05.secureserver.net>
References: <20100224184725.0dedd0f3f91314fbc67db20f64e304ca.5965319e6a.wbe@email05.secureserver.net>
Message-ID: <SNT120-DS242A51D9B28A00501B27DBAE400@phx.gbl>

Just found about this site which is a cross-site search engine for free
ebooks in various formats:

http://inkmesh.com


From jimad at msn.com  Thu Feb 25 15:19:37 2010
From: jimad at msn.com (James Adcock)
Date: Thu, 25 Feb 2010 15:19:37 -0800
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <AA902F07008A4680BD1E6D65699E82CD@alp2400>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<4B8582E5.5050103@novomail.net>	<alpine.DEB.2.00.1002241317320.7483@mail.pglaf.org>	<B592E6447AAB45C891E108507D6811FA@alp2400>	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>	<A871E93708224C1FAB0E62A643357D64@alp2400>	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>
	<AA902F07008A4680BD1E6D65699E82CD@alp2400>
Message-ID: <SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>

>If the latter, what's to stop someone from taking those files, getting
their 
own clearance, and submitting them to PG as their own work?

What's to stop a PM and Content Provider who is sick to death of having
their hard work stuck in limbo year after year at DP from taking a SR copy,
cleaning it up and submitting it to PG -- given that they are the holder of
the CC in the first place and the person who did the lion's share of the
work cleaning it up for submission to DP in the first place?

Answer: Simple Integrity, and the desire to play fair with DP even when DP
is not playing fair with that PM and Content Provider.  What is NOT fair
IMHO is when works that volunteers have put their blood sweat and tears into
gets stuck forever at DP while apparently a commercial entity has taken the
SR from DP and is selling it on Amazon.  Work that volunteers put into the
public domain should go there first, and THEN back to the commercial
providers. 

But, this is what happens when you take years sitting on books instead of
allowing them to be finished.


From cmiske at ashzfall.com  Thu Feb 25 15:22:57 2010
From: cmiske at ashzfall.com (Carel)
Date: Thu, 25 Feb 2010 23:22:57 -0000
Subject: [gutvol-d] Re: =?utf-8?q?Inkmesh?=
Message-ID: <mailman.1.1267140179.2333.gutvol-d@lists.pglaf.org>

Very nice resource. Thank you!
It would be nice if it had a mobi version. 
Carel

Sent from my HTC on the Now Network from Sprint!

----- Reply message -----
From: "James Adcock" <jimad at msn.com>
Date: Thu, Feb 25, 2010 2:51 PM
Subject: [gutvol-d] Inkmesh
To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d at lists.pglaf.org>

Just found about this site which is a cross-site search engine for free
ebooks in various formats:

http://inkmesh.com


_______________________________________________
gutvol-d mailing list
gutvol-d at lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100225/1ea2fc5f/attachment.html>

From jimad at msn.com  Thu Feb 25 15:38:42 2010
From: jimad at msn.com (James Adcock)
Date: Thu, 25 Feb 2010 15:38:42 -0800
Subject: [gutvol-d] Re: Inkmesh
In-Reply-To: <20100225232259.6BD34392@pglaf.org>
References: <20100225232259.6BD34392@pglaf.org>
Message-ID: <SNT120-DS191BA44D66FC71B1F73354AE400@phx.gbl>


>It would be nice if it had a mobi version. 

Not sure I understand this comment.  If the result says ?Kindle? and it?s a free version, then it is a .mobi file, or else is a .prc file -- which in practice is the same as a .mobi file if you change the file extension.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100225/d1ac7f8b/attachment-0001.html>

From hart at pglaf.org  Thu Feb 25 15:52:50 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Thu, 25 Feb 2010 15:52:50 -0800 (PST)
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<4B8582E5.5050103@novomail.net>
	<alpine.DEB.2.00.1002241317320.7483@mail.pglaf.org>
	<B592E6447AAB45C891E108507D6811FA@alp2400>
	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>
	<A871E93708224C1FAB0E62A643357D64@alp2400>
	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>
	<AA902F07008A4680BD1E6D65699E82CD@alp2400>
	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
Message-ID: <alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>


We have always invited people to take our completed books and redo them
into their own editions, and hopefully resubmit them to redistribute.

If we do this for books that are done, why not for those undone?

Are we worried more about who gets the credit and getting books out?

Michael


On Thu, 25 Feb 2010, James Adcock wrote:

> >If the latter, what's to stop someone from taking those files, getting
> their
> own clearance, and submitting them to PG as their own work?
>
> What's to stop a PM and Content Provider who is sick to death of having
> their hard work stuck in limbo year after year at DP from taking a SR copy,
> cleaning it up and submitting it to PG -- given that they are the holder of
> the CC in the first place and the person who did the lion's share of the
> work cleaning it up for submission to DP in the first place?
>
> Answer: Simple Integrity, and the desire to play fair with DP even when DP
> is not playing fair with that PM and Content Provider.  What is NOT fair
> IMHO is when works that volunteers have put their blood sweat and tears into
> gets stuck forever at DP while apparently a commercial entity has taken the
> SR from DP and is selling it on Amazon.  Work that volunteers put into the
> public domain should go there first, and THEN back to the commercial
> providers.
>
> But, this is what happens when you take years sitting on books instead of
> allowing them to be finished.
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

From greg at durendal.org  Thu Feb 25 16:35:54 2010
From: greg at durendal.org (Greg Weeks)
Date: Thu, 25 Feb 2010 19:35:54 -0500 (EST)
Subject: [gutvol-d] [SPAM] Re: Re: the d.p. opinion on "prerelease" of
	e-texts
In-Reply-To: <alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<4B8582E5.5050103@novomail.net>
	<alpine.DEB.2.00.1002241317320.7483@mail.pglaf.org>
	<B592E6447AAB45C891E108507D6811FA@alp2400>
	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>
	<A871E93708224C1FAB0E62A643357D64@alp2400>
	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>
	<AA902F07008A4680BD1E6D65699E82CD@alp2400>
	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
Message-ID: <alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>

On Thu, 25 Feb 2010, Michael S. Hart wrote:

> Are we worried more about who gets the credit and getting books out?

I think we're worried about the fact that the only version available is one 
that you have to BUY that's based on our volunteer labor.

-- 
Greg Weeks
http://durendal.org:8080/greg/


From cmiske at ashzfall.com  Thu Feb 25 16:40:17 2010
From: cmiske at ashzfall.com (Carel)
Date: Fri, 26 Feb 2010 00:40:17 -0000
Subject: [gutvol-d] Re: =?utf-8?q?Inkmesh?=
Message-ID: <mailman.2.1267144819.2333.gutvol-d@lists.pglaf.org>

I meant that it would be nice if the site had a version for smartphones. I was failed to realize for a moment that mini has many meanings.

Carel

Sent from my HTC on the Now Network from Sprint!

----- Reply message -----
From: "James Adcock" <jimad at msn.com>
Date: Thu, Feb 25, 2010 3:38 PM
Subject: [gutvol-d] Re: Inkmesh
To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d at lists.pglaf.org>


>It would be nice if it had a mobi version. 

Not sure I understand this comment.  If the result says ?Kindle? and it?s a free version, then it is a ..mobi file, or else is a .prc file -- which in practice is the same as a ..mobi file if you change the file extension.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100226/fdbcd33c/attachment.html>

From gbuchana at teksavvy.com  Thu Feb 25 17:33:19 2010
From: gbuchana at teksavvy.com (Gardner Buchanan)
Date: Thu, 25 Feb 2010 20:33:19 -0500
Subject: [gutvol-d] Re: ok, let's take a look at gardner's book,
 just for the exercise
In-Reply-To: <14b0a.4237863c.38b84bd3@aol.com>
References: <14b0a.4237863c.38b84bd3@aol.com>
Message-ID: <4B8724DF.3050902@teksavvy.com>

Thank you BB.

I've made a good share of these changes already based
on the line by line comparison you did earlier.  I'll
merge all the rest and re-post as time allows.

On 25-Feb-2010 16:55, Bowerbird at aol.com wrote:
> comparing this against the p.g. e-text, i found
> about 79 places where the p.g. text is incorrect.
> i appended the list to this post, plus it is here:
>  > http://z-m-l.com/go/gardn/gardn-79diffs.html
>
> (there _are_ 79 differences, but somebody _might_
> challenge a few of them as _my_text_ being wrong,

You missed the place where I corrected the spelling of
Ste H?l?ne.  That's 80.

>
> am the seignieur
> am the seigneur
> ============x===

Here, the correct spelling "seigneur" was used several other
places in the text.  It seemed reasonable to correct this
as a typo.

Definitely an improvement all 'round though.

See you,

============================================================
Gardner Buchanan                     <gbuchana at teksavvy.com>
Ottawa, ON             FreeBSD: Where you want to go. Today.

From Bowerbird at aol.com  Thu Feb 25 18:15:44 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 25 Feb 2010 21:15:44 EST
Subject: [gutvol-d] Re: ok, let's take a look at gardner's book,
	just for the exercise
Message-ID: <370dc.6ad2fd79.38b888d0@aol.com>

gardner said:
>    Thank you BB.

you're welcome.   and thank you.   been a pleasure interacting with you.


>    You missed the place where I corrected the spelling of Ste H?l?ne.
> ? That's 80.

or 78, depending on how you count, since i stripped it to "helene".     ;+)


>    Here, the correct spelling "seigneur" was used several other places 
>    in the text.? It seemed reasonable to correct this as a typo.

you're absolutely correct.

and see how we join xeno in our approach to perfection?

for all of you lurkers following along with all the drama here,
i might add that gardner caught a lot of errors in the p-book,
especially regarding quotemarks they'd missed.   he is sharp...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100225/88ff2cdf/attachment.html>

From Bowerbird at aol.com  Thu Feb 25 18:27:32 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 25 Feb 2010 21:27:32 EST
Subject: [gutvol-d] roundlessness -- 006
Message-ID: <37998.1d6084e0.38b88b94@aol.com>

in looking at some of the text from rfrank's "roundless" experiment,
there's a chunk of a book which contained 192 lines that got changed.

a closer analysis showed _80_ of those lines (over 40% of the total)
were unchanged except that a spacey-quotemark had been fixed...

however, since the repair of spacey-quotemarks can be _automated_,
that means all of those 80 lines could have been corrected in advance.

if you really want roundless to work, you have to do it correctly...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100225/3e8a0a85/attachment.html>

From ke at gnu.franken.de  Thu Feb 25 19:17:37 2010
From: ke at gnu.franken.de (Karl Eichwalder)
Date: Fri, 26 Feb 2010 04:17:37 +0100
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
	e-texts
In-Reply-To: <alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org> (Greg
	Weeks's message of "Thu, 25 Feb 2010 19:35:54 -0500 (EST)")
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<4B8582E5.5050103@novomail.net>
	<alpine.DEB.2.00.1002241317320.7483@mail.pglaf.org>
	<B592E6447AAB45C891E108507D6811FA@alp2400>
	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>
	<A871E93708224C1FAB0E62A643357D64@alp2400>
	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>
	<AA902F07008A4680BD1E6D65699E82CD@alp2400>
	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
Message-ID: <m2ljeg1y9a.fsf@gnu.franken.de>

Greg Weeks <greg at durendal.org> writes:

> I think we're worried about the fact that the only version available
> is one that you have to BUY that's based on our volunteer labor.

If that's at least an option, why not?  Nobody forces you to buy it,
though.

-- 
Karl Eichwalder

From dakretz at gmail.com  Thu Feb 25 19:23:52 2010
From: dakretz at gmail.com (don kretz)
Date: Thu, 25 Feb 2010 19:23:52 -0800
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
	e-texts
In-Reply-To: <m2ljeg1y9a.fsf@gnu.franken.de>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<B592E6447AAB45C891E108507D6811FA@alp2400>
	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>
	<A871E93708224C1FAB0E62A643357D64@alp2400>
	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>
	<AA902F07008A4680BD1E6D65699E82CD@alp2400>
	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
Message-ID: <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>

On DP I've called it the "Let Them Eat Cake" approach. Some people think
if they can't wait for the best, then anything less shouldn't be available
derived from their work. It's an option they shouldn't be permitted to have.


On Thu, Feb 25, 2010 at 7:17 PM, Karl Eichwalder <ke at gnu.franken.de> wrote:

> Greg Weeks <greg at durendal.org> writes:
>
> > I think we're worried about the fact that the only version available
> > is one that you have to BUY that's based on our volunteer labor.
>
> If that's at least an option, why not?  Nobody forces you to buy it,
> though.
>
> --
> Karl Eichwalder
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100225/0433fbd8/attachment.html>

From dakretz at gmail.com  Thu Feb 25 19:29:49 2010
From: dakretz at gmail.com (don kretz)
Date: Thu, 25 Feb 2010 19:29:49 -0800
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
	e-texts
In-Reply-To: <627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>
	<A871E93708224C1FAB0E62A643357D64@alp2400>
	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>
	<AA902F07008A4680BD1E6D65699E82CD@alp2400>
	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
Message-ID: <627d59b81002251929p62ce2631we8c42642602dbc3d@mail.gmail.com>

It's not uncommon among programmers, either, I've found, apropo of nothing,
present company excepted, etc.

On Thu, Feb 25, 2010 at 7:23 PM, don kretz <dakretz at gmail.com> wrote:

> On DP I've called it the "Let Them Eat Cake" approach. Some people think
> if they can't wait for the best, then anything less shouldn't be available
> derived from their work. It's an option they shouldn't be permitted to
> have.
>
>
>
> On Thu, Feb 25, 2010 at 7:17 PM, Karl Eichwalder <ke at gnu.franken.de>wrote:
>
>> Greg Weeks <greg at durendal.org> writes:
>>
>> > I think we're worried about the fact that the only version available
>> > is one that you have to BUY that's based on our volunteer labor.
>>
>> If that's at least an option, why not?  Nobody forces you to buy it,
>> though.
>>
>> --
>> Karl Eichwalder
>> _______________________________________________
>> gutvol-d mailing list
>> gutvol-d at lists.pglaf.org
>> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100225/677fdd2b/attachment-0001.html>

From pterandon at gmail.com  Thu Feb 25 19:56:33 2010
From: pterandon at gmail.com (Greg M. Johnson)
Date: Thu, 25 Feb 2010 22:56:33 -0500
Subject: [gutvol-d] Re: opinions on pre-release
Message-ID: <a0bf3e961002251956g41990be7gcc475a36c31d69e1@mail.gmail.com>

> Are we more worried about who
> gets credit or getting books out?

I don't mind seeing "produced by" in the intro to a book, but it is a bit
odd to see people getting editor and author citations in the PG database.
o__O

The workers at Googlebooks afaics have made completely anonymous
contributions-- their names aren't in the epubs.  But I think we are all at
a loss for not knowing the name of that lady whose lovely hand is preserved
forever in so many photocopiess in their collection.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100225/5c840647/attachment.html>

From donovan at abs.net  Thu Feb 25 21:20:25 2010
From: donovan at abs.net (D Garcia)
Date: Fri, 26 Feb 2010 00:20:25 -0500
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
Message-ID: <201002260020.25754.donovan@abs.net>

DISCLAIMER: The following _data_ comes directly from DP site statistics. All 
opinions following are my own.

Of the "8000" works "trapped at DP":

(rounded to nearest hundreds)
4100 from TIA.
700 from Gallica/BNF.
1000 from Google.
400 from the next 5 most represented online sources.
(6200 in total.)

Those 6200+ works already are available to the public, at minimum in scanned 
pages form, and most of them with OCR available. The argument that these works 
are "trapped" is a red herring stemming from frustration over how long it now 
takes the DP process to produce a "finished" version of the text.

On Thu, 25 Feb 2010, Michael Hart wrote:
>We have always invited people to take our completed books and redo them
>into their own editions, and hopefully resubmit them to redistribute.
>
>If we do this for books that are done, why not for those undone?

PG is of course welcome to continue their status quo practice with respect to 
those completed texts. However, I know of nothing that entitles PG to take 
advantage of the efforts of the volunteers of a separate organization before 
the results of those efforts are freely and willingly offered to them.

>Are we worried more about who gets the credit and getting books out?

Considering that the proposed scheme essentially serves to inflate PG's number 
of texts "available," with little significant benefit to the public, and with a 
real risk of significant detriment to DP as an organization and to its 
individual volunteers, can you honestly expect a reasonable person to take 
your question seriously?

There are (as is often pointed out on gutvol-d) hundreds of thousands of works 
in the various book scanning repositories, all "undone" as you would have 
people believe. If PG were truly interested in making large numbers of 
"undone" books available in "pre-print" then perhaps they should take 
advantage of their organizational clout and forge partnerships to have direct 
access to that material in all forms. But that would require effort.

What seems to have happened instead is that PG has decided that the DP in-
process text, even though unfinished, is desirable low-hanging fruit, and 
*that* requires only minimal effort. All that's required is to convince the 
'right' people at DP to either A) expend limited resources towards that end 
instead of where they're needed, or, B) stand aside and allow PG to take 
unreasonable advantage of what have so far been amiable terms of relationship.

Michael, I honestly respect your vision, but your ethic is sorely lacking at 
the moment.

David (donovan)

From hart at pglaf.org  Thu Feb 25 21:52:14 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Thu, 25 Feb 2010 21:52:14 -0800 (PST)
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <201002260020.25754.donovan@abs.net>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<201002260020.25754.donovan@abs.net>
Message-ID: <alpine.DEB.2.00.1002252148460.19226@mail.pglaf.org>


What these numbers and comments do NOT reflect is that even after
only a portion of processing if we make these available they will
be in additional formats and/or improved quality. . .not to leave
out that some people may find them here rather than not at all.

It's not as if this is some kind of secret process we must hide--

mh


On Fri, 26 Feb 2010, D Garcia wrote:

> DISCLAIMER: The following _data_ comes directly from DP site statistics. All
> opinions following are my own.
>
> Of the "8000" works "trapped at DP":
>
> (rounded to nearest hundreds)
> 4100 from TIA.
> 700 from Gallica/BNF.
> 1000 from Google.
> 400 from the next 5 most represented online sources.
> (6200 in total.)
>
> Those 6200+ works already are available to the public, at minimum in scanned
> pages form, and most of them with OCR available. The argument that these works
> are "trapped" is a red herring stemming from frustration over how long it now
> takes the DP process to produce a "finished" version of the text.
>
> On Thu, 25 Feb 2010, Michael Hart wrote:
> >We have always invited people to take our completed books and redo them
> >into their own editions, and hopefully resubmit them to redistribute.
> >
> >If we do this for books that are done, why not for those undone?
>
> PG is of course welcome to continue their status quo practice with respect to
> those completed texts. However, I know of nothing that entitles PG to take
> advantage of the efforts of the volunteers of a separate organization before
> the results of those efforts are freely and willingly offered to them.
>
> >Are we worried more about who gets the credit and getting books out?
>
> Considering that the proposed scheme essentially serves to inflate PG's number
> of texts "available," with little significant benefit to the public, and with a
> real risk of significant detriment to DP as an organization and to its
> individual volunteers, can you honestly expect a reasonable person to take
> your question seriously?
>
> There are (as is often pointed out on gutvol-d) hundreds of thousands of works
> in the various book scanning repositories, all "undone" as you would have
> people believe. If PG were truly interested in making large numbers of
> "undone" books available in "pre-print" then perhaps they should take
> advantage of their organizational clout and forge partnerships to have direct
> access to that material in all forms. But that would require effort.
>
> What seems to have happened instead is that PG has decided that the DP in-
> process text, even though unfinished, is desirable low-hanging fruit, and
> *that* requires only minimal effort. All that's required is to convince the
> 'right' people at DP to either A) expend limited resources towards that end
> instead of where they're needed, or, B) stand aside and allow PG to take
> unreasonable advantage of what have so far been amiable terms of relationship.
>
> Michael, I honestly respect your vision, but your ethic is sorely lacking at
> the moment.
>
> David (donovan)
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

From Bowerbird at aol.com  Thu Feb 25 22:02:03 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 26 Feb 2010 01:02:03 EST
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
Message-ID: <3ea02.41e90581.38b8bddb@aol.com>

david (donovan) said:
>    All that's required is to convince the 'right' people at DP to either 
>    A) expend limited resources towards that end
>             _instead_of_ where they're needed, or, 
>    B) stand aside and allow PG to take unreasonable advantage of 
>             what have _so_far_ been amiable terms of relationship.

emphasis added...

here's an obvious ploy to push things to the ultimatum stage...

behold the biggest bottleneck at distributed proofreaders: donovan.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100226/b9899fdc/attachment.html>

From hart at pglaf.org  Thu Feb 25 22:22:28 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Thu, 25 Feb 2010 22:22:28 -0800 (PST)
Subject: [gutvol-d] Re: opinions on pre-release
In-Reply-To: <a0bf3e961002251956g41990be7gcc475a36c31d69e1@mail.gmail.com>
References: <a0bf3e961002251956g41990be7gcc475a36c31d69e1@mail.gmail.com>
Message-ID: <alpine.DEB.2.00.1002252220210.19226@mail.pglaf.org>

On Thu, 25 Feb 2010, Greg M. Johnson wrote:

>
> > Are we more worried about who
> > gets credit or getting books out?
>
> I don't mind seeing "produced by" in the intro to a book, but it is a bit odd to see people
> getting editor and author citations in the PG database.? o__O

Some of our contributors of materials don't like that either.


> The workers at Googlebooks afaics have made completely anonymous contributions-- their
> names aren't in the epubs.? But I think we are all at a loss for not knowing the name of
> that lady whose lovely hand is preserved forever in so many photocopiess in their
> collection.

I rarely put my name on the books I work on, other than "Anonymous,"
unless I prepare the entire thing from end to end, and even then I'm
still sometimes an "Anonymous Project Gutenberg Volunteer."

It's kinda fun like that.


mh

From hart at pglaf.org  Thu Feb 25 22:34:14 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Thu, 25 Feb 2010 22:34:14 -0800 (PST)
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
 e-texts
In-Reply-To: <alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<4B8582E5.5050103@novomail.net>
	<alpine.DEB.2.00.1002241317320.7483@mail.pglaf.org>
	<B592E6447AAB45C891E108507D6811FA@alp2400>
	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>
	<A871E93708224C1FAB0E62A643357D64@alp2400>
	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>
	<AA902F07008A4680BD1E6D65699E82CD@alp2400>
	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
Message-ID: <alpine.DEB.2.00.1002252231450.19226@mail.pglaf.org>

On Thu, 25 Feb 2010, Greg Weeks wrote:

> On Thu, 25 Feb 2010, Michael S. Hart wrote:
>
> > Are we worried more about who gets the credit and getting books out?
>
> I think we're worried about the fact that the only version available is one
> that you have to BUY that's based on our volunteer labor.

We've always let people sell our eBooks. . .period.

We just don't let them use our name.

I think it's more important to get the books out,
however they get out, than anything else. . . .

mh

From marcello at perathoner.de  Thu Feb 25 23:02:15 2010
From: marcello at perathoner.de (Marcello Perathoner)
Date: Fri, 26 Feb 2010 08:02:15 +0100
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease"
 of	e-texts
In-Reply-To: <627d59b81002251929p62ce2631we8c42642602dbc3d@mail.gmail.com>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>	<A871E93708224C1FAB0E62A643357D64@alp2400>	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>	<AA902F07008A4680BD1E6D65699E82CD@alp2400>	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>	<m2ljeg1y9a.fsf@gnu.franken.de>	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<627d59b81002251929p62ce2631we8c42642602dbc3d@mail.gmail.com>
Message-ID: <4B8771F7.9070501@perathoner.de>

don kretz wrote:

> It's not uncommon among programmers, either, I've found, apropo of nothing,
> present company excepted, etc.

The Free Software community says: "Release early, release often."

Its the proprietary software producers that let you wait forever and 
then release crap.

Now, which of these is DP?


-- 
Marcello Perathoner
webmaster at gutenberg.org

From sankarrukku at gmail.com  Fri Feb 26 01:21:02 2010
From: sankarrukku at gmail.com (Sankar Viswanathan)
Date: Fri, 26 Feb 2010 14:51:02 +0530
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
	e-texts
In-Reply-To: <4B8771F7.9070501@perathoner.de>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>
	<AA902F07008A4680BD1E6D65699E82CD@alp2400>
	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<627d59b81002251929p62ce2631we8c42642602dbc3d@mail.gmail.com>
	<4B8771F7.9070501@perathoner.de>
Message-ID: <e45c9fe71002260121u1975164as3f9740aa5f085396@mail.gmail.com>

As soon as a book is released in PG, a number of other sites copy this, make
their own formats and release it.

Now when PG pre-releases a book this process will repeated. The other sites
will make their own format and release the book.

When the final corrected book is posted to PG, very few sites would update
their version. Our experience in correcting the earlier texts bears this
out. Even when an updated and corrected version is posted, not many sites
update their version. The old error ridden version continues.
Also the older version comes up when you google for the book.

For example Sense and Sensibility by Jane Austen is EText#161. This was
corrected and a new illustrated version posted in 2007 (#21839). But when
you google you get only Etext#161. Fortunately this has also been updated in
January, 2009.

Having done the illustrated version Solo, I know.

So in effect PG will be releasing a number of error-ridden books.

Again the side effects of this could be

*Many active volunteers in D.P  may lose interest. These are the proofers
and formatters who do not get their name in the credit line. A drop in the
active volunteers of DP is definitely not in the best interests of PG.*
*
*
*IMHO, it is for DP to decide whether the pre-release version should be
released taking into account the sensitivities of the volunteers. Forum
discussions in D.P can not be taken as the representative opinion of D.P
volunteers.
*
*
*
D.P does not release crap unlike some of the software producers (including
open source) who release software full of bugs and what not.*
*

Sankar Viswanathan


On Fri, Feb 26, 2010 at 12:32 PM, Marcello Perathoner <
marcello at perathoner.de> wrote:

> don kretz wrote:
>
>  It's not uncommon among programmers, either, I've found, apropo of
>> nothing,
>> present company excepted, etc.
>>
>
> The Free Software community says: "Release early, release often."
>
> Its the proprietary software producers that let you wait forever and then
> release crap.
>
> Now, which of these is DP?
>
>
> --
> Marcello Perathoner
> webmaster at gutenberg.org
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>


-- 
Sankar

Service to Humanity is Service to God
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100226/2762f42d/attachment-0001.html>

From hart at pglaf.org  Fri Feb 26 02:47:36 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Fri, 26 Feb 2010 02:47:36 -0800 (PST)
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
 e-texts
In-Reply-To: <e45c9fe71002260121u1975164as3f9740aa5f085396@mail.gmail.com>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>
	<AA902F07008A4680BD1E6D65699E82CD@alp2400>
	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<627d59b81002251929p62ce2631we8c42642602dbc3d@mail.gmail.com>
	<4B8771F7.9070501@perathoner.de>
	<e45c9fe71002260121u1975164as3f9740aa5f085396@mail.gmail.com>
Message-ID: <alpine.DEB.2.00.1002260229340.26749@mail.pglaf.org>


I don't know how you use Google, but your #21839 appears for me:

Sense and Sensibility by Jane Austen - Project Gutenberg
Jun 15, 2007 ... Download the free ebook: Sense and Sensibility by Jane
Austen.    www.gutenberg.org/etext/21839 - Cached - Similar

The Project Gutenberg eBook of Sense & Sensibility, by Jane Austen
End of the Project Gutenberg EBook of Sense and Sensibility, by Jane Austen
...    www.gutenberg.org/files/21839/21839-h/21839-h.htm

The Project Gutenberg EBook of Sense and Sensibility, by Jane ...
Jan 18, 2009 ... Project Gutenberg is a registered trademark, and may not be
...    www.gutenberg.org/files/21839/21839-8.txt


Interestingly enough, our audio book comes up very high, as well.


Now, in all fairness, I will be the first to admit that the order in which
the various editions appears fluctuates from time to time as Google change
policies affect the results.  [See Microsoft suing Google in Europe, etc.]

Today, for example, your edition shows up around #30 in all the results of
"project gutenberg" "sense and sensibility by jane austen" which I did cut
and paste from the exact words in your message.

I have asked in the past for volunteers who know how to make higher ranked
hits for us, but all in all, those hits all refer to us, though, as said:
some of the hits, even the first one for your edition, comes from outside,
not directly from Project Gutenberg.

All you have to do to make YOUR edition move up the charts is all the same
things millions of other sites do, and I have no objection to you trying a
manipulation of this kind.

If it works out, perhaps we should modify our headers and footers to quite
literally pull ourselves up by our bootstraps.

As for those who never, or rarely, update their Project Gutenberg eBooks--
you can say it makes us look better by comparison, and thus encourage more
people to come directly to our sites.

However, to keep doors closed that could be open is not the way to get the
most eBooks to the most people.


Michael


On Fri, 26 Feb 2010, Sankar Viswanathan wrote:

> As soon as a book is released in PG, a number of other sites copy this, make
> their own formats and release it.
>
> Now when PG pre-releases a book this process will repeated. The other sites
> will make their own format and release the book.
>
> When the final corrected book is posted to PG, very few sites would update
> their version.

> Our experience in correcting the earlier texts bears this out. Even when an
> updated and corrected version is posted, not many sites update their
> version. The old error ridden version continues. Also the older version
> comes up when you google for the book.
>
> For example Sense and Sensibility by Jane Austen is EText#161. This was
> corrected and a new illustrated version posted in 2007 (#21839). But when
> you google you get only Etext#161. Fortunately this has also been updated in
> January, 2009.
>
> Having done the illustrated version Solo, I know.
>
> So in effect PG will be releasing a number of error-ridden books.
>
> Again the side effects of this could be
>
> Many active volunteers in D.P ?may lose interest. These are the proofers and
> formatters who do not get their name in the credit line. A drop in the
> active volunteers of DP is definitely not in the best interests of PG.
>
> IMHO, it is for DP to decide whether the pre-release version should be
> released taking into account the sensitivities of the volunteers. Forum
> discussions in D.P can not be taken as the representative opinion of D.P
> volunteers.
>
> D.P does not release crap unlike some of the software producers (including
> open source) who release software full of bugs and what not.
>
> Sankar Viswanathan
>
>
>
>
>
> On Fri, Feb 26, 2010 at 12:32 PM, Marcello Perathoner
> <marcello at perathoner.de> wrote:
>       don kretz wrote:
>
>             It's not uncommon among programmers, either, I've found, apropo of
>             nothing,
>             present company excepted, etc.
>
>
> The Free Software community says: "Release early, release often."
>
> Its the proprietary software producers that let you wait forever and then release
> crap.
>
> Now, which of these is DP?
>
>
> --
> Marcello Perathoner
> webmaster at gutenberg.org
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
>
>
>
> --
> Sankar
>
> Service to Humanity is Service to God
>
>

From dakretz at gmail.com  Fri Feb 26 02:48:00 2010
From: dakretz at gmail.com (don kretz)
Date: Fri, 26 Feb 2010 02:48:00 -0800
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
	e-texts
In-Reply-To: <e45c9fe71002260121u1975164as3f9740aa5f085396@mail.gmail.com>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<AA902F07008A4680BD1E6D65699E82CD@alp2400>
	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<627d59b81002251929p62ce2631we8c42642602dbc3d@mail.gmail.com>
	<4B8771F7.9070501@perathoner.de>
	<e45c9fe71002260121u1975164as3f9740aa5f085396@mail.gmail.com>
Message-ID: <627d59b81002260248u4457ee52y80efb2927b9c0f23@mail.gmail.com>

It's been true in the past that this point of view has held the strongest
position
in shaping DP's policy and practices. If PG wants to publish "early and
often",
I think it's unlikely DP will be the partner of choice, unless there's
someone
willing to offer a stronger voice than Mr. Garcia and Mr. Viswanathan.
There's
been no evidence that such a voice exists. If that's really PG's goal, it
probably
should find another partner. Unfortunately.

On Fri, Feb 26, 2010 at 1:21 AM, Sankar Viswanathan
<sankarrukku at gmail.com>wrote:

> As soon as a book is released in PG, a number of other sites copy this,
> make their own formats and release it.
>
> Now when PG pre-releases a book this process will repeated. The other sites
> will make their own format and release the book.
>
> When the final corrected book is posted to PG, very few sites would update
> their version. Our experience in correcting the earlier texts bears this
> out. Even when an updated and corrected version is posted, not many sites
> update their version. The old error ridden version continues.
> Also the older version comes up when you google for the book.
>
> For example Sense and Sensibility by Jane Austen is EText#161. This was
> corrected and a new illustrated version posted in 2007 (#21839). But when
> you google you get only Etext#161. Fortunately this has also been updated in
> January, 2009.
>
> Having done the illustrated version Solo, I know.
>
> So in effect PG will be releasing a number of error-ridden books.
>
> Again the side effects of this could be
>
> *Many active volunteers in D.P  may lose interest. These are the proofers
> and formatters who do not get their name in the credit line. A drop in the
> active volunteers of DP is definitely not in the best interests of PG.*
> *
> *
> *IMHO, it is for DP to decide whether the pre-release version should be
> released taking into account the sensitivities of the volunteers. Forum
> discussions in D.P can not be taken as the representative opinion of D.P
> volunteers.
> *
> *
> *
> D.P does not release crap unlike some of the software producers (including
> open source) who release software full of bugs and what not.*
> *
>
> Sankar Viswanathan
>
>
>
>
>
> On Fri, Feb 26, 2010 at 12:32 PM, Marcello Perathoner <
> marcello at perathoner.de> wrote:
>
>> don kretz wrote:
>>
>>  It's not uncommon among programmers, either, I've found, apropo of
>>> nothing,
>>> present company excepted, etc.
>>>
>>
>> The Free Software community says: "Release early, release often."
>>
>> Its the proprietary software producers that let you wait forever and then
>> release crap.
>>
>> Now, which of these is DP?
>>
>>
>> --
>> Marcello Perathoner
>> webmaster at gutenberg.org
>>
>> _______________________________________________
>> gutvol-d mailing list
>> gutvol-d at lists.pglaf.org
>> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>>
>
>
>
> --
> Sankar
>
> Service to Humanity is Service to God
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100226/ba3a06b5/attachment.html>

From hart at pglaf.org  Fri Feb 26 03:07:19 2010
From: hart at pglaf.org (Michael S. Hart)
Date: Fri, 26 Feb 2010 03:07:19 -0800 (PST)
Subject: [gutvol-d] Re: [SPAM] Re: Re: the d.p. opinion on "prerelease" of
 e-texts
In-Reply-To: <4B8771F7.9070501@perathoner.de>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>
	<A871E93708224C1FAB0E62A643357D64@alp2400>
	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>
	<AA902F07008A4680BD1E6D65699E82CD@alp2400>
	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
	<627d59b81002251923m7737601fv871796be951a4fb8@mail.gmail.com>
	<627d59b81002251929p62ce2631we8c42642602dbc3d@mail.gmail.com>
	<4B8771F7.9070501@perathoner.de>
Message-ID: <alpine.DEB.2.00.1002260250570.26749@mail.pglaf.org>


On Fri, 26 Feb 2010, Marcello Perathoner wrote:

> The Free Software community says: "Release early, release often."

I was saying this long before they were.

If you recall, Alice in Wonderland, our breakthrough eBook,
appeared in 30 revised editions in just a few years leading
our readers to the conclusion that they could come back any
time and get revised versions of our eBooks.

No one complained out in the real world, but eventually the
insider complainers in PG decided there should be a final--
once and future--Alice in Wonderland.

There were plenty of errors to go around in the early days,
but it turned out that the biggest complainers were just an
assortment of eBook insiders, but that the public was a big
fan of both Project Gutenberg and of eBooks, and were happy
to send us error reports and get the new editions.

This whole idea/ideal of waiting, waiting, waiting for some
"perfect" edition we could release has thus caused problems
in the extreme that we would never have encountered if this
final edition business had never gotten started.

I don't know how many of you know PG history all that well,
but the first editions of all of our early works had labels
like "Alice in Wonderland 0.1" to "Alice in Wonderland 0.9"
before they were ever "officially" released, simply because
we all KNEW there would be errors to correct.

I never believed in trying to START with "perfect" eBooks--
I just figured they would perfect themselves in growing up,
through the natural process of our reader sending errors.

Now we want to pretend there ARE no errors, even to points
where our bigmouth says "perfect" in referring to this.

This pretense is causing us HUGE problems and denying book
access to thousands of titles we could release as "0.x."

By the way, as for the count being 2,000 in excess, do not
forget the 2008, or so, currently in "PrePrints."

Counting those it is a little over 8,000.


> Its the proprietary software producers that let you wait forever and then
> release crap.
>
> Now, which of these is DP?


I would like to see PG & DP be a little less proprietary,
a little less about who gets how much credit and a little
more about getting the books out there ASAP and then work
them up to a never ending Xeno's progress to perfection.


Please. . . .


I hope to be thanking you for your consideration,


I'd like to get our CEO, Greg Newby, started testing this
out in the near future, and we'll see how it works.

Until we actually try it, all this is just conjecture....


Michael S. Hart
Founder
Project Gutenberg

From greg at durendal.org  Fri Feb 26 05:10:49 2010
From: greg at durendal.org (Greg Weeks)
Date: Fri, 26 Feb 2010 08:10:49 -0500 (EST)
Subject: [gutvol-d] [SPAM] Re: Re: [SPAM] Re: Re: the d.p. opinion on
 "prerelease" of e-texts
In-Reply-To: <m2ljeg1y9a.fsf@gnu.franken.de>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<4B8582E5.5050103@novomail.net>
	<alpine.DEB.2.00.1002241317320.7483@mail.pglaf.org>
	<B592E6447AAB45C891E108507D6811FA@alp2400>
	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>
	<A871E93708224C1FAB0E62A643357D64@alp2400>
	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>
	<AA902F07008A4680BD1E6D65699E82CD@alp2400>
	<SNT120-DS20D6F243141087CD7545BCAE400@phx.gbl>
	<alpine.DEB.2.00.1002251551070.6859@mail.pglaf.org>
	<alpine.DEB.2.00.1002251934400.14271@durendal.durendal.org>
	<m2ljeg1y9a.fsf@gnu.franken.de>
Message-ID: <alpine.DEB.2.00.1002260809480.17831@durendal.durendal.org>

On Fri, 26 Feb 2010, Karl Eichwalder wrote:

> Greg Weeks <greg at durendal.org> writes:
>
>> I think we're worried about the fact that the only version available
>> is one that you have to BUY that's based on our volunteer labor.
>
> If that's at least an option, why not?  Nobody forces you to buy it,
> though.

Because it's the ONLY way it's available. I should have to PAY someone 
else to see the work I did. It's incredibly irritating.

-- 
Greg Weeks
http://durendal.org:8080/greg/


From gbnewby at pglaf.org  Fri Feb 26 07:43:42 2010
From: gbnewby at pglaf.org (Greg Newby)
Date: Fri, 26 Feb 2010 07:43:42 -0800
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <AA902F07008A4680BD1E6D65699E82CD@alp2400>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<4B8582E5.5050103@novomail.net>
	<alpine.DEB.2.00.1002241317320.7483@mail.pglaf.org>
	<B592E6447AAB45C891E108507D6811FA@alp2400>
	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>
	<A871E93708224C1FAB0E62A643357D64@alp2400>
	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>
	<AA902F07008A4680BD1E6D65699E82CD@alp2400>
Message-ID: <20100226154342.GC4301@pglaf.org>

Apologies for not chiming in as much as I'd like on this debate.  There
have been many excellent comments.  This whole discussion is
"pre-release," not just about pre-releases.  I don't think we have
enough information yet to decide how, whether, or for which items it
would make sense to have pre-release DP items more widely available.

There are various issues involved (technical, moral, conceptual,
practical...).  Some sort of proof-of-concept is what I'd like to work
toward, to try to see what we're really talking about, and whether many
(often conflicting) goals could be met.  More:

On Thu, Feb 25, 2010 at 10:17:19AM -0800, Al Haines (shaw) wrote:
> Some of these questions may overlap a bit--bear with me...
> 
> 
> Who's going to monitor this pre-prelease page for when projects on
> it are posted?  The WWers?  DP? Greg?

Needs to be cron (the Linux/Unix automated task scheduler).  If it's not
automated, it's not going to achieve anyone's goals.  I realize this
applies to *removal* not just harvesting.

> Will it contain only pre-release text files, or all working files
> associated with a given project (page scans, illustrations, text,
> HTML, etc, etc)?

Unknown, which is why I'd like to see an experiment or two before
figuring out how, whether, when, etc.

I can also envision some sort of "by permission" from the PM
in charge of a given project.  There are several phases in the
DP workflow.   I was (always) just thinking about the items that
are stuck (i.e., delayed enough to, statistically, be thought
of as stuck), but have significant value added and a reasonable
level of completion.

Somewhat related: page scans have been welcome as part of eBooks for
years, but are seldom delivered by DP with a new eBook.  (There are a
few people who add them separately, later.)  Maybe efforts towards
getting pre-release items could also be helpful with adding page scans.

> If the latter, what's to stop someone from taking those files,
> getting their own clearance, and submitting them to PG as their own
> work?  Or is DP going to consider that they're, in effect, abandoned
> projects, and up for grabs? (I can only imagine the reaction that
> would cause.)

I'm not sure how likely that is, but I would discourage it and attempt
to make sure further efforts on an item go back to DP & credit DP.  We
do reasonably well at spotting duplicate copyright clearances, and could
have some README-type info about the "proper" way to take pre-release
items and get them completed.

My experience is that volunteers tend to honor such requests.  After
all, there are plenty of items to work on, and the DP front door
is open to those with interests in particular items.

> Related to a couple of the above questions, would the WWers be
> expected to check to see if a given submission is one that's also in
> progress at DP, or would it be a case of first-come, first-posted,
> and let DP take its lumps?

No, such checks need to be automated.  It won't be perfect, but it's
doable, well enough to raise a flag at submission time that
an item might be a harvested DP item.

In our harvesting how-to at www.gutenberg.org, we talk about
asking permission, and about honoring requests to not add items
to the collection -- even when they are clearly public domain.
I don't favor allowing back-dooring of DP in-progress items
by non-DP sources.  

> Rather than dumping who knows how many pre-releases into Preprints,
> I'd suggest a separate Prerelease page.  (Speaking personally, I
> regularly check Preprints for interesting/doable projects, and have
> drawn a number of projects from it.  I doubt I'd be interested in
> looking through a raft of ex-DP items, searching for non-DP Preprint
> items.)

You're solo efforts are amazing and appreciated, but unusual.  I
don't think the fact that a number (even thousands) of items appear
in reprints will result in a lot of separate solo submissions.  As
others have mentioned, there are any number of sources of items
available that motivated individuals could select from.  I think
that most such people will honor requests to keep prerelease items
with DP (including, of course, a link to sign up and get to work
on the DP workflow).

  -- Greg

From dakretz at gmail.com  Fri Feb 26 08:11:28 2010
From: dakretz at gmail.com (don kretz)
Date: Fri, 26 Feb 2010 08:11:28 -0800
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <20100226154342.GC4301@pglaf.org>
References: <20100224123715.0dedd0f3f91314fbc67db20f64e304ca.94c625de9a.wbe@email05.secureserver.net>
	<4B8582E5.5050103@novomail.net>
	<alpine.DEB.2.00.1002241317320.7483@mail.pglaf.org>
	<B592E6447AAB45C891E108507D6811FA@alp2400>
	<alpine.DEB.2.00.1002241749590.18059@mail.pglaf.org>
	<A871E93708224C1FAB0E62A643357D64@alp2400>
	<alpine.DEB.2.00.1002242007430.23593@mail.pglaf.org>
	<AA902F07008A4680BD1E6D65699E82CD@alp2400>
	<20100226154342.GC4301@pglaf.org>
Message-ID: <627d59b81002260811iec2c116u2838390c98abf4d5@mail.gmail.com>

Since I seem to be conveying messages, Someone With Authority has
finally spoken, at DP, and the word is pretty clear. This is from Louise
Davies, the General Manager:

There are many arguments for and against the idea of making
not-quite-finished texts available sooner than they would be otherwise.

I will not list all the pros and cons here, as it would only be repeating
most of the points already made. Here are my thoughts on it.

1. Once the preprints have been posted, anyone--even non-DP members--can
pick those up, apply for a copyright clearance and post them themselves.
Those clearances are not, and never have been, a reservation for the
clearance holder. To the best of my knowledge, PG has not offered to make
them exclusive, either. Whoever posts first, wins. We have deleted many a
project because someone else has beaten us to the posting. That is one of
the worst ways that our resources can be wasted and our morale shot down.

I deleted a three-volume set of projects last week, because DP-EU posted two
of them to PG first and the third one is in R2. (And yes, the two persons
listed in the DP-EU credits are also members here, so I did not see the
point in comparing quality. I only verified that they were the same edition,
and they were even from the same source as well.) It cuts me deeply every
time I have to do that. So, do we really want to put our texts out there in
preprint-land, which would encourage more of this?

2. Transferring text files en masse to an off-site preprints area would be
counter to the current site policy. Policy changes are the dominion of the
DP Board. These discussions, both here and on gutvol-d, have been brought to
their attention. So unless we hear differently, I would say it is business
as usual. (And please, please, please, do not construe the opinion of an
individual who also happens to be a Board member as being that of the entire
Board. Unless there is an official stamped and notarized announcement, it
remains an individual opinion.)

3. While it is possible for anyone with a little ingenuity to harvest our
text files in bulk, this would also not be a good thing to encourage. It
would add a certain (though possibly negligible) load on our servers. It
would upset a good number of PMs and PPers ('violated' is the word that
comes to mind). And, it would have the potential of compromising our
members' privacy which we maintain so carefully through the application of
our Privacy Policy <http://www.pgdp.net/c/faq/privacy.php>.

4. If a PM decides to make their text files available in the preprints area,
and later--while a PPer is polishing it up--someone else grabs the preprint
and posts it to PG, I think the PPer would be a might upset. At the very
least the PM might place a warning for the PPer that it has been uploaded to
preprints and it could possibly be posted by someone else.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100226/24346cd2/attachment.html>

From Bowerbird at aol.com  Fri Feb 26 10:25:41 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 26 Feb 2010 13:25:41 EST
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
Message-ID: <1b4ab.445cf8dd.38b96c25@aol.com>

michael said:
>    I just figured they would perfect themselves in growing up,

how?   by _magic_?


>    I just figured they would perfect themselves in growing up,
>    through the natural process of our reader sending errors.

but you do nothing to encourage the process of reporting errors.

and it often takes _years_ for reported errors to be corrected...

(and i have _never_ seen a person credited for error-reporting.)


>    Now we want to pretend there ARE no errors, even to points
>    where our bigmouth says "perfect" in referring to this.

well, my big mouth _often_ says "perfect", because that is
what i am aiming at, and working toward, and _attaining_,
in some measure, by using methods that find and fix errors.

where is the equivalent push by project gutenberg?

the whitewashers announce "corrected editions" which they
_know_ (or _should_ know) are woefully lacking in accuracy,
because all they did was apply the corrections readers sent,
instead of aggressively using known methods to find errors.

nor are the whitewashers (and p.g. more generally) the sole
problem in this equation...

the d.p. people _love_ to go on and on about their "quality",
but they have shown _absolutely_no_ interest in going back
and fixing their own early crappy books.   and let's not even
talk about going back and fixing the early crappy p.g. books.
so the d.p. people are full of horse shit on "quality" as well.


>    This pretense is causing us HUGE problems and denying 
>    book access to thousands of titles we could release as "0.x."

i haven't taken a stand on this issue because, as i already said,
i _want_ d.p. to choke on their backlog.   as a big reminder that
something is very, very much wrong in their current workflow.

plus there is some merit to the argument that most of these
books are already available to the public in one form or other.

but i wouldn't be opposed to releasing the _semi-improved_
versions that've come out of the various rounds over at d.p.
(although i think some of us are kidding ourselves greatly about
exactly how much that text has been "improved", i.e., very little.
the typos have been removed, yes.   but i could do the same job,
for the most part, with a simple spellcheck, so that's no big deal.)

and i think it would be great if these in-progress works were
being released so people "on the outside" could improve them,
but i doubt p.g. or d.p. could build the proper infrastructure
for the general public to _do_ that, so i think that's nonsense.

i believe everyone has good motivations in this situation.
(well, except donovan, and you better watch out for him.)

unfortunately, good motivations can't prevent an implosion.


>    I would like to see PG & DP be a little less proprietary,
>    a little less about who gets how much credit and a little
>    more about getting the books out there ASAP and then work
>    them up to a never ending Xeno's progress to perfection.

couldn't avoid using the "p" word yourself, could you, michael?       ;+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100226/41fe8a5f/attachment.html>

From donovan at abs.net  Fri Feb 26 10:45:34 2010
From: donovan at abs.net (D Garcia)
Date: Fri, 26 Feb 2010 13:45:34 -0500
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
In-Reply-To: <3ea02.41e90581.38b8bddb@aol.com>
References: <3ea02.41e90581.38b8bddb@aol.com>
Message-ID: <201002261345.34791.donovan@abs.net>

bowerbird abandoned logic to say:
>david (donovan) said:
>>    All that's required is to convince the 'right' people at DP to either
>>    A) expend limited resources towards that end
>>             _instead_of_ where they're needed, or,
>>    B) stand aside and allow PG to take unreasonable advantage of
>>             what have _so_far_ been amiable terms of relationship.
>
>emphasis added...
>here's an obvious ploy to push things to the ultimatum stage...
>behold the biggest bottleneck at distributed proofreaders: donovan.
>
>-bowerbird

I can't envision any set of circumstances that would lead to an 'ultimatum 
stage' between DP and PG as even being possible, so your ploy to cast it as 
the prelude to one is especially amusing since you're using hyperbole against 
my own use of it. I stand by my characterization of PG's pre-press scheme as 
both unnecessary and a disturbing shift in attitude towards DP.

As for your other remark, the facts and evidence indicate otherwise.

David (donovan)

From ricardofdiogo at gmail.com  Fri Feb 26 11:25:35 2010
From: ricardofdiogo at gmail.com (Ricardo F Diogo)
Date: Fri, 26 Feb 2010 19:25:35 +0000
Subject: [gutvol-d] Re: Our 400th Portuguese eBook
In-Reply-To: <alpine.DEB.2.00.1002240416410.14865@mail.pglaf.org>
References: <alpine.DEB.2.00.1002240416410.14865@mail.pglaf.org>
Message-ID: <9c6138c51002261125s56573betdfb647cb5dff0a30@mail.gmail.com>

Yes, please. "Os Maias", by E?a de Queir?s.

Ricardo F. Diogo

From Bowerbird at aol.com  Fri Feb 26 11:28:47 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 26 Feb 2010 14:28:47 EST
Subject: [gutvol-d] Re: the d.p. opinion on "prerelease" of e-texts
Message-ID: <1ff38.39475b94.38b97aef@aol.com>

donovan said:
>   I stand by my characterization of PG's pre-press scheme as
>   both unnecessary and a disturbing shift in attitude towards DP.

i don't have a dog in this fight.   but i'm glad it's all "amiable" and 
such.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100226/58f36d5b/attachment.html>

From Bowerbird at aol.com  Fri Feb 26 11:36:38 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 26 Feb 2010 14:36:38 EST
Subject: [gutvol-d] roundlessness -- 007
Message-ID: <20709.60b214b3.38b97cc6@aol.com>

ok now, remember how i said that preprocessing is a _simple_ thing,
typically with no need for a large set of checks that need to be done?

on the current chunk of text i'm analyzing from rfrank's experiment,
it looks like a dozen simple checks would've found 75% of the errors,
right off the bat, with no need for a word-by-word proofing round...

save the time and energy and _focus_ of your proofers:   preprocess.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100226/9b64a769/attachment-0001.html>

From Bowerbird at aol.com  Fri Feb 26 11:41:00 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 26 Feb 2010 14:41:00 EST
Subject: [gutvol-d] roundlessness -- 008
Message-ID: <20bb1.752cc690.38b97dcc@aol.com>

by the way, i've coded an offline program that hooks into rfrank's system.
if anyone would be interested in using such a beast, take it up with roger.

or, if you just want to use it to scrape the books on his site, talk to 
me...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100226/c6bb40b8/attachment.html>

From cmiske at ashzfall.com  Fri Feb 26 13:53:37 2010
From: cmiske at ashzfall.com (cmiske at ashzfall.com)
Date: Fri, 26 Feb 2010 14:53:37 -0700
Subject: [gutvol-d] Processing eTexts (was RE: Re: d.p.'s undeserved
 superior attitude about "quality")
Message-ID: <20100226145337.0dedd0f3f91314fbc67db20f64e304ca.cf1ebe2825.wbe@email05.secureserver.net>

Bowerbird said:
>carel said:
>>   I've read your posts for many years
>
>so _you're_ the one who's been reading them!           ;+)

I'm sure many people read them. :)

>>   I'm not sure the most efficient system on earth 
>>   could keep up with the scan rate.
>
>sure we can.  just have 10 million people do 1 book each.
>
>piece'a'cake.

It sounds easy when you put it that way.... ;)

>>   And, if you speed things up by skipping the manual proofing
>>   and use automation completely for the first release then you 
>>   will have text little to no better than what google already has.
>
>oh, don't kid yourself.  google's text is _much_ better than
>what they're showing to us members of the general public.

Their OCR software requires training, so I agree with you that they
always have an at least one step better version on the backburner. They
probably have a stepped up version of everything we see....

>i haven't been impressed by the wiki approach, honestly...

The standard wiki layout is not a proper environment for editing. Better
would be something that allows the vast majority of readers to just read
the page and click a button to be presented with an interface for
making/submitting corrections if such is required. Direct comparison to
the scans is part of the formal proofing process and, although available
during the 'quest for perfection,' should not be a forced issue.

I think calling it a wiki makes it easier for quick comprehension rather
than explaining the whole process of storage, editing, etc. Rather than
working an editing interface into an out-of-the-box wiki system, it
would be better to write a proprietary system with some of the social
networking features of a wiki. That part would take a very firm second
place to just getting some tools out there for people to produce decent
quality texts in a minimal amount of time. Because of that fact, I
haven't really given much thought to the 'wiki' interface as it will
evolve by itself from wants/needs.

[snip]

in re: google:
>people on this listserve have been cut off for scraping...
>you can still do it, of course, but you have to be careful.

Has anyone ever just asked google if they could use the scans and OCR
text for PG? In regards to the scans, I believe they just require you to
leave the branding.

>i would bother to nitpick certain parts of it, like the x.m.l.,
>but really, my objection is at the more fundamental level.

XML makes a nice standard for storage and conversion and can be human
readable. I'm not one of those people who believes that XML is the
answer to everything, but I do think it works well for this use-case
scenerio. Opinions on this will vary, etc.
 
>yes, i believe every page of every book should be online,
>on its own webpage with unchanging u.r.l., text and scan,
>with an error-report form that the general public can use
>to detail any problems with the page, or ask questions, or
>even make annotations and have dialog about that page...

Agreed.

>but i also believe that the text for the vast majority of these
>pages should be _perfect_ at the time that text is mounted.

We are in agreement on this point as well. The majority of the work
towards corrections should be done via automation and human interaction
with a software process so as to reduce the error level to a minimm
before it is put in the hands of a proofreader (or the general public).

>and i've demonstrated, repeatedly, that this is fully possible.
It is very possible.

[snip]
>so here's how i see things going...

>when you find a scanset online of a book you want to digitize,
>you'll download it, along with the o.c.r.  (if it comes from a site
>that won't give you the o.c.r., you can upload it to archive.org,
>and they'll o.c.r. it for you.  providing you have no o.c.r. app.)

>you'll use a handy-dandy tool to digitize it in an hour or two,
>and then you'll make the results available to everyone online.

My only protest is that I believe a human proofing stage is still a good
choice. For one thing, it would be a way of catching errors that could
be induced by the human assisted automation process. Otherwise, I would
say to have two people run the same correction process, diff the results
and then present the mis-matches to a third set of eyes. If that doesn't
result in a book that is ready for release, I'm not sure what would.

Ready for release != perfect. :)

On a point in favor of the proofing round that really has nothing to do
with the potential for automation, the existance of a proofing round
creates a sense of belonging to the project and could assist in
recruiting people to try the 'harder' process of running the
'automation' scripts and submitting books, etc. People need a sense that
they are not just assisting tools to create book editions, but are a
fundamental part of the process. If you try to proof after a release has
been done, there will be little interests in it because you lose the
motivation to 'get the book out there."

>and carel, since you're interested in programming such tools,
>we'll have plenty to talk about...

Yes. :)
It's very much a 'will do' at this point. I just have to survive my
mid-terms. I look forward to your feedback and your input on the
project.

>well, like i told gardner just the other day, i really don't
>_have_ any collection of scripts.  i use a text-editor and
>start looking at the book.  i see an error, and devise a
>global search-and-replace to deal with that type of error.
>when they're all fixed, i look again to find the next error.

That brings up the point that pretty much all of what my scripts would
be doing could be done in word processing software....

>i think of this process as "listening to the book", in that
>the book itself will tell you what kinds of errors it has...

I like that. I'm not sure I'm ready to program an AI engine capable of
"listening to the book." Which is why I do not believe in leaving people
out of the process, but rather giving them tools to assist them with the
process. And, the tools can 'learn' from the people who use them.

I'll be busy finishing up a group project for my classes, so it'll be
late March before I get any real time to work on scripts for this
project. I'll work on some of the research, logic, and flow in the
meantime so that I am ready to roll when I do have the time.

Back to my programming....

Carel


From Bowerbird at aol.com  Fri Feb 26 15:37:23 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 26 Feb 2010 18:37:23 EST
Subject: [gutvol-d] Re: Processing eTexts
Message-ID: <2e38f.36883dde.38b9b533@aol.com>

carel said:
>    Direct comparison to the scans is part of 
>    the formal proofing process and, 
>    although available during the 'quest for perfection,' 
>    should not be a forced issue.

i disagree.

it's largely a matter of "framing" the issue and its perception.

i don't think we should ask the user to help with "proofing" per se.

i don't want them to think of this as "the formal proofing process".

i believe we need to put the books in front of them _to_read_...

_that_ should be our understanding with them, how we "frame" it.

they're there because _they_ want to _read_ this particular book.

now, as part of the interface by which we present the book to them,
we give them both the digital text _and_ the page-scan, together...

most interfaces make the user _choose_ one of the other, but
i see no reason (especially no _benefit_) in doing it that way...

i think it's wise to give them _both_, and have them understand that
there may have been some transcription errors during digitization,
so we're giving them the page-scan too, for any cases they _suspect_
might be errors.   in most cases, they'll see that we matched the scan.
or they'll understand why we made the change that we did, if we did.

_or_ -- in some rare cases, in which we request their assistance --
they'll find an actual transcription error that we made, and _tell_ us.

this is the thing that project gutenberg doesn't do well.   it doesn't
_invite_ the reader to become a part of the transcription process,
to become one of the instruments that marches texts to perfection.

indeed, you can _feel_ deep in your gut the _contempt_and_scorn_
that jim tinsley and al haines have for the stupid people who _do_
report errors, because _half_ of those "errors" are not, in fact, errors.

well, gee, big surprise.   you haven't given them the page-scans, so
how in the world are they gonna know if an apparent error _is_ real?


>    I think calling it a wiki makes it easier for quick comprehension 
>    rather than explaining the whole process of storage, editing, etc.

we should never _have_to_ "explain the whole process" to the public.
for them, it should be as easy as clicking a button to say "check this!"
they shouldn't even need to explain the error.   (we can find it, right?)


>    Rather than working an editing interface 
>    into an out-of-the-box wiki system

we don't need to make "an editing interface"...


>   Has anyone ever just asked google if they could 
>    use the scans and OCR text for PG? 

michael had a meeting with google when they were just starting out...

he reported that they treated him rather rudely, hustling them out of
their famous cafeteria before he had even finished his lunch.   it's sad,
but i don't think they really appreciate project gutenberg very much...


>   XML makes a nice standard for storage and conversion

it's obtuse, and unnecessary obstructive to the text itself...


>    My only protest is that I believe 
>    a human proofing stage is still a good choice.

if you mean a word-by-word reading of the page by a person
who has no real interest in reading the entire book, i disagree.

a comparison of different digitizations is a much better process,
in terms of finding errors, and it takes less work, and is more fun.

i can't think of one single reason for a word-by-word proofing,
let alone two or three rounds of it.   it's simply not efficient, and
anybody who explores the options can learn that for themselves.

between aggressive clean-up of o.c.r., and the comparison method,
and "smoothreading" by people from the general public, we're good.


>    Otherwise, I would say to have two people 
>    run the same correction process, diff the results and
>    then present the mis-matches to a third set of eyes.

close...   but it's unnecessary for "people" to do the digitizations.
two sets of cleaned-up o.c.r., diffed against each other, is enough...

(but yes, it's necessary to have a sentient human doing the clean-up.)


>    the existance of a proofing round
>    creates a sense of belonging to the project and 
>    could assist in recruiting people to try the 'harder' process of 
>    running the 'automation' scripts and submitting books

word-by-word proofing is a slow, boring, humdrum task.

o.c.r. clean-up is a fast-paced, active, and energetic task.
which does _not_ mean that it's "harder".   just more fun...

doing the former task won't help you at all with the latter.
(except perhaps make you _extremely_ grateful that you
don't have to do such a boring, miserable job ever again.)


>    People need a sense that they are not just 
>    assisting tools to create book editions, 
>    but are a fundamental part of the process.

is the hammer more important than the carpenter?

i doubt it, whether you ask the hammer _or_ the carpenter...

i get the impression you think these clean-up tools are just
a bunch of scripts that people turn on, and the results fall out.
but that's not anything like what actually goes on.


>    I'm not sure I'm ready to program an AI engine 
>    capable of "listening to the book." 

the software can't listen to the book.   _you_ have to do that.


>    Which is why I do not believe in leaving people out of the process

again, you have a profound misunderstanding of the process.

the hammer can't do a thing without the active involvement of
the carpenter.   even then, the carpenter has to know what s/he
is doing, or the hammer won't be able to do one bit of good...

the best thing you can do, carel, is to play around with twister...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100226/39a09830/attachment.html>

From cmiske at ashzfall.com  Fri Feb 26 17:28:14 2010
From: cmiske at ashzfall.com (cmiske at ashzfall.com)
Date: Fri, 26 Feb 2010 18:28:14 -0700
Subject: [gutvol-d] Re: Processing eTexts
Message-ID: <20100226182814.0dedd0f3f91314fbc67db20f64e304ca.346916d325.wbe@email05.secureserver.net>

 Bowerbird said:
>i don't want them to think of this as "the formal proofing process".

>i believe we need to put the books in front of them _to_read_...

Which is exactly why a book that is ready for release to PG should not
force upon an individual the original scan. When the reader spots
something they think is an error, they should be able to call up the
scan to see if it is an actual error before reporting it as such, etc.
Otherwise their reading is distracted by the scan. A proper interface
would allow users to select what they want to see. An interface should
be about options.

In any case, I have no desire to host an archive that duplicates the
content of PG and others. The purpose in coming to my project site would
be to prepare books for release to PG and to polish released books so
that they can be read for leisure and pleasure off of _my_ bandwidth.

>>   I think calling it a wiki makes it easier for quick comprehension 
>>   rather than explaining the whole process of storage, editing, etc.
>
>we should never _have_to_ "explain the whole process" to the public.
>for them, it should be as easy as clicking a button to say "check this!"
>they shouldn't even need to explain the error.  (we can find it, right?)

Of course, but it is hard to develop it and request assistance with the
development of it without explaining the basic concepts, etc. A project
must be explained and defined or it cannot be realized.

>>   Rather than working an editing interface 
>>   into an out-of-the-box wiki system

>we don't need to make "an editing interface"...

It's rather hard for people to make/submit corrections without having an
interface for doing so.

>>   Has anyone ever just asked google if they could 
>>   use the scans and OCR text for PG? 
>
>michael had a meeting with google when they were just starting out...
>
>he reported that they treated him rather rudely, hustling them out of
>their famous cafeteria before he had even finished his lunch.  it's sad,
>but i don't think they really appreciate project gutenberg very much...

That is too bad.

>>   XML makes a nice standard for storage and conversion
>
>it's obtuse, and unnecessary obstructive to the text itself...

It's also one of many methods for the storage of 'intelligent' data and
is easy to do. Once put in place, it would never be seen again by anyone
except someone correcting the markup (which one hopes would be rare).
InDesign uses markup, MS Word uses markup, and on and on. I don't think
the users of this software find the markup obstructive to the text
itself.

>>   My only protest is that I believe 
>>   a human proofing stage is still a good choice.
[snip]
>i can't think of one single reason for a word-by-word proofing,
>let alone two or three rounds of it.  it's simply not efficient, and
>anybody who explores the options can learn that for themselves.

I don't believe I said anything about multiple rounds of proofing, etc.
I believe I discussed the main processing of a document being done
through human interaction with scripts and then people actually looking
at the results of the process before the output is released (proofing).

I feel that a human looking at a smaller subset of a large document is a
good thing in the error finding process. You apparently do not think it
is. Neither of us is right or wrong: It is a matter of perspective and
opinion.

>between aggressive clean-up of o.c.r., and the comparison method,
>and "smoothreading" by people from the general public, we're good.

Yes and no. It depends on your definition of 'good.'

>>   Otherwise, I would say to have two people 
>>   run the same correction process, diff the results and
>>   then present the mis-matches to a third set of eyes.

>close...  but it's unnecessary for "people" to do the digitizations.
>two sets of cleaned-up o.c.r., diffed against each other, is enough...

That depends on a lot of factors including the assumption that two OCR
programs would not make the same mistake and that, if they did, it would
be a mistake that could be caught by another process. Etc.

>(but yes, it's necessary to have a sentient human doing the clean-up.)
Exactly. And, if that human makes a major error, the entire text in now
a mess. If two people do the same process it reduces, but does not
eliminate, the risk for 'the human error factor.'

>o.c.r. clean-up is a fast-paced, active, and energetic task.
>which does _not_ mean that it's "harder".  just more fun...

People actually do enjoy proofing. You do not. Again, neither is wrong
in their viewpoint. Volunteers should be involved in a manner that gives
them pleasure and satisfaction, etc. I say to give people tools and
options and let them be the judge of what is best for them.

>i get the impression you think these clean-up tools are just
>a bunch of scripts that people turn on, and the results fall out.
>but that's not anything like what actually goes on.

Since I am planning to write scripts for such an environment, one would
hope that I do not "think the results will just fall out." I think much
of this will be very difficult to program and that the software must be
capable of 'learning' and that it will require a human to interact with
it so that the job can be done well and the software upgraded to do the
job even better in future.

I actually wrote several scripts like this many years ago. They are
archived on my CDs, but my manifest was damaged in a fire, so I am not
sure where to even begin to retrieve them. It would be faster and easier
to just start from scratch than to dig up outdated software anyway, so I
do not regard it as a great loss.

>>   I'm not sure I'm ready to program an AI engine 
>>   capable of "listening to the book." 

>the software can't listen to the book.  _you_ have to do that.

I pretty much said that in what you snipped. :)


>>   Which is why I do not believe in leaving people out of the process

>again, you have a profound misunderstanding of the process.

I'll just laugh and pretend you didn't say that....

Perhaps I have a profound misunderstanding of _your_ proposed process,
but I have a very firm understanding of my own. There is no _the_
process for me to understand. In any case, if you would like to create
your own project, feel free. "The road is free to all." And Michael and
PG would be happy to recieve the texts submitted through any process. 

And I would be happy to receive constructive feedback on my efforts as
well from any source that cares to give it. I am just interested in
creating a method for people to complete initial editions of books
rapidly, to avoid putting features in place that 'waste' the time of
volunteers in doing tasks that could be done more rapidly and easily
through the use of software, and to have an environment that offers the
opportunity to correct any errors that were missed in the initial
release. Also, it should offer a sense of community for volunteers to
interact. The _process_ is not a complex one by any means.

As a human, I expect that my project process will not be perfect nor
will it agree with the opinions of all others on how it _should_ be
done, but it will be a process that produces books. So, it will be part
of the solution no matter how many problems it has. :)

Nor do I feel that the process at DP is flawed. They are doing the
process the way they have chosen to do it and it is a project that
acutally exists and is productive. They simply created a bottleneck in
their workflow that they need to deal with. I actually hope that some of
my tools are of assistance to those who use DP to process books. I am
about getting books into PG and care very little how one goes about
doing it so long as one _is_ going about doing it. :)

Carel


From Bowerbird at aol.com  Sat Feb 27 12:39:09 2010
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 27 Feb 2010 15:39:09 EST
Subject: [gutvol-d] Re: Processing eTexts
Message-ID: <571a1.687ad2ef.38badced@aol.com>

well, it's kinda silly for us to talk about this in generalities, because
i'm capable of demonstrating what i mean specifically and exactly...

and when i'm done, i'm quite sure that you will agree completely.
(and you might say that's what you meant all along, for all i know.)

so let's get specific.

i've posted a file of text taken from rfrank's roundless experiment:
>    http://z-m-l.com/go/campf/campf-001.zml

as you can see, this book is called "pemrose lorry, camp fire girl"...

the book is now in-progress, so it's half-done and half-undone...
(it looks like it's been proofed up through about page 80 so far, but
since the pages go out multiple times, it's difficult to be certain of it,
but after page 80, it's fairly clear that the pages haven't been edited.)

that first version had the page-separators fixed, and other basic stuff.

the next version has more stuff cleaned up, but it's still rather basic:
>    http://z-m-l.com/go/campf/campf-002.zml

as you can easily see, if you look through either one of the versions,
starting after page 80, there are a bunch of paragraphing problems.

specifically, there are often blank lines inserted -- incorrectly --
between the lines of a single paragraph.   you can find instances of
this problem on pages __, __, __, and __.   it's a not uncommon glitch.

paragraphing is one of the first things i try to correct, because it's
necessary to have the paragraphs correct to fix any spacey quotes.

fortunately, it's rather easy to locate these bad paragraphs via search.
find a blank line followed by a line that starts with a lowercase letter.
(in other words, you search for two newlines followed by lowercase.)

now, one way of doing the search would be to automatically replace
_all_ of these occurrences by simply deleting one of the two newlines.
you could click "replace all" on that find-and-replace, and change all
of the occurrences without even looking at them.   you _could_ do that.
the proper changes would almost always outnumber improper ones.

whenever anyone talks about "taking the human out of the equation",
blind global changes is what immediately springs into my little brain.

and i am definitely _not_ a fan of blind global changes.

there are some changes you can make blindly, but they are few and
far between, and they certainly don't characterize the usual process.

and this change -- deleting those excess lines -- i don't do blind...

i will step through the changes one-by-one, and look at all of them
_against_the_scan_.   (even though, in most cases, i wouldn't need to
look at the scan, because it's pretty obvious how to fix the problem.)

sometimes i even delete the excess line manually, rather than merely
approve the find-and-replace, just to keep my grubby fingers busy...

but -- here's the very important part -- i do a lot _more_ than that.

at every one of these spots to which i'm automatically transported,
i look around the immediate problem area, to see if it spilled over.

and, if it did, then i will clean up that neighboring area right away.

so i'm walking a fine line here between doing a specialized search
and a general correction routine.   i start with the specialized check,
so i have a laser focus on the nature of what the check will show me,
so i don't have to waste mental energy figuring out what the glitch is.
but then i also use my peripheral vision to see what else needs fixing.

and this particular check in this particular text is a very good example.

because it ends up that a _lot_ of the surrounding areas had problems.

(that's not uncommon with these excess blank lines, because it's often
the other glitches that _caused_ the o.c.r. its paragraphing difficulties.)

there were a good number of problem-lines caught by this routine --
about 85 if i remember correctly -- but i probably corrected the text
on 85 surrounding lines as well, because i saw bugs, so i fixed them.

i encourage people to "play along at home" and _do_ this search on
this text, so you can really see how you can spot neighboring glitches.

what you will _not_ know, however, unless you are also looking at the
_scan_ for each page, is that there were often _entire_words_missing_
from the o.c.r.   they were cut off entirely from the left-hand margin.
this isn't the kind of thing for which you can do a search to find them.
they're just missing from the o.c.r., and you don't know they're gone.

however, since you are paying attention during this _other_ search,
you can become aware of these missing words, via peripheral vision.

if you _do_ "play along at home", and actually step through that check,
you will also see that there's yet another check that needs to be done
to catch _all_ of these paragraphing problems from excess blank lines.

this second check is the "flip side" of the first one.   the first one was
looking for supposedly-first-lines of paragraphs that were lowercase.
the second one checks the _termination_ of supposedly-last-lines...
so you're checking for a letter or a comma at the _end_ of a line which
is followed by a blank line.   if it were _really_ the end of a paragraph,
as is implied by that blank line, it should be punctuation-terminated.

so you're looking now for a letter/comma followed by two newlines...

this search turns up roughly 50 instances in this particular text-file.

and, once again, when you make these corrections, you look around.
roughly half of the 50 instances were cases where the blank line that
was deleted was followed itself by a line beginning with a single-dash,
which was a misrecognition for an _em-dash_ at the start of that line.
(the em-dash is represented by a double-dash.)   so those corrections
were made at the same time that the excess blank line was eliminated.

if you were a party-pooper, and didn't play along at home, i have
uploaded a copy of the file where the lines that were flagged by the
first page are indicated by an "@" sign, which you can search for:
>    http://z-m-l.com/go/campf/campf-003.zml

and here's one with the lines from the second check tagged with "%":
>    http://z-m-l.com/go/campf/campf-004.zml

you can also view the book page-by-page, as usual.   for instance:
>    http://z-m-l.com/go/campf/campfp123.html

(you will see, right there on page 123, an example of a page which
has some of the words cut off from the right-margin.   when i find
this type of problem in a scanset, it just makes me want to scream!
when you scan a book, you need to make sure you do it _carefully_.
sloppy work such as this is tremendously uncharacteristic of rfrank.)

at any rate...

it's this "looking around and fixing other errors in the neighborhood"
that makes this whole preprocessing thing such a _vibrant_ procedure.

...which brings me squarely back to the major point of this post...

***

so yeah, i'm doing a computerized check, and it's darn _efficient_, but
i'm surely not "taking humans out of the process".   not by a long shot.
i am a mentally alert human, who is _actively_ engaged with the text...
my cursor is _flying_ all around that document.   i move at warp speed.

and i am _efficient_ too.   i am _tremedously_ efficient.   i am rocking!
all the other carpenters are using hammers, and i have a _nail-gun_.

part of the efficiency is due to that _focus_ that i just talked about...
but most is simply because i am being directly transported to errors.
i don't waste any of my time _looking_ for them; they're _presented_.
i don't waste any time positioning my cursor -- it is _prepositioned_.
and the scan is right there, ready for me to look at it, immediately...

and it's _fun_.   downright _exhilarating_.   like driving a sports car.
(whereas word-by-word proofing is akin to pushing a baby stroller.)

and the results are _better_.   not just a little bit better, a lot better.
i've shown time after time after time that i can find (lots of) errors
proofers _miss_, not just in one round, but two and three and four.

and, just to repeat the message yet again, it's _simple_ to do this.
a couple dozen _simple_ checks finds a high percentage of errors.

so yeah, i know i am explaining all of this with accurate language.

but people always seem to misunderstand.

on the one hand, they think the tool is making all the changes...

on the other hand, they think you need a human to find errors...

both of those positions are wrong, and when i argue against one,
people jump to the other, and then i have to argue against _that_.

i know that i am capable of walking the tight-rope between them,
but everyone else seems to get tangled up in the dialog semantics.

the tool can find almost _all_ the errors, so you don't need to look.
(you're welcome to look, and i'm sure that you'll find a few errors,
but at some point, you have to ask whether it was worth your time.)

on the other hand, _you_ have to be the active agent that mentally
makes the decision whether or not each change should be made...

like i said, there are only a few global changes that i make blindly.

(to give you a for-instance, i will globally delete any space that
follows any doublequote located at the very beginning of a line,
or any space preceding a doublequote at the very end of a line.
i'll also blindly change spacey-quotes if the open/close pairing
within the rest of the paragraph makes the change unambiguous.
i'll also close up contractions blindly.   maybe a few more things.
but for the most part, i really want to look at the changes i make.
maybe if i had the luxury of knowing i'd have human volunteers
following up with a word-by-word proofing of everything i did,
i would be more willing to make blind changes.   or maybe not,
because that just seems wasteful of donated time and energy.)

now, i'm sure that dkretz will confirm what i've been saying here,
as his experience with his tool has led him to the same thoughts.

and maybe carel will say that this is exactly what she has meant,
and we've been doing a semantic dance around the same thought.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20100227/912f12b0/attachment.html>