From gbuchana at teksavvy.com  Tue Jul  1 20:20:47 2008
From: gbuchana at teksavvy.com (Gardner Buchanan)
Date: Tue, 01 Jul 2008 23:20:47 -0400
Subject: [gutvol-d] continued confusion over at distributed proofreaders
In-Reply-To: <d0c.36cd3e71.359b1c04@aol.com>
References: <d0c.36cd3e71.359b1c04@aol.com>
Message-ID: <486AF40F.7050606@teksavvy.com>


Bowerbird at aol.com wrote:

> 
> rfrank (roger frank) said:
>  >   If page after page goes by, does a proofer's attention fade?
>  >   I believe it does.
> 

Fade, maybe.  Or it might not have been there to begin with.
I PM'd a project in which timestamps showed that a good number
of pages were proofed in less time that it would have taken
to read them normally -- > 10 pages per minute or so -- let
alone proofread.

I would like to see a system like DP actually _introduce_ a
specific known error or two into each page and not accept
the page until the proofers had found and corrected it.  I
want the system to be able to verify that a known level of
dilligence is being taken.

============================================================
Gardner Buchanan                     <gbuchana at teksavvy.com>
Ottawa, ON             FreeBSD: Where you want to go. Today.

From hyphen at hyphenologist.co.uk  Tue Jul  1 23:04:15 2008
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Wed, 2 Jul 2008 07:04:15 +0100
Subject: [gutvol-d] continued confusion over at distributed proofreaders
In-Reply-To: <486AF40F.7050606@teksavvy.com>
References: <d0c.36cd3e71.359b1c04@aol.com> <486AF40F.7050606@teksavvy.com>
Message-ID: <002701c8dc09$7c543950$74fcabf0$@co.uk>



Gardner Buchanan wrote

> Bowerbird at aol.com wrote:

> > 
> > rfrank (roger frank) said:
> >  >   If page after page goes by, does a proofer's attention fade?
> >  >   I believe it does.
> 

> Fade, maybe.  Or it might not have been there to begin with.
> I PM'd a project in which timestamps showed that a good number
> of pages were proofed in less time that it would have taken
> to read them normally -- > 10 pages per minute or so -- let
> alone proofread.

> I would like to see a system like DP actually _introduce_ a
> specific known error or two into each page and not accept
> the page until the proofers had found and corrected it.  I
> want the system to be able to verify that a known level of
> dilligence is being taken.

I use a orogrammers editor for the same error on multiple  pages.
This allows me to edit several dozen pages/chapters at a time. 
When I find a repeating error and I use "replace all occurances" 
facility with regular expressions for the tricky changes.

Difficult to do with DPs single page system.

Dave Fawthrop.


From Bowerbird at aol.com  Wed Jul  2 00:10:58 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 2 Jul 2008 03:10:58 EDT
Subject: [gutvol-d] continued confusion over at distributed proofreaders
Message-ID: <ce0.31487dd1.359c8402@aol.com>

gardner said:
>    I want the system to be able to verify that 
>    a known level of dilligence is being taken.

the proof is in the pudding.

the proofers do an excellent job.   if you examine it closely,
as i have, you will be continually surprised how well they do.
they couldn't do that if they weren't executing with "dilligence".

they aren't perfect, but they do accumulate to it quite rapidly.
and, frankly, it's just ridiculous for someone like roger frank
to imply that the proofers are not paying sufficient attention.

if individual proofers wanted to be subjected to injected errors,
and informed when they weren't proofing up to par, i would be
in favor of that...   but it would have to be a _voluntary_ system...

to force it on people, who are _volunteers_, would be unseemly,
i would think, _especially_ since they _are_ doing such a fine job.
(if they were doing a crappy job, that might be another matter...)

the problem is not the proofers.   the problem is in the workflow.


>   I use a orogrammers editor for the same error on multiple? pages.
>    This allows me to edit several dozen pages/chapters at a time.
>    When I find a repeating error and I use "replace all occurances"
>    facility with regular expressions for the tricky changes.

"programmer's" and "occurrences".   are you doing this on purpose?       ;+)


>    Difficult to do with DPs single page system.

right.   and that's just one of many problems with their workflow...

whenever an error is found, a book-wide search should be done
to see if that error is systematic, and -- if so -- fixed throughout.

but that capability isn't baked into their infrastructure.

even worse is the fact that even if you _know_ there is an error on
a specific page, you can't just go in and fix it.   that's one big killer.

but there are dozens of such d.p. inadequacies, so discussing them
hit-and-miss is like swatting individual flies on a hot summer night.

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080702/d9cf47f9/attachment.htm 

From Bowerbird at aol.com  Wed Jul  2 03:35:36 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 2 Jul 2008 06:35:36 EDT
Subject: [gutvol-d] continued confusion over at distributed proofreaders
Message-ID: <cc4.34b30d74.359cb3f8@aol.com>

dave said:
>    When I find a repeating error and I use "replace all occurances"
>    facility with regular expressions for the tricky changes.

sorry, i spaced on this in my last message.

i meant to say that this is one of the descriptions of preprocessing
-- to use programmed search routines (a la reg-ex) to locate flaws,
and fix 'em en masse, usually after a very-quick consult of the scans
(e.g., verifying "fagade" to "facade" 4 times doesn't take much time)...

indeed, it might be the essence of preprocessing, under a microscope.

(the main concept lacking is explicit focus on _book-wide_ process;
it's more than just global changes; it's that some phenomenon _only_
emerge at the book-level, like whether a word should be hyphenated,
or whether a name is high-frequency enough that we assume it's ok.)

and what is phenomenal, what d.p. has no way of even _knowing_,
since they haven't done the research that i've performed, is that it is
amazing how _few_ routines it takes to move o.c.r. to high quality...

this is _not_ hard to do; on the contrary, it's _so_ easy that it's crazy!
(that's why i keep banging away on this topic; it's low-hanging fruit.)

in a clean book, with clear typography, and relatively simple text,
the o.c.r. will be highly accurate to begin with, and the flaws that you
will preprocess away will be highly predictable as well, meaning you
will start with very accurate text, even before line-by-line proofing...

at that point, the only reliably big chunk of errors are stealth scannos,
and they generally boil down to a handful in a book, at the very most.
(the other big class: flecks causing flawed but harmless punctuation.)

so you do the o.c.r. and you preprocess, and boom, you've got text
that's clean up in the nine nine nine nines, and you're just _starting_.

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080702/d8a3a78a/attachment.htm 

From grythumn at gmail.com  Wed Jul  2 05:10:12 2008
From: grythumn at gmail.com (Robert Cicconetti)
Date: Wed, 2 Jul 2008 08:10:12 -0400
Subject: [gutvol-d] continued confusion over at distributed proofreaders
In-Reply-To: <002701c8dc09$7c543950$74fcabf0$@co.uk>
References: <d0c.36cd3e71.359b1c04@aol.com> <486AF40F.7050606@teksavvy.com>
	<002701c8dc09$7c543950$74fcabf0$@co.uk>
Message-ID: <15cfa2a50807020510x364c83a1o310f3972964fbf6a@mail.gmail.com>

On Wed, Jul 2, 2008 at 2:04 AM, Dave Fawthrop
<hyphen at hyphenologist.co.uk> wrote:
> I use a orogrammers editor for the same error on multiple  pages.
> This allows me to edit several dozen pages/chapters at a time.
> When I find a repeating error and I use "replace all occurances"
> facility with regular expressions for the tricky changes.
>
> Difficult to do with DPs single page system.

There are easy tools to do that on the front end (prep, PM) and
backend (PP) and ways for a PM to do it while in the rounds (harder,
but possible). There are also a lot of new tools introduced with
Wordcheck.

R C

From gegut at edwardjohnson.com  Wed Jul  2 06:35:33 2008
From: gegut at edwardjohnson.com (G. Edward Johnson)
Date: Wed, 2 Jul 2008 09:35:33 -0400 (EDT)
Subject: [gutvol-d] Open source and the Kindle
Message-ID: <56688.65.242.47.185.1215005733.squirrel@webmail11.pair.com>


http://news.cnet.com/8301-13505_3-9982318-16.html?tag=nefd.top

This week I tried downloading Jane Austen's Northanger Abbey from Project
Gutenberg... The content is free. But it's not pretty. Line breaks aren't
formatted for the Kindle, making the normally exceptional Kindle-reading
experience...much less exceptional. For $1.60, I can have that exact same
book with everything pre-formatted for me.



He does seem to confuse open source with the public domain, but otherwise,
it seems like a valid complaint.  Not sure if it is PG's problem for
having the linebreaks, or Kindle's problem for not doing a decent job of
un-wrapping.



Edward.
http://edwardjohnson.com/

From marcello at perathoner.de  Wed Jul  2 08:16:08 2008
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed, 02 Jul 2008 17:16:08 +0200
Subject: [gutvol-d] Open source and the Kindle
In-Reply-To: <56688.65.242.47.185.1215005733.squirrel@webmail11.pair.com>
References: <56688.65.242.47.185.1215005733.squirrel@webmail11.pair.com>
Message-ID: <486B9BB8.1060909@perathoner.de>

G. Edward Johnson wrote:
> http://news.cnet.com/8301-13505_3-9982318-16.html?tag=nefd.top
> 
> This week I tried downloading Jane Austen's Northanger Abbey from Project
> Gutenberg... The content is free. But it's not pretty. Line breaks aren't
> formatted for the Kindle, making the normally exceptional Kindle-reading
> experience...much less exceptional. For $1.60, I can have that exact same
> book with everything pre-formatted for me.
> 
> 
> 
> He does seem to confuse open source with the public domain, but otherwise,
> it seems like a valid complaint.  Not sure if it is PG's problem for
> having the linebreaks, or Kindle's problem for not doing a decent job of
> un-wrapping.

He should have complained about the cretinous design of the kindle, 
which does not read HTML. What would it have cost to port WebKit or 
Gecko the the kindle? Heck, even my cellphone groks HTML.

We have Northanger Abbey in both HTML and plucker for pleasurable 
reading on less cretinous devices.



-- 
Marcello Perathoner
webmaster at gutenberg.org


From eve-news at shaw.ca  Wed Jul  2 07:39:18 2008
From: eve-news at shaw.ca (Eve M. Behr)
Date: Wed, 02 Jul 2008 08:39:18 -0600
Subject: [gutvol-d] Open source and the Kindle
In-Reply-To: <56688.65.242.47.185.1215005733.squirrel@webmail11.pair.com>
References: <56688.65.242.47.185.1215005733.squirrel@webmail11.pair.com>
Message-ID: <tj4n64t7p504mlb9jus3ao73psmbicfea3@4ax.com>

On Wed, 02 Jul 2008 09:35:33 -0400 (EDT), "G. Edward Johnson"
<gegut at edwardjohnson.com> wrote:

>This week I tried downloading Jane Austen's Northanger Abbey from Project
>Gutenberg... The content is free. But it's not pretty. Line breaks aren't
>formatted for the Kindle, making the normally exceptional Kindle-reading
>experience...much less exceptional. For $1.60, I can have that exact same
>book with everything pre-formatted for me.

Have you tried downloading this title from http://manybooks.net which
seems to function as a mirror for Gutenberg.  It's where I go to get
Gutenberg titles for my Palm Pilot and it comes in a properly wrapped
form that flows onto my screen correctly.

Eve M. Behr
EveB on DP
ebehr at shaw.ca

From sly at victoria.tc.ca  Wed Jul  2 10:04:37 2008
From: sly at victoria.tc.ca (Andrew Sly)
Date: Wed, 2 Jul 2008 10:04:37 -0700 (PDT)
Subject: [gutvol-d] Open source and the Kindle
In-Reply-To: <tj4n64t7p504mlb9jus3ao73psmbicfea3@4ax.com>
References: <56688.65.242.47.185.1215005733.squirrel@webmail11.pair.com>
	<tj4n64t7p504mlb9jus3ao73psmbicfea3@4ax.com>
Message-ID: <Pine.GSO.4.58.0807020957480.11736@vtn1.victoria.tc.ca>



On Wed, 2 Jul 2008, Eve M. Behr wrote:

> Have you tried downloading this title from http://manybooks.net which
> seems to function as a mirror for Gutenberg.  It's where I go to get
> Gutenberg titles for my Palm Pilot and it comes in a properly wrapped
> form that flows onto my screen correctly.
>

Yes, manybooks is interesting. They do a good job at offering PG
texts in different formats for people.

I only have two caveats. They appear to take the "lowest common
denominator" file, that is the plain ascii text file as their
basis, possibly losing some information in the process. Also
because texts are converted automatically, there are sometimes
problems with word-wrap happening where it should not.

Andrew

From dlowry8 at comcast.net  Wed Jul  2 10:23:41 2008
From: dlowry8 at comcast.net (Douglas Lowry)
Date: Wed, 2 Jul 2008 13:23:41 -0400
Subject: [gutvol-d] Open source and the Kindle
In-Reply-To: <56688.65.242.47.185.1215005733.squirrel@webmail11.pair.com>
References: <56688.65.242.47.185.1215005733.squirrel@webmail11.pair.com>
Message-ID: <001201c8dc68$60013320$6400a8c0@dlowry>

For $1.60?  No, everything pre-formatted for free.

Try the layout of Northanger Abbey at www.wctse.com. Click on 'A', then
"Austen, Jane", then on "Northanger Abbey". Zero out the voluntary payment.
That makes it free. Make sure you download the WCT Reader program, also
free. If you want to know what you can do with this version, try the manual
at http://www.wordsclosetogether.com/HowTOC.asp. 

Feedback is always appreciated at dlowry8 at comcast.net.

Doug Lowry

-----Original Message-----
From: G. Edward Johnson [mailto:gegut at edwardjohnson.com] 
Sent: Wednesday, July 02, 2008 9:36 AM
To: gutvol-d at lists.pglaf.org
Subject: [gutvol-d] Open source and the Kindle


http://news.cnet.com/8301-13505_3-9982318-16.html?tag=nefd.top

This week I tried downloading Jane Austen's Northanger Abbey from Project
Gutenberg... The content is free. But it's not pretty. Line breaks aren't
formatted for the Kindle, making the normally exceptional Kindle-reading
experience...much less exceptional. For $1.60, I can have that exact same
book with everything pre-formatted for me.



He does seem to confuse open source with the public domain, but otherwise,
it seems like a valid complaint.  Not sure if it is PG's problem for
having the linebreaks, or Kindle's problem for not doing a decent job of
un-wrapping.



Edward.
http://edwardjohnson.com/



From Bowerbird at aol.com  Wed Jul  2 13:48:21 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 2 Jul 2008 16:48:21 EDT
Subject: [gutvol-d] Open source and the Kindle
Message-ID: <d2d.30454b00.359d4395@aol.com>

edward said:
>   it seems like a valid complaint

oh, absolutely.


>   Not sure if it is PG's problem for having the linebreaks, 
>    or Kindle's problem for not doing a decent job of un-wrapping.

the number-one rule for having happy users is "don't blame the user".

so let's not blame the kindle.   (but i'll bet you a nickel that marcello 
did;
that's what technoids do -- blame the user for any problems that arise.)

if you want people to do a "decent" job of un-wrapping (of whatever),
give 'em _a_tool_ that does a decent job of unwrapping (or whatever).

so, the first failing of p.g. here is that it hasn't given people such a 
tool.

but the more-important failing of p.g. here is that its e-texts are _not_
designed in a way that lets p.g. (or anyone) even _create_ such a tool,
because there's no marking on the lines which should not be wrapped.

so the user finds that sections of poetry have been wrapped incorrectly,
as have blocks of various types (like address blocks, tables, and so on).

i've described this problem _many_ times before, to no good resolution.

the solution is quite simple -- just include one or more leading spaces
on any line that should not be wrapped -- but nobody who could write
this sensible rule into the guidelines has been smart enough to do it...

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      (www.tourtracker.com 
?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080702/d884a898/attachment.htm 

From Bowerbird at aol.com  Wed Jul  2 14:00:03 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 2 Jul 2008 17:00:03 EDT
Subject: [gutvol-d] continued confusion over at distributed proofreaders
Message-ID: <ca1.35d08ca9.359d4653@aol.com>

dave said:
>    >    Difficult to do with DPs single page system.

robert said:
>    There are easy tools to do that on the front end (prep, PM) 
>    and backend (PP)

dave was talking about _within_ the single-page system.


>    and ways for a PM to do it while in the rounds
>    (harder, but possible).

yeah, that's what he said, it's "difficult".

he could have also added that the proofer who _finds_
an error is the person who logically should be doing
the search for other similar errors.   while you here
have the project-manager doing it, which is why _i_
said that this capability is not baked into your system.

so basically, although you would like to leave some
_impression_ that you have "countered" what was said,
really you've done nothing but _confirm_ its accuracy...

and we haven't even gotten around to social pressures
(e.g., to "keep the diffs straight") that lead to the fact
that project-managers almost _never_ actually do this
mid-round, despite that it is "possible", as you put it...


>    There are also a lot of new tools
>    introduced with Wordcheck.

again, this is mere distraction that has very little to do
with the points that were raised.   if you're not going to
bring anything of substance to the thread, you might as
well just stay silent like the rest of your d.p. counterparts.

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      (www.tourtracker.com 
?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080702/098173e4/attachment.htm 

From ajhaines at shaw.ca  Wed Jul  2 15:32:59 2008
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Wed, 02 Jul 2008 15:32:59 -0700
Subject: [gutvol-d] Open source and the Kindle
References: <d2d.30454b00.359d4395@aol.com>
Message-ID: <002601c8dc93$94fcd540$6401a8c0@ahainesp2400>

Concerning the indenting of text to prevent unwanted wrapping - this article has been in PG's Volunteer FAQ for some years:

http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.89._Are_there_any_places_where_I_should_indent_text.3F


Al


  ----- Original Message ----- 
  From: Bowerbird at aol.com 
  To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com 
  Sent: Wednesday, July 02, 2008 1:48 PM
  Subject: Re: [gutvol-d] Open source and the Kindle


  edward said:
  >   it seems like a valid complaint

  oh, absolutely.


  >   Not sure if it is PG's problem for having the linebreaks, 
  >   or Kindle's problem for not doing a decent job of un-wrapping.

  the number-one rule for having happy users is "don't blame the user".

  so let's not blame the kindle.  (but i'll bet you a nickel that marcello did;
  that's what technoids do -- blame the user for any problems that arise.)

  if you want people to do a "decent" job of un-wrapping (of whatever),
  give 'em _a_tool_ that does a decent job of unwrapping (or whatever).

  so, the first failing of p.g. here is that it hasn't given people such a tool.

  but the more-important failing of p.g. here is that its e-texts are _not_
  designed in a way that lets p.g. (or anyone) even _create_ such a tool,
  because there's no marking on the lines which should not be wrapped.

  so the user finds that sections of poetry have been wrapped incorrectly,
  as have blocks of various types (like address blocks, tables, and so on).

  i've described this problem _many_ times before, to no good resolution.

  the solution is quite simple -- just include one or more leading spaces
  on any line that should not be wrapped -- but nobody who could write
  this sensible rule into the guidelines has been smart enough to do it...

  -bowerbird



  **************
  Get the scoop on last night's hottest shows and the live music scene in your area - Check out TourTracker.com!
  (www.tourtracker.com ?NCID=aolmus00050000000112) 


------------------------------------------------------------------------------


  _______________________________________________
  gutvol-d mailing list
  gutvol-d at lists.pglaf.org
  http://lists.pglaf.org/listinfo.cgi/gutvol-d
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080702/7de7ce15/attachment.htm 

From Bowerbird at aol.com  Wed Jul  2 17:14:22 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 2 Jul 2008 20:14:22 EDT
Subject: [gutvol-d] Open source and the Kindle
Message-ID: <c92.2e163b59.359d73de@aol.com>

al said:
>    this?article has been in?PG's Volunteer FAQ for some years:

sorry, i spoke "metaphorically" about "writing it into the guidelines".

what i _really_ meant was _enforcing_ the policy in the actual e-texts.

you know, so actual users could actually unwrap those actual e-texts.

if anyone needs, i can show you actual e-texts posted in the last week
where this was not done...   and thousands posted in the last decade...

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      (www.tourtracker.com 
?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080702/f2173ec4/attachment.htm 

From marcello at perathoner.de  Wed Jul  2 17:42:11 2008
From: marcello at perathoner.de (Marcello Perathoner)
Date: Thu, 03 Jul 2008 02:42:11 +0200
Subject: [gutvol-d] Open source and the Kindle
In-Reply-To: <002601c8dc93$94fcd540$6401a8c0@ahainesp2400>
References: <d2d.30454b00.359d4395@aol.com>
	<002601c8dc93$94fcd540$6401a8c0@ahainesp2400>
Message-ID: <486C2063.7090300@perathoner.de>

BB wrote:

>> so let's not blame the kindle.  (but i'll bet you a nickel that 
>> marcello did; that's what technoids do -- blame the user for any 
>> problems that arise.)

So blaming the kindle is blaming the user?

This is sub-standard thinking, even by your standards.


The following is more like you. Without understanding, without having 
done any research whatsoever you just open your tusked snout and let out 
the voice of God:

>> the solution is quite simple -- just include one or more leading 
>> spaces on any line that should not be wrapped -- but nobody who
>> could write this sensible rule into the guidelines has been smart
>> enough to do it...

Then Al Haines wrote:

> Concerning the indenting of text to prevent unwanted wrapping - this 
> article has been in PG's Volunteer FAQ for some years:
> 
> 
http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.89._Are_there_any_places_where_I_should_indent_text.3F

That was a very unfortunate oversight. But no, No, NEVER, try to weasel 
your way back out:

BB wrote:

> sorry, i spoke "metaphorically" about "writing it into the guidelines".

Aaaaarghhh! Never admit defeat, never.

Wrong, wrong, wrong. Bad troll!


-- 
Marcello Perathoner
webmaster at gutenberg.org


From tb at baechler.net  Thu Jul  3 02:26:41 2008
From: tb at baechler.net (Tony Baechler)
Date: Thu, 03 Jul 2008 02:26:41 -0700
Subject: [gutvol-d] Apertium: Open source machine translation
Message-ID: <486C9B51.3010603@baechler.net>

All,

I'm not 100% sure exactly what this is, but I thought it might be of interest to some here who have commented on machine translations in the past.  I haven't used the software and I don't know anymore than what it says below.  It would be interesting to see how accurate the translation engine is.  google for example can translate text but doesn't do a great job of it.

[Apertium][1] is an open source shallow-transfer machine translation (MT) system. In addition to the translation engine, it also provides tools for manipulating linguistic data, and translators designed to run using the engine. At the time of writing, there are stable bilingual translators available for English-Catalan, English-Spanish, Catalan-Spanish, Catalan-French, Spanish-Portuguese, Spanish-Galician, and French-Spanish; as well as monolingual translators that translate from Esperanto to Catalan and to Spanish, and from Romanian to Spanish. There are also a number of unstable translators in various stages of development. (A [list of language pairs][2], updated daily, is available on the [Apertium wiki][3]). 

   [1]: http://www.apertium.org
   [2]: http://wiki.apertium.org/wiki/List_of_language_pairs
   [3]: http://wiki.apertium.org/wiki/Main_Page

URL: http://linuxgazette.net/152/oregan.html




From Bowerbird at aol.com  Thu Jul  3 11:05:20 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 3 Jul 2008 14:05:20 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	001
Message-ID: <cc4.34dc1059.359e6ee0@aol.com>

ok, here's a new series, on how to clean up text from o.c.r.

distributed proofreaders calls such clean-up "preprocessing",
because it's done prior to the text being sent off to proofers...

i'll show you how you can use a text-editor to do the clean-up.
mostly i'll be using plain-english to tell you what to search for,
but sometimes i'll have to resort to reg-ex (regular expressions)
for simplicity, so you should have a text-editor that does reg-ex.

i'll be using "blood mountain", the latest test-book by roger frank.

to kick off this series, i'll repeat a tip i offered a while back...

1.   search for all lines that start with a semi-colon.

in the o.c.r. from "blood mountain", there were three such lines:

>    "I have been shot at," Valentine Simmons replied
>    ; "behind my back. The men who fail are like

>    Her breathing increasingly grew labored, oppressed
>    ; a little sob escaped, softly miserable. She

>    The lines on Gordon's thin, dark face had multiplied
>    ; his eyes, in the shadow of his bony forehead,

for all three, it's pretty obvious the semicolon belongs on
the previous line, and merely needs to be shifted up there.

so, in my text-editor -- where "^c" stands for a line-end --
i just do a global change of "^c; " to ";^c", and step through
the 3 occurrences and approve each one individually.   done.

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080703/fdfbaad8/attachment-0001.htm 

From schultzk at uni-trier.de  Fri Jul  4 01:17:38 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Fri, 4 Jul 2008 10:17:38 +0200
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book
	-- 001
In-Reply-To: <cc4.34dc1059.359e6ee0@aol.com>
References: <cc4.34dc1059.359e6ee0@aol.com>
Message-ID: <D07F7E4F-3950-4986-8FFD-BF376EF3E4DA@uni-trier.de>

Hi Bowerbird,

	As a general rule almost all punctuation in english should not be
	at the beginning of a line. Except quote marks and opening brackets!
	Dashes I guess are O.K.

	regards
		Keith.
	
	
Am 03.07.2008 um 20:05 schrieb Bowerbird at aol.com:

> ok, here's a new series, on how to clean up text from o.c.r.
>
> distributed proofreaders calls such clean-up "preprocessing",
> because it's done prior to the text being sent off to proofers...
>
> i'll show you how you can use a text-editor to do the clean-up.
> mostly i'll be using plain-english to tell you what to search for,
> but sometimes i'll have to resort to reg-ex (regular expressions)
> for simplicity, so you should have a text-editor that does reg-ex.
>
> i'll be using "blood mountain", the latest test-book by roger frank.
>
> to kick off this series, i'll repeat a tip i offered a while back...
>
> 1.  search for all lines that start with a semi-colon.
>
> in the o.c.r. from "blood mountain", there were three such lines:
>
> >   "I have been shot at," Valentine Simmons replied
> >   ; "behind my back. The men who fail are like
>
> >   Her breathing increasingly grew labored, oppressed
> >   ; a little sob escaped, softly miserable. She
>
> >   The lines on Gordon's thin, dark face had multiplied
> >   ; his eyes, in the shadow of his bony forehead,
>
> for all three, it's pretty obvious the semicolon belongs on
> the previous line, and merely needs to be shifted up there.
>
> so, in my text-editor -- where "^c" stands for a line-end --
> i just do a global change of "^c; " to ";^c", and step through
> the 3 occurrences and approve each one individually.  done.
>
> -bowerbird
>
>
>
> **************
> Gas prices getting you down? Search AOL Autos for fuel-efficient  
> used cars.
> (http://autos.aol.com/used?ncid=aolaut00050000000007)
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080704/84d1972e/attachment.htm 

From Bowerbird at aol.com  Fri Jul  4 02:55:27 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 4 Jul 2008 05:55:27 EDT
Subject: [gutvol-d] happy bird-day
Message-ID: <c67.38d69421.359f4d8f@aol.com>


happy birthday to project gutenberg!   thank you michael!

thank you, anonymous grocery-chain marketers, for printing
the declaration of independence on those shopping bags!

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080704/ed5c744c/attachment.htm 

From Bowerbird at aol.com  Fri Jul  4 03:18:14 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 4 Jul 2008 06:18:14 EDT
Subject: [gutvol-d] World eBook Fair July 4-August 4
Message-ID: <bbb.2f26ba4d.359f52e6@aol.com>

>    A Million Plus Books Free for the Taking!
>    July 4 2008

you know, in years past i've scoffed at that number.

and i would continue to scoff at it this year, because
let's face the facts, there are a lot of duplicates there,
so the number is vastly inflated.

however, just the other day, i was reading a piece that
a group of publishers is now reporting they have found
a truly massive number of scanned textbooks online...

let's see if i can dig it up...
>   http://chronicle.com/free/2008/07/3623n.htm

it says:
>    the Association of American Publishers hired an outside law firm 
>    this summer to scour the Web for illegally offered textbooks. 
>    Already the firm has identified thousands of instances of 
>    book piracy and has sent legal notices to Web sites hosting the files 
>    demanding that they be removed. The group is looking for all types of 
>    books, though trade books and textbooks, which generally have high 
>    price tags, are the most frequent books offered on peer-to-peer sites.
>
>    "In any given two-week period we found from 60,000 files 
>    all the way up to 250,000 files," said Edward McCoyd, 
>    director of digital policy for the publishing association.

so maybe there really _are_ a lot of scanned books out there.

or maybe this is just one of those scary numbers that the corporations
like to throw out, to make it sound like they're losing scads of money...

indeed, the website mentioned -- textbook torrents dot com -- says:
>   "There are very few scanned textbooks in circulation, and 
>    that's what we're here to change," says a welcome message 
>    on the Textbook Torrents site. "Chances are you have some 
>    textbooks sitting around, so pick up a scanner and start scanning it!"

they actually only declare 5000 scanned textbooks so far, which is
-- shall we say -- a far cry from the 60,000-250,000 the industry
claims to have found.   but hey, 5000 is better than nothing, isn't it?

and boy, you have to love their "start scanning" attitude!           :+)

with the book industry escalating its attempts to rip off their customers,
maybe more and more people will soon be "picking up" their scanners...

and maybe by next year, there really _will_ be a million books out there...

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080704/a39873c2/attachment.htm 

From Bowerbird at aol.com  Fri Jul  4 03:22:03 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 4 Jul 2008 06:22:03 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book
	-- 001
Message-ID: <c65.2991fe78.359f53cb@aol.com>

keith said:
>    As a general rule almost all punctuation in english should not be
>    at the beginning of a line. Except quote marks and opening brackets!

you're absolutely right, keith.

which means we could formulate a reg-ex that'd
search for all of them in one pass, if we wanted to.

but i've found it's better to do them one-at-a-time.

first, that helps focus in your attention much better.

second, the actions required for each one often vary,
between them, but are consistent within themselves.
this will become clear as we step through all of them.

third, it's just kind of interesting (to me, anyway) to
see just how many instances turn up for each mark.


>    Dashes I guess are O.K.

we'll separate single-dashes from double-dashes.

single-dashes at the beginning of a line are usually
scanning mistakes, since single-dashes are typically
hyphens used in hyphenated words, so they will be
located at the _end_ of the line, and not at the start.

em-dashes -- or double-dashes to an o.c.r. app --
are however quite often found at the start of a line.

so a single-dash at the start of the line is probably
a double-dash improperly recognized by the o.c.r.

indeed, of the 22 cases of this in "blood mountain",
all but one of 'em should have been a double-dash.
so that will be a global change from "^c-" to "^c--",
which is different from the global change for "^c;",
and thus is an example of what i mentioned above,
where i said the remedy varies for different marks...

also, in terms of the "focus" point, when i know that
i'm now looking for a dash at the beginning of a line
-- and nothing else -- i'm more likely to spot it when
-- as actually occurs on p#346 of "blood mountain --
>   http://www.z-m-l.com/go/mount/mountp346.html
the o.c.r.   _missed_ the dash at the start of the 4th line.

however, since i was looking at the second dash there,
and my attention was focused on that particular mark,
i was alerted to the fact that there should have been
two instances highlighted on that page, so i caught it.
whenever you do preprocessing, you have to be alert
for errors that are on the periphery of your attention.

***

d.p. has an _extremely_stupid_ policy on em-dashes,
where they will bring them up from the start of a line
to the end of the previous line, and then will _also_
bring up a word from that next line as well, and they
have no spaces on either side of the em-dash, meaning
they often have very long lines followed by short ones.

(of course, this also happens with their dehyphenation.)

while it's easy enough for me to change "--" to " -- ",
thus enabling more-esthetic linewrapping to happen
-- it doesn't matter if the dash is at the end of one line
or the start of the next, it means the same darn thing --
i wish they'd just leave the original linebreaks alone...

this unnecessary rewrapping from the original breaks
is -- in the long run -- going to cause me to _jettison_
the p.g. e-texts completely, good only for proofing my
re-done o.c.r., where i've retained original linebreaks...

users need the _option_ of rewrapping, or of
retaining the linebreaks from the original book.

users don't need to have rewrapping forced on them.

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080704/ca6dfbd1/attachment.htm 

From hyphen at hyphenologist.co.uk  Fri Jul  4 06:52:54 2008
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Fri, 4 Jul 2008 14:52:54 +0100
Subject: [gutvol-d] World eBook Fair July 4-August 4
In-Reply-To: <bbb.2f26ba4d.359f52e6@aol.com>
References: <bbb.2f26ba4d.359f52e6@aol.com>
Message-ID: <002401c8dddd$483c6e50$d8b54af0$@co.uk>

 

 

Bowerbird at aol.com wrote



 

>>   A Million Plus Books Free for the Taking!

 

>>   July 4 2008

>you know, in years past i've scoffed at that number.

>and i would continue to scoff at it this year, because
>let's face the facts, there are a lot of duplicates there,
>so the number is vastly inflated.

 

If you *read* mh's post you will note that the claim was 1,000,000

Books whereas the count was 1,210,000,   and increasing.

20% for duplicates seems reasonable to me.

 

Dave Fawthrop






-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080704/63b568b2/attachment-0001.htm 

From Bowerbird at aol.com  Fri Jul  4 10:21:41 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 4 Jul 2008 13:21:41 EDT
Subject: [gutvol-d] World eBook Fair July 4-August 4
Message-ID: <cec.388f621b.359fb625@aol.com>

dave said:
>    If you *read* mh?s post you will note that 
>   the claim was 1,000,000 Books whereas 
>   the count was 1,210,000, and increasing. 

i did read it.


>    20% for duplicates seems reasonable to me.

i put the number of duplicates at more like 50%.

here's the inflated count:
>    ? ~100,000+ from Project Gutenberg
>   ? ~500,000+ from The World Public Library
>   ? ~450,000?   from The Internet Archive
>   ? ~160,000?   from eBooks About Everything
>   ----------
>   ~1,210,000+ Grand Total as of July 1, 2008

here's a much more reasonable guess:
>   ? ~050,000   from Project Gutenberg
>    ? ~200,000   from The World Public Library
>    ? ~250,000? from The Internet Archive
>   ? ~050,000? from eBooks About Everything
>     ---------
>     ~550,000   Grand Total as of July 1, 2008

multiple copies of a project gutenberg e-text
count as 1 book, just 1, no matter how many
libraries make a copy, by any rational census...

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080704/0c2fb7cf/attachment.htm 

From Bowerbird at aol.com  Fri Jul  4 11:06:54 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 4 Jul 2008 14:06:54 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	002
Message-ID: <d34.2de545d9.359fc0be@aol.com>

ok, since keith brought it up, here are the 22 cases
in "blood mountain" of a single-dash at a linestart...

2.   search for all lines that start with a semi-colon.

the dash on the pagenumber for page 125 was a speck, so
it was the sole exception to a global change from "-" to "--".

and, as i mentioned earlier, i noticed the absence of one
em-dash at the beginning of an o.c.r. line, so i added it...

27 corrections made, on 2 routines...

by the way, you can get a copy of the o.c.r. text here:
>    http://www.pgdp.net/c/project.php?id=projectID4865040815e01
if you'd like to follow along and verify my statements.

i'll be back tomorrow with the next tip in this series...

-bowerbird

-- singledash at linestart --

On the occasions when he was too drunk to drive
-not over often--a substitute was quietly found

"is always made a target for the abuse of the
-the thoughtless. But he usually comes to the

-----File: 073.png
-Clare dangerously ill ... a question of dying,

eighty dollars in his pocket. He had another vision
-of Simmons; it was two hundred and fifty dollars

nuto passage of a symphony; "but it's all one to me
-there's nothing else they can take; I'm free, free

Delaying his expression of gratitude to the priest
-he could stop on his return with trout--Gordon

business with it, a ... a gun store,--I like guns,
-here in Greenstream. And I'd sharpen scythes,

together, and he made a list of what I would need
-files and vises and parts of guns. If I mailed my

in the dark house.... He shut his eyes for a
-[125]

"Wouldn't I?" she exclaimed; "oh, wouldn't I?
-smart crowds and gay streets and shops on fire

laboriously polite, "the next time--I'll do it!
-when I'm in Stenton again I'll bring you a pair

from the rough, minor forms into the bigger sweep
-it was like a great, green bed half filled with a

got penetrating as a musket. Rose is just like her
-she's all taffy now on that young man, but in a

"whenever you like. Of course it's a fine article
-all strung on gold wire. I won't be surprised

a little from his blood. She demanded a great deal
-a man could never return. He bitterly cursed his

was driving, and by her side ... Lettice! Lettice
-riding over the rough field, over the dark stony

-----File: 267.png
-what man had not?--but this was different; this

man with his crop a failure on the field like, well
-we'll say, Cannon does, with a note in my hand

inhibition had arisen in the negotiations
-he had destroyed him with Gordon's own blindness,

thousand dollars to get them, and they're worth
-that," he flung them with a quick gesture into the

Stenton stage," he shrilled; "and I made out to ask
-you can take it or leave it--if you'd drive again?

 ... others ... new courage, example of bigness
-Why! what's the matter with you, Makimmon?

======================================

plus replacement of 1 missing em-dash on page 346

"Have you got the options?" Entriken demanded
"all them that Pompey had and you bought?"

"Have you got the options?" Entriken demanded
-- "all them that Pompey had and you bought?"

======================================



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080704/eb32eab2/attachment.htm 

From Bowerbird at aol.com  Fri Jul  4 11:57:12 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 4 Jul 2008 14:57:12 EDT
Subject: [gutvol-d] speaking of happy bird-day
Message-ID: <cc6.341316aa.359fcc88@aol.com>


speaking of bird-days, it's also the birthday of "alice in wonderland".

meaning that it's probably a good time to report that the p.g. e-text
of this wonderful story has recently had a facelift from david widger.

i haven't checked out the new version, but i'm sure it's quite spiffy...

so happy bird-day, alice.   you sure don't look 146 years old...

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080704/a8c398d4/attachment.htm 

From prosfilaes at gmail.com  Fri Jul  4 18:23:30 2008
From: prosfilaes at gmail.com (David Starner)
Date: Fri, 4 Jul 2008 21:23:30 -0400
Subject: [gutvol-d] World eBook Fair July 4-August 4
In-Reply-To: <002401c8dddd$483c6e50$d8b54af0$@co.uk>
References: <bbb.2f26ba4d.359f52e6@aol.com>
	<002401c8dddd$483c6e50$d8b54af0$@co.uk>
Message-ID: <6d99d1fd0807041823l6edf8ffcu1e3f6cd38e40b064@mail.gmail.com>

On Fri, Jul 4, 2008 at 9:52 AM, Dave Fawthrop
<hyphen at hyphenologist.co.uk> wrote:
> If you *read* mh's post you will note that the claim was 1,000,000
>
> Books whereas the count was 1,210,000,   and increasing.
>
> 20% for duplicates seems reasonable to me.

It also claimed that PG donated 100,000 books, which would indicate
the duplicate count was a bit higher than that.

From Bowerbird at aol.com  Sat Jul  5 12:44:28 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 5 Jul 2008 15:44:28 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	003
Message-ID: <cdd.3225ac2a.35a1291c@aol.com>

and we continue with our series on routines to preprocess o.c.r.

3.   search for a lowercase letter followed by an uppercase letter.

>    PHWTED IN THE T7NITED STATES OlfAMEEIOA
>    and playing him out. Come here, General JacK-son."
>    George Gordon MacKimmon, resting on the porch *** embedded cap correct
>    everywhere; Gordon had pitched the headstalFinto
>    by George Gordon MacKimmon from world-old *** embedded cap correct
>    General Jackson moved forward over the porcK.

we see that 2 of the 6 presenting cases are _correct_ --
as the last name of "mackimmon" has an embedded cap,
such last names a common "false alarm" with this routine,
-- so we are left with 4 lines that need to be corrected...

4 more lines corrected, for a grand total of 31, on 3 routines...

i'll be back tomorrow with the next tip in this series...

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080705/989b7ce3/attachment.htm 

From hart at pglaf.org  Sat Jul  5 15:54:41 2008
From: hart at pglaf.org (Michael Hart)
Date: Sat, 5 Jul 2008 15:54:41 -0700 (PDT)
Subject: [gutvol-d] World eBook Fair July 4-August 4
In-Reply-To: <cec.388f621b.359fb625@aol.com>
References: <cec.388f621b.359fb625@aol.com>
Message-ID: <Pine.LNX.4.64.0807051523350.7144@pglaf.org>


OK, it's one thing for you to trash me, I can, and have,
walked away from that.

But when you start trying to trash mathematics and logic,
well, then something must be said.

Starting with the largest collections:

1.

A serious look at World Public Library would indicate the
number of duplications at approximately 10%.

Even if you dicounted those from different paper editions
of the same title, you couldn't say 15% in reality.

Then again, there are more than just formatting diffs for
the .txt, .html, .pdf, etc. files in many cases.

Personally, I think we will end up with 30 legitimate for
all the great books, and should keep and count them all--
but even with discounting a total 10% that leaves:

Starting with:

   500,000+ as their grand total

subtracting out 10% as duplicates:

   450,000 that are not duplicates.



2.

Then counting TWICE as many duplication at Internet Archive,
due to those also at The World Public Library:

   450,000+ as their grand total

subtracting out 20% as duplicates:

   370,000 that are not duplicates.


Thus, the first sub-total is:

   820,000 that are not duplicates.


3.

The 17,000 music scores are obviously NOT dupes.

Thus, the second sub-total is:

   837,000 that are not duplicates.


4.

Project Gutenberg

Out of some 28,000 - 29,000 files, only the first 17,000
are listed by either Internet Archive/World Public Library.

This leave about 12,500 not listed in the above libraries.


Thus the third sub-total is:

  ~850,000 that are not duplicates.


5.

As for the commercial eBooks, these all their their own
editors, artwork, etc., though I cannot say how like an
earlier paper or ebook edition they are, as far as I am
informed each has its own copyright as must be listed a
totally separate way by any library.

Thus the fourth sub-total is:

1,010,000 that are not duplicates.



6.

Internet archive is adding about 1,000 per day.

Given only 20 total business days from July 4 - Aug 4--
they plan to add about 20 more books to the:

  ~453,000 already there as of the last business day.

Unless these new items are duplicates, which I doubt is
the case with the current batch, their final total will
be approximately:

   473,000

And we shoudl add perahps 20,000 to the above sub-total:




7.

Thus our final grand total, wiping out 200,000 or so from
the highest possible grand total of no more than:

1,250,000  [given all other additions]

to create a "deflated total" of:

1,050,000  [or somewhere thereabouts].


Obviously there is room to argue a few percent,

But unless you go more overboard on duplications
than any of the librarians I have asked, it will
not be all that different a grand total.



The illogical examples given below want to take
out any possible duplication over and over, but
with only 17,000 even LISTED from PG, you could
not change the grand total by much over 50,000,
even if you took them ALL OUT THREE TIMES OVER.

And even having taken them out TWICE, we still,
and yet, have 50,000 to spare.

Again, it's not exact counting, and none of our
standards seem to be met by many of these books
listed here, but I'll put over half these books
up against the better half of what U Michigan's
report claimed a month or so ago, or Google's--
or another other collection over a million.

HOWEVER. . .I would LOVE to come back a year or
two or three from now and hear that there are a
whole million well proofread full text eBooks!


Thank You!!!


Give the world eBooks for 2008!!!


Don't forget:

Over 1.2 million eBooks starting July 4 at:


http://www.worldebookfair.org


Ends August 4.


Michael S. Hart
Founder
Project Gutenberg
Inventor of eBooks


100,000 eBooks easy to download at:
http://www.gutenberg.org [over 28,000 eBooks]
http://www/gutenberg.cc [over 75,000 eBooks]
Http://gutenberg.net.au   Project Gutenberg of Australia ~1640+
http://pge.rastko.net 65 languages  PG of Europe ~500+
http://gutenberg.ca  Project Gutenberg of Canada ~100+
http://preprints.readingroo.ms  Not Primetime Ready ~387

>>> Your Project Gutenberg Site Could Be Listed Here <<<

Don't forget Project Runeberg for Scandinavian languages.

Blog at http://hart.pglaf.org






On Fri, 4 Jul 2008, Bowerbird at aol.com wrote:

> dave said:
>>    If you *read* mh??s post you will note that
>>   the claim was 1,000,000 Books whereas
>>   the count was 1,210,000, and increasing.
>
> i did read it.
>
>
>>    20% for duplicates seems reasonable to me.
>
> i put the number of duplicates at more like 50%.
>
> here's the inflated count:
>>    ? ~100,000+ from Project Gutenberg
>>   ? ~500,000+ from The World Public Library
>>   ? ~450,000?   from The Internet Archive
>>   ? ~160,000?   from eBooks About Everything
>>   ----------
>>   ~1,210,000+ Grand Total as of July 1, 2008
>
> here's a much more reasonable guess:
>>   ? ~050,000   from Project Gutenberg
>>    ? ~200,000   from The World Public Library
>>    ? ~250,000? from The Internet Archive
>>   ? ~050,000? from eBooks About Everything
>>     ---------
>>     ~550,000   Grand Total as of July 1, 2008
>
> multiple copies of a project gutenberg e-text
> count as 1 book, just 1, no matter how many
> libraries make a copy, by any rational census...
>
> -bowerbird
>
>
>
> **************
> Gas prices getting you down? Search AOL Autos for
> fuel-efficient used cars.
>      (http://autos.aol.com/used?ncid=aolaut00050000000007)
>

From keichwa at gmx.net  Sat Jul  5 18:18:38 2008
From: keichwa at gmx.net (Karl Eichwalder)
Date: Sun, 06 Jul 2008 03:18:38 +0200
Subject: [gutvol-d] World eBook Fair July 4-August 4
Message-ID: <m28wwfzu6p.fsf@gnu.franken.de>

"David Starner" <prosfilaes at gmail.com> writes:

> It also claimed that PG donated 100,000 books, which would indicate
> the duplicate count was a bit higher than that.

It's the known nonsense Mr Hart uses to post since several years.
Just ignore it.

-- 
Karl Eichwalder

From Bowerbird at aol.com  Sat Jul  5 23:18:41 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 6 Jul 2008 02:18:41 EDT
Subject: [gutvol-d] World eBook Fair July 4-August 4
Message-ID: <cff.35d562da.35a1bdc1@aol.com>

michael-

i'm gonna give you a night or two to sleep on it,
and if you still want to stand behind that last post,
i'll send my response...

you let me know...

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080706/0759fa34/attachment.htm 

From Morasch at aol.com  Sun Jul  6 13:45:43 2008
From: Morasch at aol.com (Morasch at aol.com)
Date: Sun, 6 Jul 2008 16:45:43 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	004
Message-ID: <ca1.3645fe96.35a288f7@aol.com>

here's another twist on casing anomalies:

4.   search for two uppercase letters followed by a lowercase letter.

3 lines presented, and were corrected as indicated:

>    ITwas his own home to which he returned, the
>    It was his own home to which he returned, the

>    commonplace. He saw TOPable sitting on
>    commonplace. He saw Tol'able sitting on

>    JBeggs added; "your money's tight around his neck."
>    Beggs added; "your money's tight around his neck."

3 more lines corrected, for a grand total of 34, on 4 routines...

i'll be back tomorrow with the next routine in this series...

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080706/4c553f9a/attachment.htm 

From Bowerbird at aol.com  Sun Jul  6 23:56:35 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 7 Jul 2008 02:56:35 EDT
Subject: [gutvol-d] i sent this over to the discussion listserve at
	openlibrary
Message-ID: <d0f.309eeee4.35a31823@aol.com>

i sent this over to the discussion listserve at openlibrary

***

the o.c.r. on the books you scan is _still_ fatally flawed!

it's missing em-dashes, and probably more characters too,
but i can't stand to look at it, because it turns my stomach...

_when_ are you going to clear up this _significant_ problem?

it's getting extremely difficult for me to shake the feeling that
absolutely nobody there cares about the quality of what you do.

and that's a crying shame, because so many people depend on you.

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080707/925e429e/attachment.htm 

From Bowerbird at aol.com  Mon Jul  7 08:26:02 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 7 Jul 2008 11:26:02 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	005
Message-ID: <cf3.37f5a2b4.35a38f8a@aol.com>

5.   search for lines containing multiple spaces...

>    the trap.                 _{t}

one line.   ok.   fixed.

1 more line corrected, for a grand total of 35, on 5 routines...

i'll be back tomorrow with the next tip in this series...

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080707/f2729feb/attachment.htm 

From Bowerbird at aol.com  Mon Jul  7 09:46:08 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 7 Jul 2008 12:46:08 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	roadmap
Message-ID: <ccd.398a41be.35a3a250@aol.com>

just so you know where we're going with this...

when we're done with the clean-up for this book,
there will be nothing but a handful of errors left
that the human proofers will have to find and fix...

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080707/386bca3e/attachment.htm 

From hart at pglaf.org  Mon Jul  7 13:00:12 2008
From: hart at pglaf.org (Michael Hart)
Date: Mon, 7 Jul 2008 13:00:12 -0700 (PDT)
Subject: [gutvol-d] !@! Source files possible?
Message-ID: <Pine.LNX.4.64.0807071259330.12574@pglaf.org>



---------- Forwarded message ----------
Date: Mon, 7 Jul 2008 09:41:55 -0400
From: Vincent Terreri <vterreri at gmail.com>
To: hart at pobox.com
Subject: Source files possible?


First of all, let me tell you that you are doing a wonderful work in this
project.  I have recently signed up to volunteer for proofing.



Is there any way of getting a view of the scanned pages from which the
following etext was developed?



Title: Ritchie's Fabulae Faciles

        A First Latin Reader



Author: John Kirtland, ed.



Release Date: September, 2005 [EBook #8997]

[Yes, we are more than one year ahead of schedule]

[This file was first posted on August 31, 2003]



Edition: 10



Language: English



Character set encoding: ASCIIEnd of Project Gutenberg's Ritchie's Fabulae
Faciles, by John Kirtland, ed.



*** END OF THE PROJECT GUTENBERG EBOOK RITCHIE'S FABULAE FACILES ***



This file should be named 7flrd10.txt or 7flrd10.zip

Corrected EDITIONS of our eBooks get a new NUMBER, 7flrd11.txt

VERSIONS based on separate sources get new LETTER, 7flrd10a.txt



Produced by Karl Hagen, Tapio Riikonen and Online Distributed Proofreaders



I would like to use the text in my Latin classes next year, but am
interested in how the lines are numbered in the original text.  It would
help considerably in preparing text that was more user friendly to my middle
school students.



Anything you can do would be greatly appreciated.  Please let me know if
there is another address I should send this request.



All the best,



Vincent





Vincent Terreri

703-431-7467 mobile phone

540-668-7157 home phone

801-459-3733 fax number




No virus found in this outgoing message.
Checked by AVG.
Version: 7.5.524 / Virus Database: 270.4.5/1537 - Release Date: 7/6/2008
5:26 AM


From grythumn at gmail.com  Mon Jul  7 13:21:40 2008
From: grythumn at gmail.com (Robert Cicconetti)
Date: Mon, 7 Jul 2008 16:21:40 -0400
Subject: [gutvol-d] !@! Source files possible?
In-Reply-To: <Pine.LNX.4.64.0807071259330.12574@pglaf.org>
References: <Pine.LNX.4.64.0807071259330.12574@pglaf.org>
Message-ID: <15cfa2a50807071321q138fcd3fx9b1fa5bd81947e33@mail.gmail.com>

Page images are at:

http://www.pgdp.org/ols/tools/display.php?book=3f2145e242d4f&nextpage=001.png&numpages=150

The R2 output is archived somewhere at IA, but that requires human
intervention to retrieve.

R C

On Mon, Jul 7, 2008 at 4:00 PM, Michael Hart <hart at pglaf.org> wrote:
>
>
> ---------- Forwarded message ----------
> Date: Mon, 7 Jul 2008 09:41:55 -0400
> From: Vincent Terreri <vterreri at gmail.com>
> To: hart at pobox.com
> Subject: Source files possible?
>
>
> First of all, let me tell you that you are doing a wonderful work in this
> project.  I have recently signed up to volunteer for proofing.
>
>
>
> Is there any way of getting a view of the scanned pages from which the
> following etext was developed?
>
>
>
> Title: Ritchie's Fabulae Faciles
>
>        A First Latin Reader
>
>
>
> Author: John Kirtland, ed.
>
>
>
> Release Date: September, 2005 [EBook #8997]
>
> [Yes, we are more than one year ahead of schedule]
>
> [This file was first posted on August 31, 2003]
>
>
>
> Edition: 10
>
>
>
> Language: English
>
>
>
> Character set encoding: ASCIIEnd of Project Gutenberg's Ritchie's Fabulae
> Faciles, by John Kirtland, ed.
>
>
>
> *** END OF THE PROJECT GUTENBERG EBOOK RITCHIE'S FABULAE FACILES ***
>
>
>
> This file should be named 7flrd10.txt or 7flrd10.zip
>
> Corrected EDITIONS of our eBooks get a new NUMBER, 7flrd11.txt
>
> VERSIONS based on separate sources get new LETTER, 7flrd10a.txt
>
>
>
> Produced by Karl Hagen, Tapio Riikonen and Online Distributed Proofreaders
>
>
>
> I would like to use the text in my Latin classes next year, but am
> interested in how the lines are numbered in the original text.  It would
> help considerably in preparing text that was more user friendly to my middle
> school students.
>
>
>
> Anything you can do would be greatly appreciated.  Please let me know if
> there is another address I should send this request.
>
>
>
> All the best,
>
>
>
> Vincent
>
>
>
>
>
> Vincent Terreri
>
> 703-431-7467 mobile phone
>
> 540-668-7157 home phone
>
> 801-459-3733 fax number
>
>
>
>
> No virus found in this outgoing message.
> Checked by AVG.
> Version: 7.5.524 / Virus Database: 270.4.5/1537 - Release Date: 7/6/2008
> 5:26 AM
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>

From ajhaines at shaw.ca  Mon Jul  7 13:26:52 2008
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Mon, 07 Jul 2008 13:26:52 -0700
Subject: [gutvol-d] !@! Source files possible?
References: <Pine.LNX.4.64.0807071259330.12574@pglaf.org>
Message-ID: <000901c8e06f$caf00b60$6401a8c0@ahainesp2400>

This book is at Internet Archive:

http://www.archive.org/details/fabulaefacilesfi00ritcrich

PDF versions are available from that page, in the "View the book" box..  Clicking on the HTTP link 
(at the bottom of that box) gives access to GIF, JP2 (JPEG 2000), and assorted other versions.

It's probably best to ignore IA's text version in favor of PG's - IA's text files are pretty raw, to 
put it mildly.

Al


----- Original Message ----- 
From: "Michael Hart" <hart at pglaf.org>
To: "The gutvol-d Mailing List" <gutvol-d at lists.pglaf.org>
Cc: <From:>
Sent: Monday, July 07, 2008 1:00 PM
Subject: [gutvol-d] !@! Source files possible?


>
>
> ---------- Forwarded message ----------
> Date: Mon, 7 Jul 2008 09:41:55 -0400
> From: Vincent Terreri <vterreri at gmail.com>
> To: hart at pobox.com
> Subject: Source files possible?
>
>
> First of all, let me tell you that you are doing a wonderful work in this
> project.  I have recently signed up to volunteer for proofing.
>
>
>
> Is there any way of getting a view of the scanned pages from which the
> following etext was developed?
>
>
>
> Title: Ritchie's Fabulae Faciles
>
>        A First Latin Reader
>
>
>
> Author: John Kirtland, ed.
>
>
>
> Release Date: September, 2005 [EBook #8997]
>
> [Yes, we are more than one year ahead of schedule]
>
> [This file was first posted on August 31, 2003]
>
>
>
> Edition: 10
>
>
>
> Language: English
>
>
>
> Character set encoding: ASCIIEnd of Project Gutenberg's Ritchie's Fabulae
> Faciles, by John Kirtland, ed.
>
>
>
> *** END OF THE PROJECT GUTENBERG EBOOK RITCHIE'S FABULAE FACILES ***
>
>
>
> This file should be named 7flrd10.txt or 7flrd10.zip
>
> Corrected EDITIONS of our eBooks get a new NUMBER, 7flrd11.txt
>
> VERSIONS based on separate sources get new LETTER, 7flrd10a.txt
>
>
>
> Produced by Karl Hagen, Tapio Riikonen and Online Distributed Proofreaders
>
>
>
> I would like to use the text in my Latin classes next year, but am
> interested in how the lines are numbered in the original text.  It would
> help considerably in preparing text that was more user friendly to my middle
> school students.
>
>
>
> Anything you can do would be greatly appreciated.  Please let me know if
> there is another address I should send this request.
>
>
>
> All the best,
>
>
>
> Vincent
>
>
>
>
>
> Vincent Terreri
>
> 703-431-7467 mobile phone
>
> 540-668-7157 home phone
>
> 801-459-3733 fax number
>
>
>
>
> No virus found in this outgoing message.
> Checked by AVG.
> Version: 7.5.524 / Virus Database: 270.4.5/1537 - Release Date: 7/6/2008
> 5:26 AM
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
> 



From Bowerbird at aol.com  Tue Jul  8 00:15:39 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 8 Jul 2008 03:15:39 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	006
Message-ID: <c4f.2fea847b.35a46e1b@aol.com>

it's very good to do an early search for garbage characters...
this is a search that benefits from using regular expressions.
the reg-ex is something along these lines:   [\&\*\<\>\\\/\|\*\{\}\_]
you'll usually also want to search for high-bit ascii characters...

6.   search for lines with garbage characters, and edit as necessary.

>    In this manner his father, |ust such another, had
>    _Mr. Ottinger elected to imbibe his "straight"
>    _{cr}ippling, the other. A chair fell, sliding across the
>    harsh, lik_{t}e a discordant bell clashing in the soste-
>    of men; envy was perceptible, bitterness *" ... for
>    "A 'little stroll.' *" Buckley produced a heavy
>    It enraged him that she was so collected; her body,*
>    *?217]
>    employed Mrs. Caley. The grea<t, tin pot of coffee
>    vast, indefinable peril, blacker than night, lo&ming
>    \
>    the throes of a new piece, Mc*Ginty, and Gordon
>    *?330}
>    the trap.                 _{t}
>    "Sim," Gordon demanded sharply, "_{you} never
>    of wrath, his arm rose, with a finger indicating the*
>    "'Give it to him,' *" Gordon repeated thinly. "I
>    "?' dam' idiot," Gordon mumbled, "if I die out

18 hits.

of note is that the second line -- "elected to imbibe" -- had also
been separated from its paragraph, on both ends, so was reunited.

in a similar vein, the seventh line -- "she was so collected" -- was
improperly joined to the paragraph above, so i added a blank line.

moreover, the line with "\" also involved a badly broken paragraph.

we'll do a check on paragraphing later, but when you see a glitch,
you should correct it right away, even if it's not the type you were
"looking for" at the current moment.   the most efficient time to do
a fix is right when you see an error.   don't wait until later to fix it.

22 more lines corrected, for a grand total of 57, on 6 routines...

i'll be back tomorrow with the next routine in this series...

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080708/228933ee/attachment.htm 

From Bowerbird at aol.com  Wed Jul  9 02:46:24 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 9 Jul 2008 05:46:24 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	007
Message-ID: <c06.3c02ecb8.35a5e2f0@aol.com>

7.   search for letter-number or number-letter combos...
(in reg-ex terms, this will be "[a-z][0-9]" or "[0-9][a-z]".)

letter-number
>    PHWTED IN THE T7NITED STATES OlfAMEEIOA
>    12Q4J
>    [247J
>    P25J

number-letter
>    PHWTED IN THE T7NITED STATES OlfAMEEIOA
>    fl04]
>    12Q4J
>    P25J

4 of each, with 3 overlapping, for a total of 5 unique lines...
of these 5 unique lines, 4 of them involved _pagenumbers_.

>    PHWTED IN THE T7NITED STATES OlfAMEEIOA
>    fl04]
>    12Q4J
>    [247J
>    P25J

5 more lines corrected, for a grand total of 62, on 7 routines...

i'll be back tomorrow with the next tip in this series...

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080709/62147f10/attachment.htm 

From Bowerbird at aol.com  Wed Jul  9 03:33:45 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 9 Jul 2008 06:33:45 EDT
Subject: [gutvol-d] how to digitize a book, step by step -- 001
Message-ID: <d15.2a3beb26.35a5ee09@aol.com>

i can see it's time for me to rewind and review the whole process,
from the start.

voila, "how to digitize a book, step by step..."

first off, choose a book that you really want to get intimate with.
you're gonna be doing an internal inspection of the thing's guts,
so it might as well be a book you really like, or think you might...

next, get a clean copy.   try to get several copies, and use the cleanest.

third, thumb through the book and familiarize yourself with it fully...

are there illustration pages?   are they numbered or unnumbered?
how many pages are there in it?   what is the first numbered page?
how many unnumbered front-matter pages come before that page?

number back from the first numbered page, hopefully back to _1_,
which will be the title-page.   if there's anything before the title-page,
like a frontispiece, you should shuffle it so that it comes _afterward_.
it's perfectly ok to shuffle unnumbered frontmatter pages, and even 
delete unnecessary pages (especially blank pages) so the numbers
will come out right.

fourth, start scanning...   do a careful job.   it _does_ make a difference.
try to make the scans as straight as possible, so you'll get good o.c.r.
also try to position 'em consistently on the scan-bed, for even margins.

start scanning at the first numbered page, and set the default filename
to that number.   after that, the o.c.r. will increment the default filename
_automatically_, which means you'll want to _skip_ unnumbered pages
on this pass.   you'll come back to 'em later, and name them _manually_.
you'll also do the frontmatter pages, and name those manually as well...

when you're done, every scan will be _named_ with its _pagenumber_,
meaning you won't have to do any guesswork to know what file is what.

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080709/5d7b3797/attachment.htm 

From Bowerbird at aol.com  Wed Jul  9 03:35:02 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 9 Jul 2008 06:35:02 EDT
Subject: [gutvol-d] how to digitize a book, step by step -- 002
Message-ID: <c42.374db086.35a5ee56@aol.com>

now, let's see what happens when you _don't_ name your files wisely...

i'm gonna use this book, which is currently in-process over at d.p.:
>    http://www.pgdp.net/c/project.php?id=projectID4870dfe646daf
it's called "the cabin on the prairie"...

so, i've explored this book, and i can tell you a little bit about it...

the numbered pages start at page 5.

there are 5 pages _before_ that page,
but one of them is blank (whew), so it
can (and should) be eliminated, of course.

so toss the blank page.

there.

now we have 4 unnumbered pages, which we'll name p001-p004,
and then our numbered pages start on p005.

so the numbers should work fine...

_except_...

no, they don't.

why not?

well, first, because we have two illustration pages,
which were unnumbered in the original p-book...

one is named 066.png, and the other is 191.png...
(but remember that the numbering is whack here.)

they came after pages 64 and 186, respectively,
so we will rename them 064a.png and 186a.png,
of course.

and then, to preserve the recto/verso in the book,
we'll introduce blank page-scans as 064b and 186b.

that should fix our numbering...

but no, it still doesn't work.

why not?

we have to dig a little deeper...

well, because pages 176 and 177, who are known as
180.png and 181.png in the badly-named scan-set,
are repeats of 178.png and 179.png.   oh-oh, a glitch.

(i don't make this up.   if i did, you wouldn't believe it.
go check for yourself, and you will see it for yourself.)

this bug -- accidentally scanning a page-spread twice --
is relatively common.   face it, human beings make errors.
and it's much better to scan a spread twice than not at all.
(which is another relatively-common error.)

one of the reasons you want to name your scans wisely
is so you can _catch_ these errors as soon as possible...

when you are using an intelligent filenaming convention,
and external filename mismatches internal pagenumber,
you immediately know that you've got an error on the line.

the content provider here was using opaque filenames,
so he didn't have a clue that he had made that mistake...

anyway, so i tossed out the duplicates, and now it works.

i've uploaded the scans here, so you can see how it works:
>    http://z-m-l.com/go/cabin

you see i also gave the files full-fledged names, starting with
"cabin" as my unique filename used exclusively for this book,
and then with the "p" prefix -- for "page"...   the unnumbered
illustration pages are "cabinp064a.png" and "cabinp186a.png",
and their versos are "cabinp064b.png" and "cabinp186b.png"...
all other names are transparent: "cabinp###.png" for page ###.

so we've got a neatly-structured scan-set, and can go to work.

this is the kind of filenaming structure you want to aim for, and
it's easiest if you plan ahead so the o.c.r. app will do it for you...

-bowerbird



**************
Gas prices getting you down? Search AOL Autos for 
fuel-efficient used cars.
      (http://autos.aol.com/used?ncid=aolaut00050000000007)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080709/04e479e6/attachment.htm 

From Bowerbird at aol.com  Wed Jul  9 10:25:19 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 9 Jul 2008 13:25:19 EDT
Subject: [gutvol-d] how to digitize a book, step by step -- 003
Message-ID: <bc8.32e15abd.35a64e7f@aol.com>

another in-process book being used as a test over at d.p. is this one:
>    http://www.pgdp.net/c/project.php?id=projectID4873f471bb8c9&
detail_level=4

it's called "the crevice".   i prepared a properly-named version of it too:
>    http://z-m-l.com/go/crvic

the first numbered page here was page 1 of chapter 1 -- p001.png --
meaning the unnumbered frontmatter pages needed to be named with
another prefix.   i usually use "f" -- f001.png -- for frontmatter pages...

i only kept 4 of the frontmatter pages -- you have to have an _even_
number of pages in each prefix so as to retain the recto/verso mode
-- so they're named "crvicf001.png" through "crvicf004.png", of course.
(well, actually, f004 is an illustration, so it's a .jpg, so it's 
"crvicf004.jpg".)

so far, in my library, i always include "c001" and "c002" files as well
-- for "cover", so it's the front-cover and "hot" (i.e., linked) contents.
i will frequently simply duplicate the "f001" or "p001" page for "c001",
but in this book there was a scan of the cover, so i used it for c001...

anyway, this is just to demonstrate how to handle multiple prefixes
in the naming condition.   there are lots more wrinkles that can be
engineered to handle any special cases you might encounter, but
for the most part the filenaming convention is very straightforward,
because i intentionally designed it that way, for it to be transparent.

you'll also note that this book, like the other one, has _illustrations_
on _unnumbered_pages_, this time located after pages 94 and 262...

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080709/2062142b/attachment.htm 

From Bowerbird at aol.com  Thu Jul 10 10:31:55 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 10 Jul 2008 13:31:55 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	008
Message-ID: <cfc.39ebfab4.35a7a18b@aol.com>

we shouldn't have a paragraph that starts with a lowercase letter...

8.   search for a double-line-end followed by a lowercase letter...

12 cases present.   the double-line-end is signified here by "//".

>    'Most people go a good length before fighting with//me."
>    see in my ramshackle house and used up ground, is//over me."
>    _Mr. Ottinger elected to imbibe his "straight"//from the bottle--it was 
drunk...
>    slowly and rolled like a flash over her plastered//skin.
>    cern now was to get away, to take the money with//him.
>    sake," Otty gasped, "get to him, the town'll be on//us."
>    to the door; it said, "Gone fishing. Back to-//morrow."
>    interior which absorbed them.//fl04]
>    the other men would hate him; they would all want//me."
>    that would go twice about the neck and then hang//some."
>    \//he wouldn't have gone, anyway.
>    fell sooner and night lingered late into//morning.

as you can see, this happens with _short_ lines, one or two words.

all 12 of these lines were in error, so were pulled up.   in addition,
the "mr. ottinger" line was improperly broken _above_ itself as well,
and the "\" line was a glitch, so it was deleted, and the "fl04]" line
was a botched pagenumber (104), so that was corrected as well...

it is worth nothing that this paragraphing check should be done
_early_ in the process, because some of the later routines involve
checks aimed at the _paragraph_level_ -- e.g., balanced quotes --
so it's important that the paragraphs be correct for them to work.

of course, the paragraphs need to be correct for their _own_ sake,
it should go without saying.   since it is just as easy to ensure that
they are correct at the _beginning_ of the workflow as at the _end_,
you might as well do it at the beginning.

15 more lines corrected, for a grand total of 77, on 8 routines...

i'll be back tomorrow with the next tip in this series...

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080710/149cf091/attachment.htm 

From Bowerbird at aol.com  Thu Jul 10 13:59:12 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 10 Jul 2008 16:59:12 EDT
Subject: [gutvol-d] that didn't take long
Message-ID: <bfc.3c304a5e.35a7d220@aol.com>

well, in case you were wondering how long it would take --
once apple opened up iphone for independent developers
-- for iphone e-book apps to debut, the answer is "not long".

i don't think the store will "officially" open until tomorrow,
but there's already news e-books have made an appearance.

here's the link for "jane eyre":
>    
http://phobos.apple.com/WebObjects/MZStore.woa/wa/viewSoftware?id=284928522&mt=8

and here's "pride and prejudice":
>    
http://phobos.apple.com/WebObjects/MZStore.woa/wa/viewSoftware?id=284922530&mt=8

$.99 each.   for public-domain books.   boing-boing weighs in:
>    http://gadgets.boingboing.net/2008/07/10/iphone-app-store-sel.html

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080710/fbb6a4cc/attachment.htm 

From Bowerbird at aol.com  Thu Jul 10 15:45:43 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 10 Jul 2008 18:45:43 EDT
Subject: [gutvol-d] how to digitize a book, step by step -- 004
Message-ID: <ccd.39f87826.35a7eb17@aol.com>

ideally, it's best if you straighten and crop your scans, because you will 
get
much better o.c.r. results if you do.   it's probably obvious why straight 
lines
are better, since the letters will more closely resemble the "idealized" 
forms.
well-cropped scans also give better results, because margin-marks are not
misrecognized, and you can set up separate "zones" to capture the runheads
and the pagenumbers that might be located down at the bottom of any pages.

that's the point of this message, that you should _scan_and_retain_runheads_.
the policy at d.p. is to chop them off before the pages go in front of 
proofers.
that's just misguided.   those runheads and pagenumbers give you _bearing_
in navigating the book.   they help you avoid getting things badly screwed 
up.
they can be deleted later down in the workflow.   leave them in there for 
now...

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080710/b0dd5b88/attachment.htm 

From Bowerbird at aol.com  Thu Jul 10 17:04:52 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 10 Jul 2008 20:04:52 EDT
Subject: [gutvol-d] how to digitize a book, step by step -- 005
Message-ID: <cf4.31d7e17d.35a7fda4@aol.com>

over at d.p., someone said this:
>    In pre-processing, I make corrections that can be done:
>    (a) Without too much effort. If something's going to take a lot of time
>    to fix in preprocessing, it's a better use of time to have P1 fix it. 
>    (And we're not short of P1'ers).
>    (b) And there is a high probability that the correction is right, 
>    rather than introducing an error. I don't make automatic changes 
>    in prep that are likely to introduce errors. 

off the top of the dome, this sounds rather reasonable...

but on reflection, it's almost 180-degrees wrong.   (about 150.)

first, on point (a), it's almost _always_ faster to fix something in
preprocessing than having it go in front of the proofers instead.
at least it _should_ be.   the main reason is that, in preprocessing,
the computer is doing the grunt-work of _finding_ the glitches,
and -- in the realm of a good text-editor or dedicated tool --
applying the fix is quite straightforward and maximally efficient.
moreover, when you step through each type of glitch individually,
the process of applying the fix becomes even more streamlined...

on point (b), the plain fact is that almost no fix can be done "blind".
this doesn't necessarily mean that you have to examine the _scan_,
but it _does_ mean that you have to grok the content of the context,
and the computer just can't do that.   and even if you make a flock
of changes without looking at each one _before_ it gets enacted,
you must peruse the list of them _afterward_, just to make sure...

so, how did this person get derailed into saying what they said?

easy.   they don't have a good tool to do "preprocessing" at d.p.,
so he's working under a severe handicap clouding his thinking.
his blinders mean he can't see how useful preprocessing can be.

a good interface for a decent o.c.r. clean-up tool is fairly simple.
you need an editing capability, side-by-side with a scan viewer,
and a solid means of isolating and jumping to problematic text...

i programmed such a tool years ago -- called "banana cream" --
and i've decided that in the light of recent improvements, i will be
releasing a stripped-down version of it to the public very soon...

and as the series on "how to do preprocessing" continues, i will
incorporate those routines into the program to flesh it out a bit.

i could've released this program years ago -- and intended to --
but since there were a several d.p. people among my antagonists
here on this listserve, i decided to hold it back instead.   in view of
their silence recently, there's no need for continued punishment...

given my app, d.p. should be able to see how to do preprocessing.

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080710/a6ed5fcb/attachment.htm 

From vlsimpson at gmail.com  Thu Jul 10 22:58:21 2008
From: vlsimpson at gmail.com (V. L. Simpson)
Date: Fri, 11 Jul 2008 00:58:21 -0500
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book
	-- 006
In-Reply-To: <c4f.2fea847b.35a46e1b@aol.com>
References: <c4f.2fea847b.35a46e1b@aol.com>
Message-ID: <bd09bf340807102258h46a6ff0dr8bea89e0d8e0b41f@mail.gmail.com>

On Tue, Jul 8, 2008 at 2:15 AM,  <Bowerbird at aol.com> wrote:
> it's very good to do an early search for garbage characters...
> this is a search that benefits from using regular expressions.
> the reg-ex is something along these lines:  [\&\*\<\>\\\/\|\*\{\}\_]

Why all the backslashes in a character class? And why "*" character twice?

From Bowerbird at aol.com  Thu Jul 10 23:45:21 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 11 Jul 2008 02:45:21 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book
	-- 006
Message-ID: <cc0.3a089025.35a85b81@aol.com>

v.l. said:
>    Why all the backslashes in a character class?

because i don't speak reg-ex fluently?

because i tried it and it seemed to work,
so heck, that was good enough for me?

because i know someone will correct me when
i am wrong, or even just somewhat inefficient?

because real programmers code our own find routines,
due to reg-ex being messy-complexy _and_ poke-slow?


>    And why "*" character twice?

because i want to make sure it gets them _all_...        :+)

-bowerbird

p.s.   but seriously, thank you for your input on this;
indeed, if you would be so kind as to turn all of my
plain-english descriptions into reg-ex, it'd be great
for those people out there who _do_ rely on reg-ex.



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080711/a3d7bb3d/attachment.htm 

From Bowerbird at aol.com  Fri Jul 11 00:42:08 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 11 Jul 2008 03:42:08 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	roadmap2
Message-ID: <cc7.3739b438.35a868d0@aol.com>

ok, the two parallel proofings of "mountain blood" have concluded.

there are 129 differences between the two proofings, shown here:
>    http://z-m-l.com/go/mount/129_differences_a-vs-b_total.html
considering there were 9000+ lines in this book, that's pretty good...

more importantly, from my perspective, the preprocessing i did
seems to have left _just_3_errors_ in this book of over 360 pages,
which these human p1 proofers detected...

here they are:

>    If they took away the chair, Gordon knew, he wag
should be:
>    If they took away the chair, Gordon knew, he was

>    "Why, damn it fell, Gord!" exclaimed an individual,
should be:
>    "Why, damn it t'ell, Gord!" exclaimed an individual,

>    grip of these blood-money men; we'll have a state
>    la wed bank; a rate of interest a man can carry without
should be:
>    grip of these blood-money men; we'll have a state
>    lawed bank; a rate of interest a man can carry without

i'll have more to say about this tomorrow; this is enough for now...

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080711/32defd66/attachment.htm 

From Bowerbird at aol.com  Fri Jul 11 09:43:13 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 11 Jul 2008 12:43:13 EDT
Subject: [gutvol-d] data on the mountain experiment
Message-ID: <c80.2eca2243.35a8e7a1@aol.com>

here's some more data on the differences between the two parallel proofings
on the "mountain blood" experiment that roger frank ran over on the d.p. 
site.

before i just showed you the differences:
>    http://z-m-l.com/go/mount/129_differences_a-vs-b_total.html

appended, i show which of the two parallel-proofings got which lines wrong...

-bowerbird


page:line         -->   these are the 59 lines that mountain-a got wrong

0003:0011     ALFRED.A.KNOPF
0004:0003     ALFRED A. KNOPF, inc.
0020:0015     It was well known that the first George Gordon Mac-Kimmon--the
0028:0002     'Most people go a good length before fighting with
0041:0001     -*ing or benevolent sentences; these, with appropriate
0091:0020     the palpitating day. One of Gordon's nephews-a
0097:0003     you for that amount,'the skinflint says, and sells
0098:0001     *nuto passage of a symphony; "but it's all one to me
0098:0002     --there's nothing else they can take; I'm free, free
0100:0003     CLARE'S funeral deducted a further sum
0100:0011     or for pleasure. It was the hottest hour of the
0100:0015     been cropping the grass in. the broad, shallow gutter
0116:0001     *-ers, from the farm. As he approached he saw that
0119:0001     *-flected in the warmer tones of his replies; a new
0121:0031     spice. Still his grasp tightened upon her.hand, drew
0124:0001     *-luctant eagerness. He kissed her again and again,
0129:0032     elements, to the bitter mountain winters, the ruthless
0130:0001     suns of the August valleys. He was as seasoned,
0130:0024     Opposite Gordon Malummon sat a slight, feminine
0135:0026     -smart crowds and gay streets and shops on fire
0142:0001     *-ing that it must be a messenger from the village, dispatched
0159:0006     were all stirring him up a little; you didn't say any-thing--"
0160:0004     IT was his own home to which he returned, the
0161:0002     Lattice, in white, with a dark shawl drawn about
0166:0007     an effort to keep his impatience from his voice, "I
0174:0001     *-ing General Jackson at his heels, he picked the dog
0178:0004     THE spring night was potent, warm and
0182:0004     THE memory of Meta Beggs was woven like a
0182:0012     He wished to repay her for that injury to his selfesteem.
0184:0004     HE drove over the road that lay at the base
0191:0004     META BEGGS saw Gordon at the same
0196:0017     "A 'little stroll.' " Buckley produced a heavy
0197:0030     It was seen immediately that the skull was broken-a
0205:0004     ON Sunday he strolled soon after breakfast
0214:0001     *-atory position. He would extract the last penny of
0219:0014     hundred per cent, increase."
0227:0001     *-nolia flowers, would never thicken and grow rough.
0234:0003     RUTHERFORD BERRY and Effie, Barnwell
0238:0001     accomplished fact; Lattice's wishes, her quality of
0249:0004     "I'VE got something for you," Gordon said suddenly.
0249:0030     "I've been thinking of you in-those pretty clothes,"
0252:0004     BUT, curiously, sitting alone, he gave little
0253:0013     a little from his blood. She demanded a great deal---a
0255:0004     was insanity. Simeon Caley's wife should
0256:0003     GORDON MAKIMMON made one step toward
0256:0004     her. Lattice held the box in an extended
0264:0003     A HOARSE, thin cry sounded from within
0289:0004     TWENTY-SEVEN hundred and ninety
0291:0013     everywhere; Gordon had pitched the headstal into
0301:0013     Alexander 'll take your horse. He's only at the back
0314:0003     THE year, in the immemorial, minute shifting
0319:0003     GORDON MAKIMMON, absorbed in the
0326:0003     EVEN if he proved able to buy out Simmons,
0333:0002     Mrs. Hollidew in. the sitting room. He would wake
0347:0007     "The two hundred dollar dog!The joke on
0349:0003     THE cold sharpened; the sky, toward evening,
0351:0025     -you can take it or leave it--if you'd drive again?
0356:0004     BUCKLEY SIMMONS was late in arriving
0361:0004     GORDON MAKIMMON rose to a sitting



page:line         -->   these are the 24 lines that mountain-b got wrong

0037:0007     stood the Makimmon dwelling. Originally a foursquare,
0098:0009     They stood before the dark, porchless fagade
0100:0025     in that banal setting, suddenly grew unbearable....
0100:0026     There was no life in Greenstream....
0122:0009     But the things I want to hear may not
0123:0003     my heart, something has gone, and
0125:0014     medicine. Wait here for me, I will come
0125:0016     in you. Love makes everything
0142:0032     away, leaving her pale. Her lips trembled, A palpable,
0151:0011     in silkaleen and back in Al mohair, it'll stand you
0155:0035     the options, bring you the result in a couple of weeks.]
0213:0029     the astute storekeeper into such a satisfactory, retail-*
0232:0019     unintelligible period about French widows and pink....
0232:0020     "Buried before my time," he proclaimed. He
0275:0024     denned her breasts and a hip as crisply as though
0327:0027     He might get them all together, explain, persuade....
0327:0028     Goddy! it was for their good. They needn't
0331:0025     but not Kenny's for nineteen years." Another bore,
0341:0007     the prospect of release from, its bewildering fullness.
0341:0010     in. the return of the options to a county enhanced
0346:0005     "all them that Pompey had and you bought?"
0349:0012     A thread of light appeared against the fagade of
0363:0019     "Cm on," he called impatiently; "you'll take no
0368:0002     him to where, on. the bureau, a lamp had been left.



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080711/3132ba1d/attachment-0001.htm 

From Bowerbird at aol.com  Fri Jul 11 22:29:42 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 12 Jul 2008 01:29:42 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	009
Message-ID: <cee.3a859f95.35a99b46@aol.com>

9.   search for all lines that start with a space.

>     ... the strain of lawlessness brought so many years
>     ... rascal," Gordon heard him mutter, "spendthrift.
>     ... There was no life in Greenstream....
>     ... the medicine. Wait here for me, I will come
>     ... trust in you. Love makes everything
>     ... Pompey left one of the solidest estates in this
>     ... What would you say to a flat eight dollars an
>     ... was a lucky man,
>     ... never saw the women
>     ... Won't you come up and smoke a cigarette?
>     ... it's rising," he proclaimed, in a loud, singsong
>     ... now."
>     ... nobody saw."
>     ... by the South Fork entrance ... through
>     ... that is all the Stenton doctor will say; a piece
>     ... "Buried before my time," he proclaimed. He
>     ... waiting ... I couldn't wait any longer, Gordon,
>     ... quick as you can ... the doctor."
>     ... this time. Tell your husband he can pay me
>     ... I've got a lot of money laid out. What's been
>     ... it's the blood. I've studied considerable about
>     ... never again! I want--"
>     ... Goddy! it was for their good. They needn't
>     ... others ... new courage, example of bigness

24 of them.

on all of them, the ellipse was at the start of the line,
so the space simply needed to be deleted.   easy enough.

in addition, however, 4 lines had dropped an opening quote:
>    "... rascal," Gordon heard him mutter, "spendthrift.
>    "... it's rising," he proclaimed, in a loud, singsong
>    "... by the South Fork entrance ... through
>    "... quick as you can ... the doctor."

also, 2 more lines were in a poem, so needed to be indented,
along with another 2 lines that accompanied them.
>         ... was a lucky man,
>         Rip van Winkle ... grummmble
>         ... never saw the women
>         At Coney Island swimming ...

32 more lines corrected, for a grand total of 109, on 9 routines...

i'll be back tomorrow with the next tip in this series...

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080712/d9122497/attachment.htm 

From Bowerbird at aol.com  Sat Jul 12 13:54:43 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 12 Jul 2008 16:54:43 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	010
Message-ID: <c78.36e72c9a.35aa7413@aol.com>

10.   search for all one-character lines.

8 lines present, 3 of which were correct.

>    X (correct, for chapter x)
>    O (deleted as incorrect)
>    X (correct, for chapter x)
>    V (deleted as incorrect)
>    \ (deleted as incorrect)
>    X (correct, for chapter x)
>    T (deleted as incorrect)
>    : (moved up to end of previous line)

surrounding blank lines were also closed up, where appropriate.

5 more lines corrected, for a grand total of 114, on 10 routines...

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080712/42bb67f4/attachment.htm 

From Bowerbird at aol.com  Sun Jul 13 23:27:37 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 14 Jul 2008 02:27:37 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	011
Message-ID: <c33.31decfa3.35ac4bd9@aol.com>

11.   search for all lines with a period-comma, or a comma-period.

period-comma:
>    Gordon's lips formed a silent exclamation.,.
>    to throw it away--the vultures, Hollidew and Co.,
>    girl until--until Buckley.,. until to-night, now.
>    Barnwell K., through an oversight, was defrauded
>    Barnwell K., valiantly endeavoring to emulate his
>    to its goal,., Gordon saw now that Mrs. Caley
>    your wife. Miss Beggs oughtn't.,. she isn't anything

comma-period:
>    Gordon's lips formed a silent exclamation.,.
>    girl until--until Buckley.,. until to-night, now.
>    to its goal,., Gordon saw now that Mrs. Caley
>    your wife. Miss Beggs oughtn't.,. she isn't anything

7 lines presented with a period-comma.   of those 7,
4 of 'em also presented as containing a comma-period.
it is the case that those 4 lines were the incorrect ones,
where the misrecognition should have been an ellipse.

4 lines corrected, for a grand total of 118, on 11 routines...

i'll be back tomorrow with the next tip in this series...

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080714/320f6428/attachment.htm 

From Bowerbird at aol.com  Mon Jul 14 09:55:40 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 14 Jul 2008 12:55:40 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	012
Message-ID: <c04.3edc0641.35acdf0c@aol.com>

12.   search for all lines with a period at the start of the line...

>    . I wanted to see you; ah, yes." He 41 line-beginning-ellipse
>    . in the beginning they had let their wide share 44 
line-beginning-ellipse
>    . or for pleasure. It was the hottest hour of the 100 
line-beginning-ellipse
>    . it seemed so useless. You were like a ... a 122 line-beginning-ellipse
>    . in my heart, something has gone, and 123 line-beginning-ellipse
>    .stones, wedding bands, gold pins and 238 nothing(speck)
>    . it was insanity. Simeon Caley's wife should 255 line-beginning-ellipse

of the 7 lines presenting, 6 were cases where the line-starting period
was actually a line-starting ellipses, and they were changed accordingly.

in the 7th (.stones), the line-starting period was a speck, so was deleted...

7 more lines corrected, for a grand total of 125, on 12 routines...

i'll be back tomorrow with the next suggestion in this series...

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080714/4e11dcc3/attachment.htm 

From Gutenberg9443 at aol.com  Mon Jul 14 12:16:26 2008
From: Gutenberg9443 at aol.com (Gutenberg9443 at aol.com)
Date: Mon, 14 Jul 2008 15:16:26 EDT
Subject: [gutvol-d] continued confusion over at distributed proofreaders
Message-ID: <c0a.3d5ed309.35ad000a@aol.com>

 
In a message dated 7/1/2008 9:30:32 P.M. Mountain Daylight Time,  
gbuchana at teksavvy.com writes:

would  like to see a system like DP actually _introduce_ a
specific known error or  two into each page and not accept
the page until the proofers had found and  corrected it.  I
want the system to be able to verify that a known  level of
dilligence is being taken.



This is an extremely good idea. When I was a police officer
I was taught to put a deliberate typo on every page of a 
statement or confession and then have the person making
the statement correct and initial the typo. That way I could 
demonstrate to a jury that the person did have the chance
to read and if necessary correct errors--and since people
never tell the same story twice exactly the same way,
unless they're bards, often real errors were caught and
corrected while doing this.
 
Anne



**************Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080714/c8fc6e25/attachment.htm 

From jeroen.mailinglist at bohol.ph  Mon Jul 14 14:47:46 2008
From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account))
Date: Mon, 14 Jul 2008 23:47:46 +0200
Subject: [gutvol-d] continued confusion over at distributed proofreaders
In-Reply-To: <c0a.3d5ed309.35ad000a@aol.com>
References: <c0a.3d5ed309.35ad000a@aol.com>
Message-ID: <487BC982.5070003@bohol.ph>


A good way to get controversial or bad reports approved is to introduce 
some obvious problems (typo's, etc.) in some places. This will then 
distract the attention
from the real issues, and move the thing through bureaucracy, with 
people happy they've been able to make some comments and improve it....

It can work both ways.....

Jeroen.

Gutenberg9443 at aol.com wrote:
>  
> In a message dated 7/1/2008 9:30:32 P.M. Mountain Daylight Time,  
> gbuchana at teksavvy.com writes:
>
> would  like to see a system like DP actually _introduce_ a
> specific known error or  two into each page and not accept
> the page until the proofers had found and  corrected it.  I
> want the system to be able to verify that a known  level of
> dilligence is being taken.
>
>
>
> This is an extremely good idea. When I was a police officer
> I was taught to put a deliberate typo on every page of a 
> statement or confession and then have the person making
> the statement correct and initial the typo. That way I could 
> demonstrate to a jury that the person did have the chance
> to read and if necessary correct errors--and since people
> never tell the same story twice exactly the same way,
> unless they're bards, often real errors were caught and
> corrected while doing this.
>  
> Anne
>
>
>
> **************Get the scoop on last night's hottest shows and the live music 
> scene in your area - Check out TourTracker.com!      
> (http://www.tourtracker.com?NCID=aolmus00050000000112)
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>   


From Bowerbird at aol.com  Mon Jul 14 15:26:50 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 14 Jul 2008 18:26:50 EDT
Subject: [gutvol-d] continued confusion over at distributed proofreaders
Message-ID: <c6a.357514b4.35ad2caa@aol.com>

as i said in response to this originally,
this "solution" isn't needed, because
proofers _are_ paying good attention,
as evidenced by their _high_accuracy_.

moreover, with good "preprocessing",
which can take the error-rate down to
next-to-nothing before proofers get it,
a one-round proofing will be sufficient.
actually, it's more like a smooth-reading.

as example, consider "blood mountain".

there were 3 errors after preprocessing.
both parallel proofings found all of 'em.

you're being distracted by a concern over
a nonexistent "problem".   open your eyes.

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080714/f993ef48/attachment.htm 

From schultzk at uni-trier.de  Tue Jul 15 01:48:11 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Tue, 15 Jul 2008 10:48:11 +0200
Subject: [gutvol-d] continued confusion over at distributed proofreaders
In-Reply-To: <c6a.357514b4.35ad2caa@aol.com>
References: <c6a.357514b4.35ad2caa@aol.com>
Message-ID: <4F77EA90-819A-4F32-A995-91F3278EC13E@uni-trier.de>

Hi Everbody,

	I agree with Bowerbird.
	There is no need for such "quaility control"
	
	I do admit that some of the arguments mentioned
	have thier place in the situations mentioned,  yet
	they are not applicable to DP.

	DP text are proofed by, in general, by a couple of
	proofers. Therefore it is redundant and quality control
	is overt.

	Futhermore the proofers are not stressed to finish up
	nor are they not allowed to go back and check again in
	case they are unsure!

	regards
		Keith.

Am 15.07.2008 um 00:26 schrieb Bowerbird at aol.com:

> as i said in response to this originally,
> this "solution" isn't needed, because
> proofers _are_ paying good attention,
> as evidenced by their _high_accuracy_.
>
> moreover, with good "preprocessing",
> which can take the error-rate down to
> next-to-nothing before proofers get it,
> a one-round proofing will be sufficient.
> actually, it's more like a smooth-reading.
>
> as example, consider "blood mountain".
>
> there were 3 errors after preprocessing.
> both parallel proofings found all of 'em.
>
> you're being distracted by a concern over
> a nonexistent "problem".  open your eyes.
>
> -bowerbird
>
>
>
> **************
> Get the scoop on last night's hottest shows and the live music  
> scene in your area - Check out TourTracker.com!
> (http://www.tourtracker.com?NCID=aolmus00050000000112)
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080715/ce65d24a/attachment.htm 

From Bowerbird at aol.com  Tue Jul 15 02:37:02 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 15 Jul 2008 05:37:02 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	014
Message-ID: <c50.32901810.35adc9be@aol.com>

capital-i, followed by a number of low-probability letters...

14.   I[abcdeghijklopquvwxyz]

>    face, with its heavy, good features and slow-Idndling
>    Ill
>    Ill
>    "I do! Idol" He turned and left them, striding
>    Ill

5 lines presented, with each of the 5 containing an error...

5 more lines corrected, for a grand total of 130, on 14 routines...

i'll be back tomorrow with the next tip in this series...

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080715/4e20692a/attachment.htm 

From Bowerbird at aol.com  Tue Jul 15 09:33:50 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 15 Jul 2008 12:33:50 EDT
Subject: [gutvol-d] the proofreaders are not the problem at distributed
	proofreaders
Message-ID: <d14.2c0ed531.35ae2b6e@aol.com>

the proofreaders are _not_ the problem at distributed proofreaders, no sir...
the problem is the awful workflow to which the proofers are being subjected.

_all_ of the data i have analyzed from the various d.p. experiments -- and it
has been a lot of data, i know, i don't blame you if you haven't followed it 
all
-- has made it abundantly clear that the proofers are doing a _great_ job...

if i were to grade their performance, i would give them a good solid "a"...
they don't get it all right the first time, but they rarely introduce any 
errors.

the d.p. _administrators_, though, have a significantly worse track-record.

in 2003, i would have given them a "b", based on the big implicit potential.

by 2004, their grade had dropped to a "b-".   by 2005, a "c+".   2006, a "c".
2007 would have netted them a "c-".   and now in 2008, it's clearly a "d"...

consider the "how-to-preprocess" series that i've been running recently...
i've already listed over a dozen simple, predictable routines to find errors.
all of 'em should be immediately obvious to any person familiar with o.c.r.
every one has found errors in the text against which they're being tested,
and returned very few false-alarms.   so one simple question poses itself:
why were _none_ of these routines used in the preprocessing of this text?

seriously, haven't the administrators at d.p. learned _anything_ about
finding and fixing errors in o.c.r.?   they've digitized literally 
_thousands_
of books, yet they don't have the most primitive of routines in place yet...
they should be _extremely_embarrassed_ by their miserable performance.
instead of using the computer to find and fix glitches, they leave it to 
their
human volunteers.   this is a waste of the resources being donated to them.
indeed, more than being embarrassed, they should be ashamed.

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080715/d47d43f3/attachment.htm 

From Bowerbird at aol.com  Tue Jul 15 10:47:08 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 15 Jul 2008 13:47:08 EDT
Subject: [gutvol-d] what d.p. and rfrank need to do -- a 10-point plan
Message-ID: <cf8.3928c769.35ae3c9c@aol.com>

here's what roger frank needs to do to 
get his preprocessing program going...

1.   clean up the paragraphs.   (have to do it sooner or later, so do it 
sooner.)

2.   put the top-blank-line on appropriate pages.   (so proofers don't have 
to.)

3.   clear up the spacey quotes. (literally _hundreds_and_hundreds_ of 
these.)

4.   standardize ellipses. (so proofers skip the merry-go-round of changes.)

5.   standardize em-dashes. (here too, skip the changes merry-go-round.)

6.   dehyphenate. (or, better yet, delay that step until _after_ the 
proofing.)

7.   "clothe" hyphens.   (or, better yet, just stop doing that stupid d.p. 
policy.)

8.   run the routines that find the obvious o.c.r. errors (as i've 
demonstrated.)

9.   do a much better job of formulating the "good words" list.   (saves 
time!)

10.   congratulate yourselves for a job well done...

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080715/f6316429/attachment-0001.htm 

From Bowerbird at aol.com  Wed Jul 16 01:38:15 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 16 Jul 2008 04:38:15 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	015
Message-ID: <c0f.2f9fe5ff.35af0d77@aol.com>

15.   search for all lines with a period-whitespace followed by lowercase,
except controlling for cases where it was ellipse-whitespace-lowercase...

6 of the presenting cases were specks misrecognized as a period,
which, of course, is one of the main targets of this search routine:
>    been cropping the grass in. the broad, shallow gutter
>    himself pointedly in. its defiance.
>    Mrs. Hollidew in. the sitting room. He would wake
>    in. the return of the options to a county enhanced
>    quickly away; the. house was without a
>    him to where, on. the bureau, a lamp had been left.

5 of the presenting cases were ones where the period was really an ellipse:
>    . in the beginning they had let their wide share
>    . or for pleasure. It was the hottest hour of the
>    . it seemed so useless. You were like a -- a
>    . in my heart, something has gone, and
>    . it was insanity. Simeon Caley's wife should

2 of the presenting cases were other instances of a misrecognized ellipse:
>    girl until--until Buckley.,. until to-night, now.
>    your wife. Miss Beggs oughtn't.,. she isn't anything

the remaining 3 of the presenting cases were _correct_,
as they involved a last-name represented as a single letter:
>    I had promised to bring Barnwell K. the next time."
>    red cloth; on one side Barnwell K. sat flanked by
>    K. and the delicate Rose, left after

13 more lines corrected, for a grand total of 143, on 15 routines...

i'll be back tomorrow with the next suggestion in this series...

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080716/c1129ed5/attachment.htm 

From russellbell at gmail.com  Wed Jul 16 17:46:53 2008
From: russellbell at gmail.com (Russell Bell)
Date: Wed, 16 Jul 2008 18:46:53 -0600
Subject: [gutvol-d] Maxims for Revolutionists from Shaw's 'Man and Superman'
Message-ID: <688269960807161746ib821bd2n1febc78df04d96ca@mail.gmail.com>

     'Maxims for Revolutionists' is an appendix to Shaw's 'Man and Superman'
but it is not included in Gutenberg's edition thereof.  Why?

From gbnewby at pglaf.org  Wed Jul 16 19:03:03 2008
From: gbnewby at pglaf.org (Greg Newby)
Date: Wed, 16 Jul 2008 19:03:03 -0700
Subject: [gutvol-d] Maxims for Revolutionists from Shaw's 'Man and
	Superman'
In-Reply-To: <688269960807161746ib821bd2n1febc78df04d96ca@mail.gmail.com>
References: <688269960807161746ib821bd2n1febc78df04d96ca@mail.gmail.com>
Message-ID: <20080717020302.GA15135@mail.pglaf.org>

On Wed, Jul 16, 2008 at 06:46:53PM -0600, Russell Bell wrote:
>      'Maxims for Revolutionists' is an appendix to Shaw's 'Man and Superman'
> but it is not included in Gutenberg's edition thereof.  Why?

I don't really know.  But most likely this was because whoever
digitized the text didn't provide the appendix.
  -- Greg

From hyphen at hyphenologist.co.uk  Wed Jul 16 23:50:46 2008
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Thu, 17 Jul 2008 07:50:46 +0100
Subject: [gutvol-d] Maxims for Revolutionists from Shaw's 'Man
	and	Superman'
In-Reply-To: <20080717020302.GA15135@mail.pglaf.org>
References: <688269960807161746ib821bd2n1febc78df04d96ca@mail.gmail.com>
	<20080717020302.GA15135@mail.pglaf.org>
Message-ID: <000901c8e7d9$7d4f2f00$77ed8d00$@co.uk>



Greg Newby wrote

>On Wed, Jul 16, 2008 at 06:46:53PM -0600, Russell Bell wrote:
>>      'Maxims for Revolutionists' is an appendix to Shaw's 'Man and
Superman'
>> but it is not included in Gutenberg's edition thereof.  Why?

>I don't really know.  But most likely this was because whoever
>digitized the text didn't provide the appendix.
  -- Greg

Is Russell volunteering to do it?  Somebody should!

Dave Fawthrop. <hyphen at hyphenologist.co.uk>


From jayvdb at gmail.com  Thu Jul 17 00:05:11 2008
From: jayvdb at gmail.com (John Vandenberg)
Date: Thu, 17 Jul 2008 17:05:11 +1000
Subject: [gutvol-d] Maxims for Revolutionists from Shaw's 'Man and
	Superman'
In-Reply-To: <000901c8e7d9$7d4f2f00$77ed8d00$@co.uk>
References: <688269960807161746ib821bd2n1febc78df04d96ca@mail.gmail.com>
	<20080717020302.GA15135@mail.pglaf.org>
	<000901c8e7d9$7d4f2f00$77ed8d00$@co.uk>
Message-ID: <deea21830807170005p305a3b74h71bd1aed7fb9ea77@mail.gmail.com>

On Thu, Jul 17, 2008 at 4:50 PM, Dave Fawthrop
<hyphen at hyphenologist.co.uk> wrote:
>
> Greg Newby wrote
>
>>On Wed, Jul 16, 2008 at 06:46:53PM -0600, Russell Bell wrote:
>>>      'Maxims for Revolutionists' is an appendix to Shaw's 'Man and
> Superman'
>>> but it is not included in Gutenberg's edition thereof.  Why?
>
>>I don't really know.  But most likely this was because whoever
>>digitized the text didn't provide the appendix.
>  -- Greg
>
> Is Russell volunteering to do it?  Somebody should!

Wikisource has the complete text; we used the PG text and bartleby for
the appendixes:

http://www.bartleby.com/157/index.html

--
John

From Bowerbird at aol.com  Thu Jul 17 01:13:13 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 17 Jul 2008 04:13:13 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	016
Message-ID: <bc8.33ce57ea.35b05919@aol.com>

ok, let's spend a few days talking about how to clean up the paragraphing.
basically, this means that you've got an empty line between all paragraphs,
and no empty lines within any paragraphs.   the o.c.r. gets most of this 
right,
most of the time, and the exceptions are fairly easy to detect 
automatically...

once you've got the paragraphing correct, you can then have the machine
go ahead and fix almost all the spacey quotes automatically and correctly...
that's a good thing.   (in our current test book, there were no spacey 
quotes,
but in some of the other test books, there are many, sometimes over 1000.)

our first test is lines after a blank line which start with a lower-case 
character.

16.   double-line-end (here signified by "//") followed by lowercase

>    'Most people go a good length before fighting with//me."
>    see in my ramshackle house and used up ground, is//over me."
>    _Mr. Ottinger elected to imbibe his "straight"//from the bottle--it was 
drunk with 
>    mutual assurances
>    slowly and rolled like a flash over her plastered//skin.
>    cern now was to get away, to take the money with//him.
>    sake," Otty gasped, "get to him, the town'll be on//us."
>    to the door; it said, "Gone fishing. Back to-//morrow."
>    interior which absorbed them.//fl04]
>    the other men would hate him; they would all want//me."
>    that would go twice about the neck and then hang//some."
>    \//he wouldn't have gone, anyway.
>    fell sooner and night lingered late into//morning.

all of these 12 cases were ones where a paragraph was incorrectly split,
so they were rejoined.

12 more lines corrected, for a grand total of 155, on 16 routines...

be back tomorrow with the next tip in this series...

-bowerbird



**************
Get the scoop on last night's hottest shows and the live music 
scene in your area - Check out TourTracker.com!
      
(http://www.tourtracker.com?NCID=aolmus00050000000112)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080717/87d477d1/attachment.htm 

From hart at pglaf.org  Thu Jul 17 09:44:01 2008
From: hart at pglaf.org (Michael Hart)
Date: Thu, 17 Jul 2008 09:44:01 -0700 (PDT)
Subject: [gutvol-d] Maxims for Revolutionists from Shaw's 'Man and
 Superman'
In-Reply-To: <20080717020302.GA15135@mail.pglaf.org>
References: <688269960807161746ib821bd2n1febc78df04d96ca@mail.gmail.com>
	<20080717020302.GA15135@mail.pglaf.org>
Message-ID: <Pine.LNX.4.64.0807170943190.24724@pglaf.org>


Actaully, this is a GREAT appendix, and if anyone is willing to
work on it, I will help walk it through. . . .

Thanks!!!


Michael



On Wed, 16 Jul 2008, Greg Newby wrote:

> On Wed, Jul 16, 2008 at 06:46:53PM -0600, Russell Bell wrote:
>>      'Maxims for Revolutionists' is an appendix to Shaw's 'Man and Superman'
>> but it is not included in Gutenberg's edition thereof.  Why?
>
> I don't really know.  But most likely this was because whoever
> digitized the text didn't provide the appendix.
>  -- Greg
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>

From russellbell at gmail.com  Thu Jul 17 18:26:12 2008
From: russellbell at gmail.com (Russell Bell)
Date: Thu, 17 Jul 2008 19:26:12 -0600
Subject: [gutvol-d] Maxims for Revolutionists from Shaw's 'Man and Superman'
Message-ID: <688269960807171826s1c133dfx41cc90288a81ae88@mail.gmail.com>

I just downloaded it and the other appendix from Bartleby's.
I'll look up Gutenberg's rules for submission.  Should I
make them separate items or add them to 'Man and Superman'?

From Bowerbird at aol.com  Fri Jul 18 02:51:52 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 18 Jul 2008 05:51:52 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	017
Message-ID: <bc0.3819d308.35b1c1b8@aol.com>

as our next step in ensuring the integrity of our paragraphing, we will 
search for
paragraphs that were not terminated with something we consider "reasonable",
which would be a period, exclamation-point, or question-mark, or any of those
three things followed by a single-quote-mark and/or a double-quote-mark...

17.   paragraphs terminated incorrectly

you'll see that we get a lot of hits here, so many that i've appended them,
after sorting them into some basic categories, which i will discuss here...

so, we churned up lots of stuff...   first, titles and forward matter spring 
up,
but we weed those out quickly, since they aren't really "paragraphs" at 
all...

next, there are a lot of lines that "end" with a colon.   they're very 
plentiful.
the colon signifies that "a block of some type follows this".   sometimes 
it's
just a brief statement from a person, mere dialog.   but other times it can 
be
quite an extensive block, such as a letter or a sign or a telegram or 
whatever.

so the colon is legitimate here, but it will also be useful to us later, when 
we
do _formatting_ -- because that "block that follows" will need to be treated 
--
so it's a good thing we discovered this quick method of finding those blocks.

the next category is the _em-dash_, which is also a legitimate termination,
so we'll add those checks to the routine in the future.   this is how you 
learn.

we also get one _en-dash_, which would be an invalid termination, except
it's actually an o.c.r. error, so we will just fix it, thereby removing the 
flag...

we also get -- and fix -- a misrecognition of exclamation-point as capital-i.

and a misrecognition of a period as a comma, so we've now fixed three lines.

we've also located a poem, consisting of 4 lines, currently split apart but
needing to be rejoined, so we eliminate the 3 blank lines separating them.

we found one paragraph that ended in garbage, so we corrected that glitch.

we also had a broken paragraph that we fixed, so we're up to 8 lines now...

we drop 4 one-character lines, so we're at 12.

we eliminate some drop-cap garbage, and an excess runhead, so 14...

all in all, an interesting hodge-podge of lines popping up from that search,
and 14 lines corrected, so well worth the effort of sorting through the 
stuff...

14 more lines corrected, for a grand total of 169, on 17 routines...

i'll be back tomorrow with the next suggestion in this series...

-bowerbird


17.   paragraphs terminated incorrectly


*** colon (so a legitimate termination)
and familiar figures:
past its banks. Then:
solemn eyes:
spoke in a strain of querulous sweetness:
audible in broken phrases:
She said promptly:
toward the lower level. Then:
distance away:
formed words:
composure with a struggle:
personality. He heard Simmons say:
she drew him to her:
hand:
words:
formed it soundlessly, he even spoke it aloud:
only by Gordon's breathing:
Then he went to the door:
exactly the same manner:
She turned her face from him. He said:
be a fortune." Silence fell upon them. Then:
:
He grew silent, enveloped in thought. Then:
the sitting room, where he stood lost in thought:
dangerous murmur rose:
said:
him:


*** em-dash (so a legitimate termination)
she had kissed him for a pair of silk stockings--
his desire, his--
printed a deliberate--a deliberate--


*** misrecognition of an em-dash (which is legitimate) as an en-dash (which 
is not)
of a thing to go and do! ... off horse ... willing-"


*** misrecognition of exclamation-point as a capital-i
I'm no sheep to drive into their lot and shear I"


*** misrecognition of a period as a comma
He heard a murmur from the back of the throng,
                ~~   "Give it to him, we didn't come here to talk.@


*** poem (which should be joined into a single block)
... was a lucky man,
                ~~   Rip van Winkle ... grummmble
Rip van Winkle ... grummmble
                ~~    ... never saw the women
... never saw the women
                ~~   At Coney Island


*** garbage
the trap.                 _{t}
                ~~   The bitter irony of it rose in a wave of black mirth


*** broken paragraph (i.e., garbage)
served in two glasses and a cracked toothbrush mug
                ~~   _Mr. Ottinger elected to imbibe his "straight"


*** single-character lines (i.e., garbage)
O
                ~~   "I got it," he interrupted her tersely, "and I
V
                ~~   BUT, curiously, sitting alone, he gave little
\
                ~~   he wouldn't have gone, anyway.@
T
                ~~   XI


*** drop-cap garbage
' 'TT'VE got something for you," Gordon said sud-
                ~~   I denly.@


*** runhead (i.e., garbage)
MOUNTAIN BLOOD
                ~~   "I don't choose to be," Meta Beggs retorted. "I



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080718/0f388ae6/attachment.htm 

From Bowerbird at aol.com  Fri Jul 18 13:48:29 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 18 Jul 2008 16:48:29 EDT
Subject: [gutvol-d] lots of good stuff next week
Message-ID: <c08.3fc02cb8.35b25b9d@aol.com>

there'll be lots of good stuff next week...

first, "the crevice" -- another rfrank experiment at d.p. --
has made its way rapidly through both p1 and p2 now...

i actually re-did the o.c.r. on this book -- as part of my
ongoing series on "how to digitize a book, step by step",
so i'll be continuing that series using this live example...

also, "the cabin on the prairie" -- another rfrank test --
has finished p1, so i'll be able to make comments on it...

neither of these books got good preprocessing on them
-- which is why i re-did the o.c.r. -- so i won't examine
the (hundreds and hundreds) of _unnecessary_changes_
that the proofers had to do (e.g., rejoining hyphenates),
because if that b.s. doesn't already _stink_badly_ to you,
your nose isn't working correctly.

what i _will_ do is show -- like i did on "mountain blood" --
that if you do the preprocessing correctly, you transform
the "proofing" job into something where the proofers can
concentrate on the job of _perfecting_the_book_ instead of
just removing the obvious crap on all the individual pages
and leaving the "perfecting" task to the next person in line...

i'll also continue my "how to clean up the o.c.r." series, so
you know _exactly_ how to _do_ that good preprocessing.

lots of sleeves-up fun here in the lobby of the p.g. library...

have a good weekend...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080718/fb216785/attachment.htm 

From russellbell at gmail.com  Fri Jul 18 14:52:18 2008
From: russellbell at gmail.com (Russell Bell)
Date: Fri, 18 Jul 2008 15:52:18 -0600
Subject: [gutvol-d] submitting a text
Message-ID: <688269960807181452q78a8cf5al54a2372d61ae05a8@mail.gmail.com>

I downloaded Bartleby's copies of 'Maxims for Revolutionists' and
'Revolutionists' Handbook and Pocket Guide',
formatted them in accord with the rules, gutchecked them, made iso8859
& ASCII copies.  Downloaded a
copy of the image of an original edition from googlebooks for
comparison.  Now where do I send them?  The
FAQ tells me to e-mail them to any member of the posting team but
gives no addresses for any of them.

From ajhaines at shaw.ca  Fri Jul 18 15:53:01 2008
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Fri, 18 Jul 2008 15:53:01 -0700
Subject: [gutvol-d] submitting a text
References: <688269960807181452q78a8cf5al54a2372d61ae05a8@mail.gmail.com>
Message-ID: <000801c8e929$08661d60$6401a8c0@ahainesp2400>

Russell, before you can submit your files to PG, you'll have to get copyright clearances for both 
books.  You can do that at http://upload.pglaf.org/.

- click the "New username" link to set yourself up with an account.
- click the "Login" link to log in with your new username and password.
- click the "welcome" link at the top of the page.
- in the News section, click on "new copyright clearance system"
- for each book, click on "submit a new clearance request", and fill in the form.  You'll have to 
have scans of each book's title and copyright page, the latter usually being on the back (verso) of 
the title page.
- click the "logout" link at the top of the form when you're finished.

It may take several days, but you'll get back an e-mail for each book saying whether the clearance 
is "OK" or "Not OK".  (The reasons for a "Not OK" are beyond the scope of this message.)

Zip all the files (ASCII, Latin1, HTML, etc) for a given book into a single file for uploading.

Log into the upload page above, and click the "Get status of my prior clearance requests" link.

Click the Cleared link to see your clearances, then on whichever book you want to upload to the 
Whitewashers.  Click on the book's Clearance OK link, and fill in the upload form.

In step 2 of the form, select ASCII, Latin1 (ISO8859), or whatever's appropriate for that book. 
(Don't select Other.)  In steps 3 and 4, check that the info is correct.  Fill in Steps 5 and 6 as 
you see fit.

At step 7, if the submission includes an HTML file (with or without illustrations), it's recommended 
that you do a Preview submission to check the HTML's validity.  If it's OK, fill in step 1 again, 
then click the Submit eBook button.

The Whitewashers (one of whom is me) will get an email notifying us that a new submission has come 
in.  Depending on the volume of new submissions, it may take us a day or two to handle yours.


FYI - you don't necessarily have to generate your own ASCII files.  The Whitewashers routinely 
generate ASCII files from ISO8859 files as part of the posting process.  It's only when that 
conversion proves difficult, for whatever reason, that we ask the submitter to prepare and submit 
their own ASCII file, along with the Latin1 file.

Al


----- Original Message ----- 
From: "Russell Bell" <russellbell at gmail.com>
To: <gutvol-d at lists.pglaf.org>
Sent: Friday, July 18, 2008 2:52 PM
Subject: [gutvol-d] submitting a text


>I downloaded Bartleby's copies of 'Maxims for Revolutionists' and
> 'Revolutionists' Handbook and Pocket Guide',
> formatted them in accord with the rules, gutchecked them, made iso8859
> & ASCII copies.  Downloaded a
> copy of the image of an original edition from googlebooks for
> comparison.  Now where do I send them?  The
> FAQ tells me to e-mail them to any member of the posting team but
> gives no addresses for any of them.
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
> 



From Bowerbird at aol.com  Sat Jul 19 11:20:11 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 19 Jul 2008 14:20:11 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	018
Message-ID: <c1d.43b60f58.35b38a5b@aol.com>

saturday, so let's take a big bite for the weekend...

the paragraphs should be all set now...   a quick visual scan through
the entire book will inform you of any remaining errors in the flow...

we will be slightly modifying the level of description we've been using.
so far, we've talked about _lines_.   now, we'll talk about _paragraphs_.

we _define_ a paragraph as "any line or lines that occur _between_
empty lines", so there has been a _correspondence_ of the two, and
you can still program the paragraph routines _in_terms_of_ "lines",
but it's easier to say what needs to be said if we call 'em "paragraphs".

today, we're not really gonna bring up "hits" and evaluate what to do.
we're just gonna describe in plain english words what a tool will do...

specifically, this tool will "fix" the spacey-quotes in our o.c.r. output.

and this is how it does it.   first, it examines paragraph-by-paragraph.
within each paragraph, it counts double-quote-marks. (single later.)

it goes on to evaluate each quotemark, to determine whether it is:
1.   a spacey-quotemark. (one that has whitespace on both sides.)
2.   an open-quotemark.   (whitespace to the left, letters to the right.)
3.   a close-quotemark. (letters to the left, whitespace to the right.)

(things get a little more complicated when you have nested quotes,
and markup characters, but the basics work well most of the time.)

then lastly, the routine evaluates whether all _odd-numbered_ quotes
are open-quotes and all _even-numbered_ quotes are close-quotes;
if so, then it assigns the spacey-quotemarks their respective status...
if not, then it summons these questionable quotes to your attention.

in addition, when there is an odd number of quotes in the paragraph,
it checks to make sure the next paragraph starts with a quote-mark,
and -- if not -- summons that paragraph to your attention as well...

because of the redundancy built into this routine, it is _very_ robust.
it will basically _find_ more mistakes in the text than it will _cost_ you
in erroneous assignments.   and it can auto-fix _hundreds_ of errors,
-- hundreds and hundreds and hundreds of errors -- in no time flat.

assuming, that is, that there _are_ some spacey-quotes in your o.c.r.

"mountain blood", however, had no spacey-quotes.   so none were fixed.

18.   fix the spacey quotes.

0 more lines corrected, for a grand total of 169, on 18 routines...

i'll be back tomorrow with the next suggestion in this series...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080719/c1c2cab7/attachment.htm 

From Bowerbird at aol.com  Sun Jul 20 21:52:03 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 21 Jul 2008 00:52:03 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	019
Message-ID: <c51.3265cf65.35b56ff3@aol.com>

today we talk about paragraph-breaks that occur on page-breaks...

the preprocessing done on "mountain blood" followed the rule that 
any page that started with a capital letter or a double-quotemark got
a line inserted above it, the assumption being it was a new paragraph.

that's a pretty good rule, but it's not the best one you can use, since
it creates a number of unnecessary false-alarms.

it's far better to check the last line of the _previous_ page to see if it
ends with a paragraph-termination.   if it does, and the current page
starts with a capital letter or a double-quotemark, add the line then.
if the previous page is paragraph-terminated, but the current one
doesn't look like the start of a paragraph, then summon a human...

of course, it's entirely possible for a _sentence_ to end on one page, 
with the next sentence then beginning on the top   of the next page, 
with both sentences within the same paragraph.   you can do a check
for these exceptions by checking the _length_ of the terminating line;
if it's a long line, then there's a good chance that it is mid-paragraph.
however, there are exceptions to these exceptions as well, meaning
it's good to do one last visual confirmation of each page in the book.

but other than this set of wrinkles, _most_ of the paragraph-breaks
(or non-breaks) that occur on a page-break can be auto-detected...

19.   check the paragraph-breaks that occur on page-breaks.

this turned up a number of hits, which i've appended...

in the first group, we find 2 lines where the paragraph-terminating
period was misrecognized as a comma, so we will correct those two.


in the second group, we have 7 lines that were clearly _false-alarms_,
since the previous page was not paragraph-terminated.   i fixed them.

next, by searching for lines that _are_ paragraph-terminated, but
which are also _long_lines_, we find and fix another 8 false-alarms.

as exceptions to exceptions, not _all_ the long lines are false-alarms;
the last group shows 2 long lines that did indeed end the paragraph,
even though examination of the preceding page won't tell you that,
you can only see it by going to the following page where you observe
that yes, indeed, there is a new paragraph-start at the top of the page,
as indicated by indentation of the paragraph.   (there might have been
more than these two examples, but i didn't think to save earlier ones.)


17 more lines corrected, for a grand total of 186, on 19 routines...

i'll be back tomorrow with the next suggestion in this series...

-bowerbird


*** a period-termination was misrecognized as a comma

>    radiant content settled upon her,
>
>    http://z-m-l.com/go/mount/mountp142.html
>
>    "Thank you," she told him seriously; "it will

>    "We've never been storekeepers,"
>
>    http://z-m-l.com/go/mount/mountp321.html
>
>    "Never kept much of anything, have you, any of


*** non-termination, incorrectly coded as paragraph-breaks...

>    it toward him. "In Greenstream," he continued,
>    http://z-m-l.com/go/mount/mountp121.html
>    "men don't like me, they are afraid of me; but the

>    "Are you going to the camp meeting on South
>    http://z-m-l.com/go/mount/mountp179.html
>    Fork next week?" she demanded. "I have never

>    a man will murder me," she replied in level tones;
>    http://z-m-l.com/go/mount/mountp210.html
>    "perhaps I'll get a thrill from that." Her voice

>    "You could study a life on women," Rutherford
>    http://z-m-l.com/go/mount/mountp225.html
>    Berry pronounced, "and never come to any satisfaction.

>    "And you go right around, Alec," his wife added,
>    http://z-m-l.com/go/mount/mountp301.html
>    "and twist the head off that dominicker chicken.

>    shade of minute, variously-colored silks the effigy of
>    http://z-m-l.com/go/mount/mountp304.html
>    Mrs. Hollidew dead. Undisturbed in the film of

>    "I say I wanted to see you," the voice persisted;
>    http://z-m-l.com/go/mount/mountp366.html
>    "it's Edgar Crandall. You'll take pleasure from


*** not a paragraph-break, just a long mid-paragraph line

>    She started toward him in an excess of tender pity.
>    http://z-m-l.com/go/mount/mountp143.html
>    "Do you care as much as that?" She laid her

>    "You're getting on the money now, are you?
>    http://z-m-l.com/go/mount/mountp172.html
>    Going to start that song? That'll come natural to

>    He entered the room. It was, he divined, hers.
>    http://z-m-l.com/go/mount/mountp176.html
>    His foot struck against a chair, and his hand caught

>    a lithe, wicked hatred in any other human being.
>    http://z-m-l.com/go/mount/mountp209.html
>    "You are a gentle object," he satirized her, loosening

>    she sought his lips. "Soon again," she murmured.
>    http://z-m-l.com/go/mount/mountp214.html
>    "Don't desert me; I am entirely alone except for

>    The postmaster laid it on top of the glass case.
>    http://z-m-l.com/go/mount/mountp238.html
>    "The jobber sent it up by accident," he explained;

>    hanging limply, breathing in sharp inspirations.
>    http://z-m-l.com/go/mount/mountp260.html
>    She gazed about at the valley, the half-distant maple

>    endeavor to instil into her some of his warmth.
>    http://z-m-l.com/go/mount/mountp276.html
>    He gazed at her for a moment, at the shadows like


*** long lines, but ones that were actually paragraph-breaks.

>    pale orange paper, pinched between withered fingers.
>
>    http://z-m-l.com/go/mount/mountp329.html
>
>    Suddenly he was in a hurry to get away; he drew

>    was the power, the unconquerable godhead, of gold.
>
>    http://z-m-l.com/go/mount/mountp339.html
>
>    The thought of the storekeeper was lost in the



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080721/4226b0f6/attachment.htm 

From Bowerbird at aol.com  Mon Jul 21 09:55:57 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 21 Jul 2008 12:55:57 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	020
Message-ID: <d32.3680bc9c.35b6199d@aol.com>

we shouldn't have a line that starts with a comma, should we?

20.   search for all lines that start with a comma.

>    philosophy underlying them, any ruthless strength,
>    , escaped him entirely. They appealed solely to him

one of them, fixed.   (a speck on the page caused the glitch.)

1 more line corrected, for a grand total of 187, on 20 routines...

i'll be back tomorrow with the next tip in this series...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080721/322a0f4e/attachment.htm 

From hart at pglaf.org  Mon Jul 21 10:05:37 2008
From: hart at pglaf.org (Michael Hart)
Date: Mon, 21 Jul 2008 10:05:37 -0700 (PDT)
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book
 -- 020
In-Reply-To: <d32.3680bc9c.35b6199d@aol.com>
References: <d32.3680bc9c.35b6199d@aol.com>
Message-ID: <Pine.LNX.4.64.0807211005090.21948@pglaf.org>


I'm still hoping you will send me a list of all 20.


Thanks!!!


Michael


From hart at pglaf.org  Mon Jul 21 10:39:40 2008
From: hart at pglaf.org (Michael Hart)
Date: Mon, 21 Jul 2008 10:39:40 -0700 (PDT)
Subject: [gutvol-d] submitting a text
In-Reply-To: <688269960807181452q78a8cf5al54a2372d61ae05a8@mail.gmail.com>
References: <688269960807181452q78a8cf5al54a2372d61ae05a8@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0807211038540.21948@pglaf.org>



When in doubt, you can always send them to me.

Michael S. Hart
Founder
Project Gutenberg
hart at pglaf.org


On Fri, 18 Jul 2008, Russell Bell wrote:

> I downloaded Bartleby's copies of 'Maxims for Revolutionists' and
> 'Revolutionists' Handbook and Pocket Guide',
> formatted them in accord with the rules, gutchecked them, made iso8859
> & ASCII copies.  Downloaded a
> copy of the image of an original edition from googlebooks for
> comparison.  Now where do I send them?  The
> FAQ tells me to e-mail them to any member of the posting team but
> gives no addresses for any of them.
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>

From Bowerbird at aol.com  Mon Jul 21 15:46:42 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 21 Jul 2008 18:46:42 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book
	-- 020
Message-ID: <c0d.31ab9afd.35b66bd2@aol.com>

michael said:
>   I'm still hoping you will send me a list of all 20.

for the founder, i can do that, yes.   we will end up with more than 20.

for everyone else, collecting them is a matter of having the dedication.
i did the work to define the set for this particular "mountain blood" test;
that's the hard part; the mere act of collecting my posts is the easy part.

but there's little reason for anyone to collect these,
unless they plan on programming their own tool...

i'll be releasing a version of my tool, which incorporates these routines
(and more), which anyone can use to clean up an o.c.r. text, so use that.

-bowerbird

p.s.   if anyone else does want to capture all of the posts in this series,
i'd suggest the july digest in the archives...

p.p.s.   for more routines, you can check out gutcheck and guiguts:
>    http://gutcheck.sourceforge.net/index.html
>    http://home.comcast.net/~thundergnat/guiguts.html


**************
Get 
fantasy football with free live scoring. Sign up for FanHouse Fantasy Football 
today.
      (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080721/9a7335ba/attachment.htm 

From Bowerbird at aol.com  Mon Jul 21 20:14:19 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 21 Jul 2008 23:14:19 EDT
Subject: [gutvol-d] a few more references
Message-ID: <c38.383232cb.35b6aa8b@aol.com>

i said:
>    p.p.s.? for more routines, you can check out gutcheck and guiguts:
>?? http://gutcheck.sourceforge.net/index.html
>?? http://home.comcast.net/~thundergnat/guiguts.html

you can also look here, on the distributed proofreader forums:
>    http://www.pgdp.net/phpBB2/viewtopic.php?p=331320
>    http://www.pgdp.net/phpBB2/viewtopic.php?p=332044

as you can see, it was over a year ago i was bringing this up
directly to d.p., back when they "allowed" me to do it directly
-- and roger frank was "working on it" even way back then --
with various people chipping in offering several good routines,
but somehow a whole year has gone by with nothing being done.

actually, i've been "bringing this up" to d.p. for many years now,
with "nothing being done" being precisely what was (not) done.

come back in a year from now and see if they've done anything...

-bowerbird

p.s.   by the way, it's _not_ the case that "more" routines is "better",
because you start to run into the "false alarm" problem before long.
but, you know, have at it with all the routines...



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080721/ba4bd4bd/attachment.htm 

From schultzk at uni-trier.de  Mon Jul 21 23:14:15 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Tue, 22 Jul 2008 08:14:15 +0200
Subject: [gutvol-d] a few more references
In-Reply-To: <c38.383232cb.35b6aa8b@aol.com>
References: <c38.383232cb.35b6aa8b@aol.com>
Message-ID: <1539A2F1-DC7E-413C-B203-930D442D5017@uni-trier.de>

Hi Bowerbird,


Am 22.07.2008 um 05:14 schrieb Bowerbird at aol.com:

>
> p.s.  by the way, it's _not_ the case that "more" routines is  
> "better",
> because you start to run into the "false alarm" problem before long.
> but, you know, have at it with all the routines...

	As you mentioned before the routines you have mentioned are the
	easy part !!

	The hard part is making the improved mouse trap so that you do not
	have those false alarms! Furthermore, if one can not automatically
	distinguish between a true case and false alarm then this is a true
	case for human intervention and should be flagged.

	True, it is annoying and it slows down the process, yet it adds to the
	qualitiy of the result.

	regards
		Keith.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080722/c9d9b87b/attachment.htm 

From Bowerbird at aol.com  Tue Jul 22 01:24:25 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 22 Jul 2008 04:24:25 EDT
Subject: [gutvol-d] a few more references
Message-ID: <ca0.35b81f32.35b6f339@aol.com>

keith said:
>    if one can not automatically distinguish between 
>    a true case and false alarm then this is a true case 
>    for human intervention and should be flagged.

well, actually, aside from the spacey-quote corrections,
_all_ of these fixes are done in a human-mediated way.

the tool will take you to each glitch, show you the scan,
and position the cursor in the field for you to do an edit.

so each decision on these -- to edit or not -- is _informed_;
you have examined text and scan, so you know the score...
these glitches are treated as "false alarms" until confirmed.
(even though the number of _real_ false alarms is very low.)

and even the auto-spacey-quote corrections are _verified_,
by colorizing the quotes so you can assess their correctness.

for the double-quotes, you will step through each page,
but -- as a demonstration of what i mean -- here's a _list_
of the colorized passages that were inside _single-quotes_:
>   http://z-m-l.com/go/mount/mount-singlequotes.png

(single-quotes are actually much more difficult to check,
because you need to control for apostrophe contractions.)

this colorized verification ensures auto-changes are right...

***

essentially, what makes this clean-up so bloody efficient
is that the _computer_ is _finding_ the errors for you, and
then making it _as_simple_as_possible_ for you to fix 'em.

i can do this by locating the error-locating routines inside
the tool that juxtaposes a text-editor with a scan-viewer,
such that all three of these elements are working together.

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080722/ed2632db/attachment.htm 

From hart at pglaf.org  Tue Jul 22 08:45:13 2008
From: hart at pglaf.org (Michael Hart)
Date: Tue, 22 Jul 2008 08:45:13 -0700 (PDT)
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book
 -- 020
In-Reply-To: <c0d.31ab9afd.35b66bd2@aol.com>
References: <c0d.31ab9afd.35b66bd2@aol.com>
Message-ID: <Pine.LNX.4.64.0807220842390.12751@pglaf.org>


On Mon, 21 Jul 2008, Bowerbird at aol.com wrote:

> michael said:
>>   I'm still hoping you will send me a list of all 20.
>
> for the founder, i can do that, yes.  we will end up with more 
> than 20.

Any idea what the expected total might be?



> for everyone else, collecting them is a matter of having the 
> dedication. i did the work to define the set for this particular 
> "mountain blood" test; that's the hard part; the mere act of 
> collecting my posts is the easy part.

Does that mean you would mind if I passed them on?


> but there's little reason for anyone to collect these, unless they 
> plan on programming their own tool...

_I_ collect all possible error hunting tools. . .period.

I can't speak for others, but I'm willing to share.


> i'll be releasing a version of my tool, which incorporates these 
> routines (and more), which anyone can use to clean up an o.c.r. 
> text, so use that.

And it will run on what OS's?


Thanks!!!

Michael


>
> -bowerbird
>
> p.s.  if anyone else does want to capture all of the posts in this 
> series, i'd suggest the july digest in the archives...
>
> p.p.s.  for more routines, you can check out gutcheck and guiguts:
>>    http://gutcheck.sourceforge.net/index.html
>>    http://home.comcast.net/~thundergnat/guiguts.html
>
>
> **************
> Get
> fantasy football with free live scoring. Sign up for FanHouse Fantasy Football
> today.
>      (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
>

From Bowerbird at aol.com  Tue Jul 22 09:05:42 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 22 Jul 2008 12:05:42 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	021
Message-ID: <c75.33a3b5cb.35b75f56@aol.com>

21.   search for all lines with 11, or small-l-capital-i, or 
capital-i-small-l...

>    -----File: 011.png
>    [11]
>    -----File: 110.png
>    [110]
>    -----File: 111.png
>    [111]
>    -----File: 112.png
>    [112]
>    -----File: 113.png
>    [113]
>    -----File: 114.png
>    [114]
>    -----File: 115.png
>    [115]
>    -----File: 116.png
>    [116]
>    -----File: 117.png
>    [117]
>    -----File: 118.png
>    [118]
>    -----File: 119.png
>    [119]
>    -----File: 211.png
>    [211]
>    [3011
>    -----File: 311.png
>    [311]
>    South Fork; Nickles'11 do it and glad. It will wipe

>    Ill (17)
>    Ill (164)
>    Ill (289)

well, first, the "[3011" pagenumber was corrected to read "[301]".
the 3 versions of "chapter iii" were corrected, misrecognized as "ill".
and the "'11" which should have been an "'ll" was also corrected...

5 more lines corrected, for a grand total of 192, on 21 routines...

i'll be back tomorrow with the next suggestion in this series...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080722/356d955f/attachment.htm 

From Bowerbird at aol.com  Tue Jul 22 09:15:03 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 22 Jul 2008 12:15:03 EDT
Subject: [gutvol-d] one tenth of one percent
Message-ID: <bbf.2f550439.35b76187@aol.com>

good news!   planet strappers has finished with its _10th_iteration_
through the "perpetual proofing" experiment.

but, um...   no, sorry, this iteration did not catch the error on page 33.
maybe the _11th_iteration_ will catch it...

more data from this iteration later...

***

also, "mountain blood" -- a test which had parallel p1 proofings --
has now finished with p2.   this is the book which i've been treating
with my "how to do preprocessing clean-up" series, so it will be fun
to see how my clean-up compares with _three_ rounds of proofing...

stay tuned for that...

***

and yes, i _do_ know that you're probably sick of the data from these
d.p. "experiments", especially since they all show the same old thing,
which is that the human proofers are doing an outstanding job, while
the d.p. bureaucracy and workflow are immensely stupid and wasteful.

believe me, i'm as tired of the minutiae of mistakes as you.   likely more.

but let's put this into perspective, ok?

i've probably analyzed the data from a _dozen_ of these experiments...

distributed proofreaders claims it has now digitized over 13,000 books.

so i've analyzed less than _one_tenth_of_one_percent_ of their books.

if we extrapolate, then where i have pointed to _thousands_ of changes
in each book that were needlessly imposed on the volunteer proofers,
what we realize is that -- over the course of their entire output so far --
d.p. has forced its proofers to make _millions_ of unnecessary changes...

millions of unnecessary changes...

chew on that fat...

any volunteer with half a brain can easily and clearly see the 
inefficiencies.
changing the same scanno on page after page after page, when they know
it would be so much faster and easier to make the change _once_, globally.
what a waste.   how stupid.   the inefficiencies, vast and deep, are 
staggering.
how many bright people have walked away, refusing to be abused like that?
i don't know.   but it makes me very sad to think about it...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080722/96064af0/attachment-0001.htm 

From hart at pglaf.org  Tue Jul 22 09:33:45 2008
From: hart at pglaf.org (Michael Hart)
Date: Tue, 22 Jul 2008 09:33:45 -0700 (PDT)
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book
 -- 021
In-Reply-To: <c75.33a3b5cb.35b75f56@aol.com>
References: <c75.33a3b5cb.35b75f56@aol.com>
Message-ID: <Pine.LNX.4.64.0807220930240.14877@pglaf.org>



how about searching for all lines with l1 ???

damn!  those look so much alike I'm wondering if that IS
what you wrote below, small-l-numeral-1???

the font i am using makes them look nearly exactly the same,
but I think one is slightly dimmer than the other.


lllllllllll
vs
11111111111


Hmmm, even THEY don't all look the same.

Oh, well. . .I guess it would be worth spelling them out, eh?


mh




On Tue, 22 Jul 2008, Bowerbird at aol.com wrote:

> 21.   search for all lines with 11, or small-l-capital-i, or
> capital-i-small-l...
>
>>    -----File: 011.png
>>    [11]
>>    -----File: 110.png
>>    [110]
>>    -----File: 111.png
>>    [111]
>>    -----File: 112.png
>>    [112]
>>    -----File: 113.png
>>    [113]
>>    -----File: 114.png
>>    [114]
>>    -----File: 115.png
>>    [115]
>>    -----File: 116.png
>>    [116]
>>    -----File: 117.png
>>    [117]
>>    -----File: 118.png
>>    [118]
>>    -----File: 119.png
>>    [119]
>>    -----File: 211.png
>>    [211]
>>    [3011
>>    -----File: 311.png
>>    [311]
>>    South Fork; Nickles'11 do it and glad. It will wipe
>
>>    Ill (17)
>>    Ill (164)
>>    Ill (289)
>
> well, first, the "[3011" pagenumber was corrected to read "[301]".
> the 3 versions of "chapter iii" were corrected, misrecognized as "ill".
> and the "'11" which should have been an "'ll" was also corrected...
>
> 5 more lines corrected, for a grand total of 192, on 21 routines...
>
> i'll be back tomorrow with the next suggestion in this series...
>
> -bowerbird
>
>
>
> **************
> Get fantasy football with free live scoring. Sign up for
> FanHouse Fantasy Football today.
>
> (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
>

From Bowerbird at aol.com  Tue Jul 22 10:24:50 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 22 Jul 2008 13:24:50 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book
	-- 021
Message-ID: <cd2.38639d17.35b771e2@aol.com>

michael said:
>    how about searching for all lines with l1 ???

we already did a search where we looked for a letter-number combo,
or a number-letter combo.   many of these searches are redundant...

one of the next searches i'll recommend is a search for any line which
has a number in it (which will disregard the pagenumbers, of course).

in a sense, that one general search could have been done instead of
all of these more-specific searches.

however, i'm listing the hits of all of these routines when run against
the _original_o.c.r._, as if each was the _first_ such routine to be run.

that's not how it actually happens when you do this in the real world.

when you run the first routine, you _fix_ the errors the routine flags;
and that means they won't come up when the later routines are run...

so by running the routine that finds the "11" for "ll" misrecognitions,
and fixing those first, you eliminate them from being hits for the later
"any number" routine.

this gives you a good focus when you're handling the specific routines
-- you're looking at the same type of error, so the fixes are the same --
and means that the general routines (where you have to work harder
to figure out "what is the nature of the error here?") return fewer hits.

***

and, in a larger sense, that's even why we do such search routines first.

many -- if not most -- of these glitches would be flagged with generic
_spellcheck_, so we could just spellcheck and not bother with routines.
but these routines give us a _focus_ that a general spellcheck does not,
and that focus makes us more efficient.   when we're done with all these,
we will run a regular spellcheck, but by that time the text will be refined.

***

>    Oh, well. . .I guess it would be worth spelling them out, eh?

this routine searched for "11" -- the number after 10,
and it searched for small-l-capital-i (small ell and capital eye),
and it searched for capital-i-small-l (capital eye and small ell).

three errors came up on capital-i-small-l.   this is a common glitch,
where the "iii" (in all-capitals) that is the roman-numeral for "three"
-- as in "chapter 3" -- is (mis)recognized as the word "ill" (capitalized).
it's because the o.c.r. is trying to make it _into_ a real word -- ill --
which we can tell because the same type of glitch does _not_ occur
on other instances with three capital-i in a row, such as with "xviii".
this is why earlier i had a search for a capital-i followed by a bunch of
letters, including l (i.e., ell).   it's not that such words are _impossible_
-- or even _unknown_ or even _rare_ -- just consider "illustration" --
but it's worth those false-alarms to catch these subtle misrecognitions.

when you've immersed yourself in this data, you can make judgments
about the cost-benefit ratio of those false-alarms to those subtle hits...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080722/6ab271a3/attachment.htm 

From Bowerbird at aol.com  Tue Jul 22 10:34:30 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 22 Jul 2008 13:34:30 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book
	-- 020
Message-ID: <c09.3eb721d3.35b77426@aol.com>

michael said:
>    Any idea what the expected total might be?

25-30 routines; then we resort to the generic spellcheck.

some of the routines need to control for _names_ as well,
so there is one routine that will _gather_up_ those names.


>    Does that mean you would mind if I passed them on?

do whatever you like with them.


>    And it will run on what OS's?

mac, p.c., and various flavors of linux.

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080722/f62f8e74/attachment.htm 

From Bowerbird at aol.com  Wed Jul 23 00:55:17 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 23 Jul 2008 03:55:17 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	022
Message-ID: <c33.331fe9fd.35b83de5@aol.com>

22.   search for lines with numbers, excluding well-formatted pagenumbers.

7 lines presented, only 1 of which was correct.
3 of the incorrect lines involved pagenumbers.

>    COPYRIGHT, 1915, 1919, BY
>    PHWTED IN THE T7NITED STATES OlfAMEEIOA
>    12Q4J
>    "You forget, unfortunately, that.1 am forced to
>    P25J
>    *?330}
>    South Fork; Nickles'11 do it and glad. It will wipe

6 more lines corrected, for a grand total of 198, on 22 routines...

i'll be back tomorrow with the next tip in this series...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/6dffff85/attachment.htm 

From rfrank at pobox.com  Wed Jul 23 01:01:04 2008
From: rfrank at pobox.com (Roger Frank)
Date: Wed, 23 Jul 2008 02:01:04 -0600
Subject: [gutvol-d] algorithm description correction
Message-ID: <4886E540.9050103@pobox.com>

I've had some time to look back over recent posts here on
gutvol-d. I read that:

   the preprocessing done on "mountain blood" followed the
   rule that any page that started with a capital letter or
   a double-quotemark got a line inserted above it, the
   assumption being it was a new paragraph.

This is not correct. The preprocessing code looks at the
line ending characteristics and line length on the preceding
page to decide if a blank line should be inserted.

The readily available cpprep source code shows this, so I'm not
sure what the misstatement above is based upon.

--Roger Frank

From rfrank at pobox.com  Wed Jul 23 01:04:13 2008
From: rfrank at pobox.com (Roger Frank)
Date: Wed, 23 Jul 2008 02:04:13 -0600
Subject: [gutvol-d] preprocessing definition
Message-ID: <4886E5FD.20506@pobox.com>

Here's another interesting statement I see in a recent post:

  neither of these books got good preprocessing on them
  -- which is why i re-did the o.c.r. --

There are different definitions of "preprocessing" here. My preprocessing
code is only code that analyzes the text and makes corrections it is
confident are warranted. If it's not sure, it either flags it or leaves
it for the proofers. If preprocessing is defined to include a person
making a decision on if a correction is needed, then to me that's
proofing, not preprocessing. Doing proofing at the start of a project
doesn't make it preprocessing.

I have scanned and content-provided well over 300 books over at
Distributed Proofreaders. With the help of a lot of good people there ,
those books have been proofed and formatted effectively. If I did
the first round of proofing myself (what some here call preprocessing),
I would not have had the time to post-process the 240 books that I've
uploaded to Project Gutenberg. It's that experience and the feedback
and advice of active proofers that helps me to continue to develop what
I hope is useful software.

--Roger Frank

From rfrank at pobox.com  Wed Jul 23 01:07:57 2008
From: rfrank at pobox.com (Roger Frank)
Date: Wed, 23 Jul 2008 02:07:57 -0600
Subject: [gutvol-d] rejoining hyphens
Message-ID: <4886E6DD.7050004@pobox.com>

This was posted recently about preprocessing:

   so i won't examine
   the (hundreds and hundreds) of _unnecessary_changes_
   that the proofers had to do (e.g., rejoining hyphenates),
   because if that b.s. doesn't already _stink_badly_ to you,
   your nose isn't working correctly.

I'm not sure what that means, but it sounds very negative to me.  I think
it implies that the preprocessing code doesn't attempt to rejoin
hyphenated words.  That is not how it works at all.  The code for
resolving hyphenation is fairly involved.  What really happens in when a
hyphen appears at the end of a line separating a word pair, the software
looks for the hyphenated form including the hyphen throughout the entire
text completely within a line.  If it's convinced the author meant it to
be hyphenated, the word is brought up an the hyphen is retained.  If that
wasn't conclusive, it looks for the concatenated word pair without the
hyphen throughout the text to try to resolve it as non-hyphenated.  If
that isn't conclusive, a check is made of a contemporary word list of
hyphenated words.  And if that isn't conclusive, the word is left
hyphenated and separated.  What will usually happen in that case is the
proofer will mark it with -* and I'll make a final decision in
post-processing.  Every attempt is made to match the then-current
conventions of the particular book.

As the developer of cpprep, I just wanted to speak authoritatively on how
the preprocessing code works since it is being repeatedly mischaracterized
on gutvol-d.

--Roger Frank

From rfrank at pobox.com  Wed Jul 23 01:40:59 2008
From: rfrank at pobox.com (Roger Frank)
Date: Wed, 23 Jul 2008 02:40:59 -0600
Subject: [gutvol-d] banana cream program
Message-ID: <4886EE9B.4050706@pobox.com>

This sounds very promising:

   i programmed such a tool years ago -- called "banana cream" --
   and i've decided that in the light of recent improvements,
   i will be releasing a stripped-down version of it to the
   public very soon...

I would certainly like to see the code for the "banana cream"
program, with or without the newest changes. Except for ppvtxt
and ppvhtml (programs that are used by post-processing verifiers
to do final checks on submitted text and HTML files before submitting
them to the whitewashers), none of my programs use a UI.

   i could've released this (banana cream) program years ago -- and
   intended to -- but since there were a several d.p. people among
   my antagonists here on this listserve, i decided to hold it back
   instead. in view of their silence recently, there's no need for
   continued punishment...

My position is that the people that read this list are serious,
dedicated people who genuinely want to learn and contribute to
making the process better. Holding something back for personal
reasons--as if to punish misbehaving children--isn't something I
would do, but that's just me. All of my source codes are readily
available to study, improve, constructively criticize and I hope,
for many, to use.

I'm looking forward to when we see an announcement "very soon"
that the "banana cream" code is available.

--Roger Frank

From schultzk at uni-trier.de  Wed Jul 23 02:10:24 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Wed, 23 Jul 2008 11:10:24 +0200
Subject: [gutvol-d] banana cream program
In-Reply-To: <4886EE9B.4050706@pobox.com>
References: <4886EE9B.4050706@pobox.com>
Message-ID: <03D43930-4CBF-4A3C-B8C6-1B221111F0E8@uni-trier.de>

Hi All,

	The person invovled I hope will reconsider
	offering the stripped down version.
	A stripped down version most always has
	its draw backs which arises in unneeded critic.

	Maybe a modularized version with nice interfaces
	(no not GUI -- function) would be nice. helps
	in the integration process.

	regards
		Keith.

Am 23.07.2008 um 10:40 schrieb Roger Frank:

> This sounds very promising:
>
>    i programmed such a tool years ago -- called "banana cream" --
>    and i've decided that in the light of recent improvements,
>    i will be releasing a stripped-down version of it to the
>    public very soon...
>
> I would certainly like to see the code for the "banana cream"
> program, with or without the newest changes. Except for ppvtxt
> and ppvhtml (programs that are used by post-processing verifiers
> to do final checks on submitted text and HTML files before submitting
> them to the whitewashers), none of my programs use a UI.
>
>    i could've released this (banana cream) program years ago -- and
>    intended to -- but since there were a several d.p. people among
>    my antagonists here on this listserve, i decided to hold it back
>    instead. in view of their silence recently, there's no need for
>    continued punishment...
>
> My position is that the people that read this list are serious,
> dedicated people who genuinely want to learn and contribute to
> making the process better. Holding something back for personal
> reasons--as if to punish misbehaving children--isn't something I
> would do, but that's just me. All of my source codes are readily
> available to study, improve, constructively criticize and I hope,
> for many, to use.
>
> I'm looking forward to when we see an announcement "very soon"
> that the "banana cream" code is available.
>
> --Roger Frank
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d


From schultzk at uni-trier.de  Wed Jul 23 02:04:39 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Wed, 23 Jul 2008 11:04:39 +0200
Subject: [gutvol-d] preprocessing definition
In-Reply-To: <4886E5FD.20506@pobox.com>
References: <4886E5FD.20506@pobox.com>
Message-ID: <71E8D8DC-E426-4D36-A5EA-D7D95C265DAB@uni-trier.de>

Hi Roger,

	I understand your point, but your preprocessing
	should be post processing of the o.c.r.

	The way you have describe it here I would say
	have the process run automatically. You won't
	need to interevene yourself. That is what computers
	are there for in the first place.

	regards
		Keith.


Am 23.07.2008 um 10:04 schrieb Roger Frank:

> Here's another interesting statement I see in a recent post:
>
>   neither of these books got good preprocessing on them
>   -- which is why i re-did the o.c.r. --
>
> There are different definitions of "preprocessing" here. My  
> preprocessing
> code is only code that analyzes the text and makes corrections it is
> confident are warranted. If it's not sure, it either flags it or  
> leaves
> it for the proofers. If preprocessing is defined to include a person
> making a decision on if a correction is needed, then to me that's
> proofing, not preprocessing. Doing proofing at the start of a project
> doesn't make it preprocessing.
>
> I have scanned and content-provided well over 300 books over at
> Distributed Proofreaders. With the help of a lot of good people  
> there ,
> those books have been proofed and formatted effectively. If I did
> the first round of proofing myself (what some here call  
> preprocessing),
> I would not have had the time to post-process the 240 books that I've
> uploaded to Project Gutenberg. It's that experience and the feedback
> and advice of active proofers that helps me to continue to develop  
> what
> I hope is useful software.
>
> --Roger Frank
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d


From Bowerbird at aol.com  Wed Jul 23 02:37:15 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 23 Jul 2008 05:37:15 EDT
Subject: [gutvol-d] is this a dialog?
Message-ID: <d1f.2cb9a54a.35b855cb@aol.com>

ok, i guess rfrank wants to have a dialog?

or maybe not, i dunno.

anyway, fine, roger, if you do, and fine if you don't.

if you do, then you've got lots and lots of posts to read
before you get up to speed on where i'm coming from,
so i'll give you time to do all that reading if you want to.

and if you just want to pop up here and accuse me of
"mischaracterizing" what you're doing, and then burrow
back into noncommunicative mode, that's all right too...

really, roger, whatever you want to do.

if you want to ignore everything i've written, and just
engage me in general friendly conversation, that's fine,
i'll be happy to share with you exactly what's on my mind.

***

i'll just sit back and wait until the dust settles
and make my replies to whatever you've said.

***

but let's look at your 4 posts from tonight:

1.   "algorithm description correction"

you say that i've mischaracterized your algorithm,
and say we can look at the code to see how it works.

my reply is that my observation was based on _direct_
well... observation... of the actual o.c.r. text itself, and
i stand by my observation.   i could post the text itself,
so that everyone can see that what i said is correct...
or you can post the text, roger.   i checked.   i'm right.

***

2.   "preprocessing definition"

you seem to assume i've said that _you_ should have
done the "preprocessing".   i've never maintained that.
i've said that _someone_ should.   not necessarily you.

in fact, i've explicitly said that it does _not_ have to be
done by the content provider, that it could be done by
a regular proofer as a book-wide process that occurs
_after_ "content providing" and _before_ p1 proofing...

just carve out a special round -- call it "p-zero" --
and have a designated "preprocessor" do the job...

it might as well be the person who will post-process,
since the jobs overlap so much. indeed, what i suggest
is you do many of the tasks one does in post-processing
_before_ the text goes to the p1 proofers.   better that way.

this concept is not strange.   dkretz is already doing it,
and he'll tell you that it's working out very well for him.

and it's not all that hard to carve out a special round.
d.p. did it big time when it went from 2 to 4 rounds,
and again switching from 4 to 5 when they added p3.

plus there's been a shift toward making smoothreading
a quasi-official round too, this one gradual and informal,
but still a demonstration that rounds can be carved out.


>   There are different definitions of "preprocessing" here.

i use my definition.   consistently.   and i always have.


>    My preprocessing code is only code that analyzes the text 
>    and makes corrections it is confident are warranted. 

i've found there are _very_ few of those type of "corrections".
almost every darn routine will hit on a false-alarm _sometime_.
so where your "definition" ends up is to do no preprocessing...

whereas _my_ definition says that preprocessing is _efficient_.
sure you have to _look_at_ the changes you make, but so what?

while _you_ might define such "examination" as _proofing_
rather than _preprocessing_, i define "proofing" instead as the
word-by-word examination of the text compared to the scan.
my _preprocessing_ doesn't involve that word-by-word mode.
you only fix the glitches that the computer can find _instantly_.
but if a tool finds it instantly, why make a human search for it?
that's counterproductive.   i mean, after your proofers have
spent, oh, maybe 12 hours proofing all the pages of a book,
and then i drop it in a tool and the errors pop out _instantly_,
well, it just makes me kind of shake my head and laugh a bit.


>    I have scanned and content-provided well over 300 books 
>    over at Distributed Proofreaders. 

yeah, i know.   you're doing _great_ work as a content provider...
and you're doing _fantastic_ work as a post-processor as well...

where you -- and d.p. in general, in a long-lived shortcoming --
are coming up short is in your _preprocessing_, which sucks...

which means you're wasting a _lot_ of the time of your proofers,
because they must make unnecessary changes.   i've documented
these unnecessary changes over _many_ of your "experiments"...

if you've analyzed your data anywhere near as closely as i have,
please share your results and conclusions here.   because i have.


>    I have scanned and content-provided well over 300 books 
>    over at Distributed Proofreaders. 
>    With the help of a lot of good people there
>    those books have been proofed and formatted effectively

effectively?   yes.   the proofers rock.

efficiently?   no.   not by a long shot.

the absence of good preprocessing has wasted many resources.
if you had done that, you could have finished over 600 by now...

***

3. "rejoining hyphens"

it's getting late.   let's talk about rejoining hyphenates later, ok?
but i can assure you that i know _all_ about how it's done...        :+)

***

4.   "banana cream"

>    I would certainly like to see the code for the "banana cream"

the code isn't available.   only the compiled app.
the code wouldn't help you much anyway, since
it's basic (as in "beginners all-purpose symbolic etc.),
and besides, it's the design and operation of the tool
which is what you _really_ need to be interested in...


>    Except for ppvtxt and ppvhtml 
>    (programs that are used by post-processing verifiers
>    to do final checks on submitted text and HTML files 
>    before submitting them to the whitewashers), 
>    none of my programs use a UI.

i don't believe i've seen any programs from you that _do_
have a g.u.i., so i didn't know you even had any.   i'm glad.

because yeah, you really must have a g.u.i. to make it work.

if you've read one of the latest posts here from me, i've said
that you need to have _three_ elements all working together,
the first one being the text-editor, the second a scan-viewer,
and the third being the routines that find the lines with errors.

of course, you also need good old "find" functionality, and
auto-generated tables of contents, and lots of other stuff,
but those big three are the basic heart-and-soul of the tool.
until you have that, you don't really have much at all...


>    My position is that the people that read this list are serious,
>    dedicated people who genuinely want to learn and contribute 
>    to making the process better. 

yeah, well, i guess you haven't been reading along for 5 years.      :+)

if you want to know the truth, you can always read the archives.


>    Holding something back for personal reasons--
>    as if to punish misbehaving children--

that's exactly right, i was punishing the misbehaving children.
i haven't ever said it that way, but that's a very apt description.

and moreover, i was punishing the _friends_ of those children
by denying them a tool that could've been very useful to them,
simply because those children were misbehaving so very badly.

so yes, a few bad children here meant everyone at d.p. had to
go without a useful tool, for years.   hey, no skin off my nose...

(of course, these friends also had some culpability, because they
_could_ have reined in the ones who were misbehaving so badly.)


>    isn't something I would do, but that's just me. 

it is something i would do.   something i've done.   and would do again.
and again and again.   i don't let bullies pick on me or treat me badly...
i give 'em their own medicine, and i make sure my rocks hit the target.

however, if you wanna be nice and friendly, i'm nice and friendly back.


>    All of my source codes are readily available to study, improve, 
>    constructively criticize and I hope, for many, to use.

that's nice.   but usually i'm more interested in the tool than its source.

i once offered to lead an effort here to create some open-source stuff,
including full-on open-source replacements for my close-source tools.
but i was the little red hen who was doing all the work, so i ended that.

but, you know, if you want me to re-start that effort and you'll do work,
i will be happy to guide you in an effort to create tools similar to mine...

feel free to pick my brain.   i'll tell you anything you want to know.
but no, i won't hand over my code.   you can buy it.   but it's not free.
you can even buy it and (because you'll own it) turn it open source.
but i'm not gonna give it to you.   besides, wouldn't help you anyway.


>    I'm looking forward to when we see an announcement "very soon"
>    that the "banana cream" code is available.

the code won't be available.   but the tool will be.   could be right now,
if i wanted it to be.   if you'd like to see a copy right now, and you agree
to write some criticism of it (constructive or not, doesn't matter to me)
and send it as a post to this listserve, i can send you a copy right now...

i got other stuff available too, some of it web-based, if you're curious...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/0a80742c/attachment-0001.htm 

From marcello at perathoner.de  Wed Jul 23 04:54:49 2008
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed, 23 Jul 2008 13:54:49 +0200
Subject: [gutvol-d] is this a dialog?
In-Reply-To: <d1f.2cb9a54a.35b855cb@aol.com>
References: <d1f.2cb9a54a.35b855cb@aol.com>
Message-ID: <48871C09.4030501@perathoner.de>

Bowerbird at aol.com wrote:

> ok, i guess rfrank wants to have a dialog?
> 
> if you do, then you've got lots and lots of posts to read
> before you get up to speed on where i'm coming from,


Or he could go here

   http://www.gnutenberg.de/bowerbird/

and get the gist in a few minutes.




-- 
Marcello Perathoner
webmaster at gutenberg.org


From jayvdb at gmail.com  Wed Jul 23 04:55:27 2008
From: jayvdb at gmail.com (John Vandenberg)
Date: Wed, 23 Jul 2008 21:55:27 +1000
Subject: [gutvol-d] is this a dialog?
In-Reply-To: <d1f.2cb9a54a.35b855cb@aol.com>
References: <d1f.2cb9a54a.35b855cb@aol.com>
Message-ID: <deea21830807230455i71b4b5e8na9c3c86b128eda93@mail.gmail.com>

Hi,

On Wed, Jul 23, 2008 at 7:37 PM,  <Bowerbird at aol.com> wrote:
> ok, i guess rfrank wants to have a dialog?
>
> or maybe not, i dunno.
>
> anyway, fine, roger, if you do, and fine if you don't.

I've have only recently started paying attention to the inner workings
of Project Gutenberg, for reasons explained below, so I have very
little understanding of the backstory.  This level of confrontation
seems very unhealthy, for all involved and the project as a whole,
because my first impression on joining this list was an atmosphere
that I was not expecting.  It wouldnt be such a problem if the
majority of discussion was more collegial and it was only the
occasional spat.

I've started looking at PG as a project I am heavily involved in,
Wikisource, is coming to the stage where it is completing proofreading
projects on a regular basis, and these texts are suitable to be pushed
into PG, as Distributed Proofreaders currently does.  The German
Wikisource has long had a policy of rejecting contributions that not
accompanied by pagescans, so they have many texts which are suitable
and verified.  The English and French projects are not so stringent,
but are increasingly focusing on proofing based on pagescans.  Here
are our stats:

http://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics

The works are available here:
http://de.wikisource.org/wiki/Kategorie:Index
http://fr.wikisource.org/wiki/Cat%C3%A9gorie:Index
http://en.wikisource.org/wiki/Category:Index

We may never be as big a contributor as the fine people at DP, but we
can help.  More importantly, we can compete.  Even if we are not in
the same league as DP, Wikisource can compete in processes, methods,
etc.  Efficiency, which bowerbird talks about, is orthogonal to output
volume.  If DP is inefficient, the most effective way of demonstrating
this is build a better mouse trap.  Personally, I think Wikisource has
a decent mousetrap, we have developers regularly improving the
software, and because it is a wiki, we have a userbase that is
constantly improving the interface and social fabric.  Also because it
is a wiki, _anyone_ can run a bot to automate parts of the process.  I
use the "pywikipediabot" codebase, but there are many other frameworks
that connect the coder to the wiki interface.

I would like to publicly encourage bowerbird to come to Wikisource,
evaluate it, and either tell us what is wrong with it, or better yet,
demonstrate your code in action doing the pre-processing.  Obviously
your time is precious, and your input into our growth will be valued,
so I will make myself available to help you ramp up quickly.  If your
tool cant be bolted onto the wiki framework with minimal enhancement
to your tool, we can work out an interface that will suit, or I will
sign an NDA and help you code it releasing all copyright to you.

> if you've read one of the latest posts here from me, i've said
> that you need to have _three_ elements all working together,
> the first one being the text-editor, the second a scan-viewer,
> and the third being the routines that find the lines with errors.

and if I understand your prior posts correctly, you want a third
element, the user.  The user approves or rejects the suggested change?

> but, you know, if you want me to re-start that effort and you'll do work,
> i will be happy to guide you in an effort to create tools similar to mine...

I am keen.  Pick me.

> you can even buy it and (because you'll own it) turn it open source.
> but i'm not gonna give it to you.  besides, wouldn't help you anyway.

If your GUI tool can bolt onto Wikisource as the backend, and you
price licenses sensibly, I know a few people who will buy it.  And,
I'd like to hear what sort of figure you have in mind, as a "open
source bounty" might be a way to make everyone happy.

--
John

From hyphen at hyphenologist.co.uk  Wed Jul 23 06:28:03 2008
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Wed, 23 Jul 2008 14:28:03 +0100
Subject: [gutvol-d] banana cream program
In-Reply-To: <4886EE9B.4050706@pobox.com>
References: <4886EE9B.4050706@pobox.com>
Message-ID: <001701c8ecc7$f47c1a00$dd744e00$@co.uk>



Roger Frank wrote:


> I'm looking forward to when we see an announcement "very soon"
> that the "banana cream" code is available.

Hope you rename it as something more descriptive of its function.
I would not expect a program with such a name to help me in any way.

Dave Fawthrop



From marcello at perathoner.de  Wed Jul 23 07:50:39 2008
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed, 23 Jul 2008 16:50:39 +0200
Subject: [gutvol-d] banana cream program
In-Reply-To: <4886EE9B.4050706@pobox.com>
References: <4886EE9B.4050706@pobox.com>
Message-ID: <4887453F.4020702@perathoner.de>

Roger Frank wrote:

> I'm looking forward to when we see an announcement "very soon"
> that the "banana cream" code is available.

We already saw that announcement 3 years ago on 27 Aug 2005:

 > i'll be uploading banana-cream to the web next week;
 > but anyone who would like to use it before then can
 > backchannel me for a preview copy...

What we will never see is the working tool, because writing code is 
harder than writing announcements.

BB said more than once that he will not release his code, and, believe 
me, from the code snippets he *did* publish, you wouldn't want to see it 
anyway. This is how he writes regular expressions:

BB wrote on 8 Jul 2008:

> it's very good to do an early search for garbage characters...
> the reg-ex is something along these lines:   [\&\*\<\>\\\/\|\*\{\}\_]

More BB code at:

   http://www.gnutenberg.de/bowerbird/#reality


BB wrote on 11 Jul 2008:

> i could've released this program years ago -- and intended to --
> but since there were a several d.p. people among my antagonists
> here on this listserve, i decided to hold it back instead.   in view of
> their silence recently, there's no need for continued punishment...

History repeats itself. And some children never grow. Film at:

   http://www.gnutenberg.de/bowerbird/#toc31



-- 
Marcello Perathoner
webmaster at gutenberg.org


From rfrank at pobox.com  Wed Jul 23 09:30:14 2008
From: rfrank at pobox.com (Roger Frank)
Date: Wed, 23 Jul 2008 10:30:14 -0600
Subject: [gutvol-d] preprocessing definition
In-Reply-To: <71E8D8DC-E426-4D36-A5EA-D7D95C265DAB@uni-trier.de>
References: <4886E5FD.20506@pobox.com>
	<71E8D8DC-E426-4D36-A5EA-D7D95C265DAB@uni-trier.de>
Message-ID: <48875C96.9050300@pobox.com>

Schultz Keith J. wrote:

> 	I understand your point, but your preprocessing
> 	should be post processing of the o.c.r.
> 
> 	The way you have describe it here I would say
> 	have the process run automatically. You won't
> 	need to interevene yourself. That is what computers
> 	are there for in the first place.
> 
> 	regards
> 		Keith.

Somehow I mis-explained it, Keith. My cpprep.rb code runs
on the output of Abbyy, which I have save each page as UTF.
I don't intervene or look at the pages myself. I do look
at the log because of a few really rare cases that happen
only once in many books. For example, it is possible that
a sincle word both starts and ends with an apostrophe, and
the smart quote routines will get that wrong every time if
it isn't in the exceptions list.

So it does run against the text from the OCR. All I get
is a summary, like this excerpt from a recent book just
uploaded to the proofers (In Her Own Right):

2253 start of line spaced double quote
  835 double quote spacing, type 1
  403 double spaces
  397 end of line spaced double quote
  261 double quote spacing, type 2
  236 spaced punctuation
   80 too many dashes
   32 false paragraph break suspect
   25 spaced exclamation
   20 single quote spacing, type 2
   15 end page asterisk added
   15 start page asterisk added
   12 spaced double-punctation
    8 suspect start of line
    5 single quote spacing, type 1
    4 three dashes
    3 end of line spaced punctation
    2 spaced out 'll
    2 numeric 11 for letters ll
    2 missing space?
    1 suspect l1
    1 spaced out 's

I hope what cpprep does is more clear from this.

--Roger Frank






From prosfilaes at gmail.com  Wed Jul 23 09:39:34 2008
From: prosfilaes at gmail.com (David Starner)
Date: Wed, 23 Jul 2008 12:39:34 -0400
Subject: [gutvol-d] is this a dialog?
In-Reply-To: <deea21830807230455i71b4b5e8na9c3c86b128eda93@mail.gmail.com>
References: <d1f.2cb9a54a.35b855cb@aol.com>
	<deea21830807230455i71b4b5e8na9c3c86b128eda93@mail.gmail.com>
Message-ID: <6d99d1fd0807230939o1d765cd1g9ae51783488c52c5@mail.gmail.com>

On Wed, Jul 23, 2008 at 7:55 AM, John Vandenberg <jayvdb at gmail.com> wrote:
> I've started looking at PG as a project I am heavily involved in,
> Wikisource, is coming to the stage where it is completing proofreading
> projects on a regular basis, and these texts are suitable to be pushed
> into PG, as Distributed Proofreaders currently does.

That's interesting; I was thinking that we could push many of the
texts that DP does directly from DP to Wikisource from the pre-PPer
stage, which would provide a public archive of the scans with a
page-by-page proofed version. A filter should be able to translate
DP's formatting to Wikisource's pretty easily. The two ideas aren't
mutually exclusive, of course.

From Bowerbird at aol.com  Wed Jul 23 10:49:53 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 23 Jul 2008 13:49:53 EDT
Subject: [gutvol-d] is this a dialog?
Message-ID: <d11.2ff6e511.35b8c941@aol.com>

john-

wow.   i certainly wasn't expecting anything like _that_.
what a nice surprise.   i'm bowled over.

i will be happy to go over and take a look at wikisource.

it would be a pleasure to offer my constructive criticism
to an entity smart enough to actually treasure such input.

and i'd be honored to help improve your infrastructure...

right from the get-go -- with the wiki structure and your
ability to run bot-based error-finding routines -- i'd say
you have some fantastic potential there.   really fantastic.

my apps are written in basic, so my code won't help you,
but i'm skilled at expressing them in pseudo-code, so if
you've got web-programmers to implement my routines,
we'll be able to work together.

and if it wasn't clear, my offline tools are cross-plat apps
that are available at zero cost.   (i'd guess that people still
are more efficient doing this work offline, but i'm willing
to let some crafty web-programmers prove me wrong...)

i will respond here when i've taken a look at wikisource,
just to show the kind of interaction p.g. could have had,
but if we continue on for long, we can take it elsewhere...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/3553319a/attachment.htm 

From Bowerbird at aol.com  Wed Jul 23 10:51:18 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 23 Jul 2008 13:51:18 EDT
Subject: [gutvol-d] banana cream program
Message-ID: <c7d.2efdb02c.35b8c996@aol.com>

dave said:
>    Hope you rename it as something more descriptive of its function.

it's a one-file application, dave.   name it whatever you want.        :+)

-bowerbird




**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/2aee420c/attachment.htm 

From Bowerbird at aol.com  Wed Jul 23 11:09:39 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 23 Jul 2008 14:09:39 EDT
Subject: [gutvol-d] preprocessing definition
Message-ID: <c0d.31f2c2b4.35b8cde3@aol.com>

roger said:
>    I don't intervene or look at the pages myself.

my experience is that preprocessing can't do all of what needs to be done
if the methodology does not involve a human who will look at the pages...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/3e5d3d93/attachment-0001.htm 

From rfrank at pobox.com  Wed Jul 23 11:23:21 2008
From: rfrank at pobox.com (Roger Frank)
Date: Wed, 23 Jul 2008 12:23:21 -0600
Subject: [gutvol-d] not a dialog
Message-ID: <48877719.2060505@pobox.com>

Bowerbird wrote

   ok, i guess rfrank wants to have a dialog?
   or maybe not, i dunno.

Let me reply directly and say No, I am not interested in a
dialog. In that same post, I read that I might

   "burrow back into noncommunicative mode"

I am communicative to people whose opinions I value. The
proofers and formatters and smoothies and yes, the management
at Distributed Proofreaders are all on that list. That why I
spend the time doing the documentation of my code, as best
as I can given that I'd rather code than document code.
Follow the "software" link at http://pgdp.rfrank.net or
http://www.fadedpage.com/ppgen-main.html for the more
popular programs or for the alpha versions of the documentation
on the post-processing generator.

I value Greg Newby's interest in preprocessing, too. That's
why I posted clarification of the way the code actually works
once it was mischaractized on the list. Before Greg's post,
I didn't think anyone was paying that much attention, figuring
that the list was mostly poisoned, that you were on the multiple
kill lists, and that creative discussions were finding
other more fertile places to grow.

Nobody is on my kill list. I just look at everything and if
it's a personal attack or a diatribe or some form of
competition as to who is right, then I just click delete.
There is some technically worthwhile material in your posts,
though as of yet it's not news to me, so I look at it at
least until the post gets personal, and it almost always does.

Even in this same posting of yours I read:

   i'll just sit back and wait until the dust settles
   and make my replies to whatever you've said.

   ***

   but let's look at your 4 posts from tonight:

You didn't sit back very long--perhaps just the time to
type the three asterisks? Then you go on to say you
observe an algorithm (that is not in the code). You say

   everyone can see that what i said is correct...
   or you can post the text, roger. i checked. i'm right.

I don't need to post the text. If anyone is really interested
in who is "right" here, they can look at the text as you did
or look at the code or whatever they want. I simply don't care
who is "right" here. To me, it's not a competition. I'll
always try to do my best to make this process better, and
no part of that is in trying to make anyone else look worse.
People have already made their own judgement about that, it seems.

Later it says

   > There are different definitions of "preprocessing" here.
   i use my definition.  consistently.  and i always have.

That's fine, of course, as long as we understand that it's
a manual process. That "P0" round is great if there are
proofers available who would work in that round, and if
they are sufficiently talented with regexs to find the
trouble spots, or if a tool were available to take them to
the trouble spots. From what I read, the tool you announced
as imminent may not be coming soon--I hope I'm wrong about
that. I hope you don't decide to "punish" us and not release
it as you have apparently for the last two years since you
last announced it.

It's unfortunate that no part of your code is available in
source. That keeps others from taking it and improving it.
I would have liked to see whatever it is, in whatever shape
it is in, on the chance that it is licensed appropriately
and that I could use at least the display portion of the code.
Someone posted that perhaps that premature availability would
lead to unneeded criticism. I'd rather see if it's going to be
useful now than wait an indefinite time for it to be polished.
If it's worthwhile, I could wrap it in a MVC framework and we
could all benefit from it. You've decided to withhold the source,
which makes me wonder if it exists at all.

Back to the main reason for this posting. You wondered
if I wanted a dialog with you here. I do not. You wrote:

   where you ... are coming up short is in your _preprocessing_,
   which sucks...

Actually, that's not particularly helpful or appropriate, to me.
The sad part of it all is that when the list gets personal, good
people who might otherwise post interesting or even exciting
topics simply stay away. My observation is that since you were
banned from DP that the forums there have been much more productive,
that many more people are willing to share their ideas there,
and that overall it's just a good place to work and contribute.

It's sad that, given you feel you have good ideas, you can't find
a way to present them without making it a competition and without
making it personal. There is no "I" in Project Gutenberg.

--Roger Frank















From joshua at hutchinson.net  Wed Jul 23 12:11:58 2008
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Wed, 23 Jul 2008 19:11:58 +0000 (GMT)
Subject: [gutvol-d] not a dialog
Message-ID: <238672185.93911216840318819.JavaMail.mail@webmail09>



On Jul 23, 2008, rfrank at pobox.com wrote: 

I value Greg Newby's interest in preprocessing, too. That's
why I posted clarification of the way the code actually works
once it was mischaractized on the list. Before Greg's post,
I didn't think anyone was paying that much attention, figuring
that the list was mostly poisoned, that you were on the multiple
kill lists, and that creative discussions were finding
other more fertile places to grow.

*******

Just wanted to pop in to ask if you (or anyone else) has looked into incorporating these checks into the proofing interface at DP?

What I mean is, it would be useful to have a "spell-check" like utility within the DP proofing window that highlights "possible" problems and let the proofer check them and change as necessary.  You could really dial up the automated check code since final say would be by the proofer (human) and you aren't as reliant on the system being 100% correct before making a change.

Josh

From dakretz at gmail.com  Wed Jul 23 12:38:27 2008
From: dakretz at gmail.com (don kretz)
Date: Wed, 23 Jul 2008 12:38:27 -0700
Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 26
In-Reply-To: <mailman.3751.1216836583.2809.gutvol-d@lists.pglaf.org>
References: <mailman.3751.1216836583.2809.gutvol-d@lists.pglaf.org>
Message-ID: <627d59b80807231238o624ab523od907849b98b91d3d@mail.gmail.com>

Someone suggested I should check this thread, and I'm glad I have. It looks
a bit like spring-time in Prague. :)

I have a couple things I can chip in, if anyone might find them useful.

First, I have several dozen regular expressions I've honed over a couple
years that I use to preprocess text for the ongoing Encyclopedia Britannica
project. The two strongest areas it deals with are numbers (intentional or
otherwise from the OCR), and double-quotes. I suspect either of the tools
from rfrank or the Bird are stronger on the quotes; but my number handling
goes further than what B has shown so far, anyway (based on a quick perusal
of the thread archive.)

In addition, Michael Lockey (vasa) and I have an alpha-test-level
reimplementation of the DP process that eliminates the infamous Rounds, and
supports independent tracking of text units (pages or otherwise). Since I'm
already a known preprocessing fanatic on the dp site, it's intentionally
friendly to that type of work. It's written (as is the pg site) in php and
mysql.

The main question mark I have is how to build the UI. I've been working with
Adobe Flex quite a bit recently, and find it useful but not compelling. But
somehow I think we need a highly portable UI that shows a. text analysis by
location, b. quickly jumps through the locations of interest, c.
synchronously pulls up the matching image automatically, d. provides a
configurable workflow checklist with all the features of gutcheck plus new
ones. I haven't seen anything yet that's not pretty clumsy and slow in one
way or another.

Bird, I think you've traditionally used BBasic or some such, right? I like
your checklist so far; but what's your policy wrt text-interrupters like
footnotes/sidenotes, tables, math expressions, etc.?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/5345da6f/attachment.htm 

From Bowerbird at aol.com  Wed Jul 23 13:21:41 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 23 Jul 2008 16:21:41 EDT
Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 26
Message-ID: <cec.3ac35c27.35b8ecd5@aol.com>

dakretz (a.k.a. dkretz) said:
>    It looks a bit like spring-time in Prague. :)

smileys are good.   i like smileys.          :+)


>    First, I have several dozen regular expressions 
>    I've honed over a couple years that I use to preprocess text 
>    for the ongoing Encyclopedia Britannica project. 

i tried to locate the d.p. forum threads where you listed a ton of reg-ex.
i found some of 'em, but i distinctly think that i missed some other ones.
so if you could point to them, it would be good.


>    The two strongest areas it deals with are numbers 
>    (intentional or otherwise from the OCR), and double-quotes. 
>    I suspect either of the tools from rfrank or the Bird are stronger 
>    on the quotes; but my number handling goes further than what B has 
>    shown so far, anyway (based on a quick perusal of the thread archive.)

i remember you describing those routines, yes.

i haven't analyzed a lot of texts that have those types of problems,
so i don't have many routines to check for them.   this work is always
tremendously iterative, in that one finds errors after-the-fact and try 
to come up with a way that one could have found them automatically.

i'm sure that, besides numbers, there are many other idiosyncrasies of
the encyclopedia britannica where your routines are equally unique...


>    Since I'm already a known preprocessing fanatic on the dp site, 
>    it's intentionally friendly to that type of work. It's written 
>    (as is the pg site) in php and mysql.

you have been _fantastic_ as the "known preprocessing fanatic" at d.p.,
ever since i was banned from there.   thanks for fighting the good fight!

and your proofing interface is quite good.

and the roundless methodology is what i have _always_ suggested,
as you know.   so i think you're the smartest guy over at d.p.        :+)


>    The main question mark I have is how to build the UI. 

well, the crucial decision-point in that matter is "offline or online".

preprocessing really is much better using a "whole-book" method,
so i believe it lends itself to an _offline_ approach, not an online one.

that is, it's best when it's done by _one_person_, who has the _text_
and the _scans_ on their hard-drive.   a huge part of the methodology
is that you're jumping all around in the book, based on the error-type
that you are specifically seeking right now, so it doesn't make sense to
work in the browser, not relative to the cost-benefit of working offline.

and, for me, it's easier to program offline apps, because i know how.


>    I've been working with Adobe Flex quite a bit recently, 

yes, i know that.   and i've been encouraged by that, precisely because
it offers (promise of) a chance to bridge the offline/online distinction.


>    and find it useful but not compelling. 

really?   i'd have thought your reaction would be a bit more positive.

because you've started to come to the interface you'll need to build.

your new interface for "twister", the one that displays the page-image
automatically depending on the selection in the listbox, is the _key_.
(now you just have to load the text into an editfield on the left side.)


>    But somehow I think we need a highly portable UI that shows 
>    a. text analysis by location, 
>    b. quickly jumps through the locations of interest, 
>    c. synchronously pulls up the matching image automatically, 

that's the interface i've laid out before, the interface of banana cream.
(but you need to add, for item (a), that the text is there, and editable.)


>    d. provides a configurable workflow checklist 
>    with all the features of gutcheck plus new ones.

you don't really need this.   (it's largely subsumed in item (b).)


>    I haven't seen anything yet that's not pretty clumsy 
>    and slow in one way or another.

banana cream is neither clumsy nor slow, in any way...


>    Bird, I think you've traditionally used BBasic or some such, right?

realbasic.   http://www.realsoftware.com


>    I like your checklist so far; but what's your policy wrt 
text-interrupters 
>    like footnotes/sidenotes, tables, math expressions, etc.?

i'll explain those later.   for now, just let them go in the flow...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/37400ce7/attachment.htm 

From Bowerbird at aol.com  Wed Jul 23 14:00:28 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 23 Jul 2008 17:00:28 EDT
Subject: [gutvol-d] the dust
Message-ID: <ca0.35ede615.35b8f5ec@aol.com>

well, roger has kicked up some dust, so i'll let that settle...

in the meantime, since roger mentioned "a woman in her own right",
i've modified that project -- i refuse to work with badly-named files --
and uploaded it to my site so that we can take a closer look at it:
>    http://z-m-l.com/go/wihorp001.html
>    http://z-m-l.com/go/wihor.zml

i don't know if i've mentioned that i've also done "the crevice":
>    http://z-m-l.com/go/crvicp001.html
>    http://z-m-l.com/go/crvic.zml

and "cabin on the prairie":
>    http://z-m-l.com/go/cabinp001.html
>    http://z-m-l.com/go/cabin.zml

so we will go through the whole data-analysis exercise on these books.
and, of course, finish up the books that we've already discussed so far...

but i _swear_, i'm not taking on any more d.p. "experiments" after these.

i have demonstrated -- clearly and unequivocally -- that i am correct
when i praise the work of the proofers and condemn the d.p. workflow,
and there's no sense in continuing to prove something so transparent...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/cb2efaa8/attachment.htm 

From walter.van.holst at xs4all.nl  Wed Jul 23 14:13:28 2008
From: walter.van.holst at xs4all.nl (Walter van Holst)
Date: Wed, 23 Jul 2008 23:13:28 +0200
Subject: [gutvol-d] not a dialog
In-Reply-To: <238672185.93911216840318819.JavaMail.mail@webmail09>
References: <238672185.93911216840318819.JavaMail.mail@webmail09>
Message-ID: <48879EF8.3010104@xs4all.nl>

Joshua Hutchinson wrote:

> Just wanted to pop in to ask if you (or anyone else) has looked into
> incorporating these checks into the proofing interface at DP?
> 
> What I mean is, it would be useful to have a "spell-check" like
> utility within the DP proofing window that highlights "possible"
> problems and let the proofer check them and change as necessary.  You
> could really dial up the automated check code since final say would
> be by the proofer (human) and you aren't as reliant on the system
> being 100% correct before making a change.

I gathered from the DP fora that Jeroen Hellingman is working on a 'heat
map' that colours such potential problems. But since he is on this list 
as well, he may be able to fill you in on the details.

Regards,

  Walter

From Bowerbird at aol.com  Wed Jul 23 14:42:39 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 23 Jul 2008 17:42:39 EDT
Subject: [gutvol-d] not a dialog
Message-ID: <cb9.3525eaaf.35b8ffcf@aol.com>

gee, interesting to see -- because walter quoted his post --
that josh has made a good suggestion.   (josh is one of those
"bad children" who has been relegated to my "spam" folder".)

oh, wait, that's the exact same suggestion i made over at d.p.,
when they were developing wordcheck.   before i got banned...

these guys throw mud at me for saying stuff,
and then turn around and say the same thing.
it's humorous to me...

but, no, walter, the "heat map" idea that jeroen is working on
-- an idea that he lifted directly from my post on this list --
is not intended to be integrated in the proofing environment.

and the problem with using _colors_ is that you can't use them
in a web-based textfield -- at least not in the past, although
new w.y.s.i.w.y.g. textfields might allow you to do it now, but
d.p. hasn't seemed willing to do the work to incorporate those.

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/2a97292c/attachment-0001.htm 

From Bowerbird at aol.com  Wed Jul 23 14:47:18 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 23 Jul 2008 17:47:18 EDT
Subject: [gutvol-d] woman in her own right -- 001
Message-ID: <c87.346b46ee.35b900e6@aol.com>

ok, let's take a first look at "woman in her own right"...

right off, a search for garbage characters reveals 55 lines, appended...

just as a sign of _respect_ for the proofers, i would not send them
lines that included "<" or "+" or "^" in them.   just a sign of respect.
i want them to spend their time finding stuff that's _hard_ to find...

further, as with "mountain blood", the runheads and pagenumbers
are in the text.   what's up with that?   d.p. policy is to remove them,
and it's something the computer can do quite easily, so why have the
_proofers_ spend their time and energy doing it instead?   that's b.s.

of course, since i don't think the runheads and pagenumbers should
be removed, not at this time, i do it _automatically_ at a later stage,
i'm _glad_ that i can get the a copy of the text that still includes them.

but my heavens, why make the proofers do the deletion grunt-work?

the response might be that "this is a newcomers-only project,
so we leave the runheads in so newcomers learn to delete them."
sorry, i don't buy that.   if you have the machine delete them
(and, of course, check that the operation was done correctly),
there is no need for anyone to "learn" they need to be deleted.

so this is just another waste of the proofers time and energy.

moreover, it creates a _diff_ for each page, so -- when you look back --
you have no way of easily discerning how clean the project really was
until you do a _second_ round of proofing, and _then_ find all no-diffs.
a meaningless diff on every page can mask important information...

-bowerbird

>   VIII.--STOLEN.....^................................ 120
>    ~AND STEPPED TWO^pUNDHED AND FIFTY PACES....... 112
>    "Royster & Axtell have been thrown into bankruptcy.
>    of it with Royster & Axtell, who knows?"
>    "Well, it's come," he remarked: ~" Royster &
>    ass----(U+00BB)
>    "Tell me of Royster & Axtell," he said.
>    sudden resolve only the failure of Royster &
>    "I'll speak to Fra^ois," said Macloud, arising.
>    "I see Royster & Axtell went up to-day. I
>    again,--an<f think quickly!"
>    ROYSTER & AXTELL FAIL!
>    & Axtell failure," and, with that, he would pass
>    languid: ~" Been away, somewhere, haven't you ?
>    He took the night's express on the N. Y., P. &
>    "Colonel Duval is dead, however," she added^-
>    Tery satisfactory, indeed. And he was a competent (U+201E)
>    "Sut'n'y, seh," returned the dark}'. "Dat's
>    if Gaspard, his particular waiter, missed him ?
>    / see you not again, Farewell. I am, sir, with
>    "Y'r humb'l # obed't Serv'nt
>    Croyden nodded. Then proceeded^ Urith. much ap-*
>    CONFIDENCE AND SCRUTI^S
>    "Your recent experience with Royster & Axtell
>    the Duvals didn't keep an eye on Greenberry Point ?
>    persisted. "Has Royster & AxtelPs failure anything
>    "They're safe--<I put them under the blankets."
>    "I'm glad to make your acquaint---->" began
>    self----(U+00BB)
>    & Axtell's loan," she said. "Oh, don't be alarmed!
>    spent, by his own fireside, alone! Alone! Alone ]
>    "You are determined ?--Very well, then, come
>    Once upon a time--->--" and laughed, softly.
>    possession recently, you, with two companions^
>    "But you're not quite sure ?--oh! modest man!"
>    "Nothing! "s^id Croyden. "You're a good
>    "Humph! Blaxham & Company! "he grunted.
>    the Bonds and the stock of Royster $ Axtell,
>    & Company bought them at the public sale."
>    "I could refuse to sell unless Blaxham & Company
>    Macl^ud observed; ~" though, it's a pity to tilt at
>    moment, will you ?--you're hipped on it!"
>    "Than your Southern ancestors ?--isn't that
>    "/ sent him! I don't know the man."
>    \
>    else--won't you tell me where you are? (/ don't
>    will be: ~' Come over and see us, won't you ?'"
>    millionaire. We've got >our share of fools, but we
>    \
>    280 IN HER OWN RIGHT /
>    (U+00AB)No!~----"
>    THE LONE HOUSE BY THE BAY (U+00A3)87
>    \permanent residence."
>    \
>    the Treasure--/ have lifted the Iron box, from the



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/643af6dc/attachment.htm 

From rfrank at pobox.com  Wed Jul 23 15:02:28 2008
From: rfrank at pobox.com (Roger Frank)
Date: Wed, 23 Jul 2008 16:02:28 -0600
Subject: [gutvol-d] not a dialog
In-Reply-To: <238672185.93911216840318819.JavaMail.mail@webmail09>
References: <238672185.93911216840318819.JavaMail.mail@webmail09>
Message-ID: <4887AA74.6000601@pobox.com>

Joshua Hutchinson wrote:

|  Just wanted to pop in to ask if you (or anyone else) has
|  looked into incorporating these checks into the proofing
|  interface at DP?

That would be a big boost to productivity. The difficulty
for me is that I'm comfortable with Ruby and Perl but
uncomfortable with PHP, and I think that's an important
deficiency for anyone wanting to integrate it at DP.
That's why for me it's a standalone utility, like guiprep,
only written in Ruby--it's just my limitation in being able
to put it inside a wrapper with something stronger than a
textbox widget. If I could find the equivalent of guiguts'
built in editor/presentation manager, only written in Ruby,
I would certainly use it. That would at least make it
interactive in a "proofing round 0" sense.

So bottom line, for me the answer is that it's only a
"I wish I was smart enough to do that" kind of thing. As
a proofer myself at DP, I agree it would be a big win.

--Roger Frank




From rfrank at pobox.com  Wed Jul 23 15:16:45 2008
From: rfrank at pobox.com (Roger Frank)
Date: Wed, 23 Jul 2008 16:16:45 -0600
Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 26
In-Reply-To: <627d59b80807231238o624ab523od907849b98b91d3d@mail.gmail.com>
References: <mailman.3751.1216836583.2809.gutvol-d@lists.pglaf.org>
	<627d59b80807231238o624ab523od907849b98b91d3d@mail.gmail.com>
Message-ID: <4887ADCD.8080508@pobox.com>

don kretz wrote:

 > First, I have several dozen regular expressions I've honed over a couple
 > years that I use to preprocess text for the ongoing Encyclopedia
 > Britannica project.

That sounded familiar. I baked something specific you suggested into the
post-processing verification code (ppvtxt.pl) back on 3/19/2007.
The changelog shows I incorporated special consideration for &c "as
in EB articles." That said, I'm pretty sure I haven't seen the complete
list of several dozen and I sure would like to. How can that happen?

--Roger Frank




From dakretz at gmail.com  Wed Jul 23 15:18:23 2008
From: dakretz at gmail.com (don kretz)
Date: Wed, 23 Jul 2008 15:18:23 -0700
Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 28
In-Reply-To: <mailman.3760.1216849368.2809.gutvol-d@lists.pglaf.org>
References: <mailman.3760.1216849368.2809.gutvol-d@lists.pglaf.org>
Message-ID: <627d59b80807231518u65474dbey521f4858e3fc19fd@mail.gmail.com>

>
>
> From: Joshua Hutchinson <joshua at hutchinson.net>
> To: gutvol-d at lists.pglaf.org
> Date: Wed, 23 Jul 2008 19:11:58 +0000 (GMT)
> Subject: Re: [gutvol-d] not a dialog
>
>
> Just wanted to pop in to ask if you (or anyone else) has looked into
> incorporating these checks into the proofing interface at DP?
>
> What I mean is, it would be useful to have a "spell-check" like utility
> within the DP proofing window that highlights "possible" problems and let
> the proofer check them and change as necessary.  You could really dial up
> the automated check code since final say would be by the proofer (human) and
> you aren't as reliant on the system being 100% correct before making a
> change.
>
> Josh
>
>
You may remember that I implemented a new proofing interface a year or two
ago, which provided a "preview" mode showing real italics, etc. That has
since added a quote-matching display, and a punctuation
reasonability-checker. They may still be on the dev server - I haven't
checked for a long time.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/718d13d8/attachment.htm 

From Bowerbird at aol.com  Wed Jul 23 15:24:54 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 23 Jul 2008 18:24:54 EDT
Subject: [gutvol-d] not a dialog
Message-ID: <cf8.3a23495f.35b909b6@aol.com>

roger said:
>   Let me reply directly and say 
>    No, I am not interested in a dialog.

fine.   no problem.


>   Before Greg's post, I didn't think anyone was 
>    paying that much attention, figuring that 
>    the list was mostly poisoned, that you were 
>    on the multiple kill lists, and that creative discussions 
>    were finding other more fertile places to grow.

when i talked about "the dust settling", it was because i anticipated
this reply from you, roger...   this is the post where you kick up dust.

and once that dust has settled, i'll still be here, examining things...

in the meantime, for someone who doesn't want a dialog,
you sure have brought up a lot of issues.   so while i wait
for the dust to settle, and resume with my monolog mode,
let me tackle this newest crop of 'em...

***

the first issue seems to be about the blank lines at the top of
some pages -- and not other -- in the "mountain blood" book.

i've posted the o.c.r. text.   people can find it here:
>   http://z-m-l.com/go/mount/mount.ocr.txt

as it shows, clearly, i was right when i described what pages
_do_ and _do_not_ have a blank line at the top of them...

i've also appended a list of the pages that _do_ have a blank line
-- it's all the pages that start with a capital letter or a double-quote --
and a list of the pages that _do_not_ have a blank line at the top
-- it's all the pages that start with a lower-case letter or a dash --
which is _exactly_ what i said.

i have already listed the pages where this blank line was incorrect,
so i won't bother to repeat that.

the list of pages that _do_ have a blank line is here:
>   http://z-m-l.com/go/mount/mount-roger_is_wrong-upper.txt

the list of pages that _do_not_ have a blank line is here:
>   http://z-m-l.com/go/mount/mount-roger_is_wrong-lower.txt

so, if people look at what i said, and look at the facts, i was right.


>   I simply don't care who is "right" here.

well, gee, roger, i don't know what to say in response to that.

if i say something, and i'm right, and you say i "mischaracterized"
what was done, when i actually didn't, then perhaps it is _best_
that you "simply don't care" who is "right" and who is "wrong"...

stick your head in the sand.   refuse to look at the actual data.

i, on the other hand, _do_ care what is right and who is wrong --
not so much on this particular question but in questions in general
-- because i want to know what the facts are, and what the truth is,
because a general ignorance of _the_truth_ doesn't do me much good.
(and an outright rejection of it -- as you've done here -- is dangerous.)

i care very deeply about which _position_ is "right", because i want to
_adopt_ the right position, because it's silly to cling to the wrong one.

so i don't put my head in the sand.   i look at the actual data...


>    That "P0" round is great if there are proofers available who 
>    would work in that round

if there are volunteers who will make changes individually, over and
over and over again on page after page after page, i am _quite_sure_
there are volunteers who would _love_ to have the power to make
_global_corrections_ in one fell swoop, since it's a lot more efficient.
and a sense of agency and efficacy are _important_ to volunteers...


>    and if they are sufficiently talented with regexs 
>    to find the trouble spots

they don't need to be talented with reg-ex.   they just need a tool.


>    or if a tool were available to take them to the trouble spots.

right.   which is what i've been saying for about 5 years now.
welcome to the party.


>   From what I read, the tool you announced as imminent 
>    may not be coming soon--I hope I'm wrong about that

who told you that?   i told you that you could have it right now,
if you wanted it right now, all you had to do was ask me for it
and agree to write up a report on it and send it to this listserve.

so, do you want to see it?   or not?


>   It's unfortunate that no part of your code is available in source.

yet another case where you're not attending to what's important...

here's how you set up an editable field using the realbasic compiler:
you drag such a field onto the window.   boom, you've got an editfield.
it's a styled editfield, meaning you can have italicized and/or bold text.
colors, and even some other stuff, like alignment and super/subscripts.
plus, of course, the normal stuff, like choice of font and fontsize...

does that help you set up an editable field in perl or ruby or java?

i wouldn't think so.

or, you place a "canvas" using realbasic by dragging it onto the window.
then you load the image of choice into the canvas.   does that knowledge
help tell you how you would do the same thing in perl or ruby or java?

i wouldn't think so.

there is some code that would help you.   for instance, here's how i
split the overall book text-file into pages, for a page-based display:
>    pages=split(book," {{")

and here's how i split it into paragraphs, to do that kind of analysis:
>    paragraphs=split(book,chr(10)+chr(10))

that's pretty close to the command that you would use, in perl anyway.
so i guess, in some sense, it would give you an advantage to see that...
but, really, knowing how to do a "split" is something you already know.
isn't it?


>   Someone posted that perhaps that premature availability would
>    lead to unneeded criticism.

personally, i ain't afraid of criticism.

like any software that you build for yourself, there's kinks in this thing.
but i expect that another programmer like you cuts some slack for that.

and even if you don't, who cares?

if someone decides not to even look at my program because of the
criticism you have leveled against it -- whether it was correct or not --
then that's their loss.   no skin off my nose.


>   You've decided to withhold the source,
>    which makes me wonder if it exists at all.

you know, roger, that just makes you sound stupid.   i mean, really stupid.
i told you point blank that if you wanted to see the app, i'd send it to you.

so, if you "wonder if it exists at all", why don't you just ask to see a 
copy?

i just can't fathom the stupidity of that.

perhaps you think that, instead of writing the source-code,
i created the app by waving my hands in the air magically?

a good thing you don't want a dialog, because you're not holding up
your end of the bargain anyway, not when you say stupid stuff like this.


>   Actually, that's not particularly helpful or appropriate, to me.

well, gee, roger, i'm up to item #23 on a list of routines that could have
been used in a good preprocessing methodology that would have found
roughly 150 lines that contained errors in the "mountain blood" book,
errors that could have been quickly and easily corrected before the text
went in front of any proofers.   if that's not helpful or appropriate to you,
then i don't really know what more i can do to explain it to you...


>   My observation is that since you were banned from DP that 
>    the forums there have been much more productive,

they returned to the groupthink that prevailed before i was there, yes.
if you like that type of thing, fine...

but when the future looks back on that archive, it will tell you that
my posts there were the most insightful ones in the entire bunch.
and they will laugh at you for your stupidity in drumming me out...

(there you go, marcello, another juicy quote for your fansite.)


>   It's sad that, given you feel you have good ideas, you can't find
>    a way to present them without making it a competition and 
>    without making it personal. There is no "I" in Project Gutenberg.

oh please...   you attack me, personally, and then say that _i_ am
the one who is making it "personal"?   and you think people don't
see through that ploy?   i've written dozens of posts here where i
examined your work, and never cast anything personal on you...

***

anyway, like i said, whatever you want.   if you don't want a dialog,
then i'll be happy to return to the monolog i was having before...

-bowerbird


these are the pages that had a blank line at the top of them.
i believe they all start with a capital letter or a double-quote.
if i'm wrong about that, please let me know...

007.png // ONE
009.png // The fiery disk of the sun was just lifting above
010.png // From the vantage point of the back porches of
011.png // II
015.png // Gordon Makimmon, with secret dissatisfaction,
017.png // Ill
018.png // With a sharp flourish of his whip Gordon urged
019.png // IV
023.png // They rose steadily, crossing the roof of a
027.png // Gordon Makimmon gazed with newly-awakened
032.png // VI
035.png // "Four. They're real buck, and a topnotch article.
037.png // VII
042.png // The other consulted the book. "Two years, a
043.png // "I can give you fifty dollars," Gordon told him,
046.png // "By God!" he exclaimed, suddenly prescient,
047.png // "I'm not like that," Gordon informed him; "it's
048.png // VIII
052.png // IX
056.png // X
059.png // A preliminary drink was indispensable; and,
062.png // XI
064.png // "You got enough, all right," Em agreed. "Now,
069.png // XII
071.png // XIII
078.png // XIV
080.png // XV
081.png // "That's not correct," Simmons informed him
082.png // XVI
086.png // XVII
087.png // "Clare's dead," Gordon replied involuntarily.
088.png // XVIII
090.png // Gordon Makimmon stood at the end of the porch,
092.png // XIX
095.png // "It was certainly nice-hearted of you to come to
096.png // "I wanted to tell you," she said finally, with palpable
097.png // Then, when Zebener Hull's corn failed, 'I'll trouble
100.png // XX
101.png // The following morning found him on the front
103.png // XXI
105.png // XXII
109.png // Inan instinctive need for human support, the reassurance
110.png // XXIII
117.png // XXIV
120.png // This, he told himself complacently, was but a description
121.png // I might write there, but I'd lose time and money.
122.png // "men don't like me, they are afraid of me; but the
125.png // "What does that matter? don't you love me?
127.png // XXV
128.png // The large, suave figure of the Universalist minister,
132.png // The minister's wife inserted in the door from the
133.png // XXVI
135.png // The woman's face was bitter, her body tense.
138.png // XXVII
143.png // "Thank you," she told him seriously; "it will
144.png // "Do you care as much as that?" She laid her
147.png // TWO
149.png // In the clear glow of a lengthening twilight of
156.png // Don't disturb yourself; yours is the time for pleasures,
159.png // "Kick him again, Buck," he said; "kick him
160.png // II
161.png // Lattice, in white, with a dark shawl drawn about
164.png // Ill
165.png // "Here, General, here," Gordon commanded, and
169.png // After he had spent a limited amount, the principal
171.png // IV
173.png // Going to start that song? That'll come natural to
175.png // It was late when they returned from the farm.
176.png // The form above him leaned forward over the railing.
177.png // His foot struck against a chair, and his hand caught
178.png // VI
180.png // Fork next week?" she demanded. "I have never
181.png // He rose to leave, and she held out her hand. At
182.png // VII
183.png // Lettice was--superior; he recognized it pride-fully.
184.png // VIII
191.png // IX
192.png // "I do! Idol" He turned and left them, striding
193.png // He was fascinated by her naked, shapely arm; it was
194.png // "Why I--I got some money; that is, my wife
197.png // "Haven't you got enough at home," Buckley demanded,
199.png // X
205.png // XI
209.png // "Five years ago," he told her, "if you had tried
210.png // "You are a gentle object," he satirized her, loosening
211.png // "perhaps I'll get a thrill from that." Her voice
212.png // "Back to this wilderness," she scoffed; "any one
215.png // "Don't desert me; I am entirely alone except for
216.png // XII
218.png // "Whatever I say is good enough for Lettice," Gordon
220.png // "There's no good," he resumed, "in you and me getting
221.png // XIII
223.png // "Barnwell might cross him," she answered; and,
225.png // "You're not a camel," she truthfully observed,
226.png // Berry pronounced, "and never come to any satisfaction.
232.png // "I want a cheerful wife, one with a song to her,
234.png // XIV
235.png // "I threw the stone that hit Buck, didn't I! I
236.png // "Well, you don't have to stand and talk like I
239.png // "The jobber sent it up by accident," he explained;
243.png // XV
244.png // Hedescended, beyond the ridge, into the fact of
247.png // He was now, he realized dimly, at the crucial
249.png // XVI
250.png // "All lace and webby pink silk and ribbands underneath,"
252.png // XVII
256.png // XVIII
260.png // MOUNTAIN BLOOD
261.png // She gazed about at the valley, the half-distant maple
264.png // XIX
266.png // Lettice was so young, he realized suddenly. [
269.png // XX
271.png // It had been wonderfully comfortable in the evening
275.png // XXI
277.png // He gazed at her for a moment, at the shadows like
279.png // THREE
281.png // Lettice's death Gordon was fetching
284.png // II
286.png // Her voice, too, was like Lettice's--sweet with
289.png // Ill
291.png // IV
294.png // The couple grasped avidly at the opportunity
295.png // He was a youth of large, palpable bones, joints
296.png // VI
297.png // "Why should you?" Gordon interrupted
300.png // VII
302.png // "and twist the head off that dominicker chicken.
304.png // VIII
305.png // Mrs. Hollidew dead. Undisturbed in the film of
309.png // IX
311.png // X
312.png // Itwas the priest, Merlier.
314.png // T
315.png // "I heard her, but I'd ruther sit right where I am."
316.png // "Why, William Vibard! what an awful thing to
317.png // "They're in the stable," William Vibard answered
319.png // XII
320.png // "As an old friend," he declared, "an old Presbyterian
321.png // "I'm not aiming at anything," Gordon answered,
322.png // "Never kept much of anything, have you, any of
323.png // "I intended to come to you about that."
324.png // "You forget, unfortunately, that.1 am forced to
326.png // XIII
329.png // XIV
330.png // Suddenly he was in a hurry to get away; he drew
331.png // XV
332.png // The sales made to Valentine Simmons were, invariably,
335.png // XVI
336.png // A number of horses were already hitched along
340.png // The thought of the storekeeper was lost in the
341.png // XVII
342.png // He had not been in the house since, together with
343.png // "Get out of here!" Gordon shouted in a sudden
344.png // XVIII
347.png // General Jackson moved forward over the porcK.
349.png // XIX
351.png // XX
354.png // XXI
356.png // XXII
357.png // His wife, Lettice, how young she was smiling at
358.png // Gordon, doubting whether the horses' shoes had
361.png // XXIII
362.png // He swayed, but preserved himself from falling,
363.png // They were, Gordon knew, not half way up Buck
366.png // XXIV
367.png // "it's Edgar Crandall. You'll take pleasure from




these are the pages that did not have a blank line at the top.
i believe they all start with a lower-case letter (or a bracket).
if i'm wrong about that, please let me know...

012.png / night before, evading such indirect query as Makimmon
013.png / lips, they had subsided into an unintelligible mutter,
014.png / or as she gravely thanked him at the end of the day's
016.png / half-heard conversation behind him; he spoke to
020.png / youth. He lounged over the road in a careless manner
021.png / this chance to the utmost with Morley's Raiders
022.png / married sister, completing the tale, lived at the opposite
024.png / forest swept down in an unbroken tide to the porch
025.png / with his mind pleasantly vacant, lulled by the monotonous
026.png / and in the end persuaded her. The stranger continued
028.png / one will know." He could not resist adding,
029.png / a sibilant exclamation, and Lattice Hollidew covered
030.png / them. The stranger consulted a small map.
031.png / hidden space, the village lay along its white highway.
033.png / and greenish, with an incomplete mustering of buttons,
034.png / girls," he pronounced coolly; "but he'll be after them
036.png / before had crippled their resources; his last Christmas
038.png / thin, sensitive nose, and a colorless mouth set in a
039.png / on the kitchen wall, where, in the watery light of a
040.png / again?" he asked solicitously; "shall I get you the
041.png / ing or benevolent sentences; these, with appropriate
044.png / out the ability to pay for a bag of Green Goose
045.png / marked precisely, over his shoulders, "the white
049.png / lying, indomitable determination, asserted itself--he
050.png / idea.--He would pay the customary substitute to
051.png / mance of his sister's courtship; the high, strident
053.png / he had limited himself in thought, but his entire
054.png / raw clay and narrow, wood sidewalks; they were,
055.png / wild. Could he afford to lose that amount from his
057.png / fat, and oddly damp and lifeless. He could see her
058.png / high shoulders, the long, pale face, the long, pale
060.png / visible money. "A dollar a go?" Jake queried,
061.png / slowly and rolled like a flash over her plastered
063.png / cern now was to get away, to take the money with
065.png / _{cr}ippling, the other. A chair fell, sliding across the
066.png / rapidly losing power. The woman threw herself on
067.png / up in front of his head; and an intolerable pain shot
068.png / lost. He clung to it; pressed his breast against it;
070.png / tination. His coat, soiled and torn, was buttoned
072.png / don knew, a sovereign and inevitable remedy for all
073.png / -Clare dangerously ill ... a question of dying,
074.png / dark, sliding water, and the mountainous wall
075.png / didn't hear ... oh, there's nothing in it if you
076.png / supper," she worried, when he had told her; "and
077.png / getically, "it will cost a heap of money; how will you
079.png / directed; "and I'll be down to see you ... yes,
083.png / and the small assemblage of merely idle or interested
084.png / nip?" he asked, in a solemn, guarded fashion.
085.png / hat drawn over his eyes, a piece of pasteboard in
089.png / ners, a subdued, red riot of the summer, the sun
091.png / but her eyes were unwavering--they held an appeal
093.png / tured her attention and interest; he had not thought
094.png / garment of stars drawn from wall to wall. There
098.png / nuto passage of a symphony; "but it's all one to me
099.png / dressed the other, "don't walk back here, don't come
102.png / pockets in search of the proof of his assertion. In
104.png / ished, leaving the countryside sparkling and serene
106.png / printed upon the otherwise spotless board floor,
107.png / but he belted them into baggy folds. The other appeared
108.png / sorry that he had obeyed the fleeting impulse to enter.
111.png / going to die two or three times the year, and bother
112.png / so utterly, so disastrously, so swiftly upon his complacency,
113.png / the whole worthless lot?" Bartamon demurred:
114.png / turn of the wrist, skilfully avoiding the high underbrush,
115.png / upon a curtain of old blue velvet. He cast once
116.png / ers, from the farm. As he approached he saw that
118.png / turned Lettice Hollidew stood with her tiresome
119.png / fleeted in the warmer tones of his replies; a new
123.png / girl until--until Buckley.,. until to-night, now.
124.png / luctant eagerness. He kissed her again and again,
126.png / ment, and opened them with an effort. The whippoorwills
129.png / of the "small occupations," the minister's reputation
130.png / less suns of the August valleys. He was as seasoned,
131.png / in a rapid sing-song by the circuit rider, Gordon saw
134.png / the sky. He recognized the sharply-cut silhouette
136.png / was smoothly rounded, provocative; its graceful proportion
137.png / across her face, and she turned and disappeared.
139.png / passed, Gordon gathered the impression of a dark
140.png / he might have had it all. He gazed cautiously, but
141.png / garden patch beyond, Mrs. Caley said. Gordon
142.png / ing that it must be a messenger from the village, dispatched
145.png / soft contours of her virginal breasts, the bracelets
146.png / [Blank Page]
148.png / [Blank Page]
150.png / as I've a mind to?" Gordon demanded belligerently.
151.png / and white, with an occasional red thread drawn
152.png / recorded, the elbow crooked. "Don't forget his
153.png / the church all regular and highly fashionable. He
154.png / examined the details of your late father-in-law's
155.png / came grave at the contemplation of the amount involved.
157.png / legs were ludicrously, inappropriately, long and
158.png / precariously rewarded labor with the stubborn, inimical
162.png / he's as gritty as--why, yes, I do, I'll call him General
163.png / center occupied by a large silver-plated castor, its
166.png / throaty voice, "I'm afraid.... Tell me it will be
167.png / the voice, as it were, of a sinister, tin manikin galvanized
168.png / you'd like to hear General Jackson sing; he's got
170.png / had precipitated this rebellion, this strife in which
172.png / and playing him out. Come here, General JacK-son."
174.png / ing General Jackson at his heels, he picked the dog
179.png / stage-driving days, of the younger years. Her manner
185.png / were close-lipped, somber. The men were sparely
186.png / gathered in the noisome shadows, bottles were
187.png / aside, and a woman walked stiffly out, her hands
188.png / flood. "At last it's about your hearts, your hearts
189.png / duced a small jug. He wiped the mouth on his
190.png / in search of Meta Beggs; perhaps, after all, she had
195.png / for it. Almost everybody wants the same thing--plenty
196.png / hind, as the former made his way toward them.
198.png / surface of blood and hair and dirt. Buckley's eyelids
200.png / off me. I was a year and a half there, when--when
201.png / shafts of the trees on the lawn. Supper was in progress
202.png / laid it on the indistinct bed, and moved to the mirror
203.png / by the aid of a hand lantern. He was reluctant to
204.png / women since the dawning of consciousness, that it
206.png / slightest opening; and he continued uncomfortably
207.png / eluded him. "Please," she protested coolly, "don't
208.png / 'but at night--satin gowns with trains and bare
213.png / surprising fore-knowledge of the County, who had
214.png / atory position. He would extract the last penny of
217.png / always seem to be around, to get talked about, when
219.png / existing conditions. "Your wife's estate controls
222.png / with a scant black ribband about her waist, her sole
224.png / rested at Lettice's hand, and, before Gordon, a portentous
227.png / nolia flowers, would never thicken and grow rough.
228.png / rose in Berry's pallid countenance, Sim's portion
229.png / shaggy horses as they lay clumsily down to rest, on
230.png / a long drink. He drank mechanically, without any
231.png / untrimmed. A chair by the bed bore Lettice's
233.png / ing her elbow, shook her. She was as rigid, as unyielding,
237.png / the pleasantest body you'd meet in a day on a horse.
238.png / accomplished fact; Lattice's wishes, her quality of
240.png / bag, he had lamed a horse--a satisfactory driver
241.png / ing, in the sooty shadows. With the necklace of
242.png / night. It was late afternoon of the day on which he
245.png / lace. Finally he found her; or, rather, she slipped
246.png / you. Some people even like it. A man who came
248.png / deepened to its darkest hour; the moon, in obedience
251.png / sible act of cowardice--Lattice, a girl, blinded by
253.png / that gaping, insatiable chasm. He was conscious
254.png / lighter sky. The foliage of the maples, stripped of
255.png / was driving, and by her side ... Lettice! Lettice
257.png / casual subterfuges. The evasion which he summoned
258.png / plicity, the weakness, the sensual and egotistical desires,
259.png / her youth haggard with apprehension and pain, the
262.png / it was worse. The buggy, badly hitched, bumped
263.png / silence of months, dispelling the accumulating ill-will
265.png / cfeeded from ... it wasn't as though he had gone
267.png / -what man had not?--but this was different; this
268.png / which totally misrepresented him.... He would
270.png / men; it tampered ferociously with the beauty, the
272.png / horse's hoofs on the road above; the sun moved
273.png / esty on the bed; "there was a good bit I didn't get
274.png / over the uneven boards of the porch. At this hour
276.png / coldness seemed to come through the cover to his fingers.
278.png / a box, the lantern at his feet casting a pale flicker
280.png / [Blank Page]
282.png / scraped of mud, bore long cuts across the heels,
283.png / after him. Then, as he turned, he saw that there
285.png / since he had "called out" Gordon's home; the
287.png / waist had been crisply ironed, her shoes were rubbed
288.png / his way. But, almost immediately, he stopped.
290.png / ing which even the auctioneer grew apathetic. He
292.png / now, suddenly detached from the aimless procession
293.png / in rigid rows on the dresser; the pots were scoured
298.png / the same thing in the Bottom. Ask anybody who
299.png / he pulled his hat over the flaming helmet of hair.
301.png / at either side of the large, uncut stone at the threshold;
303.png / he admitted; "but I haven't got to. It's enough to
306.png / shoulders of men momentarily forgetful or caught in
307.png / the touch of a magic wand. He had never realized
308.png / say, three per cent, grant extensions of time wherever
310.png / a man with a round, freshly-colored countenance,
313.png / that know tell you," Merlier paused at the door, "the
318.png / like that ... delicate--" He knelt, with an expression
325.png / an estate estimated at--" he stopped from sheer
327.png / municate with the Tennessee and Northern Company,
328.png / them to take or leave. But, if they delayed, watch
333.png / hardly more alive than the photographed clay of
334.png / inhabited desolation, in a black chasm filled with
337.png / called, in Greenstream, the Portugee; every crop he
338.png / who paid for and removed the bodies of dead animals.
339.png / controlled the Bugle in addition to countless other
345.png / window, saw that the sweep by the stream was filling
346.png / lie cried out of his bitterness of spirit, "but I'd ruther
348.png / light, blurred, mingled, in his vision. He put out
350.png / the iron-like earth. In the pale circle of the lantern
352.png / him. He thought, in sudden approbation of a part,
353.png / stage; formerly Clare had attended to the house for
355.png / ance, over the obscured way. The stage mounted,
359.png / able,--the glassy road enormously increased the labor
360.png / other's arm sweep up.--The switch fell viciously
364.png / village soon's I can; and here you drag and hang
365.png / ing, dead planet. Gleams of light shot like quicksilver
368.png / chamber of the safe. A flickering desire to see led





**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/cfc57aac/attachment-0001.htm 

From vze3rknp at verizon.net  Wed Jul 23 15:33:00 2008
From: vze3rknp at verizon.net (Juliet Sutherland)
Date: Wed, 23 Jul 2008 18:33:00 -0400
Subject: [gutvol-d] not a dialog
In-Reply-To: <4887AA74.6000601@pobox.com>
References: <238672185.93911216840318819.JavaMail.mail@webmail09>
	<4887AA74.6000601@pobox.com>
Message-ID: <4887B19C.3070108@verizon.net>

Adding these checks to the proofing interface at DP is something that 
I've wanted for years now. Wordcheck is the first step in that 
direction, and I keep hoping that some developer will take on the task 
of writing a companion tool that will make use of regex's and other 
useful things for pointing out potential errors. This is such an obvious 
tool that we'd even thought of it before bowerbird mentioned it in the 
forums. We don't have it yet simply because none of our volunteer 
developers has been willing to tackle it. If I could wave a magic wand 
and make it happen, we'd have had it years ago.

JulietS

Roger Frank wrote:
> Joshua Hutchinson wrote:
>
> |  Just wanted to pop in to ask if you (or anyone else) has
> |  looked into incorporating these checks into the proofing
> |  interface at DP?
>
> That would be a big boost to productivity. The difficulty
> for me is that I'm comfortable with Ruby and Perl but
> uncomfortable with PHP, and I think that's an important
> deficiency for anyone wanting to integrate it at DP.
> That's why for me it's a standalone utility, like guiprep,
> only written in Ruby--it's just my limitation in being able
> to put it inside a wrapper with something stronger than a
> textbox widget. If I could find the equivalent of guiguts'
> built in editor/presentation manager, only written in Ruby,
> I would certainly use it. That would at least make it
> interactive in a "proofing round 0" sense.
>
> So bottom line, for me the answer is that it's only a
> "I wish I was smart enough to do that" kind of thing. As
> a proofer myself at DP, I agree it would be a big win.
>
> --Roger Frank
>
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
> No virus found in this incoming message.
> Checked by AVG - http://www.avg.com 
> Version: 8.0.138 / Virus Database: 270.5.5/1568 - Release Date: 7/23/2008 6:55 AM
>
>
>
>   

From rfrank at pobox.com  Wed Jul 23 15:38:52 2008
From: rfrank at pobox.com (Roger Frank)
Date: Wed, 23 Jul 2008 16:38:52 -0600
Subject: [gutvol-d] woman in her own right -- 001
In-Reply-To: <c87.346b46ee.35b900e6@aol.com>
References: <c87.346b46ee.35b900e6@aol.com>
Message-ID: <4887B2FC.2060607@pobox.com>

Bowerbird at aol.com wrote:

 > ... the runheads and pagenumbers
 > are in the text.  what's up with that?  d.p. policy is to remove them,
 > and it's something the computer can do quite easily, so why have the
 > _proofers_ spend their time and energy doing it instead?  that's b.s.

Just a point of clarification: yes the page headers have been retained
on that (and two other) newcomer's only projects. I have started over
a hundred books for newcomers and have provided individual personalized
feedback to every P1 proofer on each of those books. That's well over
a thousand PMs to the new proofers. What I've learned is that many of
the corrections on books that I preprocessed and removed the page headers
on was in the P1 not getting the page breaks right, even with the
top-of-page code in place. On these last three books, I did the
equivalent of leaving the "remove page headers" step out of a
traditional guiprep run. Because they are Newcomers Only/Rapid Review
projects, I'll know right away if forcing a decision by the P1
regarding to what to do at the top of a page is worthwhile.

By the way, when the project was created (see the Project Discussion),
I announced "In this project, all page headers have been retained.
Follow the standard guidelines to remove them, including adding a
blank line if the top of the page starts a new paragraph." Proofers
don't have to work on the project if that is onerous to them, nor
should they be surprised if they choose to participate. That's what
the project thread is for.

These three books are an experiment, and since I for one don't already
know all the right answers, it is how I discover more about proofers,
proofing, and the price of getting better output.

--Roger Frank







From Bowerbird at aol.com  Wed Jul 23 15:58:37 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 23 Jul 2008 18:58:37 EDT
Subject: [gutvol-d] woman in her own right -- 001
Message-ID: <c9f.33bea607.35b9119d@aol.com>

roger said:
>   yes the page headers have been retained
>    on that (and two other) newcomer's only projects

in "mountain blood", you left the pagenumbers, even though
it would've been quite simple to write code to remove them...

in "the crevice", there were hundreds and hundreds of cases
where "blaine" was misrecognized as "elaine", which could've
been fixed book-wide with a global change.   examples abound.


>   What I've learned is that many of the corrections on books
>    that I preprocessed and removed the page headers on 
>    was in the P1 not getting the page breaks right, 
>    even with the top-of-page code in place.

why don't you "get the page breaks right" in _preprocessing_?

and then, when a diff comes up, and you see a proofer made
a change to what was _already_ correct, you can inform them?

this is one of the biggest values of preprocessing, that when a
change is made, it can be an indicator of a misinformed proofer.


>   These three books are an experiment, and 
>    since I for one don't already know all the right answers, 
>    it is how I discover more about proofers, proofing, 
>    and the price of getting better output.

well, i don't know all the right answers either,
but i know that making humans do something
that the computer could do faster and easier
is something that makes my stomach queasy...

runheads are such an _obvious_ example of this,
it would be irresponsible of me -- when analyzing
this experiment -- to fail to mention such a thing.

so just because you don't know "all the right answers"
doesn't mean you stop using your brain to intuit them.

and really, if we can't agree on the _obvious_ things,
then there isn't much sense in having a dialog, or even
bothering to type posts back and forth to each other...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/6c8a1b32/attachment.htm 

From Bowerbird at aol.com  Wed Jul 23 16:03:40 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 23 Jul 2008 19:03:40 EDT
Subject: [gutvol-d] not a dialog
Message-ID: <c76.293f9385.35b912cc@aol.com>

dkretz said:
>    You may remember that I implemented a new proofing interface 
>    a year or two ago, which provided a "preview" mode showing 
>    real italics, etc. That has since added a quote-matching display, 
>    and a punctuation reasonability-checker. They may still be on 
>    the dev server - I haven't checked for a long time.

juliet said:
>   We don't have it yet simply because 
>    none of our volunteer developers 
>    has been willing to tackle it.

if somebody can sort this all out, do please explain it to me, ok?

thanks.

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/ed51fb5d/attachment.htm 

From jayvdb at gmail.com  Wed Jul 23 16:23:12 2008
From: jayvdb at gmail.com (John Vandenberg)
Date: Thu, 24 Jul 2008 09:23:12 +1000
Subject: [gutvol-d] is this a dialog?
In-Reply-To: <d11.2ff6e511.35b8c941@aol.com>
References: <d11.2ff6e511.35b8c941@aol.com>
Message-ID: <deea21830807231623r11149f0dkc3c4ff8de1c440e6@mail.gmail.com>

On Thu, Jul 24, 2008 at 3:49 AM,  <Bowerbird at aol.com> wrote:
> john-
>
> wow.  i certainly wasn't expecting anything like _that_.
> what a nice surprise.  i'm bowled over.
>
> i will be happy to go over and take a look at wikisource.
>
> it would be a pleasure to offer my constructive criticism
> to an entity smart enough to actually treasure such input.
>
> and i'd be honored to help improve your infrastructure...

Great; if you have questions, ask at either:

http://en.wikisource.org/wiki/Wikisource:Scriptorium
http://en.wikisource.org/wiki/User_talk:Jayvdb

Don't worry about form or format; if you mess up we'll fix it and let
you know for next time, and thank you for trying.

> right from the get-go -- with the wiki structure and your
> ability to run bot-based error-finding routines -- i'd say
> you have some fantastic potential there.  really fantastic.
>
> my apps are written in basic, so my code won't help you,
> but i'm skilled at expressing them in pseudo-code, so if
> you've got web-programmers to implement my routines,
> we'll be able to work together.

We have programmers of all flavours.  Opportunity cost will mean your
ideas will have to be as good as, or better, than the constant stream
of user requested enhancements, but it sounds like that wont be a
problem.

> and if it wasn't clear, my offline tools are cross-plat apps
> that are available at zero cost.  (i'd guess that people still
> are more efficient doing this work offline, but i'm willing
> to let some crafty web-programmers prove me wrong...)

We havent done much offline processing, however I can see what you are
saying.  I converted the following work from PG etext to wiki
structure and format using a once-off script.  Once converted, a bot
uploaded it using "pagefromfile.py"

http://en.wikisource.org/wiki/A_Short_Biographical_Dictionary_of_English_Literature

http://meta.wikimedia.org/wiki/Pagefromfile.py

Most of our tools are designed to work on the users machine,
interfaced to the _live_ wiki database.  i.e. our software is
decentralised.  Our development is decentralised.  The software that
runs the Wikisource (and Wikipedia) system _must_ be open source,
however we want people to do whatever pleases them.  A bot edit is
essentially just a human edit,  The wiki system doesnt care how the
edit is made.  We have _social_ rules around bots, some of them
unwritten, because they can make a mess very quickly and it will take
a long time to fix the mess.  Essentially, the rules boil down to:
dont make a mess.

We have frameworks in python, perl, php, etc. so if there are existing
error checking/fixing programs or routines already available, _anyone_
can integrate those into bot, automated or requiring human
intervention, that proceses up any page that is currently marked as
"not proofread". I suggest that the bot should run without making
human intervention, because humans will be proofreading the text
anyway. If the bot is consistently making good changes, it will go
through an approval process which brings two benefits: 1) it is then
approved to run at full speed, and 2) it is hidden from the "Recent
changes" view humans watch to review ongoing changes.

If a bot finds errors that it cant fix, it could report the details of
those on the "talk" page that accompanys every transcription page.

If the bot is regularly bringing pages up to "proofread" quality, it
could even be approved to mark a page as such if the bot can find no
further issues with the page.  Those pages still need to be verified
by a human, so any error can still be identified and fixed.  A side
benefit of this is that pages which the bot doesnt mark as "proofread"
will be _more carefully_ inspected by a human, because in the back of
everyones mind will be: there is something about this page that the
bot didnt like.

> i will respond here when i've taken a look at wikisource,
> just to show the kind of interaction p.g. could have had,
> but if we continue on for long, we can take it elsewhere...

Looking forward to some fresh ideas and criticism.

--
John

From Bowerbird at aol.com  Wed Jul 23 16:40:58 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 23 Jul 2008 19:40:58 EDT
Subject: [gutvol-d] is this a dialog?
Message-ID: <bee.2bdb672b.35b91b8a@aol.com>

john said:
>    Don't worry about form or format; if you mess up we'll fix it 
>    and let you know for next time, and thank you for trying.

i like that.


>    We have programmers of all flavours.? 

good to hear.


>    Opportunity cost will mean your ideas will have to be as good as, 
>    or better, than the constant stream of user requested enhancements

wouldn't have it any other way.


>    We havent done much offline processing, 
>    however I can see what you are saying.

ok.


>    Most of our tools are designed to work on the users machine,
>    interfaced to the _live_ wiki database.

that sounds like a good approach, best of both worlds.

i'm accustomed to thinking along the lines of what p.g. does,
where a book is done offline and then uploaded to the project.

but banana cream downloads scans if they're not on your machine,
so it shows the start of an effort to bridge the offline/online chasm.


>    i.e. our software is decentralised.? Our development is decentralised.

ok, good.


>    If the bot is regularly bringing pages up to "proofread" quality, 
>    it could even be approved to mark a page as such if the bot 
>    can find no further issues with the page.? 

right.   that's the nature what i meant when i talked about "respect"
that is due to the human volunteer proofer.   don't give them a page
which is still marred by deficiencies that even the machine can find.
that's the kind of thing that should be done in _preprocessing_...


>    Those pages still need to be verified by a human, 
>    so any error can still be identified and fixed.

excellent.


>    A side benefit of this is that pages which the bot doesnt mark 
>    as "proofread" will be _more carefully_ inspected by a human, 
>    because in the back of everyones mind will be: there is something 
>    about this page that the bot didnt like.

better if the bot can say exactly what it is that it didn't like...


>   Looking forward to some fresh ideas and criticism.

sounds like your head is on straight.   that's very refreshing.
i've carved out time tomorrow to take a good look at it...         :+)

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/5c9200aa/attachment-0001.htm 

From dakretz at gmail.com  Wed Jul 23 16:59:25 2008
From: dakretz at gmail.com (don kretz)
Date: Wed, 23 Jul 2008 16:59:25 -0700
Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 28
In-Reply-To: <mailman.3760.1216849368.2809.gutvol-d@lists.pglaf.org>
References: <mailman.3760.1216849368.2809.gutvol-d@lists.pglaf.org>
Message-ID: <627d59b80807231659n75615f6r4469f55d8e5460d4@mail.gmail.com>

rfrank, Bird, here are the EB
regexes<http://code.google.com/p/dp50/downloads/list>(in the re.vim
file). Have at 'em! :)
Note that, in some cases, I've been tracking how many changes they incurred.
That's dependent on the order in which they were invoked, however, since
they overlap.
Also, several can be (should be) used recursively.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/8b758c24/attachment.htm 

From rfrank at pobox.com  Wed Jul 23 17:39:33 2008
From: rfrank at pobox.com (Roger Frank)
Date: Wed, 23 Jul 2008 18:39:33 -0600
Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 28
In-Reply-To: <627d59b80807231659n75615f6r4469f55d8e5460d4@mail.gmail.com>
References: <mailman.3760.1216849368.2809.gutvol-d@lists.pglaf.org>
	<627d59b80807231659n75615f6r4469f55d8e5460d4@mail.gmail.com>
Message-ID: <4887CF45.4040807@pobox.com>

don kretz wrote:

> rfrank, Bird, here are the EB regexes 
> <http://code.google.com/p/dp50/downloads/list> (in the re.vim file). 
> Have at 'em! :)

Got 'em. Looks like really good work. Thanks!

--Roger Frank




From rfrank at pobox.com  Wed Jul 23 18:15:12 2008
From: rfrank at pobox.com (Roger Frank)
Date: Wed, 23 Jul 2008 19:15:12 -0600
Subject: [gutvol-d] BB | /dev/null
Message-ID: <4887D7A0.9030405@pobox.com>

Bowerbird wrote:

|  and really, if we can't agree on the _obvious_ things,
|  then there isn't much sense in having a dialog, or even
|  bothering to type posts back and forth to each other...

Perfect! This gutvol-d forum will be better for it. Sometimes
it's difficult not to respond to some of your posts, which I'm
beginning to think is intentional. To help me not be tempted,
I'll just put you in my kill file and suggest you do the same
for me.

I do what I do for fun. Being insulted, being called stupid,
having the effort of hundreds of hours of coding be denigrated
isn't fun. I want to live with intention, and too much time
spent on gutvol-d is not on my chosen path. I want to continue to
learn, and lucky for me there are many places and many people
outside this list that can provide that opportunity. I want
to appreciate my friends, and over the years I have made many
friends at DP where relationships are based on mutual respect
and the bonds that grow when experiences--real work, projects,
code development and such--are shared. And mostly I want to do
what I love, and so I'll go and do that.

--Roger Frank

From Bowerbird at aol.com  Wed Jul 23 19:16:09 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 23 Jul 2008 22:16:09 EDT
Subject: [gutvol-d] BB | /dev/null
Message-ID: <c6d.2d10c505.35b93fe9@aol.com>

roger said:
>    To help me not be tempted, I'll just put you in my kill file 
>    and suggest you do the same for me.

oh no, not so fast, pilgrim.

i don't just put people in my kill file.
they have to _earn_ that distinction!

you've said a couple stupid things,
and dipped into the ad hominem,
but you haven't quite earned it yet.          :+)


>    I do what I do for fun.

hey, me too.          :+)


>    Being insulted, being called stupid,
>    having the effort of hundreds of hours of coding be denigrated
>    isn't fun.

i've been the victim of insults here countless times.
and i agree with you, it isn't fun...   but i'm tough...

i've also had my "hundreds of hours of coding"
be denigrated.   some even "wonder if it exists",
believe it or not.   that strikes me as humorous,
but i still can't stretch to call it "fun".   so i agree.
but again, i'm tough, so it doesn't bother me...

oh, and i never "called you stupid"...   that would be
ad hominem, and i stay away from that type of thing.

what i said was that "what you just said was stupid".

do you get the difference?   one way is saying that
the _person_ is stupid, the other way is saying that
the _statement_ is stupid.   statements can be stupid.
they can be really stupid.   and i suppose if you made
_enough_ stupid statements, i would feel confident
saying that _you_ were stupid.   but in the meantime,
it's enough to label the _statements_ as being stupid.

because in a search for _the_truth_, it is _imperative_
that you call out stupid statements as _being_ stupid,
or the fact that they hang around gives 'em credence.

if you want to discuss whether it was or was not stupid,
that can be done.   i stand by my claim that it was stupid.

i didn't make it "personal".   if you take it that way, it's you
exhibiting a behavior that you have _chosen_ to exhibit...


>    I want to live with intention, and too much time
>    spent on gutvol-d is not on my chosen path.

that's cool.

my intention is to make the lobby of the project gutenberg library
a lively place to be.   i spend as much time as necessary to do that.


>    I want to continue to learn, and lucky for me 
>    there are many places and many people
>    outside this list that can provide that opportunity.

that's cool.

me, i have the opportunity to hang with other performance poets;
it's our job to comfort the afflicted, and to afflict the comfortable...


>    I want to appreciate my friends, and over the years 
>    I have made many friends at DP 

that's cool.   i'd say you fit right in with that crowd.   mind meld.


>    where relationships are based on mutual respect and
>    the bonds that grow when experiences--real work, 
>    projects, code development and such--are shared. 

sounds a bit like disneyland, the _happiest_ place on earth.


>    And mostly I want to do what I love, 
>    and so I'll go and do that.

that's cool.   i'll stay here and continue to do what i love.         :+)

***

and there you have it, folks.   the dust has settled.
so we can go back to what we were doing before,
which is to clearly document the inefficiency which
is a direct result of the terrible workflow over at d.p.

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/78c720d6/attachment.htm 

From dakretz at gmail.com  Wed Jul 23 21:43:38 2008
From: dakretz at gmail.com (don kretz)
Date: Wed, 23 Jul 2008 21:43:38 -0700
Subject: [gutvol-d] wikisource
Message-ID: <627d59b80807232143q7f3d936aj71f5aa8e34d7baab@mail.gmail.com>

John Vandenberg,

Have you ever loaded any of the Encyclopedia Britannica projects into
wikimedia? Does it seem like a fit to you?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080723/65de2333/attachment.htm 

From jayvdb at gmail.com  Wed Jul 23 22:23:55 2008
From: jayvdb at gmail.com (John Vandenberg)
Date: Thu, 24 Jul 2008 15:23:55 +1000
Subject: [gutvol-d] wikisource
In-Reply-To: <627d59b80807232143q7f3d936aj71f5aa8e34d7baab@mail.gmail.com>
References: <627d59b80807232143q7f3d936aj71f5aa8e34d7baab@mail.gmail.com>
Message-ID: <deea21830807232223k3632e311k87677f7268a514f8@mail.gmail.com>

On Thu, Jul 24, 2008 at 2:43 PM, don kretz <dakretz at gmail.com> wrote:
> John Vandenberg,
>
> Have you ever loaded any of the Encyclopedia Britannica projects into
> wikimedia? Does it seem like a fit to you?

It is a very good fit, but hasnt been driven by anyone due to
priorities.  Wikisource has a complete set of Catholic Encyclopedia
1913, uploaded by a dedicated soul by scraping the content from
newadvent.org (iirc), converting it to wiki syntax, and pushing it
into the database by a bot.  We have three people who have the
complete work at home to check any anonymous improvements that are
made, and one person who has been slowly going through and actively
improving the pages.  A slow and lonely job.  Recently
oce.catholic.com was launched with a complete set of pagescans, which
has helped us distribute this task a little, however sadly that site
is claiming copyright and has added an atrocious watermark to the
images.  We are not yet bold enough import their images.

Going back to Eb1911, almost all of EB1911 is in Wikipedia, however
their aim was incorporate it as a basis for new articles.

http://en.wikipedia.org/wiki/WP:EB1911

To find the EB1911 text that was imported into Wikipedia, you need to
squirrel down to the bottom of the history of a Wikipedia page, or
close to it.  So far I have found that the text is _better_ than
jrank, but nowhere near as good as the etexts that PD/PG has produced.
 I would like to find the Wikipedians who created these pages in
Wikipedia; maybe they still have the raw text which was imported.
There has been a recent discussion about EB1911 here (it is
intermingled throughout the wild discussion; scan for the tables):

http://en.wikipedia.org/wiki/Wikipedia_talk:Plagiarism

(where I argue that it is improper for Wikipedia to be so vague about
what Wikipedia text comes from the PD; clear attribution and
accessibility of the original is the solution)

On Wikisource, we have slowly been building a verifiable
reconstruction of EB1911:

http://en.wikisource.org/wiki/EB1911

and the "project page" for that effort is at

http://en.wikisource.org/wiki/WS:EB1911

We have a complete set of scans in TIFF and PNG at:

http://en.wikisource.org/wiki/User:Tim_Starling

Enjoy,
John

From hart at pglaf.org  Thu Jul 24 07:05:53 2008
From: hart at pglaf.org (Michael Hart)
Date: Thu, 24 Jul 2008 07:05:53 -0700 (PDT)
Subject: [gutvol-d] language counts 2008-07-24 (fwd)
Message-ID: <Pine.LNX.4.64.0807240704220.24961@pglaf.org>


>From our automated count.

I presume there are hundreds more,
as when we hit 25,000. . . .


Michael


Grand total for today: 26000

22197	English	en
1204	French	fr
539	German	de
451	Finnish	fi
344	Dutch	nl
319	Chinese	zh
246	Portuguese	pt
194	Spanish	es
150	Italian	it
56	Latin	la
54	Tagalog	tl
50	Esperanto	eo
40	Swedish	sv
20	Danish	da
20	Catalan	ca
10	Welsh	cy
10	Norwegian	no
7	Russian	ru
7	Icelandic	is
7	Hungarian	hu
6	Middle English	enm
6	Greek	el
6	Bulgarian	bg
4	Serbian	sr
4	Polish	pl
4	Hebrew	he
4	Friulano	fur
3	Old English	ang
3	Nahuatl	nah
3	Japanese	ja
3	Iloko	ilo
3	Czech	cs
3	Afrikaans	af
2	Mayan Languages	myn
1	Yiddish	yi
1	Slovak	sk
1	Sanskrit	sa
1	Romanian	ro
1	North American Indian	nai
1	Napoletano-Calabrese	nap
1	Maori	mi
1	Lithuanian	lt
1	Korean	ko
1	Khasi	kha
1	Iroquoian	iro
1	Irish	ga
1	Interlingua	ia
1	Gascon	gsc
1	Gamilaraay	kld
1	Galician	gl
1	Frisian	fy
1	Cebuano	ceb
1	Breton	br
1	Arapaho	arp
1	Aleut	ale

From hart at pglaf.org  Thu Jul 24 10:21:33 2008
From: hart at pglaf.org (Michael Hart)
Date: Thu, 24 Jul 2008 10:21:33 -0700 (PDT)
Subject: [gutvol-d] eBook Milestones
Message-ID: <Pine.LNX.4.64.0807241019490.27390@pglaf.org>


On July 24, 2008, the original Project Gutenberg eLibrary
reached 26,000 titles, which should be considered with an
assortment of other titles; 1653 from PG of Australia and
509 from PG of Europe, as well as 138 from our latest PG,
Project Gutenberg of Canada, and 377 in PrePrints.

In addition, there were several dozen titles our programs
have not seemed to manage to count as we have posted book
numbers up to 26,119, with only 30-40 reserved numbers.

Thus, the possible grand totals could be as high as:

26,089  from original Project Gutenberg [US copyright]
  1,653  from Project Gutenberg of Australia
    508  from Project Gutenberg of Europe
    377  from Project Gutenberg PrePrints
======
    -30
28,622  Grand total [presuming 30 numbers reserved]


On this same date, The world eBook Fair made it to totals
of 1 1/4 million total entries:


   500,000+  The World Public Library
   468,000+  The Internet Archive
   160,000+  eBooksAboutEverything.com
    17,000+  International Music Score Library Project
    28,000+  Original Project Gutenberg eBooks
    75,000+  Project Gutenberg Consortia Center
==========
1,250,000+  Grand Total




From hart at pglaf.org  Thu Jul 24 10:24:29 2008
From: hart at pglaf.org (Michael Hart)
Date: Thu, 24 Jul 2008 10:24:29 -0700 (PDT)
Subject: [gutvol-d] FIXED:  eBooks Milestones
Message-ID: <Pine.LNX.4.64.0807241023170.27390@pglaf.org>


A few little things adjusted.

Please let me know of more suggestions, comments, etc.


On July 24, 2008, the original Project Gutenberg eLibrary
reached 26,000 titles, which should be considered with an
assortment of other titles; 1653 from PG of Australia and
509 from PG of Europe, as well as 138 from our latest PG,
Project Gutenberg of Canada, and 377 in PrePrints.

In addition, there were several dozen titles our programs
have not seemed to manage to count as we have posted book
numbers up to 26,119, with only 30-40 reserved numbers.

Thus, the possible grand totals could be as high as:

26,084  from original Project Gutenberg [US copyright]
  1,653  from Project Gutenberg of Australia
    508  from Project Gutenberg of Europe
    377  from Project Gutenberg PrePrints
======
    -35
28,622  Grand total [presuming 35 numbers reserved]


On this same date, The world eBook Fair made it to totals
of 1 1/4 million total entries:


   500,000+  The World Public Library
   468,000+  The Internet Archive
   160,000+  eBooksAboutEverything.com
    17,000+  International Music Score Library Project
    28,000+  Original Project Gutenberg eBooks
    75,000+  Project Gutenberg Consortia Center
==========
1,250,000+  Grand Total


From Morasch at aol.com  Thu Jul 24 12:43:24 2008
From: Morasch at aol.com (Morasch at aol.com)
Date: Thu, 24 Jul 2008 15:43:24 EDT
Subject: [gutvol-d] woman in her own right -- 002
Message-ID: <cc7.38ba787a.35ba355c@aol.com>

here's some more data from "woman in her own right"...

appended is a list of 31 more questionable lines,
raising our total from our first 2 passes to 86 hits.

-bowerbird

>    ~"TELL ME ALL ABOUT YOURSELF," HE SAID. .Frontispiece
>    land--open the shutters, Mose, so we can see. . . .
>    'traits. . . . There, -sir, is a set of twelve
>    in a comfortable chair, lit a cigarette. . . .
>    whom he paid, would miss him. . . .
>    order--and then tell me what you think of it." . . .
>    sailing vessel, or a motor boat, obtainable? . . .
>    what's that you say? . . . Miles Casey?--on Fleet
>    Street, near the wharf? . . . Thank you!--He
>    it mild. . . . Betty Whitridge and Nancy Wellesly
>    of something over seventy-five. . . . That is about
>    across on the other. . . . Now," as they wound up
>    any one reads that letter, the jig is up for us. . . .
>    letter and the money were gone. ....
>    Lie low. . . . He's not coming this way--he's going
>    to inspect the big trees, on our left. . . . They won't
>    lines drawn from them intersect? ~" . . .
>    side. . . . "Now, sir, what is it?" as the flaps
>    equal! . . . Now, if you'll be quiet a moment, like
>    you'll not be averse to hear. . . . So, that's better.
>    ~. . . Thank you! Now, you may arise and shake
>    he not? . . .
>    fallen, by adversity, from better things. . . .
>    A little of all three, he concluded. . . . But,
>    -of the stocks and bonds, from the Trust
>    himself. . . .
>    ..........What do you make of it? ~" he
>    be found. .*. . It makes everything seem very real
>    "Better be a little careful, Bill! "he said. . "I
>    fell to thinking. . . . Presently, worn out by
>    "We'll have the full effulgence, if you please." . . .



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080724/7598deeb/attachment.htm 

From Bowerbird at aol.com  Thu Jul 24 12:46:30 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 24 Jul 2008 15:46:30 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	023
Message-ID: <d12.13fbf2de.35ba3616@aol.com>

23.   search for all lines with a dash followed by a capital letter.

a small number of lines (2) _starting_ with a dash were wrong,
since that dash was a misrecognized em-dash; they were fixed.
>    -Clare dangerously ill ... a question of dying,
>    -Why! what's the matter with you, Makimmon?

another line had a word misrecognized, including a dash-capital-i:
>    face, with its heavy, good features and slow-Idndling

and there were two other lines where the dash-capital was correct:
>    and she sat on the bed with a "G-G-God!" Jake
>    On an afternoon of mid-August Gordon was

3 lines were corrected, for a grand total of 201, on 23 routines...

i'll be back tomorrow with the next suggestion in this series...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080724/15246302/attachment.htm 

From Bowerbird at aol.com  Thu Jul 24 17:12:20 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 24 Jul 2008 20:12:20 EDT
Subject: [gutvol-d] getting my wikisource bearings
Message-ID: <caa.2ef7cb33.35ba7464@aol.com>

john-

well, i went to wikisource, and poked around for a little bit.

just some general questions and observations for now...

i'm not sure i grok the structure of the place quite yet, but
i accidentally managed to get to some proofreading interface:
>    
http://en.wikisource.org/wiki/Page:Robertson_Scottish_Gaelic_Dialects0328.png

i used the up-arrow on that page to go here:
>   http://en.wikisource.org/wiki/Index:Scottish_Gaelic_Dialects
but i'm not sure how i would navigate to that page otherwise,
and i don't see where i can overview the books being-proofed?

and do you track how many times a page has been proofed?
i'm not talking just about how many times the page was edited,
but also how many times it was proofed and no errors found.

my methodology for deciding a page is "done" is when a certain
number of proofers (e.g., 2) examined it and found no errors...
(choose a higher number for a greater probability of no errors.)

i believe it's best if this data is exposed to proofers, so those who
seek a bigger challenge can choose the more-well-proofed pages
(that might harbor the "elusive error"), while proofers who prefer
to do lots of easy corrections will choose the "raw o.c.r." pages...

i like the "magnifier" you've got on your images, it's very useful.

the images themselves are bandwidth-huge (2.5 megs each!).

that's the standard wiki editor, isn't it?

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080724/1166a315/attachment.htm 

From jayvdb at gmail.com  Thu Jul 24 20:31:23 2008
From: jayvdb at gmail.com (John Vandenberg)
Date: Fri, 25 Jul 2008 13:31:23 +1000
Subject: [gutvol-d] getting my wikisource bearings
In-Reply-To: <caa.2ef7cb33.35ba7464@aol.com>
References: <caa.2ef7cb33.35ba7464@aol.com>
Message-ID: <deea21830807242031y2ca99a88gf15025d237615908@mail.gmail.com>

On Fri, Jul 25, 2008 at 10:12 AM,  <Bowerbird at aol.com> wrote:
> john-
>
> well, i went to wikisource, and poked around for a little bit.

Great,

I've been anxiously watching for a user "Bowerbird" to be created.  If
you have used a different username, please put me out of my misery and
let me know, privately if you would prefer.

> just some general questions and observations for now...
>
> i'm not sure i grok the structure of the place quite yet, but
> i accidentally managed to get to some proofreading interface:
>>
>> http://en.wikisource.org/wiki/Page:Robertson_Scottish_Gaelic_Dialects0328.png

Note that the page begins with "Page:" - more on that to follow.

That page is proofread, and waiting for validation.  It is color coded
as yellow.  If it is correct, click edit at the top, and click the
radio button that is color coded green, down the bottom.  Then click
"Save page".  No need to worry about the edit summary field; it will
be automatically filled in.

> i used the up-arrow on that page to go here:
>>   http://en.wikisource.org/wiki/Index:Scottish_Gaelic_Dialects
> but i'm not sure how i would navigate to that page otherwise,
> and i don't see where i can overview the books being-proofed?

It has only recently been added, and has so far been only worked on by
a single user.  That user hasnt added it to the list of transcription
projects; I have corrected that now:

http://en.wikisource.org/wiki/Wikisource:Transcription_Projects#Projects_needing_to_be_proofread

All English transcription projects are automatically categorised into
this dynamic page:

http://en.wikisource.org/wiki/Category:Index

All pages that are yet to be proofread can be found in another dynamic category:

http://en.wikisource.org/wiki/Category:Not_proofread

> and do you track how many times a page has been proofed?
> i'm not talking just about how many times the page was edited,
> but also how many times it was proofed and no errors found.

We have a two stage process.  The page is marked as "proofread"
(yellow) when one person has proofed it, and we expect that they have
found and fixed all transcription issues.  It isnt necessary that
layout issues have been resolved yet.  A second person comes along and
validates (green) that it is indeed an accurate transcription.

At the moment, these are tracked by categories:

http://en.wikisource.org/wiki/Category:Proofread
http://en.wikisource.org/wiki/Category:Validated

At the same time, the work may be published.  View this page, and click edit.

http://en.wikisource.org/wiki/Scottish_Gaelic_Dialects

Notice that it only contains references to the transcribed pages, and
that it does _not_ contain any prefix (i.e. no "Page:").  This page is
a logical layer that has been created on top of the transcription
pages.  We measure the size of our wiki in number of published pages
of this kind.  We add markup into the transcription pages, often with
much complexity.  For example, this set of pagescans...

http://en.wikisource.org/wiki/Index:H.R._Rep._No._94-1476

... is presented as a single authentic page ...

http://en.wikisource.org/wiki/Copyright_Law_Revision_(House_Report_No._94-1476)

... and then as an annotated edition, with corrections which can be
found at the bottom.

http://en.wikisource.org/wiki/Copyright_Law_Revision_(House_Report_No._94-1476)/Annotated

> my methodology for deciding a page is "done" is when a certain
> number of proofers (e.g., 2) examined it and found no errors...
> (choose a higher number for a greater probability of no errors.)

our methodology for deciding when a page is done is far more fluid.
It stops being a transcription project once the pages are all green.
We hope that the proofreaders have worked together, documented their
choices, and the result is consistent.  For example here is our
"proofreading project of the month" (I believe the text came from
Project Gutenberg, so this is primarily an small projecct to re-unite
the text with a set of pagescans of an identifiable edition).

http://en.wikisource.org/wiki/Index:Wind_in_the_Willows_(1913).djvu

On the accompanying talk page are notes:

http://en.wikisource.org/wiki/Index_talk:Wind_in_the_Willows_(1913).djvu

Once we are finished that project, we will then reconstruct an earlier
edition (I had hoped we would do this earlier edition first, but ...
we've got nothing but time..).

http://en.wikisource.org/wiki/Index:Wind_in_the_Willows.djvu

> i believe it's best if this data is exposed to proofers, so those who
> seek a bigger challenge can choose the more-well-proofed pages
> (that might harbor the "elusive error"), while proofers who prefer
> to do lots of easy corrections will choose the "raw o.c.r." pages...

We believe it is best if the images and associated data are exposed to
the _reader_.  Every reader is a potential contributor.  It is readers
who find the most difficult transcription errors, simply because there
are more of them, and they are the ones that are shocked.  A
proofreader is looking for problems, and can easily miss them for
lookn at them.  A reader isn't expecting problems, and so is rudely
awakened from their enjoyable reading when they see an error.

Once a transcription project has been completed, the text is still
editable so anyone can "value add".  For fiction, we discourage
semantic markup as it is distracting, however for other types of
works, e.g. scientific, biographical, etc., transcription is just the
beginning.  The process is also not strict.  Here is a work that has
been semantically marked up prior to being proofread, because the
person doing the work is actually more interested in creating
Wikipedia biographies for the people named in it, and extracting the
images therein.  The transcription project is basicly just a way for
him to keep track of where he is up to:

http://en.wikisource.org/wiki/Index:A_Concise_History_of_the_U.S._Air_Force.djvu

I dont recall finding any errors in that transcription as yet, so the
red pages should probably have been marked yellow on creation.  Not
that it matters, because two people verifying it doesnt hurt, and it
is interesting to boot.

As an example of the fun part of Wikisource proofreading, notice on
pagescan 7, there is a link underneath "wreck of bloodstained wood,
wire, and canvas".  Clicking on that takes the reader to the NYT
article the quote comes from, complete with pagescans.

http://en.wikisource.org/wiki/Page:A_Concise_History_of_the_U.S._Air_Force.djvu/7

On page 8, we are trying to get our hands something tangible for the
quote "Why all this fuss about airplanes for the Army? I thought we
already had one." and have located the pagescans for the mentioned
appropriations act of March 3, 1911.

http://en.wikisource.org/wiki/Page_talk:A_Concise_History_of_the_U.S._Air_Force.djvu/8

Just now I have added this to our todo list.

> i like the "magnifier" you've got on your images, it's very useful.
>
> the images themselves are bandwidth-huge (2.5 megs each!).

Our images vary in size, depending on where they came from.  We dont
have any rules on quality.  One of our recent "featured texts" started
from a poor res gif, which I uploaded because a website was given a
take down order for a collection of PD obituaries.

http://en.wikisource.org/wiki/Image:Charles_Babbage_(Obituary%2C_The_Times).gif

This text was not transcribed on a "Page:" , as the distinction
between physical pages and the logical overlay was not in common use
at the time.  Here is the published page:

http://en.wikisource.org/wiki/The_Times/The_Late_Mr._Charles_Babbage%2C_F.R.S.

Others like the idea of featuring this interesting obit, so they hit
the stacks and scanned a much higher res image:

http://en.wikisource.org/wiki/Image:Obituary_for_Charles_Babbage.png

While it was featured, which means it was on the front page, the pages
are protected so that only site administrators may edit it.  This is
mostly to prevent vandalism.  As it turned out, the transcription was
not 100% correct .. and an anonymous person fixed the error after we
removed the protection:

http://en.wikisource.org/w/index.php?title=The_Times%2FThe_Late_Mr._Charles_Babbage%2C_F.R.S.&diff=619858&oldid=617551

Any contributor can upload new images, djvu files, ogg files, or even
pdf files.  If a pdf or a djvu file turns up without an OCR layer,
high res. images are very useful to allow someone else to grab them
and push them through OCR, often kicking and screaming in the case of
unconventional scripts and layouts.  This DjVu file, with embedded
OCR, can then be uploaded over the top of the original file (the old
file is still accessible), and then our bots automatically extract the
text layer from the djvu file and create the pages to be proofread.

> that's the standard wiki editor, isn't it?

It has been developed for Wikisource.

--
John

From Bowerbird at aol.com  Thu Jul 24 23:35:19 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 25 Jul 2008 02:35:19 EDT
Subject: [gutvol-d] getting my wikisource bearings
Message-ID: <bcc.31877e21.35bace27@aol.com>

john said:
>    I've been anxiously watching for a user "Bowerbird" to be created.

i just did, so you could stop being anxious...           :+)


>    All English transcription projects are automatically 
>    categorised into this dynamic page:
>    http://en.wikisource.org/wiki/Category:Index

ok, great, thank you.


>    All pages that are yet to be proofread can be found 
>    in another dynamic category:
>    http://en.wikisource.org/wiki/Category:Not_proofread

got it...


>    We have a two stage process.? 
>    The page is marked as "proofread" (yellow) 
>    when one person has proofed it, and we expect 
>    that they have found and fixed all transcription issues.

yes, i read up on that since.

thanks for all the other information you provided.
if i have any more questions, i will let you know...

i expect it will take me several weeks to get up to speed
with enough knowledge of your system to ask questions
that might be a reflection on the quality of your workflow,
but if anything's up in the interim, i'll let you know that too.

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080725/7e013d29/attachment-0001.htm 

From ralf at ark.in-berlin.de  Fri Jul 25 01:46:06 2008
From: ralf at ark.in-berlin.de (Ralf Stephan)
Date: Fri, 25 Jul 2008 10:46:06 +0200
Subject: [gutvol-d] getting my wikisource bearings
In-Reply-To: <deea21830807242031y2ca99a88gf15025d237615908@mail.gmail.com>
References: <caa.2ef7cb33.35ba7464@aol.com>
	<deea21830807242031y2ca99a88gf15025d237615908@mail.gmail.com>
Message-ID: <20080725084606.GB1589@ark.in-berlin.de>

You wrote 
> ...  It is readers
> who find the most difficult transcription errors, simply because there
> are more of them, and they are the ones that are shocked.  A
> proofreader is looking for problems, and can easily miss them for
> lookn at them.  A reader isn't expecting problems, and so is rudely
> awakened from their enjoyable reading when they see an error.

The only time I've seen such cynicism was with companies releasing
beta software to minimize the costs of testing. The fundamental
difference, however, between proofing transcriptions and software
testing is that software is never finished.

Of course, the two-step proofing of Wiksource will leave lots of
errors for its two-stepness but also because of the clumsiness of
the proofing interface when compared to PGDP. Result is significantly
lower quality than DP-released books. Sorely missing are also
print and other format versions of the texts.


Regards,
ralf

From jayvdb at gmail.com  Fri Jul 25 04:43:15 2008
From: jayvdb at gmail.com (John Vandenberg)
Date: Fri, 25 Jul 2008 21:43:15 +1000
Subject: [gutvol-d] getting my wikisource bearings
In-Reply-To: <20080725084606.GB1589@ark.in-berlin.de>
References: <caa.2ef7cb33.35ba7464@aol.com>
	<deea21830807242031y2ca99a88gf15025d237615908@mail.gmail.com>
	<20080725084606.GB1589@ark.in-berlin.de>
Message-ID: <deea21830807250443p375ee85paa836b5c3f63b10@mail.gmail.com>

On Fri, Jul 25, 2008 at 6:46 PM, Ralf Stephan <ralf at ark.in-berlin.de> wrote:
> You wrote
>> ...  It is readers
>> who find the most difficult transcription errors, simply because there
>> are more of them, and they are the ones that are shocked.  A
>> proofreader is looking for problems, and can easily miss them for
>> lookn at them.  A reader isn't expecting problems, and so is rudely
>> awakened from their enjoyable reading when they see an error.
>
> The only time I've seen such cynicism was with companies releasing
> beta software to minimize the costs of testing. The fundamental
> difference, however, between proofing transcriptions and software
> testing is that software is never finished.

It isnt cynicism to be pragmatic.  I am talking about opportunity cost
and diminishing returns.  Many early Project Gutenberg etexts are
riddled with errors, have paragraphs missing, dont include images that
are integral to the work, etc, etc.  Thank goodness PG put them out
anyway, and either readers or other reviewers identified problems and
provided corrections.  There are many errors in those etexts to this
day, but they are still useful.  English Wikisource works on the same
premise as PG, open source, Wikipedia, etc.  Near enough is usually
good enough, and if someone really wants better, they'll have the
motivation and/or funding to fix it.

This is often inappropriate, where high availability is required, or
the client of the output is paying big bucks.  Then you can put a good
QA team onto the task, and find every problem, in order to decide
whether it needs to be fixed.

When proofreaders are volunteers, it is best to let them go only as
far as they are motivated to go.  This is at odds with the running to
the "finish" line mentality, which forces many sets of eyes to review
a work in order to be 100% sure it is finished.  The problem with this
approach is that those reviewers arn't having fun.  They are looking
for errors, but if they have found none in the last 10 pages, in the
back of their mind they are pretty sure there are none - or, they know
that it is going to be damn hard to find it, so they are not happy to
be looking for the needle in the haystack.

The English Wikisource model differs slightly from the "finish line"
approach because the text is immediately tossed into the public eye on
the web, with quality indicators to alert the reader to the status of
the text.  Being a wiki, it is hard wired into the system that we
expect gradual improvements to occur.  As a result, the English
Wikisource is not pushing texts across the finish line quite so hard,
and readers are pointing out the odd error they find.  This is
pragmatic because we are small, and pragmatic because readers are not
being told it is finished - the reader is hopefully fully aware that a
wiki _can_ be full of rubbish - Wikipedia should have taught everyone
that by now.

Our approach in this may change in time, and the German Wikisource
project is run much more like PGDP, so maybe there is hope yet for the
English project.  (as an aside, I only know the English and Latin
Wikisource projects well; I know the practises of the French or German
Wikisource projects to be quite different, and I am sure that other
smaller language projects have their own policies and models. )

Personally, I like the instant gratification of putting online a 99%
accurate transcription.  I also like to verify the work of others, but
I cant stomach much of it - the eyes blur over and I start thinking
about the weekend.  My hat is off to those that have been doing this
painstaking work.

> Of course, the two-step proofing of Wiksource will leave lots of
> errors for its two-stepness but also because of the clumsiness of
> the proofing interface when compared to PGDP. Result is significantly
> lower quality than DP-released books. Sorely missing are also
> print and other format versions of the texts.

Please explain where our interface is clumsy in comparison to PGDP.
It might just be that I havent explained something, or we have come at
the problem from a different angle.  Also if you consider the two-step
process to be inadequate, I'd like to hear more.  We can add as many
steps as are necessary.  The workflow diagram is currently very
simple, but we have found it effective so far.  A "pre-processed"
stage could be a good addition, but it doesnt really say a lot, as
further pre-processing might be possible.  Automations can mark a page
as "problematic" if they can detect any errors that cant be fixed.

We do publish immediately to web, we prefer to present works as a
single page if possible so print is easier, and we havent tackled
other formats, yet.  At present we are focusing on the transcription
interface.  On English Wikisource, we havent "released" very many
works, so it is a bit to early to do a comparison on that front.  Here
is our list of "featured texts", which are increasing in quality over
time, a process which has caused us to not proceed with the monthly
cycle at times due to priorities.

http://en.wikisource.org/wiki/Wikisource:Featured_texts

--
John

From ralf at ark.in-berlin.de  Fri Jul 25 08:45:05 2008
From: ralf at ark.in-berlin.de (Ralf Stephan)
Date: Fri, 25 Jul 2008 17:45:05 +0200
Subject: [gutvol-d] getting my wikisource bearings
In-Reply-To: <deea21830807250443p375ee85paa836b5c3f63b10@mail.gmail.com>
References: <caa.2ef7cb33.35ba7464@aol.com>
	<deea21830807242031y2ca99a88gf15025d237615908@mail.gmail.com>
	<20080725084606.GB1589@ark.in-berlin.de>
	<deea21830807250443p375ee85paa836b5c3f63b10@mail.gmail.com>
Message-ID: <20080725154505.GA2340@ark.in-berlin.de>

You wrote 
> It isnt cynicism to be pragmatic.  I am talking about opportunity cost
> and diminishing returns.  Many early Project Gutenberg etexts are
> riddled with errors, have paragraphs missing, dont include images that
> are integral to the work, etc, etc.  Thank goodness PG put them out
> anyway,

Thank goodness? It's those texts that frequently draw criticism
with reviews on manybooks, for example. Also, since there are so
many important works without transcription, why don't you concentrate
on volume and just publish the OCR like UMich with their scans
accompanied by the OCR?

> Our approach in this may change in time, and the German Wikisource
> project is run much more like PGDP, so maybe there is hope yet for the
> English project.

I don't think so when I look at the results.

> Please explain where our interface is clumsy in comparison to PGDP.

With DP, you have both the scan and the text on screen. The spellcheck
needs one click and presents errors as marked and ready for correction;
good/bad word lists per language exist. Different fonts are available
at once. No risk of correction conflict. Listing of your diffs *per
project* possible. Browsing of scans with one click (prev/next).
Possibility of chosing difficulty of work (easy/normal/hard projects
as well as P123/F12). Support of TEI master producing all other formats
(TXT, HTML, PDF) automatically. Possibility of LaTeX-only projects.

> Also if you consider the two-step
> process to be inadequate, I'd like to hear more.

With P123/F12/PP/PPV, DP has a seven-step process. Although later
rounds aren't required to do proofing, glaring errors are certainly
corrected in later rounds, too.

> We can add as many steps as are necessary.

But you won't.

> We do publish immediately to web

But what people call now 'the usual Gutenberg quality' will be achieved
only much later, when eBook producers will have forgotten about the link.
At that time, say ten years, you'll still pounce on the old horse of
lower quality of <10k etexts. We'll be at 50k(?), then.


ralf

From hart at pglaf.org  Fri Jul 25 08:46:23 2008
From: hart at pglaf.org (Michael Hart)
Date: Fri, 25 Jul 2008 08:46:23 -0700 (PDT)
Subject: [gutvol-d] Who Needs a Library When You Have an iPhone
Message-ID: <Pine.LNX.4.64.0807250846010.10365@pglaf.org>


http://ct.bnet.com/clicks?t=70496868-2e256fe64ecacf4b8cba0b6fdb65369a-bf&brand=BNET&s=5">Who 
Needs a Library When You Have an iPhone? </A>BNET's


Rick Broida on how to turn your iPhone into the ultimate e-book 
reader.

From walter.van.holst at xs4all.nl  Fri Jul 25 08:50:49 2008
From: walter.van.holst at xs4all.nl (Walter van Holst)
Date: Fri, 25 Jul 2008 17:50:49 +0200
Subject: [gutvol-d] Who Needs a Library When You Have an iPhone
In-Reply-To: <Pine.LNX.4.64.0807250846010.10365@pglaf.org>
References: <Pine.LNX.4.64.0807250846010.10365@pglaf.org>
Message-ID: <4889F659.2040500@xs4all.nl>

Michael Hart wrote:
> http://ct.bnet.com/clicks?t=70496868-2e256fe64ecacf4b8cba0b6fdb65369a-bf&brand=BNET&s=5">Who 
> Needs a Library When You Have an iPhone? </A>BNET's
> 
> 
> Rick Broida on how to turn your iPhone into the ultimate e-book 
> reader.

Having both an iRex iLiad and an iPhone, I prefer the iLiad or my e-book 
purposes. To each his own, I guess.

Regards,

  Walter

(and yes, I am a bit of a gadget-addict)

From Bowerbird at aol.com  Fri Jul 25 10:33:26 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 25 Jul 2008 13:33:26 EDT
Subject: [gutvol-d] Who Needs a Library When You Have an iPhone
Message-ID: <d20.3394ffec.35bb6866@aol.com>

walter said:
>   Having both an iRex iLiad and an iPhone, 
>    I prefer the iLiad or my e-book purposes.
>    To each his own, I guess.

do you carry your iphone all the time?
and do you carry your iliad all the time?

which do you prefer for "e-book purposes"
if/when you are only carrying your iphone?

finally, which do you prefer for "phone purposes"?          :+)

there's a good reason generalized devices
usually win in the marketplace long-term...

don't get me wrong.   i'm sure the huge screen
on the iliad is _a_sheer_pleasure_ to read from.
but unless you carry a purse, it's also a bother...

(not to mention what the darn thing _costs_!)

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080725/f0f9b5f4/attachment.htm 

From Bowerbird at aol.com  Fri Jul 25 10:56:02 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 25 Jul 2008 13:56:02 EDT
Subject: [gutvol-d] getting my wikisource bearings
Message-ID: <d00.37fd8bf3.35bb6db2@aol.com>

john-

please ignore the ralf-baiting.   the d.p. echo-chamber
has convinced itself that its product is quite superior...

***

i said:
>   if anything's up in the interim, i'll let you know that too.

it did just so happen that something jumped out right away.

you're rewrapping the text before it goes in front of people
-- removing the original linebreaks from the print book --
which makes the text _exceedingly_ more difficult to proof...

if you wanna know the one thing you're doing wrong, this is it.

i would estimate it cuts proofing efficiency in half, or _more_.
moreover, it makes it much less pleasurable to do.   lose-lose.

for a look at a system that retains the p-book linebreaks, see:
>    http://z-m-l.com/go/mabie/mabiep123.html
>    http://z-m-l.com/go/myant/myantp123.html
>    http://z-m-l.com/go/sgfhb/sgfhbp123.html

any improvements i'd suggest after this will pale in comparison
to the increase you will get by bringing back original linebreaks.

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080725/7f37e63f/attachment.htm 

From sly at victoria.tc.ca  Fri Jul 25 12:12:44 2008
From: sly at victoria.tc.ca (Andrew Sly)
Date: Fri, 25 Jul 2008 12:12:44 -0700 (PDT)
Subject: [gutvol-d] getting my wikisource bearings
In-Reply-To: <20080725154505.GA2340@ark.in-berlin.de>
References: <caa.2ef7cb33.35ba7464@aol.com>
	<deea21830807242031y2ca99a88gf15025d237615908@mail.gmail.com>
	<20080725084606.GB1589@ark.in-berlin.de>
	<deea21830807250443p375ee85paa836b5c3f63b10@mail.gmail.com>
	<20080725154505.GA2340@ark.in-berlin.de>
Message-ID: <Pine.GSO.4.58.0807251155010.1497@vtn1.victoria.tc.ca>


_Pace_ Ralph and John.


As this is a project gutenberg mailing list, it might be
good to remember that Greb Newby and Michael Hart have
affirmed many times that PG wishes to encourage anyone
and everyone interested in the goals of digitizing
and preserving texts, regardless of methods used.
(Kind of "the more the merrier" approach.)


PGDP and wikisource have come from different backgrounds,
and I'm sure there is room for improvement in both.
(There always is.) I'm also sure that participants in
each could learn something from exploring the other.

Right away a difference that I see is that wikisource
appears to deal more with shorter texts, ie, single
poems, historically important letters, patents,
codes of law, etc.

Andrew

From Bowerbird at aol.com  Fri Jul 25 15:41:03 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 25 Jul 2008 18:41:03 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	024
Message-ID: <ce2.34773e8c.35bbb07f@aol.com>

24.   search for all lines with a comma followed by non-whitespace,
except the following cases, which are all accepted to be allowed:
>    comma-doublequote-whitespace
>    comma-singlequote-whitespace
>    comma-emdash
>    comma-doublequote-emdash
>    comma-singlequote-emdash

this routine returned these 7 cases:
>    Gordon's lips formed a silent exclamation.,.
>    They glanced,-each at the other, swiftly; it
>    girl until--until Buckley.,. until to-night, now.
>    darkly, Gordon, stood still, Meta Beggs fe.ll be,-
>    It enraged him that she was so collected; her body,*
>    to its goal,., Gordon saw now that Mrs. Caley
>    your wife. Miss Beggs oughtn't.,. she isn't anything

4 of them (#1, #3, #6, and #7) were misrecognized ellipses,
#2 and #4 were specks that were misrecognized as a dash,
and #5 had a speck that was misrecognized as an asterisk...

7 more lines corrected, for a grand total of 208, on 24 routines...

i'll be back tomorrow with the next tip in this series...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080725/c4acd089/attachment.htm 

From Bowerbird at aol.com  Fri Jul 25 16:03:48 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 25 Jul 2008 19:03:48 EDT
Subject: [gutvol-d] woman in her own right -- 003
Message-ID: <d21.3374c889.35bbb5d4@aol.com>

continuing on with the "woman in her own right" book...

today we observe that roger frank has learned from us.
didn't bother to say "thanks", but we don't really need it.

appended we show 44 lines that roger's program flagged,
and when we look at _why_ they were flagged, we see that
roger has incorporated some of our checks into his app...

for instance, he is now flagging lines that start with a space.

he's also now flagging lines that start with a period, and
also flagged are lines that start with a spacey quotemark...

and finally, he's now flagging "paragraphs" that start with
a lowercase letter, which are typically broken paragraphs.

none of these types of glitches were flagged in "mountain blood",
so that means roger has added the checks in recently, obviously
as a result of my series.   so i'm glad that he was paying attention.

evidently, he was able to pick up on the hints i offered via _posts_,
even though he went on to "accuse" me of _not_sharing_.   funny...

***

it's still the wrong approach, though, to _flag_ stuff for proofers.
these easily-located errors should be fixed by a preprocessor,
on a whole-book-wide basis, as a regular part of the workflow, 
using a dedicated tool that will help streamline the procedure...

inserting characters _in_ the text -- in the form of tilda flags --
which must then be _removed_ by the proofers is kind of stupid.
both the insertion and the removal take _energy_ to accomplish,
so it's just a make-work scenario.   we want to _minimize_ work...

it's also the case that the "flag-removal" process _causes_errors_.

i forget which book i was looking at -- perhaps "the crevice" --
and the "incorrect corrections" on tilda-flagged spacey-quotes
was astronomical.   that is, instead of making it a close-quote
(which is what it actually was) by attaching the quote to the word
before it, the proofers made it an open-quote on the word after.
(or vice versa, where they made an open-quote a close-quote.)

now, proofers almost _never_ get spacey-quotes wrong like that,
not when the spacey-quotes aren't flagged.   but these tilda-flags
having to be removed made the proofers goof up the job big-time.

i've never seen more "incorrect corrections" done on spacey-quotes.
indeed, i've never seen so many "incorrect corrections" on anything!

i'll be able to present you with some hard numbers on this later,
when i cover whatever book it was that manifested this problem;
for now, just store a little nugget about a problem with flagging.

appended are the 44 lines which _could_ have been preprocessed,
bringing our grand total of found-but-not-fixed errors to _132_...

instead now the proofers will have to make those corrections...

-bowerbird


*** lines that start with a space
~ The remaining member of the party was Montecute
~ Many Prominent Persons Among the Creditors.


*** lines that start with a period
~.to his lips--and, then, without a word, swung
~.go by water to Baltimore (which was available on
~.PIRATE'S GOLD 151
~. . . Thank you! Now, you may arise and shake
~.said Croyden.
~.Kneeling, he quickly dug with a small trowel a hole


*** lines that start with a quotemark followed by a space
~" she asked presently. "He appeared perfectly
~" asked Croyden.
~" picking up a pearl stud from under the
~" said Macloud.
~" she asked.
~' until the women are safely returned. They


*** "paragraphs" starting with a lowercase letter
~s. w. c.
~there were a dozen white men, with slouch hats
~mean it isn't there? ~" he exclaimed.
~the noise of the team.
~whom have we here? ~" as a buggy emerged from
~has drawn her robes about her----"
~a very sweet girl, needs no proof--unless----"
~looking at her with a meaning smile.
~enigmatically. "I want you----" She put one
~slender foot on the fender, and gazed at it, meditatively,
~shall we go this very evening?"
~be correct. So, why? Why?----" She held up
~her hand. "Don't answer! I'm not asking for
~head----"
~years, and I have never before known him to exhibit
~now----" He walked across to the window. He
~would let that sink in.--" How's the Symphony in
~ended.
~her?"
~an expressive gesture, he resumed the ascent.
~is Elaine's," said he. "I recognize the monogram
~than ten thousand cents. I am only----" She
~stopped, staring.
~think----"
~tell you the entire story.............Is
~there anything I have missed? ~" he ended.
~at him the while.
~to my affection for Elaine, it's vanished, now.----
~his coat......" Oh! I forgot to say, I
~wired the Pinkerton man to recover the package



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080725/30c56f5c/attachment.htm 

From prosfilaes at gmail.com  Fri Jul 25 16:56:50 2008
From: prosfilaes at gmail.com (David Starner)
Date: Fri, 25 Jul 2008 19:56:50 -0400
Subject: [gutvol-d] getting my wikisource bearings
In-Reply-To: <deea21830807242031y2ca99a88gf15025d237615908@mail.gmail.com>
References: <caa.2ef7cb33.35ba7464@aol.com>
	<deea21830807242031y2ca99a88gf15025d237615908@mail.gmail.com>
Message-ID: <6d99d1fd0807251656y3ae300f8r73a5df0adb9f05de@mail.gmail.com>

On Thu, Jul 24, 2008 at 11:31 PM, John Vandenberg <jayvdb at gmail.com> wrote:
> It is readers
> who find the most difficult transcription errors, simply because there
> are more of them, and they are the ones that are shocked.  A
> proofreader is looking for problems, and can easily miss them for
> lookn at them.  A reader isn't expecting problems, and so is rudely
> awakened from their enjoyable reading when they see an error.

You state that as fact; any statistics? I think there's a lot of
errors that a proofreader will almost invariably catch that readers
won't. For example:

> There was mild laughter, but Foster went into paroxysms. He slapped his thighs and shook his
> head. He always laughed heartily at humor at his own expense. That established the point that he
> could "take it," and give him license to "dish it out."

What's the error in that paragraph? I don't think any reader would
catch it. Any proofreader, with even the vaguest attention to the
scan, would see that missing sentence in a heartbeat. Readers miss
subtle substitutions (did you catch the fact I changed jokes to humor
in the above text?) and tend to hypercorrect things to match their own
perception of right; how many changes at Wikisource are to correct
misspellings like humour and colour? Over the long run, there are sets
of errors that an unlimited supply of readers will eventually catch;
but I think that there are many errors that will never be caught
except by comparing with scans or another etext, and that frequently
for minor texts the number of readers a text accumulates will fail to
do as good a job as the concerted series of proofreaders DP does. (In
some cases, I think DP has more proofreaders than will ever read the
book as an etext; at least it means that all of the books DP puts into
PG pass basic quality standards.)

From jayvdb at gmail.com  Fri Jul 25 20:20:34 2008
From: jayvdb at gmail.com (John Vandenberg)
Date: Sat, 26 Jul 2008 13:20:34 +1000
Subject: [gutvol-d] getting my wikisource bearings
In-Reply-To: <d00.37fd8bf3.35bb6db2@aol.com>
References: <d00.37fd8bf3.35bb6db2@aol.com>
Message-ID: <deea21830807252020g2655b6bbhdc7ae3c6e1799f1c@mail.gmail.com>

On Sat, Jul 26, 2008 at 3:56 AM,  <Bowerbird at aol.com> wrote:
> john-
>
> please ignore the ralf-baiting.  the d.p. echo-chamber
> has convinced itself that its product is quite superior...
>
> ***
>
> i said:
>>   if anything's up in the interim, i'll let you know that too.
>
> it did just so happen that something jumped out right away.
>
> you're rewrapping the text before it goes in front of people
> -- removing the original linebreaks from the print book --
> which makes the text _exceedingly_ more difficult to proof...
>
> if you wanna know the one thing you're doing wrong, this is it.

Interesting point.  We do re-wrap the "view" display.  Line breaks in
the actual data are being dropped in the output, so that a paragraph
is reading for reading.  This is an example of how our imperative to
"publish" (no wiki page is unpublished; it's live immediately, for
good or ill) has meant that we have prioritised the final published
view over the proofreading interface.  I think we can do better, with
a minor tweak to the display engine.  I've made the proposal here:

http://en.wikisource.org/wiki/WS:S#Adding_purge_link_to_Index_pages

In the actual data, the line breaks are not being intentionally dropped.

When we populate a transcription project from online transcriptions,
the original line breaks are usually long gone.  If the proofreading
problem discussed above is fixed, then it would make sense that we
would restore the line breaks before we finish a project.

When we load OCR into system, the line breaks are retained in the
editing window.  If the browser window is too small, the web browser
wraps the long lines:

http://en.wikisource.org/w/index.php?title=Page:United_States_Statutes_at_Large_Volume_1_-_p1-22.djvu/19&action=edit

> i would estimate it cuts proofing efficiency in half, or _more_.
> moreover, it makes it much less pleasurable to do.  lose-lose.
>
> for a look at a system that retains the p-book linebreaks, see:
>>   http://z-m-l.com/go/mabie/mabiep123.html
>>   http://z-m-l.com/go/myant/myantp123.html
>>   http://z-m-l.com/go/sgfhb/sgfhbp123.html

These look good, and the zml format looks easy to import into
Wikisource, except that the images are separate rather than bundled
into a djvu.  I'd like to write a zml importer.  Is there any specific
work you would be interested in seeing on Wikisource.  Perhaps one you
have already pre-processed, but is not yet proofread?

Thanks for taking a look.

--
John

From Bowerbird at aol.com  Fri Jul 25 21:21:42 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 26 Jul 2008 00:21:42 EDT
Subject: [gutvol-d] getting my wikisource bearings
Message-ID: <d16.30bc9243.35bc0056@aol.com>

john said:
>    In the actual data, the line breaks are not being intentionally dropped.

i take it that you mean they are intentionally retained?...         :+)

is there a way that a person can get to this "actual data"?


>    When we populate a transcription project from online transcriptions,
>    the original line breaks are usually long gone.

usually.   but sometimes if you "view source" on the .html,
you will find that they are still there, in the .html source...


>    If the proofreading problem discussed above is fixed, then 
>    it would make sense that we would restore the line breaks 
>    before we finish a project.

i've tried that -- it's a lot of work...   so much work that it's faster
and simpler to re-do the o.c.r. and apply the corrections to that.
that's why it's so darn troublesome when a digitizer rewraps text.

(although now that i've gone and said that, i suppose that i could
write a tool that would simplify that specific task.   i'd have to try.)


>    When we load OCR into system, the line breaks are 
>    retained in the editing window.? If the browser window 
>    is too small, the web browser wraps the long lines:

but, if i've understood you correctly, doing proofreading in the
editing window has problems too, including obtrusive markup.

***

we should tie this into some ramifications, too, to be complete.

i believe that it's important to tie an e-text to its "ur" paper-book.
but it's not enough to simply _do_ that, you have to also make it
_obvious_ and _easily_verifiable_ -- by anyone who cares to look.

retaining the linebreaks is the one thing you can do to make it easy.

it's just too difficult to verify that the text is the same as the scan
when the linebreaks in the text have been removed.

so, given two sources of an e-text, the future will gravitate toward
the one that kept the original linebreaks, as more-easily verified...

there are other ramifications as well -- verisimilitude in printing,
for instance -- but easy verifiability is _the_ most important one.
(even more important than easier proofreading, in the long run,
but since they both point to keeping the linebreaks, no conflict.)

***

>    If the browser window is too small, the web browser wraps the long 
lines:

i've got a cinema-screen, so i can make the browser window "big".
but if i couldn't, i'd probably start resenting that column on the left.
i'd want to get rid of as much chrome as possible.   just text and scan.


>    
http://en.wikisource.org/w/index.php?title=Page:United_States_Statutes_at_Large_Volume_1_-_p1-22.djvu/19&action=edit

uueeww.   i'm quite sure i didn't want to look at _that_ page of o.c.r.
i'll just pretend i didn't see it for now...              ;+)


>   I'd like to write a zml importer.

if you want to.   but i can probably write it more easily than you can.
i've already written zml-to-html and zml-to-pdf conversion tools...

but what i'd _most_ like to do is streamline wiki-markup to be more zen.


>    Is there any specific work you would be interested 
>    in seeing on Wikisource.  Perhaps one you have 
>    already pre-processed, but is not yet proofread?

none right now, no.   but let me think a little bit on that.

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080726/d37733ce/attachment-0001.htm 

From dakretz at gmail.com  Fri Jul 25 22:03:59 2008
From: dakretz at gmail.com (don kretz)
Date: Fri, 25 Jul 2008 22:03:59 -0700
Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 34
In-Reply-To: <mailman.3846.1217046108.2809.gutvol-d@lists.pglaf.org>
References: <mailman.3846.1217046108.2809.gutvol-d@lists.pglaf.org>
Message-ID: <627d59b80807252203k2be68603sefab0fa777c6cc97@mail.gmail.com>

John,

Let me attire myself with my technical naivete cap, and ask a question with
a probably obvious answer.

What's with this djvu?

>From googling around, it appears to be a pdf alternative. But is seems to be
strongly preferred in wikiland. Why? And if it's equivalent, why isn't it
supported interchangeably with pdf files, which we all know how to build and
deconstruct?

On Fri, Jul 25, 2008 at 9:21 PM, <gutvol-d-request at lists.pglaf.org> wrote:

> Send gutvol-d mailing list submissions to
>        gutvol-d at lists.pglaf.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://lists.pglaf.org/listinfo.cgi/gutvol-d
> or, via email, send a message with subject or body 'help' to
>        gutvol-d-request at lists.pglaf.org
>
> You can reach the person managing the list at
>        gutvol-d-owner at lists.pglaf.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of gutvol-d digest..."
>
> Today's Topics:
>
>   1. Re: getting my wikisource bearings (Andrew Sly)
>   2. how to clean up ("preprocess") the o.c.r. for a book --   024
>      (Bowerbird at aol.com)
>   3. woman in her own right -- 003 (Bowerbird at aol.com)
>   4. Re: getting my wikisource bearings (David Starner)
>   5. Re: getting my wikisource bearings (John Vandenberg)
>   6. Re: getting my wikisource bearings (Bowerbird at aol.com)
>
>
> ---------- Forwarded message ----------
> From: Andrew Sly <sly at victoria.tc.ca>
> To: Project Gutenberg Volunteer Discussion <gutvol-d at lists.pglaf.org>
> Date: Fri, 25 Jul 2008 12:12:44 -0700 (PDT)
> Subject: Re: [gutvol-d] getting my wikisource bearings
>
> _Pace_ Ralph and John.
>
>
> As this is a project gutenberg mailing list, it might be
> good to remember that Greb Newby and Michael Hart have
> affirmed many times that PG wishes to encourage anyone
> and everyone interested in the goals of digitizing
> and preserving texts, regardless of methods used.
> (Kind of "the more the merrier" approach.)
>
>
> PGDP and wikisource have come from different backgrounds,
> and I'm sure there is room for improvement in both.
> (There always is.) I'm also sure that participants in
> each could learn something from exploring the other.
>
> Right away a difference that I see is that wikisource
> appears to deal more with shorter texts, ie, single
> poems, historically important letters, patents,
> codes of law, etc.
>
> Andrew
>
>
>
> ---------- Forwarded message ----------
> From: Bowerbird at aol.com
> To: gutvol-d at lists.pglaf.org, Bowerbird at aol.com
> Date: Fri, 25 Jul 2008 18:41:03 EDT
> Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
> 024
> 24.  search for all lines with a comma followed by non-whitespace,
> except the following cases, which are all accepted to be allowed:
> >   comma-doublequote-whitespace
> >   comma-singlequote-whitespace
> >   comma-emdash
> >   comma-doublequote-emdash
> >   comma-singlequote-emdash
>
> this routine returned these 7 cases:
> >   Gordon's lips formed a silent exclamation.,.
> >   They glanced,-each at the other, swiftly; it
> >   girl until--until Buckley.,. until to-night, now.
> >   darkly, Gordon, stood still, Meta Beggs fe.ll be,-
> >   It enraged him that she was so collected; her body,*
> >   to its goal,., Gordon saw now that Mrs. Caley
> >   your wife. Miss Beggs oughtn't.,. she isn't anything
>
> 4 of them (#1, #3, #6, and #7) were misrecognized ellipses,
> #2 and #4 were specks that were misrecognized as a dash,
> and #5 had a speck that was misrecognized as an asterisk...
>
> 7 more lines corrected, for a grand total of 208, on 24 routines...
>
> i'll be back tomorrow with the next tip in this series...
>
> -bowerbird
>
>
>
> **************
> Get fantasy football with free live scoring. Sign up for FanHouse Fantasy
> Football today.
> (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
>
> ---------- Forwarded message ----------
> From: Bowerbird at aol.com
> To: gutvol-d at lists.pglaf.org, Bowerbird at aol.com
> Date: Fri, 25 Jul 2008 19:03:48 EDT
> Subject: [gutvol-d] woman in her own right -- 003
> continuing on with the "woman in her own right" book...
>
> today we observe that roger frank has learned from us.
> didn't bother to say "thanks", but we don't really need it.
>
> appended we show 44 lines that roger's program flagged,
> and when we look at _why_ they were flagged, we see that
> roger has incorporated some of our checks into his app...
>
> for instance, he is now flagging lines that start with a space.
>
> he's also now flagging lines that start with a period, and
> also flagged are lines that start with a spacey quotemark...
>
> and finally, he's now flagging "paragraphs" that start with
> a lowercase letter, which are typically broken paragraphs.
>
> none of these types of glitches were flagged in "mountain blood",
> so that means roger has added the checks in recently, obviously
> as a result of my series.  so i'm glad that he was paying attention.
>
> evidently, he was able to pick up on the hints i offered via _posts_,
> even though he went on to "accuse" me of _not_sharing_.  funny...
>
> ***
>
> it's still the wrong approach, though, to _flag_ stuff for proofers.
> these easily-located errors should be fixed by a preprocessor,
> on a whole-book-wide basis, as a regular part of the workflow,
> using a dedicated tool that will help streamline the procedure...
>
> inserting characters _in_ the text -- in the form of tilda flags --
> which must then be _removed_ by the proofers is kind of stupid.
> both the insertion and the removal take _energy_ to accomplish,
> so it's just a make-work scenario.  we want to _minimize_ work...
>
> it's also the case that the "flag-removal" process _causes_errors_.
>
> i forget which book i was looking at -- perhaps "the crevice" --
> and the "incorrect corrections" on tilda-flagged spacey-quotes
> was astronomical.  that is, instead of making it a close-quote
> (which is what it actually was) by attaching the quote to the word
> before it, the proofers made it an open-quote on the word after.
> (or vice versa, where they made an open-quote a close-quote.)
>
> now, proofers almost _never_ get spacey-quotes wrong like that,
> not when the spacey-quotes aren't flagged.  but these tilda-flags
> having to be removed made the proofers goof up the job big-time.
>
> i've never seen more "incorrect corrections" done on spacey-quotes.
> indeed, i've never seen so many "incorrect corrections" on anything!
>
> i'll be able to present you with some hard numbers on this later,
> when i cover whatever book it was that manifested this problem;
> for now, just store a little nugget about a problem with flagging.
>
> appended are the 44 lines which _could_ have been preprocessed,
> bringing our grand total of found-but-not-fixed errors to _132_...
>
> instead now the proofers will have to make those corrections...
>
> -bowerbird
>
>
> *** lines that start with a space
> ~ The remaining member of the party was Montecute
> ~ Many Prominent Persons Among the Creditors.
>
>
> *** lines that start with a period
> ~.to his lips--and, then, without a word, swung
> ~.go by water to Baltimore (which was available on
> ~.PIRATE'S GOLD 151
> ~. . . Thank you! Now, you may arise and shake
> ~.said Croyden.
> ~.Kneeling, he quickly dug with a small trowel a hole
>
>
> *** lines that start with a quotemark followed by a space
> ~" she asked presently. "He appeared perfectly
> ~" asked Croyden.
> ~" picking up a pearl stud from under the
> ~" said Macloud.
> ~" she asked.
> ~' until the women are safely returned. They
>
>
> *** "paragraphs" starting with a lowercase letter
> ~s. w. c.
> ~there were a dozen white men, with slouch hats
> ~mean it isn't there? ~" he exclaimed.
> ~the noise of the team.
> ~whom have we here? ~" as a buggy emerged from
> ~has drawn her robes about her----"
> ~a very sweet girl, needs no proof--unless----"
> ~looking at her with a meaning smile.
> ~enigmatically. "I want you----" She put one
> ~slender foot on the fender, and gazed at it, meditatively,
> ~shall we go this very evening?"
> ~be correct. So, why? Why?----" She held up
> ~her hand. "Don't answer! I'm not asking for
> ~head----"
> ~years, and I have never before known him to exhibit
> ~now----" He walked across to the window. He
> ~would let that sink in.--" How's the Symphony in
> ~ended.
> ~her?"
> ~an expressive gesture, he resumed the ascent.
> ~is Elaine's," said he. "I recognize the monogram
> ~than ten thousand cents. I am only----" She
> ~stopped, staring.
> ~think----"
> ~tell you the entire story.............Is
> ~there anything I have missed? ~" he ended.
> ~at him the while.
> ~to my affection for Elaine, it's vanished, now.----
> ~his coat......" Oh! I forgot to say, I
> ~wired the Pinkerton man to recover the package
>
>
>
> **************
> Get fantasy football with free live scoring. Sign up for FanHouse Fantasy
> Football today.
> (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
>
> ---------- Forwarded message ----------
> From: "David Starner" <prosfilaes at gmail.com>
> To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
> Date: Fri, 25 Jul 2008 19:56:50 -0400
> Subject: Re: [gutvol-d] getting my wikisource bearings
> On Thu, Jul 24, 2008 at 11:31 PM, John Vandenberg <jayvdb at gmail.com>
> wrote:
> > It is readers
> > who find the most difficult transcription errors, simply because there
> > are more of them, and they are the ones that are shocked.  A
> > proofreader is looking for problems, and can easily miss them for
> > lookn at them.  A reader isn't expecting problems, and so is rudely
> > awakened from their enjoyable reading when they see an error.
>
> You state that as fact; any statistics? I think there's a lot of
> errors that a proofreader will almost invariably catch that readers
> won't. For example:
>
> > There was mild laughter, but Foster went into paroxysms. He slapped his
> thighs and shook his
> > head. He always laughed heartily at humor at his own expense. That
> established the point that he
> > could "take it," and give him license to "dish it out."
>
> What's the error in that paragraph? I don't think any reader would
> catch it. Any proofreader, with even the vaguest attention to the
> scan, would see that missing sentence in a heartbeat. Readers miss
> subtle substitutions (did you catch the fact I changed jokes to humor
> in the above text?) and tend to hypercorrect things to match their own
> perception of right; how many changes at Wikisource are to correct
> misspellings like humour and colour? Over the long run, there are sets
> of errors that an unlimited supply of readers will eventually catch;
> but I think that there are many errors that will never be caught
> except by comparing with scans or another etext, and that frequently
> for minor texts the number of readers a text accumulates will fail to
> do as good a job as the concerted series of proofreaders DP does. (In
> some cases, I think DP has more proofreaders than will ever read the
> book as an etext; at least it means that all of the books DP puts into
> PG pass basic quality standards.)
>
>
>
> ---------- Forwarded message ----------
> From: "John Vandenberg" <jayvdb at gmail.com>
> To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
> Date: Sat, 26 Jul 2008 13:20:34 +1000
> Subject: Re: [gutvol-d] getting my wikisource bearings
> On Sat, Jul 26, 2008 at 3:56 AM,  <Bowerbird at aol.com> wrote:
> > john-
> >
> > please ignore the ralf-baiting.  the d.p. echo-chamber
> > has convinced itself that its product is quite superior...
> >
> > ***
> >
> > i said:
> >>   if anything's up in the interim, i'll let you know that too.
> >
> > it did just so happen that something jumped out right away.
> >
> > you're rewrapping the text before it goes in front of people
> > -- removing the original linebreaks from the print book --
> > which makes the text _exceedingly_ more difficult to proof...
> >
> > if you wanna know the one thing you're doing wrong, this is it.
>
> Interesting point.  We do re-wrap the "view" display.  Line breaks in
> the actual data are being dropped in the output, so that a paragraph
> is reading for reading.  This is an example of how our imperative to
> "publish" (no wiki page is unpublished; it's live immediately, for
> good or ill) has meant that we have prioritised the final published
> view over the proofreading interface.  I think we can do better, with
> a minor tweak to the display engine.  I've made the proposal here:
>
> http://en.wikisource.org/wiki/WS:S#Adding_purge_link_to_Index_pages
>
> In the actual data, the line breaks are not being intentionally dropped.
>
> When we populate a transcription project from online transcriptions,
> the original line breaks are usually long gone.  If the proofreading
> problem discussed above is fixed, then it would make sense that we
> would restore the line breaks before we finish a project.
>
> When we load OCR into system, the line breaks are retained in the
> editing window.  If the browser window is too small, the web browser
> wraps the long lines:
>
>
> http://en.wikisource.org/w/index.php?title=Page:United_States_Statutes_at_Large_Volume_1_-_p1-22.djvu/19&action=edit
>
> > i would estimate it cuts proofing efficiency in half, or _more_.
> > moreover, it makes it much less pleasurable to do.  lose-lose.
> >
> > for a look at a system that retains the p-book linebreaks, see:
> >>   http://z-m-l.com/go/mabie/mabiep123.html
> >>   http://z-m-l.com/go/myant/myantp123.html
> >>   http://z-m-l.com/go/sgfhb/sgfhbp123.html
>
> These look good, and the zml format looks easy to import into
> Wikisource, except that the images are separate rather than bundled
> into a djvu.  I'd like to write a zml importer.  Is there any specific
> work you would be interested in seeing on Wikisource.  Perhaps one you
> have already pre-processed, but is not yet proofread?
>
> Thanks for taking a look.
>
> --
> John
>
>
>
> ---------- Forwarded message ----------
> From: Bowerbird at aol.com
> To: gutvol-d at lists.pglaf.org, Bowerbird at aol.com
> Date: Sat, 26 Jul 2008 00:21:42 EDT
> Subject: Re: [gutvol-d] getting my wikisource bearings
> john said:
> >   In the actual data, the line breaks are not being intentionally
> dropped.
>
> i take it that you mean they are intentionally retained?...        :+)
>
> is there a way that a person can get to this "actual data"?
>
>
> >   When we populate a transcription project from online transcriptions,
> >   the original line breaks are usually long gone.
>
> usually.  but sometimes if you "view source" on the .html,
> you will find that they are still there, in the .html source...
>
>
> >   If the proofreading problem discussed above is fixed, then
> >   it would make sense that we would restore the line breaks
> >   before we finish a project.
>
> i've tried that -- it's a lot of work...  so much work that it's faster
> and simpler to re-do the o.c.r. and apply the corrections to that.
> that's why it's so darn troublesome when a digitizer rewraps text.
>
> (although now that i've gone and said that, i suppose that i could
> write a tool that would simplify that specific task.  i'd have to try.)
>
>
> >   When we load OCR into system, the line breaks are
> >   retained in the editing window.  If the browser window
> >   is too small, the web browser wraps the long lines:
>
> but, if i've understood you correctly, doing proofreading in the
> editing window has problems too, including obtrusive markup.
>
> ***
>
> we should tie this into some ramifications, too, to be complete.
>
> i believe that it's important to tie an e-text to its "ur" paper-book.
> but it's not enough to simply _do_ that, you have to also make it
> _obvious_ and _easily_verifiable_ -- by anyone who cares to look.
>
> retaining the linebreaks is the one thing you can do to make it easy.
>
> it's just too difficult to verify that the text is the same as the scan
> when the linebreaks in the text have been removed.
>
> so, given two sources of an e-text, the future will gravitate toward
> the one that kept the original linebreaks, as more-easily verified...
>
> there are other ramifications as well -- verisimilitude in printing,
> for instance -- but easy verifiability is _the_ most important one.
> (even more important than easier proofreading, in the long run,
> but since they both point to keeping the linebreaks, no conflict.)
>
> ***
>
> >   If the browser window is too small, the web browser wraps the long
> lines:
>
> i've got a cinema-screen, so i can make the browser window "big".
> but if i couldn't, i'd probably start resenting that column on the left.
> i'd want to get rid of as much chrome as possible.  just text and scan.
>
>
> >
> http://en.wikisource.org/w/index.php?title=Page:United_States_Statutes_at_Large_Volume_1_-_p1-22.djvu/19&action=edit
>
> uueeww.  i'm quite sure i didn't want to look at _that_ page of o.c.r.
> i'll just pretend i didn't see it for now...             ;+)
>
>
> >   I'd like to write a zml importer.
>
> if you want to.  but i can probably write it more easily than you can.
> i've already written zml-to-html and zml-to-pdf conversion tools...
>
> but what i'd _most_ like to do is streamline wiki-markup to be more zen.
>
>
> >   Is there any specific work you would be interested
> >   in seeing on Wikisource. Perhaps one you have
> >   already pre-processed, but is not yet proofread?
>
> none right now, no.  but let me think a little bit on that.
>
> -bowerbird
>
>
>
> **************
> Get fantasy football with free live scoring. Sign up for FanHouse Fantasy
> Football today.
> (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080725/e209656e/attachment-0001.htm 

From dakretz at gmail.com  Fri Jul 25 22:12:55 2008
From: dakretz at gmail.com (don kretz)
Date: Fri, 25 Jul 2008 22:12:55 -0700
Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 34
In-Reply-To: <627d59b80807252203k2be68603sefab0fa777c6cc97@mail.gmail.com>
References: <mailman.3846.1217046108.2809.gutvol-d@lists.pglaf.org>
	<627d59b80807252203k2be68603sefab0fa777c6cc97@mail.gmail.com>
Message-ID: <627d59b80807252212i15416ee5j6cb8fa2859a6adb2@mail.gmail.com>

Bird,

I've got an impending project for EB that's unusually light on the
troublesome stuff - tables, math, chemistry... Let me know when you'd like
to try something a little more substantial, and take a run at it after I've
worked it over with my regexes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080725/36a85105/attachment.htm 

From Bowerbird at aol.com  Fri Jul 25 22:24:29 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 26 Jul 2008 01:24:29 EDT
Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 34
Message-ID: <cf5.33baa619.35bc0f0d@aol.com>

dakretz said:
>    Let me know when you'd like to try something a little more substantial, 
>    and take a run at it after I've worked it over with my regexes.

well...   i'm _already_ working on "something a little more substantial"
-- and am badly behind on it -- but feel free to point me at anything.

although i doubt there will be anything left in it after you're finished.

and congrats on getting e.b. out of the rounds!   after what... 2 years?

but change your darn subject line!           :+)

and trim those digests!           ;+)

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080726/09bf739b/attachment.htm 

From jayvdb at gmail.com  Fri Jul 25 23:37:04 2008
From: jayvdb at gmail.com (John Vandenberg)
Date: Sat, 26 Jul 2008 16:37:04 +1000
Subject: [gutvol-d] getting my wikisource bearings
In-Reply-To: <d16.30bc9243.35bc0056@aol.com>
References: <d16.30bc9243.35bc0056@aol.com>
Message-ID: <deea21830807252337x23c0566ej34e0cb7dea8df443@mail.gmail.com>

On Sat, Jul 26, 2008 at 2:21 PM,  <Bowerbird at aol.com> wrote:
> john said:
>>   In the actual data, the line breaks are not being intentionally dropped.
>
> i take it that you mean they are intentionally retained?...        :+)

Not really.  They are being retained where there is no good reason to
drop them.  For example, if a word is hyphenated over a line break,
the line break is being dropped.

> is there a way that a person can get to this "actual data"?

Click edit.

The bot frameworks always obtain this raw wiki text, rather than the
html which is sent to the browser when a page is viewed.

>>   When we populate a transcription project from online transcriptions,
>>   the original line breaks are usually long gone.
>
> usually.  but sometimes if you "view source" on the .html,
> you will find that they are still there, in the .html source...
>
>>   If the proofreading problem discussed above is fixed, then
>>   it would make sense that we would restore the line breaks
>>   before we finish a project.
>
> i've tried that -- it's a lot of work...  so much work that it's faster
> and simpler to re-do the o.c.r. and apply the corrections to that.
> that's why it's so darn troublesome when a digitizer rewraps text.
>
> (although now that i've gone and said that, i suppose that i could
> write a tool that would simplify that specific task.  i'd have to try.)

See http://en.wikisource.org/wiki/Index:Nietzsche_the_thinker.djvu

This was populated by me copying and pasting the text from another
website over a period of 10 hours; it was buried away in a forum
somewhere, and had no line breaks.  That is what I mean by the line
breaks being long gone.  I broke it up into pages for proofreading
purposes, and we _could_ recommend that it is broken up into lines as
a early stage in the process, if that is profitable.  I doubt
Wikisource would ever demand that people do this, but I guess that
depends on the arguments for it.

>>   When we load OCR into system, the line breaks are
>>   retained in the editing window.  If the browser window
>>   is too small, the web browser wraps the long lines:
>
> but, if i've understood you correctly, doing proofreading in the
> editing window has problems too, including obtrusive markup.

Yup.  Hence I raised the problem on the Wikisource discussion board,
as the real problem is the viewing interface should be the
proofreading interface.

> ***
>
> we should tie this into some ramifications, too, to be complete.
>
> i believe that it's important to tie an e-text to its "ur" paper-book.
> but it's not enough to simply _do_ that, you have to also make it
> _obvious_ and _easily_verifiable_ -- by anyone who cares to look.
>
> retaining the linebreaks is the one thing you can do to make it easy.

Agreed.

> it's just too difficult to verify that the text is the same as the scan
> when the linebreaks in the text have been removed.

Agreed.  The eye needs visual markers to help it to return to the
right spot as it flicks between text and image.

> so, given two sources of an e-text, the future will gravitate toward
> the one that kept the original linebreaks, as more-easily verified...
>
> there are other ramifications as well -- verisimilitude in printing,
> for instance -- but easy verifiability is _the_ most important one.
> (even more important than easier proofreading, in the long run,
> but since they both point to keeping the linebreaks, no conflict.)
>
> ***
>
>>   If the browser window is too small, the web browser wraps the long
>> lines:
>
> i've got a cinema-screen, so i can make the browser window "big".
> but if i couldn't, i'd probably start resenting that column on the left.
> i'd want to get rid of as much chrome as possible.  just text and scan.

As you have created a user, you can alter your preferences to pick a
different skin.  Enjoy.

>> http://en.wikisource.org/w/index.php?title=Page:United_States_Statutes_at_Large_Volume_1_-_p1-22.djvu/19&action=edit
>
> uueeww.  i'm quite sure i didn't want to look at _that_ page of o.c.r.
> i'll just pretend i didn't see it for now...             ;+)

But I picked it out especially for you!  I'm heart broken.

>>   I'd like to write a zml importer.
>
> if you want to.  but i can probably write it more easily than you can.
> i've already written zml-to-html and zml-to-pdf conversion tools...

I'm more interested in seeing the result than in writing the importer.

> but what i'd _most_ like to do is streamline wiki-markup to be more zen.

The wiki markup is the least malleable piece of the Wikisource system.
 The wiki syntax is in use in hundreds of gigs of data on the
Wikimedia servers, so any improvements need to be backwards
compatible, and be thoroughly tested.  The devs dont like proposals to
change the syntax, and we have recently been through a rewrite of the
parser to speed it up and make it conform to an EBNF, so I think they
will snarl at anyone who suggests they go back and tweak even the most
obvious problem with it.

--
John

From hyphen at hyphenologist.co.uk  Sat Jul 26 01:05:39 2008
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Sat, 26 Jul 2008 09:05:39 +0100
Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 34
In-Reply-To: <627d59b80807252212i15416ee5j6cb8fa2859a6adb2@mail.gmail.com>
References: <mailman.3846.1217046108.2809.gutvol-d@lists.pglaf.org>	<627d59b80807252203k2be68603sefab0fa777c6cc97@mail.gmail.com>
	<627d59b80807252212i15416ee5j6cb8fa2859a6adb2@mail.gmail.com>
Message-ID: <000001c8eef6$653e5b10$2fbb1130$@co.uk>

 

don kretz wrote



 

>Bird,

>I've got an impending project for EB that's unusually light on the 

>troublesome stuff - tables, math, chemistry... Let me know when 

>you'd like to try something a little more substantial, and take a run 

>at it after I've worked it over with my regexes.

 

The typesetting of ?standards? of pre 1922 books were non existent.

The typesetting even in books by a single publisher vary wildly.

In the old days typesetting depended on which of the experienced 

old men actually did the job, or in my case the job was given the 

new apprentice, who made a mess of it L.   

Worse the OCR errors produced depends massively on how much ink 

the printer put on the plates, which varies massively throughout a single
book.

 

Worse "my" books contain both prose and poetry dispersed  within prose.

Not to mention subtitles in crazy fonts which OCR will never get right.

 

Having played with regex on another job. It is my opinion that any regexs 

for PG will work on anything other than the book for which they were
written.

 

Dave Fawthrop

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080726/7cbdcca5/attachment.htm 

From tb at baechler.net  Sat Jul 26 02:26:24 2008
From: tb at baechler.net (Tony Baechler)
Date: Sat, 26 Jul 2008 02:26:24 -0700
Subject: [gutvol-d] Tor books giveaway
Message-ID: <488AEDC0.50807@baechler.net>

All,

The dozen books or so that Tor is giving away are now available on one 
page for download.  This page apparently expires on July 27th, or 27 
July.  Get them while you can!  There is no plain text, but all are in 
pdf, html, and html zip.  Some are in other formats.  The wallpapers are 
also available for download.

http://tor.com/index.php?option=com_content&view=blog&id=577

-- 
----------
To reply, change slash to dot and remove example from the address.  It's
left as an exercise to put the rest of the address in the right order.



From jayvdb at gmail.com  Sat Jul 26 08:51:19 2008
From: jayvdb at gmail.com (John Vandenberg)
Date: Sun, 27 Jul 2008 01:51:19 +1000
Subject: [gutvol-d] gutvol-d Digest, Vol 48, Issue 34
In-Reply-To: <627d59b80807252203k2be68603sefab0fa777c6cc97@mail.gmail.com>
References: <mailman.3846.1217046108.2809.gutvol-d@lists.pglaf.org>
	<627d59b80807252203k2be68603sefab0fa777c6cc97@mail.gmail.com>
Message-ID: <deea21830807260851m198e1788i286d26f345868a03@mail.gmail.com>

On Sat, Jul 26, 2008 at 3:03 PM, don kretz <dakretz at gmail.com> wrote:
> John,
>
> Let me attire myself with my technical naivete cap, and ask a question with
> a probably obvious answer.
>
> What's with this djvu?

http://en.wikipedia.org/wiki/DjVu

Some tips on how to construct them are here:

http://en.wikisource.org/wiki/Help:Djvu

> From googling around, it appears to be a pdf alternative. But is seems to be
> strongly preferred in wikiland. Why? And if it's equivalent, why isn't it
> supported interchangeably with pdf files, which we all know how to build and
> deconstruct?

DjVu is a free file format, and the files are smaller for texts.  If
you look at books on archive.org, the djvu file is always smaller than
the PDF. This may be due to the compression routines being used in
their PDFs, as they may be not choosing the more sophisticated
routines which are covered by patents.

http://www.archive.org/details/harperscampingsc00grinrich

PDFs are also complex buggers, and encumbered by patents held by Adobe
(royalty free use granted to software complying with PDF standard).
We do have an Extension for the proofreading system that allows PDFs
to be used, however it hasnt been installed.   PDFs can be uploaded to
Wikisource, and someone will convert them to DjVu.  We have a 20MB
upload limit, but we are trying to get that lifted.

--
John

From Bowerbird at aol.com  Sat Jul 26 11:29:28 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 26 Jul 2008 14:29:28 EDT
Subject: [gutvol-d] getting my wikisource bearings
Message-ID: <cd9.35b0b231.35bcc708@aol.com>

john said:
>    They are being retained where there is no good reason to drop them.
>    For example, if a word is hyphenated over a line break,
>    the line break is being dropped.

ok, well, that's sad.   it means no print-out verisimilitude to the p-book.

i'm convinced that's gonna be very important down the line, not just
for its own sake (which will decrease over time), but for _verification_.

already there are so many e-versions of some books floating around
that we need a means of sorting 'em out, and this will be a prime one.
if you can't be linked to a specific p-book, we'll assume you're bogus.


>    Click edit.

hmm...

when i do that for this page:
>    
http://en.wikisource.org/w/index.php?title=Page:Wind_in_the_Willows_%281913%29.djvu/110&action=edit

...the edit-box i get does _not_ have the p-book linebreaks...

am i misunderstanding you?


>    See http://en.wikisource.org/wiki/Index:Nietzsche_the_thinker.djvu
>    This was populated by me copying and pasting the text 
>    from another website over a period of 10 hours; it was 
>    buried away in a forum somewhere, and had no line breaks.? 
>    That is what I mean by the line breaks being long gone.? 

right.   i know.   would've been faster to re-do the o.c.r.

(and tell me next time, i can write you a scraping tool.)


>    I broke it up into pages for proofreading purposes, 
>    and we _could_ recommend that it is broken up into lines 
>    as a early stage in the process, if that is profitable.

again, better to re-do the o.c.r., and use that cleaned text
in a comparison-merge that makes corrections to the o.c.r.

eventually, this is what you'll do with all of the p.g. e-texts --
find the p-book on which they were based, re-do the o.c.r.,
and then use the proofed p.g. e-text to highlight differences
-- this method is rads faster than finding them manually --
so you've got text that is accurate with the p-book linebreaks.
then you can toss the p.g. e-text, and regenerate .html/.pdf...


>    we _could_ recommend that it is broken up into lines 
>    as a early stage in the process, if that is profitable.
>    I doubt Wikisource would ever demand that people do this, 
>    but I guess that depends on the arguments for it.

anyone who has proofed both ways will demand the linebreaks.
so eventually you will have no proofers willing to do the other...


>    the real problem is the viewing interface 
>    should be the proofreading interface.

well, i don't know if it'd violate some wikisource philosophy, but
it would certainly be _possible_ to have both interfaces available,
and let people choose which one they wanted to be in at any time.

my stance on this has been that, whenever a book is newly posted,
it would be in the proofreading interface only, so end-users _know_
that "this book cannot be considered to be _finished_ at this time",
and that "your assistance in reporting errors would be highly valued."

after a certain amount of time, or a specific number of read-throughs,
the status would change such that people could view it in either mode.

and, of course, anyone who had a doubt at any time could switch into
proofreader mode to view the scan to determine if the text was correct.


>    you can alter your preferences to pick a different skin.? Enjoy.

ah yes, i need to get my wiki thinking-cap on, i'd not thought of that.

(a kind person also pointed out i can find out what links to a wiki-page
by clicking the "what links here" button in the toolbox.   oh yeah.   d'oh.)


>    But I picked it out especially for you!? I'm heart broken.

i never promised you a rose garden...         ;+)


>    I'm more interested in seeing the result than in writing the importer.

lazy programmers are the best kind...


>    The wiki markup is the least malleable piece of the Wikisource system.

oh yeah, i know that.   just wishing i could float back in time and tell ward
that "hey, there is a much better way to do what you're trying to do here."
it started out with the best of intentions, but then it grew into something
that is almost as complex as the markup that it was intended to replace.
and now of course it has hardened into concrete.   so i whine in protest...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080726/47bb5631/attachment-0001.htm 

From hyphen at hyphenologist.co.uk  Sat Jul 26 11:51:43 2008
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Sat, 26 Jul 2008 19:51:43 +0100
Subject: [gutvol-d] Tor books giveaway
In-Reply-To: <488AEDC0.50807@baechler.net>
References: <488AEDC0.50807@baechler.net>
Message-ID: <000601c8ef50$c3e424b0$4bac6e10$@co.uk>

Tony Baechler wrote


>All,

>The dozen books or so that Tor is giving away are now available on one 
>page for download.  This page apparently expires on July 27th, or 27 
>July.  Get them while you can!  There is no plain text, but all are in 
>pdf, html, and html zip.  Some are in other formats.  The wallpapers are 
>also available for download.

>http://tor.com/index.php?option=com_content&view=blog&id=577

Thanks for that, got them.
As an SF fan I usually like Tor books

Dave Fawthrop


From Bowerbird at aol.com  Sat Jul 26 13:24:05 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 26 Jul 2008 16:24:05 EDT
Subject: [gutvol-d] djvu
Message-ID: <bdc.3071471d.35bce1e5@aol.com>

john said:
>   http://en.wikipedia.org/wiki/DjVu

aside from the fact that mac people are third-class citizens in djvu-land...

i don't know how to display a specific page from a djvu on a web-page...

is there an easy way?

i assume that you're doing some back-end tricks to accomplish that?
and i assume you can share that?

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080726/192e2ef2/attachment.htm 

From Bowerbird at aol.com  Sat Jul 26 14:55:22 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 26 Jul 2008 17:55:22 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	025
Message-ID: <c1c.3f69eaa0.35bcf74a@aol.com>

25.   search for doublequote-whitespace-doublequote

as you can see from the 6 hits below, this turns up instances where
dialog paragraphs (which _should_ be separate) were run together.

>    "They get more like rats every year."
>    "I thought about you, held against your will."

>    "I thought about you, held against your will."
>    "Don't tell lies; I went right out of your mind."

>    "Don't tell lies; I went right out of your mind."
>    "Not as quick as I went out of yours. I did

>    be made. Going--"
>    "Thirty-one hundred," Gordon pronounced

>    "Well," Gordon responded, "and if I did?"
>    "I studied over it at first," the other frankly admitted;

>    over the sere grass. "Scrabble for them in the dirt."
>    "You c'n throw them away now the railroad's left

all of these were fixed by introducing the intervening blank line.

6 more lines corrected, for a grand total of 214, on 25 routines...

i'll be back tomorrow with the next suggestion in this series...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080726/c0c8d618/attachment.htm 

From Bowerbird at aol.com  Sat Jul 26 15:03:21 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 26 Jul 2008 18:03:21 EDT
Subject: [gutvol-d] woman in her own right -- 003
Message-ID: <c0e.383c0f85.35bcf929@aol.com>

continuing on with the "woman in her own right" book...

let's do the search i just suggested in my "cleanup" series,
namely doublequote-linebreak-doublequote...

sure enough, we've got a hit in this book too:
>    you wish to go?"
>    "At once!"

so our total number of corrections is now 133.

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080726/2932e096/attachment.htm 

From Bowerbird at aol.com  Sat Jul 26 15:06:31 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 26 Jul 2008 18:06:31 EDT
Subject: [gutvol-d] woman in her own right -- 004
Message-ID: <ce6.301d9daf.35bcf9e7@aol.com>

oops!   this is a re-send of a formerly-improperly-numbered post.

***

continuing on with the "woman in her own right" book...

let's do the search i just suggested in my "cleanup" series,
namely doublequote-linebreak-doublequote...

sure enough, we've got a hit in this book too:
>?? you wish to go?"
>?? "At once!"

so our total number of corrections is now 133.

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080726/de75f69d/attachment.htm 

From jayvdb at gmail.com  Sat Jul 26 19:03:13 2008
From: jayvdb at gmail.com (John Vandenberg)
Date: Sun, 27 Jul 2008 12:03:13 +1000
Subject: [gutvol-d] djvu
In-Reply-To: <bdc.3071471d.35bce1e5@aol.com>
References: <bdc.3071471d.35bce1e5@aol.com>
Message-ID: <deea21830807261903v73dc16fdoc46012164a061be4@mail.gmail.com>

On Sun, Jul 27, 2008 at 6:24 AM,  <Bowerbird at aol.com> wrote:
> john said:
>>   http://en.wikipedia.org/wiki/DjVu
>
> aside from the fact that mac people are third-class citizens in djvu-land...

As far as I can see, Mac OS X is well supported by djvulibre-3, and
djvulibre-4 is supposed to be portable - I've not tried it yet.

> i don't know how to display a specific page from a djvu on a web-page...
>
> is there an easy way?

the djvulibre package contains the tools to pull out an image from the bundle.

> i assume that you're doing some back-end tricks to accomplish that?
> and i assume you can share that?

Our image host has a naming convention which allows these images to be
obtained from the djvu, and to be scaled on the fly.  e.g. this is the
full image of a document I was working on today.

http://en.wikisource.org/wiki/Index:GeorgeTCoker.djvu  - the
transcription "project" page
http://en.wikisource.org/wiki/Image:GeorgeTCoker.djvu - the media
description page

This is the download URL:

http://upload.wikimedia.org/wikipedia/commons/2/22/GeorgeTCoker.djvu

Once I know it is in the "2/22" folder, I can then pull an image down
by asking for page 10 at 250px

http://upload.wikimedia.org/wikipedia/commons/thumb/2/22/GeorgeTCoker.djvu/page10-250px-GeorgeTCoker.djvu.jpg

or a png at 500px

http://upload.wikimedia.org/wikipedia/commons/thumb/2/22/GeorgeTCoker.djvu/page10-500px-GeorgeTCoker.djvu.png

I'm not sure how to ask for a maximum resolution image, but I am
guessing it is possible.
If you are going to be doing offline work with these bundles of
images, and need the maximum resolution, it would be better to install
the djvu tools on your machines.

--
John

From jayvdb at gmail.com  Sun Jul 27 07:04:41 2008
From: jayvdb at gmail.com (John Vandenberg)
Date: Mon, 28 Jul 2008 00:04:41 +1000
Subject: [gutvol-d] getting my wikisource bearings
In-Reply-To: <cd9.35b0b231.35bcc708@aol.com>
References: <cd9.35b0b231.35bcc708@aol.com>
Message-ID: <deea21830807270704m73184f8dxdab18d409abc3b9c@mail.gmail.com>

On Sun, Jul 27, 2008 at 4:29 AM,  <Bowerbird at aol.com> wrote:
> john said:
>>   They are being retained where there is no good reason to drop them.
>>   For example, if a word is hyphenated over a line break,
>>   the line break is being dropped.
>
> ok, well, that's sad.  it means no print-out verisimilitude to the p-book.
>
> i'm convinced that's gonna be very important down the line, not just
> for its own sake (which will decrease over time), but for _verification_.
>
> already there are so many e-versions of some books floating around
> that we need a means of sorting 'em out, and this will be a prime one.
> if you can't be linked to a specific p-book, we'll assume you're bogus.
>
>
>>   Click edit.
>
> hmm...
>
> when i do that for this page:
>>
>> http://en.wikisource.org/w/index.php?title=Page:Wind_in_the_Willows_%281913%29.djvu/110&action=edit
>
> ...the edit-box i get does _not_ have the p-book linebreaks...
>
> am i misunderstanding you?

That page was populated from an online source; there were no line breaks.

I was almost going to explain why word hyphenation is a bit pointless
at the moment for Wikisource, but ... what the heck, I've updated
pagescan 110 to split the lines as you would expect them, using two
different methods to deal with hyphenation.

In the first two cases of hyphenated words, I am using a brand new
template "hw" which displays the first and third parameter as joined
in the published version, and displays a "-" and a line break in the
proofreading view:

{{hw|con|-
|tinued}}

Template:hw points to this, which contains the logic explain above:

http://en.wikisource.org/wiki/Template:Hyphenated_word

In the second two cases, I am using wiki syntax "<noinclude>" to
declare that the hyphen and new line are not desired in the published
edition.

any<noinclude>-
</noinclude>thing

Neither mechanism has any affect on the published version of the page:

http://en.wikisource.org/wiki/The_Wind_in_the_Willows/Chapter_4

>>   See http://en.wikisource.org/wiki/Index:Nietzsche_the_thinker.djvu
>>   This was populated by me copying and pasting the text
>>   from another website over a period of 10 hours; it was
>>   buried away in a forum somewhere, and had no line breaks.
>>   That is what I mean by the line breaks being long gone.
>
> right.  i know.  would've been faster to re-do the o.c.r.
>
> (and tell me next time, i can write you a scraping tool.)

Your response here shocked me.  I dont think it would be faster going
the OCR route, but I'm going to be paying more attention to this.  The
10 hours mentioned above was elapsed, rather than constant
hand-breaking work.

To give us both a better idea of what I've been doing, I did a little
on this volume today:

http://en.wikisource.org/wiki/Index:Sacred_Books_of_the_East_42.djvu

The first 80 pagescans include an introduction and other front-matter,
and there is no online edition, so I uploaded the OCR to be proofread
a week or two ago.  I then earmarked this as a  "repopulate" project.

Today, I copied the first few pages of Book 1 from sacred-texts.org,
and corrected them. (pagescans 81-).  I then googled for the corrected
text, to see if there was a better transcription online.  I found that
ishwar.com had most of the errors found.

http://www.ishwar.com/hinduism/holy_atharva_veda/

I cross checked this a few times, and then decided to go with
ishwar.com instead of sacred-texts.org.  Then for 2 hours I copied the
text from ishwar.com to the wikisource pages, resulting in the entire
of Book 1 done - text reunited with images.  I marked each page as
"Proofread" because I am guessing it has been through two sets of eyes
prior to mine.  I fixed one or two errors as I went.  If there are
errors the verification phase will have to pick them up.

I then documented some post-processing needed to improve the wiki formatting:

http://en.wikisource.org/wiki/Index_talk:Sacred_Books_of_the_East_42.djvu

Ran a bot over the book 1 pages:

python replace.py -file:../sbe42b1.list -regex "\n([IVXL]+),
([0-9]+)\. ([^\n]*)\n" "\n===\1, \2. \3===\n" "\n([0-9]+)\n" "\n\1.
\n" "Varuna" "Varu''n''a" "kushtha" "kush''th''a" "vrishas"
"v''ri''shas" "gavants" "''g''avants"

I also run one of my standard fixes over the book 1 pages to convert
"--" to emdash:

python replace.py -file:../sbe42b1.list -fix:doubledash

You can see both changes here:

http://en.wikisource.org/w/index.php?title=Page:Sacred_Books_of_the_East_42.djvu/91&action=history

In short, 48 pages of good quality text in ~3 hours.

>>   I broke it up into pages for proofreading purposes,
>>   and we _could_ recommend that it is broken up into lines
>>   as a early stage in the process, if that is profitable.
>
> again, better to re-do the o.c.r., and use that cleaned text
> in a comparison-merge that makes corrections to the o.c.r.

Interesting idea.  I've been considering using o.c.r. to re-paginate a
proofread text.  It sounds like you're suggesting the opposite would
be more fruitful.

> eventually, this is what you'll do with all of the p.g. e-texts --
> find the p-book on which they were based, re-do the o.c.r.,
> and then use the proofed p.g. e-text to highlight differences
> -- this method is rads faster than finding them manually --
> so you've got text that is accurate with the p-book linebreaks.
> then you can toss the p.g. e-text, and regenerate .html/.pdf...

I'n not sure how often we will be re-populating works using PG etexts,
but there are many other transcribed books floating across the
internet without pagescans that we need to seriously consider how to
make use of the transcription work already done.

>>   we _could_ recommend that it is broken up into lines
>>   as a early stage in the process, if that is profitable.
>>   I doubt Wikisource would ever demand that people do this,
>>   but I guess that depends on the arguments for it.
>
> anyone who has proofed both ways will demand the linebreaks.
> so eventually you will have no proofers willing to do the other...

which means there is no need for rules; on a wiki, common practise
cautiously follows best practise.

:-)

>>   the real problem is the viewing interface
>>   should be the proofreading interface.
>
> well, i don't know if it'd violate some wikisource philosophy, but
> it would certainly be _possible_ to have both interfaces available,
> and let people choose which one they wanted to be in at any time.
>
> my stance on this has been that, whenever a book is newly posted,
> it would be in the proofreading interface only, so end-users _know_
> that "this book cannot be considered to be _finished_ at this time",
> and that "your assistance in reporting errors would be highly valued."
>
> after a certain amount of time, or a specific number of read-throughs,
> the status would change such that people could view it in either mode.
>
> and, of course, anyone who had a doubt at any time could switch into
> proofreader mode to view the scan to determine if the text was correct.

We create a "logical" layout on top of the pagescans, so until someone
creates that layer, the work can only be read page-by-page.  The
pagescan and logical views are linked to each other, so the reader can
flick between them: links are provided for logical => pagescan ; the
opposite direction requires use of "what links here".

http://en.wikisource.org/wiki/Special:WhatLinksHere/Page:Wind_in_the_Willows_%281913%29.djvu/110

--
John

From Bowerbird at aol.com  Sun Jul 27 11:39:10 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 27 Jul 2008 14:39:10 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	026
Message-ID: <c87.34d44e64.35be1ace@aol.com>

26.   search for end-line-hyphenates

because d.p. dehyphenates its texts, let's search for hits
on lines that end with a single-dash, which should then
be automatically dehyphenated with appropriate routines.

40 hits on pagebreaks, requiring 2 edits
(one on the last line of the previous page, 
and one on the first line of the next page).

>    ghaven lips forever awry in the pronouncing of rally- // [40]
>    stage driver he was totally without resources, with- // [43]
>    picking up a pen. "When you bought," he re- // [44]
>    of vain, sick regrets, his combativeness, his deep- // [48]
>    elder, vanished brother bullying him; the brief ro- // [50]
>    of tension, increased. Gordon's only con- // [62]
>    upland hay to a point within a few miles of his des- // [69]
>    The doctor greeted him seriously. He had, Gor- / [71]
>    for the ... Gordon!" she exclaimed more ener- // [76]
>    throat. The odor of June roses that filled the cor- // [88]
>    The little affair with Buckley Simmons had cap- // [92]
>    harsh, lik_{t}e a discordant bell clashing in the soste- // [97]
>    the old man, through his daughter, ad- // [98]
>    passing, profound gloom. Then the cloud van- // [103]
>    stood sharply defined, and enclosed by a fence, flow- // [115]
>    grew from her palpable liking for him, and was re- // [118]
>    edge. She was silent, and clung to him with a re- // [123]
>    elements, to the bitter mountain winters, the ruth- // [129]
>    hoof-beats of a trotting horse, and he had the feel- / [141]
>    Ah--" in spite of himself, Valentine Simmons be- // [154]
>    was more than usually unpropitious; and, discover- // [173]
>    where, from under a horse blanket, Tol'able pro- // [188]
>    darkly, Gordon, stood still, Meta Beggs fe.ll be,- // [195]
>    the astute storekeeper into such a satisfactory, retail- // [213]
>    slimly rounded, graceful; her hands, like mag- // [226]
>    something." He leaned across the bed, and, grasp- // [232]
>    there? He would like to be with her at a sap-boil- // [240]
>    there. He felt in his pocket the cool, sinuous neck- // [244]
>    Suddenly it appeared to him in the light of a pos- / [250]
>    the feminine heart. And they summed up the du- // [257]
>    cry from within the house was too deep to have pro- // [264]
>    "I didn't do right," he acknowledged to the trav- / [272]
>    presented the same order, her white shirt- // [286]
>    There was a prolonged pause in the bidding, dur- / [289]
>    before investing such a paramount sum, to com- // [326]
>    task isolated in the midst of a vast, un- // [333]
>    beyond; the towering east range bathed in keen sun- // [347]
>    The horses walked swiftly, almost without guid- / [354]
>    Stenton stage this phenomenon was highly undesir- // [358]
>    forward, an uncouth, slipping bulk, under the soar- // [364]


2 of the hits were on broken paragraphs, meaning 3 lines
will be edited here: the first line, the incorrect blank line,
and the bottom line, which is brought up to the first line.

>    to the door; it said, "Gone fishing. Back to- // morrow."
>    ' 'TT'VE got something for you," Gordon said sud- // I denly.

***

86 more lines corrected, for a grand total of 300, on 26 routines...

i'll be back tomorrow with the next suggestion in this series...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080727/cc030214/attachment-0001.htm 

From Bowerbird at aol.com  Sun Jul 27 11:47:12 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 27 Jul 2008 14:47:12 EDT
Subject: [gutvol-d] djvu
Message-ID: <c25.3f568805.35be1cb0@aol.com>

john said:
>    As far as I can see, Mac OS X is well supported by djvulibre-3

yes, in the last year since i last checked, they have brought out some stuff.

so one can view a djvu properly inside safari.   (but not firefox or camino.)

and the offline viewer, which had been flawed (e.g., no facing-pages view),
has also been improved, significantly, on a number of different dimensions.


>    If you are going to be doing offline work with these bundles 
>    of images, and need the maximum resolution, it would 
>    be better to install the djvu tools on your machines.

i'll experiment, but i think it'll still work better for my purposes
to use the images as separate files.   archive.org offers both djvu
and the "flippy" images, which is what i've been using up to now.

for instance, i still don't know how to pull a single page-image
from a djvu for display as part of a webpage, which i need to do.

nonetheless, it's great to see the djvu improvements for the mac.

thanks for all the information.   i'm glad djvu is alive and well...          
:+)

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080727/5f3fc1a1/attachment.htm 

From Bowerbird at aol.com  Sun Jul 27 11:57:06 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 27 Jul 2008 14:57:06 EDT
Subject: [gutvol-d] woman in her own right -- 005
Message-ID: <d05.3a733748.35be1f02@aol.com>

ok, let's check "a woman in her own right" to see
if there are any uncorrected end-line hyphenates.

oh gee, yes, there are, and quite a few of 'em, too.

128 cases of unfixed end-line hyphenates...

>    Liabilities of twenty million, assets prob-
>    "It is good to have you forget yourself occa-
>    at the Heights, if he had not been Warwick Mat-
>    of Mattison--let us find something more interest-
>    possibility of being followed by means of his lug-
>    usually mean and little, the latter unctuously pre-
>    "Colonel Duval is dead, however," she added^-
>    Croyden turned into the walk--the black fol-
>    "Yass, seh! yass, seh!" the darky answered, in-
>    hear torectly. An' heah comes Marster Dick, his-
>    country to you, sir, when compared to Northumber-
>    at the White Sulphur, where both spent their sum-
>    regimental guidons--and here his portrait in uni-
>    man ~" (with a wave of his hand toward the por-
>    want you to see the furniture, and the family por-
>    "Have you had any experience with negro ser-
>    sir," suddenly recollecting himself, "Miss Carring-
>    He turned, to find Moses in the doorway, wait-
>    "Dese things not pu'chased. No, seh! Dey's bor-
>    Duvals have remarked it, in making their endorse-
>    "I am very glad to meet you, Captain Carring-
>    to speak on every subject under the sun, Litera-
>    ture--Bridge--Teaching--Music. Oh, she is in-
>    "I'm very sorry; I'll try to remember in fu-
>    Miss Erskine frowned in disapproval and aston-
>    "Can I come down to-night? Answer to Belle-
>    "Hum--I see--the aristocracy of birth, not dol-
>    "So you like it--Hampton, I mean? ~" said Mac-
>    and pointing to the portraits. "I've got ances-
>    cars. And there isn't any boat sailing until day-
>    than the tonnage of the Port of Baltimore, to-
>    a little ways, now," he added, with the country-
>    us. Evidently there was none erected here, in Par-
>    may make me liable to my grantor for an account-
>    of Midshipmen contains muckers as well as gentle-
>    "Maybe you left it in Hampton?" said Mac-
>    "That's my fear," Macloud admitted. "Some-
>    renounce the opportunity for a half million dol-
>    "Yes--it has! "he said, after a moment's hesi-
>    "And is it true that you are seriously em-
>    Avenue; ~" for a supply of small arms and ammu-
>    "Is this Senator Rickrose? ~" the Lieutenant in-
>    Then they took several drinks, and the aide de-
>    only occasionally, and Greenberry Point seemed un-
>    North under the needle, ran his eye North-by-
>    "Then your supposition is that, since Par-
>    "Mr. Smith, this is Mr. Croyden!" said Hook-
>    have them all, so I can decide--I want no after-
>    could be identified. He hoped this was satisfac-
>    Croyden assured him it was more than satis-
>    "But we wanted to prove that it couldn't suc-
>    "To Hampton! "Croyden exclaimed, incredu-
>    her knees, in the reckless fashion women have now-
>    "I, naturally, don't ask you to violate any con-
>    "Not exactly--he is not proclaiming him-
>    positively pathetic. However, Croyden is not suf-
>    non-essentials, and does the essentials economic-
>    speak of your own knowledge, not from his infer-
>    when contrasted with the brightness of North-
>    "Yass, seh! her am home, seh, I seed she her-
>    the world--I repeat it--up to the minute in every-
>    cobblestones, its drains-in-the-gutters, its how much-
>    the cost of living, and clog the avenue with auto-
>    them. And then, when the spectators had de-
>    "Croyden is my name?" he replied, interro-
>    entire Point, dragged the waters immediately ad-
>    "We want an equal divide. We will take Par-
>    "I was endeavoring to state the matter suc-
>    quarter of a million and we will forget every-
>    deceased!" said he, and gave her the let-
>    isn't the slightest danger of any one being tor-
>    "I shall believe you, when I see him!" incredu-
>    "I'd sooner be the present one than all the has-
>    Presently Croyden came to a large, white en-
>    "It will put me on ~' easy street,' ~" Croyden ob-
>    "I don't care to inform them as to my where-
>    mean that you don't intend to return to North-
>    "Or of being bound, and gagged, and ill-
>    Croyden. "I could make a fortune writing fic-
>    She flung him a look that was delightfully allur-
>    (observing smiles on Croyden and Miss Carring-
>    want to see him, either to-night or in the morn-
>    "And if they are proficient, they go--some-
>    "After a fashion--we went to Dobbs Ferry to-
>    The second morning after, when Elaine Caven-
>    dish's maid brought her breakfast, Miss Carring-
>    very succinct, very informing, and very satis-
>    "And it's just as delightful to be able to re-
>    "----to your going along with me--I'm ex-
>    known him to have even an affair. He is armor-
>    doesn't please me, I'm going to talk to Miss Car-
>    her money,' and shows me scant regard in con-
>    "It seems so!--even Elaine isn't to be consid-
>    so humble--you're rather proud of your inter-
>    "What are you responsible for? ~" asked Mac-
>    "Nothing! Nothing!--not even for my resolu-
>    me back, again--and so on, and so on--and so-
>    "You're in a bad way!" laughed Macloud-
>    Then he continued with the story he was re-
>    "Very singular," said the Captain. "Half-
>    house. They discovered nothing which would ex-
>    "I do not know--if you will come in, I'll in-
>    "Something like it? ~" he replied, after a mo-
>    "I don't know! I'm too angry to know any-
>    "Hasn't Mr. Croyden told you--or Mr. Mac-
>    "Then maybe I shouldn't--but I will. Par-
>    and a glove a short distance from Hamp-
>    "Hence, a proper choice for our temporary resi-
>    it supplied the deficiency as best pleased the in-
>    "It's Parmenter again!" said Croyden, sud-
>    Parmenter's treasure, but they refuse to be con-
>    "They have been rather persistent," Macloud re-
>    solely because of us--to force us to dis-
>    But why? why? Who are Robert Parmenter's Suc-
>    "Thank you," he said. "I've been sort of un-
>    possible handicaps due to ignorance or inad-
>    to the Parmenter jewels, and all that it con-
>    as much concerned for success as I am," said Croy-
>    released. We are going to pay the amount de-
>    "Going to pay the two hundred thousand dol-
>    "You sent for me, Miss Cavendish?" he in-
>    "That remains to be seen, as I have also in-
>    "You're thinking of paying it? ~" he asked, in-
>    "I always carry a few blank checks in my hand-
>    "Thank God! you're not Elaine! "Croyden re-
>    Carrington, and her love for you," Croyden com-
>    half smothered. "My hair, dear,--do be care-
>    "In remembrance of your release, and of Par-
 
so our total number of undone corrections is now 261.

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080727/c9ed4253/attachment.htm 

From schultzk at uni-trier.de  Sun Jul 27 23:13:02 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Mon, 28 Jul 2008 08:13:02 +0200
Subject: [gutvol-d] preprocessing definition
In-Reply-To: <c0d.31f2c2b4.35b8cde3@aol.com>
References: <c0d.31f2c2b4.35b8cde3@aol.com>
Message-ID: <6E3D069A-3FCD-47F6-A7A3-D71BBDBF5F8C@uni-trier.de>

Hi BB,

	I personally do not see the need for any human interaction.
	It is a matter of specifying what up want done !

	Also, on the side of computing every process contains
	three basic steps:
		1) preprocessing (preparing data structures, geting data,  
conversions, etc)
		2) the main task
		3) post-processing (returning data, clean-up, etc)

	These three steps are true of preprocessing.

	Yes, if you define the task during preprocessing to require human  
interaction,
	then you do need human interaction otherwise not.

	regards
		Keith.

Am 23.07.2008 um 20:09 schrieb Bowerbird at aol.com:

> roger said:
> >   I don't intervene or look at the pages myself.
>
> my experience is that preprocessing can't do all of what needs to  
> be done
> if the methodology does not involve a human who will look at the  
> pages...
>
> -bowerbird
>
>
>
> **************
> Get fantasy football with free live scoring. Sign up for FanHouse  
> Fantasy Football today.
> (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080728/03681eb6/attachment.htm 

From schultzk at uni-trier.de  Sun Jul 27 23:25:33 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Mon, 28 Jul 2008 08:25:33 +0200
Subject: [gutvol-d] not a dialog
In-Reply-To: <4887AA74.6000601@pobox.com>
References: <238672185.93911216840318819.JavaMail.mail@webmail09>
	<4887AA74.6000601@pobox.com>
Message-ID: <C8162FBD-3005-469B-B018-893A62B92D19@uni-trier.de>

Hi Roger,

	I do not quite see the problem. Of course that is
	probaly because I can program in some 15 languages
	fluently and thereby pick up a knew one easily.
	At least I can understand what is going on.

	So may I suggest that you look at the PHP code and
		1) jot down what the code is doing.
		2) do a little restructuring to afford ruby or perl
		3) write the ruby or perl code
	Voila. You can now intergrate.

	Yes, some programmers have awful style(not meaning necessarily
	the ones involved, I have not seen the code).

	Hope this helps
		Keith.

Am 24.07.2008 um 00:02 schrieb Roger Frank:

> Joshua Hutchinson wrote:
>
> |  Just wanted to pop in to ask if you (or anyone else) has
> |  looked into incorporating these checks into the proofing
> |  interface at DP?
>
> That would be a big boost to productivity. The difficulty
> for me is that I'm comfortable with Ruby and Perl but
> uncomfortable with PHP, and I think that's an important
> deficiency for anyone wanting to integrate it at DP.
> That's why for me it's a standalone utility, like guiprep,
> only written in Ruby--it's just my limitation in being able
> to put it inside a wrapper with something stronger than a
> textbox widget. If I could find the equivalent of guiguts'
> built in editor/presentation manager, only written in Ruby,
> I would certainly use it. That would at least make it
> interactive in a "proofing round 0" sense.
>
> So bottom line, for me the answer is that it's only a
> "I wish I was smart enough to do that" kind of thing. As
> a proofer myself at DP, I agree it would be a big win.
>
> --Roger Frank
>
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d


From schultzk at uni-trier.de  Sun Jul 27 23:38:22 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Mon, 28 Jul 2008 08:38:22 +0200
Subject: [gutvol-d] not a dialog
In-Reply-To: <c76.293f9385.35b912cc@aol.com>
References: <c76.293f9385.35b912cc@aol.com>
Message-ID: <D1D69CEE-D8D6-480A-BA9F-66A7101D9421@uni-trier.de>

Hi All,

	Well, I actually do not know. Though I would say a lack
	of APIs and lack of modularity.

	I will go as far and say missing documentation.

	regards
		Keith.

Am 24.07.2008 um 01:03 schrieb Bowerbird at aol.com:

> dkretz said:
> >   You may remember that I implemented a new proofing interface
> >   a year or two ago, which provided a "preview" mode showing
> >   real italics, etc. That has since added a quote-matching display,
> >   and a punctuation reasonability-checker. They may still be on
> >   the dev server - I haven't checked for a long time.
>
> juliet said:
> >   We don't have it yet simply because
> >   none of our volunteer developers
> >   has been willing to tackle it.
>
> if somebody can sort this all out, do please explain it to me, ok?
>
> thanks.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080728/d05b4a81/attachment.htm 

From Bowerbird at aol.com  Mon Jul 28 14:20:28 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 28 Jul 2008 17:20:28 EDT
Subject: [gutvol-d] getting my wikisource bearings
Message-ID: <bcd.3280c184.35bf921c@aol.com>

john said:
>    I was almost going to explain why word hyphenation 
>    is a bit pointless at the moment for Wikisource, but ... 

actually, i'd like to hear that explanation, so i'm well-rounded.


>    what the heck, I've updated pagescan 110 to split the lines 
>    as you would expect them, using two different methods 
>    to deal with hyphenation.

the markup is rather obtrusive, and i think it would probably
interfere with proofing.   so i'm not sure it's an improvement.

perhaps i need to step back and consider the context...


>    Your response here shocked me.

i hope it was a tongue-on-a-9-volt-battery type of shock,
not a "clear!"-and-slap-the-pads-against-his-chest shock.       :+)


>    I dont think it would be faster going the OCR route,
>    but I'm going to be paying more attention to this.

it's pretty fast to do o.c.r.   you drag the files into abbyy,
to make a "batch", and then you turn the program loose.


>    I cross checked this a few times, and then decided to go with
>    ishwar.com instead of sacred-texts.org.

a merge of the two versions based on comparing them probably
would have given you the most accurate text.


>    Interesting idea.? I've been considering 
>    using o.c.r. to re-paginate a proofread text.? 
>    It sounds like you're suggesting the opposite 
>    would be more fruitful.

well, whether you use the already-proofed text to bring the
o.c.r. version up to final-quality, or (vice-versa-like) use the
o.c.r. version to bring the already-proofed text to final-stage,
the effect is the same either way.   you're comparing the two
and implementing whatever changes are necessary to finalize.


>    I'n not sure how often we will be re-populating works 
>    using PG etexts, but there are many other transcribed books 
>    floating across the internet without pagescans that we need to 
>    seriously consider how to make use of the transcription work 
>    already done.

that's a good cause.   doing the comparisons that i've mentioned is
-- to my mind -- the best way to leverage the value of that work...
so more on that later...


>    which means there is no need for rules; on a wiki, 
>    common practise cautiously follows best practise.    :-)

"best practice" is good enough for me, i don't need rules.      :+)

(of course, it always helps if you spell "best practice"
in the _correct_ way, instead of the _british_ way...)           ;+)

***

...more thoughts later this week...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080728/f0c9dfd5/attachment.htm 

From Bowerbird at aol.com  Mon Jul 28 14:32:04 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 28 Jul 2008 17:32:04 EDT
Subject: [gutvol-d] woman in her own right -- 006
Message-ID: <d6a.2d0347df.35bf94d4@aol.com>

let's check "in her own right" to see if there are any floating 
question-marks.

yep, 8 of 'em...

>    languid: ~" Been away, somewhere, haven't you ?
>    if Gaspard, his particular waiter, missed him ?
>    the Duvals didn't keep an eye on Greenberry Point ?
>    "You are determined ?--Very well, then, come
>    "But you're not quite sure ?--oh! modest man!"
>    moment, will you ?--you're hipped on it!"
>    "Than your Southern ancestors ?--isn't that
>    will be: ~' Come over and see us, won't you ?'"

so our total number of undone corrections is now 269.

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080728/473b358d/attachment.htm 

From Bowerbird at aol.com  Mon Jul 28 15:28:03 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 28 Jul 2008 18:28:03 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	027
Message-ID: <c39.391a4754.35bfa1f3@aol.com>

now we switch gears, ever so slightly, and do some housecleaning...

our task is to get the _names_ in the book.   first we will _check_ them;
later, we will use the names as a _control_ during some further checks.
(for instance, names will be an allowable exception when we check for
words that are inappropriately capitalized in the middle of a sentence.)

27.   get the names, and make corrections as needed...

i've written several different routines for pulling names out of a text,
but perhaps the simplest one is to pull out all capitalized words and
then cull out the ones which are present in my standard dictionary...

that routine gives the list i've appended.

the top group -- with consecutive caps -- will be checked closely.
and indeed 10 of the 13 were incorrect, and were fixed...

because words that are in the dictionary are rejected from this list,
the last name of "berry" was deleted, but that had no consequence.

the first name of "lattice" -- where "lettice" had been misrecognized --
was also deleted from this list, so the 10 instances of that error would
have become invisible, which is the most striking problem in this book,
from the standpoint of the preprocessing.

as for other errors that _would_ have been revealed, they were:

>    Al (was a misrecognition of "a1"), with 1 occurrence...
>    Erne (was a misrecognition of "effie"), with 2 occurrences...
>    Inan (was a misrecognition of "in an"), with 1 occurrence...
>    Itwas (was a misrecognition of "it was"), with 1 occurrence...
>    Kenny (was a misrecognition of "henny"), with 1 occurrence...
>    Malummon (misrecognition of "makimmon"), with 1 occurrence...
>    Mm (was a misrecognition of "him"), with 1 occurrence...
>    Tompey (was a misrecognition of "pompey"), with 1 occurrence...

18 more lines corrected, for a grand total of 318, on 27 routines...

i'll be back tomorrow with the next suggestion in this series...

-bowerbird


>    ALFRED'A'KNOPF
>    CONDON
>    HERGESHEIMER
>    JBeggs
>    MTMOTHER
>    Nickles'11
>    OlfAMEEIOA
>    PENNYS
>    RUTHERFORD
>    T
>    T7NITED
>    TOPable
>    TT'VE


>    Adelaide
>    Al
>    Albermarle
>    Alexander
>    Arkansas
>    Barnwell
>    Bartamon
>    Beggs
>    Berrys
>    Buckley
>    Caley
>    Caleys
>    Chicago
>    Christ
>    Christmas
>    Clare
>    Condons
>    Crandall
>    Cri
>    Effie
>    Elias
>    Entriken
>    Erne
>    Eytalian
>    Fiesole
>    French
>    Goddy
>    Greenstream
>    Hagan
>    Hollidew
>    Hollidews
>    Inan
>    Indian
>    Itwas
>    Jackson
>    Jake
>    Jesuit
>    Jesus
>    June
>    Kenny
>    Khufu
>    Lettice
>    London
>    Loyola
>    MacKimmon
>    Makimmon
>    Makimmons
>    Malummon
>    Matthew
>    Memphis
>    Merlier
>    Meta
>    Methodist
>    Mm
>    Morley
>    Nickles
>    Nile
>    Ottinger
>    Otty
>    Paphian
>    Paris
>    Pelliter
>    Persia
>    Peterman
>    Pompey
>    Presbyterian
>    Saturday
>    Sim
>    Simeon
>    Simmons
>    Sprucesap
>    Stenton
>    Sunday
>    Tennessee
>    Themeny
>    Thursday
>    Tol'able
>    Tompey
>    Universalist
>    Vibard
>    Vibards
>    Wednesday
>    Wellbogast
>    Zebener



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080728/b683f40f/attachment.htm 

From jayvdb at gmail.com  Mon Jul 28 18:02:05 2008
From: jayvdb at gmail.com (John Vandenberg)
Date: Tue, 29 Jul 2008 11:02:05 +1000
Subject: [gutvol-d] getting my wikisource bearings
In-Reply-To: <bcd.3280c184.35bf921c@aol.com>
References: <bcd.3280c184.35bf921c@aol.com>
Message-ID: <deea21830807281802k3d8c80b9p67c59e6b2039288c@mail.gmail.com>

On Tue, Jul 29, 2008 at 7:20 AM,  <Bowerbird at aol.com> wrote:
> john said:
>>   I was almost going to explain why word hyphenation
>>   is a bit pointless at the moment for Wikisource, but ...
>
> actually, i'd like to hear that explanation, so i'm well-rounded.

At present, the proofreading view doesnt _show_ line breaks that exist
in the raw text. e.g.

http://en.wikisource.org/wiki/Page:Wind_in_the_Willows_(1913).djvu/110

So, without standard line breaks being visible, there is no incentive
to break lines.  Without incentives or a feedback loop, breaking words
isn't likely to happen.

>>   what the heck, I've updated pagescan 110 to split the lines
>>   as you would expect them, using two different methods
>>   to deal with hyphenation.
>
> the markup is rather obtrusive, and i think it would probably
> interfere with proofing.  so i'm not sure it's an improvement.
>
> perhaps i need to step back and consider the context...

The important element to grapple with is that we use the same raw text
to proofread and publish.

If we break a word across two lines for proofreading purposes, we need
to join it back together in the published view.  The markup does that.
 It isnt ideal, but it is a start.

We _could_ enhance the parser to understand that a trailing '-' means
the word is broken and needs to be merged in the published view.

That seems like a very simple and incomplete solution, as compound
words can also be broken by a hyphen at the end of the line, and that
hyphen needs to be retained.  We could use the double hyphen (=) where
a compound word is broken across a line, in which case a single hyphen
should be placed into the published view.  There are very few cases
where a published work will use a double hyphen at the end of the
line, and _not_ mean that the word is a compound view.  (The Japanese
use of double hyphen is encoded as U+30A0)

Another option is the use U+2027 (hyphenation point) at eol to encode
that a line merge is required, allowing compound words to be encoded
as "-<U+2027><eol>".  But then there will be times that a hyphenation
point does actually exist at the eol in some works.

More thought required; ideas welcome.

>>   Your response here shocked me.
>
> i hope it was a tongue-on-a-9-volt-battery type of shock,
> not a "clear!"-and-slap-the-pads-against-his-chest shock.      :+)

Whichever it was, im still kickin.

>>   I dont think it would be faster going the OCR route,
>>   but I'm going to be paying more attention to this.
>
> it's pretty fast to do o.c.r.  you drag the files into abbyy,
> to make a "batch", and then you turn the program loose.

OCR is the easy part.  In this case, the archive.org DJVU file has an
OCR layer, and I did use that for pagescans 1-80 because there was no
existing transcription online for those pages.

http://en.wikisource.org/wiki/Index:Sacred_Books_of_the_East_42.djvu

>>   I cross checked this a few times, and then decided to go with
>>   ishwar.com instead of sacred-texts.org.
>
> a merge of the two versions based on comparing them probably
> would have given you the most accurate text.

The ishwar.com etext is the sacred-texts.org etext with improvements.
I've no interest in trying to wrangle both into alignment in order to
do a comparison on them.  If there were two disparate transcriptions
both having significant errors, it might be worth it.

>>   Interesting idea.  I've been considering
>>   using o.c.r. to re-paginate a proofread text.
>>   It sounds like you're suggesting the opposite
>>   would be more fruitful.
>
> well, whether you use the already-proofed text to bring the
> o.c.r. version up to final-quality, or (vice-versa-like) use the
> o.c.r. version to bring the already-proofed text to final-stage,
> the effect is the same either way.  you're comparing the two
> and implementing whatever changes are necessary to finalize.

Any existing code around to do something like this ?

> (of course, it always helps if you spell "best practice"
> in the _correct_ way, instead of the _british_ way...)          ;+)

You mean the "British" way surely ?  They have not been demoted from
being proper thing yet.

--
John

From schultzk at uni-trier.de  Tue Jul 29 01:16:56 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Tue, 29 Jul 2008 10:16:56 +0200
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book
	-- 027
In-Reply-To: <c39.391a4754.35bfa1f3@aol.com>
References: <c39.391a4754.35bfa1f3@aol.com>
Message-ID: <BCD59C18-2053-49BC-8A8C-3A18F1BF7FD1@uni-trier.de>

Hi All, BB,
	
	I have mentioned this many times. Have any of you tried thinking
	about parsing. That is takiong an analytic approach to the texts.

	The way I see it you are basically using pattern matching. Quite
	inefficient.

	Using parsing you pull in the text at the same time you are catching
	all those errors. An added feature is you have context information.

	Tools for doing this kind of work would be flex and bison.

	Any kind of flaging the text can be incorparated into the parser.

	On a side note, BB, your routine for names only works for english.
	But you are not interested in German anyway. How do you handle
	chapter titles and the like.

	regards
		Keith.

Am 29.07.2008 um 00:28 schrieb Bowerbird at aol.com:

> now we switch gears, ever so slightly, and do some housecleaning...
>
> our task is to get the _names_ in the book.  first we will _check_  
> them;
> later, we will use the names as a _control_ during some further  
> checks.
> (for instance, names will be an allowable exception when we check for
> words that are inappropriately capitalized in the middle of a  
> sentence.)
>
> 27.  get the names, and make corrections as needed...
>
> i've written several different routines for pulling names out of a  
> text,
> but perhaps the simplest one is to pull out all capitalized words and
> then cull out the ones which are present in my standard dictionary...
>
> that routine gives the list i've appended.
>
> the top group -- with consecutive caps -- will be checked closely.
> and indeed 10 of the 13 were incorrect, and were fixed...
>
> because words that are in the dictionary are rejected from this list,
> the last name of "berry" was deleted, but that had no consequence.
>
> the first name of "lattice" -- where "lettice" had been  
> misrecognized --
> was also deleted from this list, so the 10 instances of that error  
> would
> have become invisible, which is the most striking problem in this  
> book,
> from the standpoint of the preprocessing.
>
> as for other errors that _would_ have been revealed, they were:
>
> >   Al (was a misrecognition of "a1"), with 1 occurrence...
> >   Erne (was a misrecognition of "effie"), with 2 occurrences...
> >   Inan (was a misrecognition of "in an"), with 1 occurrence...
> >   Itwas (was a misrecognition of "it was"), with 1 occurrence...
> >   Kenny (was a misrecognition of "henny"), with 1 occurrence...
> >   Malummon (misrecognition of "makimmon"), with 1 occurrence...
> >   Mm (was a misrecognition of "him"), with 1 occurrence...
> >   Tompey (was a misrecognition of "pompey"), with 1 occurrence...
>
> 18 more lines corrected, for a grand total of 318, on 27 routines...
>
> i'll be back tomorrow with the next suggestion in this series...
>
> -bowerbird
>
>
> >   ALFRED'A'KNOPF
> >   CONDON
> >   HERGESHEIMER
> >   JBeggs
> >   MTMOTHER
> >   Nickles'11
> >   OlfAMEEIOA
> >   PENNYS
> >   RUTHERFORD
> >   T
> >   T7NITED
> >   TOPable
> >   TT'VE
>
>
> >   Adelaide
> >   Al
> >   Albermarle
> >   Alexander
> >   Arkansas
> >   Barnwell
> >   Bartamon
> >   Beggs
> >   Berrys
> >   Buckley
> >   Caley
> >   Caleys
> >   Chicago
> >   Christ
> >   Christmas
> >   Clare
> >   Condons
> >   Crandall
> >   Cri
> >   Effie
> >   Elias
> >   Entriken
> >   Erne
> >   Eytalian
> >   Fiesole
> >   French
> >   Goddy
> >   Greenstream
> >   Hagan
> >   Hollidew
> >   Hollidews
> >   Inan
> >   Indian
> >   Itwas
> >   Jackson
> >   Jake
> >   Jesuit
> >   Jesus
> >   June
> >   Kenny
> >   Khufu
> >   Lettice
> >   London
> >   Loyola
> >   MacKimmon
> >   Makimmon
> >   Makimmons
> >   Malummon
> >   Matthew
> >   Memphis
> >   Merlier
> >   Meta
> >   Methodist
> >   Mm
> >   Morley
> >   Nickles
> >   Nile
> >   Ottinger
> >   Otty
> >   Paphian
> >   Paris
> >   Pelliter
> >   Persia
> >   Peterman
> >   Pompey
> >   Presbyterian
> >   Saturday
> >   Sim
> >   Simeon
> >   Simmons
> >   Sprucesap
> >   Stenton
> >   Sunday
> >   Tennessee
> >   Themeny
> >   Thursday
> >   Tol'able
> >   Tompey
> >   Universalist
> >   Vibard
> >   Vibards
> >   Wednesday
> >   Wellbogast
> >   Zebener
>
>
>
> **************
> Get fantasy football with free live scoring. Sign up for FanHouse  
> Fantasy Football today.
> (http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080729/bfa20553/attachment-0001.htm 

From hart at pglaf.org  Tue Jul 29 08:50:07 2008
From: hart at pglaf.org (Michael Hart)
Date: Tue, 29 Jul 2008 08:50:07 -0700 (PDT)
Subject: [gutvol-d] !@!ACTA trade agreement brief for July 29-31 Washington
Message-ID: <Pine.LNX.4.64.0807290849550.16613@pglaf.org>



Feedback Wanted!!!


WIKILEAKS URGENT DOCUMENT RELEASE
Tue Jul 29 10:53:25 BST 2008


ACTA trade agreement industry negotiating brief on Border Measures 
and Civil Enforcement

The ACTA negotiations are scheduled for 29 to 31 July 2008 in 
Washington DC.

In 2007 a select handful of the wealthiest countries began a 
treaty-making process to create a new global standard for 
copyright, trademark and patent enforcement, which was called, in a 
piece of brilliant marketing, the "Anti-Counterfeiting Trade 
Agreement".

ACTA is spearheaded by the United States, and includes the European 
Commission, Japan, and Switzerland -- which have large copyright 
and patent industries. Other countries invited to participate in 
ACTA's negotiation process are Canada, Australia, Korea, Mexico and 
New Zealand. Noticeably absent from ACTA's negotiations are leaders 
from developing countries who hold national policy priorities that 
differ from the international copyright and patent industry.

This document is the ACTA negotiating brief dated July 29, 2008, 
provided by the copyright/patent/trademark industry to negotiating 
countries; pages concerning customs enforcement and civil 
enforcement.

Under customs enforcement for example it proposes:

     * Increased inspection of goods to detect potential shipments
     * Customs to provide rights holders all relevant information 
for the purposes of their own private investigations and court 
action they are to be given a minimum of 20 working days to 
commence such actions.
     * Seized counterfeit goods are to be destroyed or disposed at 
the rights holders pleasure. Removing a trademark will not cut it.
     * Under civil enforcement rights holders will have more say on 
the damages involved as well as more compensation to cover their 
legal enforcement costs including "reasonable attorney's fees";.
     * Rights holders to get the right to obtain information 
regarding an infringer, their identities, means of production or 
distribution and relevant third parties.

The exact composition of the business "side" is not known, which 
reflects the lack of transparency afflicting the ACTA process. 
Whether trade representatives can be forced to reveal the make-up 
to the press or policy groups remains to be seen.

See http://wikileaks.org/wiki/S4


From Bowerbird at aol.com  Tue Jul 29 09:06:22 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 29 Jul 2008 12:06:22 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book
	-- 027
Message-ID: <c8d.296e4ea8.35c099fe@aol.com>

keith said:
>    Have any of you tried thinking about parsing.

i never did like that word -- "parsing".   whenever i say that
what i am doing is "parsing", everything seems to fall apart.
but as long as i don't _call_ it that, it all works quite nicely...


>    The way I see it you are basically using pattern matching.

i don't know the distinctions in your terminology.


>    Using parsing you pull in the text at the same time 
>    you are catching all those errors. 

my tool _does_ "pull in the text" in order to "catch the errors".
i don't know how it would do it otherwise.


>    An added feature is you have context information.

well, i have two reactions to that.

one, when using my tool, you have the entire book as "context",
with an entire page of text on-screen in front of you at all times,
and the rest of the book available to you, and to "find" operations.

two, as i have tried to have people infer from all of my examples,
most of the time the _line_ is all the context that you really need.
indeed, that's one true beauty when using a line-based approach.
that's why 90% of what my tool does is based at the level of the line.


>    Tools for doing this kind of work would be flex and bison. 
>    Any kind of flaging the text can be incorparated into the parser.

i'm not familiar with those, but if you can repurpose them for use
in this type of analysis, the so-called open-source proponents here
would probably throw flowers your way...


>    On a side note, BB, your routine for names only works for english.
>    But you are not interested in German anyway. 

well, because german capitalizes all nouns (or something like that),
you have more words that go into the hopper in the first place, yes,
but since most of those nouns will be in the dictionary, they will be
eliminated from the name-list, so that's no reason it wouldn't work.

but you're right, i'm not really interested in german.   not yet, anyway.
some of these routines will have to be customized for each language,
but i'll leave that task to people who actually _know_ each language.
my motivation here is simply to show people how the task is done...


>    How do you handle chapter titles and the like.

what's to handle?

roman numerals are in my dictionary, thus recognized as not-names.
any word in a chapter header and not in the dictionary goes on the list
of possible names, so it's really no different than any other structure...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080729/e0337ea3/attachment.htm 

From Bowerbird at aol.com  Tue Jul 29 09:07:36 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 29 Jul 2008 12:07:36 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	028
Message-ID: <d33.371733e9.35c09a48@aol.com>

28.   do a search for comma-space-uppercase, controlling for names.

this returns 36 hits, 33 of which are correct.   the list is appended...

3 lines were fixed, where a period was misrecognized as a comma:
>    away, leaving her pale. Her lips trembled, A palpable,
>    night, Her eyes made liquid gleams in the wavering
>    the hand; and, leaning forward, touched it, A

of those hits that were correct, some were _titles_, and some were
names that double as valid words and thus were in the dictionary
(like "buck" and "rose" and "valentine"), and lastly some were names
that -- for one reason or another -- our name routines had missed,
so we will add them now to our list of names.

3 more lines corrected, for a grand total of 321, on 28 routines...

i'll be back tomorrow with the next suggestion in this series...

-bowerbird


>    *** 3 lines were fixed (period was misrecognized as a comma):
>    away, leaving her pale. Her lips trembled, A palpable,
>    night, Her eyes made liquid gleams in the wavering
>    the hand; and, leaning forward, touched it, A


>    *** titles

>    *** dr.
>    He passed the Presbyterian Church, Dr. Pelliter's

>    *** friday
>    that night, and return the following day, Friday.

>    *** god
>    the whole, God forsaken place was worth a thousand,"

>    *** general
>    regarding Gordon. "Here, here, General Jackson."
>    thank you for a panful of supper. Come on, General,
>    "Here, General, here," Gordon commanded, and
>    and playing him out. Come here, General JacK-son."
>    oath, but, before he could reach the ground, General
>    Jackson. C'm here, General."
>    "C'm here, General," Gordon called, suddenly

>    *** ginral
>    "C'm on in, doggy," he called; "c'm in, Ginral. 

>    *** miss
>    rare in Greenstream. "Why, no, Miss Beggs," he

>    *** mr.
>    on the poles. Go in, Mr. Makimmon."

>    *** mrs.
>    garden patch beyond, Mrs. Caley said. Gordon
>    come on in the kitchen. No, Mrs. Caley won't
>    toward the peacefully grazing horse, Mrs. Caley sitting
>    before, Mrs. Caley left the room as he entered; and


>    *** names to be added to the list

>    *** alec
>    "And you go right around, Alec," his wife added,

>    *** augustus
>    say, Augustus," he demanded in eager, tremulous

>    *** buck
>    "Kick him again, Buck," he said; "kick him
>    "You oughtn't to have done that, Buck," Gordon 

>    *** cannon
>    got this and that. Then, suddenly, Cannon wanted
>    -we'll say, Cannon does, with a note in my hand

>    *** gordon
>    The doctor greeted him seriously. He had, Gor-

>    *** mcginty
>    the throes of a new piece, Mc*Ginty, and Gordon 

>    *** rose
>    he was intent upon some papers, Rose's husband
>    ain't, Rose."

>    *** sampson
>    "Chalk them up, Sampson," Gordon carelessly

>    *** tol'able
>    "Shut up, Tol'able," Buckley Simmons interposed,
>    where, from under a horse blanket, Tol'able pro-

>    *** valentine
>    Ah--" in spite of himself, Valentine Simmons be-
>    with the pink fox, Valentine Simmons. He thought
>    "By God, Valentine!" Gordon exclaimed, "I'll



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080729/64cdf54e/attachment.htm 

From Bowerbird at aol.com  Tue Jul 29 09:08:33 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 29 Jul 2008 12:08:33 EDT
Subject: [gutvol-d] woman in her own right -- 007
Message-ID: <cde.311ba0a7.35c09a81@aol.com>

let's check "in her own right" to see if there are any comma-uppercase 
glitches.

yep, 2 of 'em...

>    *' Nonsense! I understand--moreover, It will
>    She looked up at him tantalizingly, Her red

so our total number of undone corrections is now 271.

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080729/fe723bb6/attachment.htm 

From Bowerbird at aol.com  Tue Jul 29 15:48:23 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 29 Jul 2008 18:48:23 EDT
Subject: [gutvol-d] getting my wikisource bearings
Message-ID: <ccc.397307e8.35c0f837@aol.com>

john said:
>    The important element to grapple with is that 
>    we use the same raw text to proofread and publish.

yes, i understand.   that's part of what i called the "philosophical"
reasons that might stand in the way of accomplishing this matter.

either way, you _could_ -- if you wanted to -- give users
the ability to _show_ the original linebreaks _or_ to rewrap.

my default is to show the linebreaks as they occurred originally,
and then to let the user call for a rewrap if they actually want it...


>    If we break a word across two lines for proofreading purposes, 
>    we need to join it back together in the published view.?

again, not necessarily.   you could make it a user option...


>    The markup does that.
>    It isnt ideal, but it is a start.

agreed.   on all counts.


>    We _could_ enhance the parser to understand that a trailing '-' means 
>    the word is broken and needs to be merged in the published view.

some end-line dashes are meant to be maintained in the joined word,
while others are not, so a distinction needs to be made to differentiate.

my rule goes something like this:
>    if a line ends with a dash or a tilde, the rightmost character is 
removed.

that way, to preserve a dash, i simply add a tilde to the end of the line,
so the rule will remove the tilde.


>    That seems like a very simple and incomplete solution, 
>    as compound words can also be broken by a hyphen 
>    at the end of the line, and that hyphen needs to be retained.

oops.   i should have read ahead, and put that last bit here instead.


>    We could use the double hyphen (=) where a compound word is 
>    broken across a line, in which case a single hyphen should be 
>    placed into the published view.? There are very few cases where 
>    a published work will use a double hyphen at the end of the line, 
>    and _not_ mean that the word is a compound view.

not sure i've ever seen such a double-hyphen.

i try to live as much of my life as i can in the lower ascii characters...


>    The ishwar.com etext is the sacred-texts.org etext with improvements.
>    I've no interest in trying to wrangle both into alignment in order to
>    do a comparison on them.? If there were two disparate transcriptions
>    both having significant errors, it might be worth it.

ok, i didn't understand that specific situation...


>    Any existing code around to do something like this ?

not any that's being released.              :+)

but there are some free-of-monetary-cost apps that could be released.
they're too raw for that now, but they certainly could get a little polish...
anyone interested in being a beta-tester should backchannel me now...

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080729/c5101b33/attachment.htm 

From lee at novomail.net  Tue Jul 29 17:18:31 2008
From: lee at novomail.net (Lee Passey)
Date: Tue, 29 Jul 2008 18:18:31 -0600
Subject: [gutvol-d] getting my wikisource bearings
In-Reply-To: <deea21830807281802k3d8c80b9p67c59e6b2039288c@mail.gmail.com>
References: <bcd.3280c184.35bf921c@aol.com>
	<deea21830807281802k3d8c80b9p67c59e6b2039288c@mail.gmail.com>
Message-ID: <488FB357.1080307@novomail.net>

John Vandenberg wrote:

[snip]

> The important element to grapple with is that we use the same raw
> text to proofread and publish.
> 
> If we break a word across two lines for proofreading purposes, we
> need to join it back together in the published view.  The markup does
> that. It isnt ideal, but it is a start.
> 
> We _could_ enhance the parser to understand that a trailing '-' means
>  the word is broken and needs to be merged in the published view.
> 
> That seems like a very simple and incomplete solution, as compound 
> words can also be broken by a hyphen at the end of the line, and that
>  hyphen needs to be retained.  We could use the double hyphen (=)
> where a compound word is broken across a line, in which case a single
> hyphen should be placed into the published view.  There are very few
> cases where a published work will use a double hyphen at the end of
> the line, and _not_ mean that the word is a compound view.  (The
> Japanese use of double hyphen is encoded as U+30A0)
> 
> Another option is the use U+2027 (hyphenation point) at eol to encode
>  that a line merge is required, allowing compound words to be encoded
>  as "-<U+2027><eol>".  But then there will be times that a
> hyphenation point does actually exist at the eol in some works.
> 
> More thought required; ideas welcome.

In cases like this, my default approach is to try and figure out a
solution using CSS.

It seems to me that you have two problems here: 1. how to preserve line
breaks in such a way that they are required in one context
(proofreading) and optional in a different context (smooth reading); and
2. how to distinguish betweens hyphens used for hyphenation, and hyphens
used for compound words.

Assuming XHTML markup, one way to preserve line breaks would to use a
CRLF plus the <br/> tag wherever there was a line break in the original.
Then, when non-proofing you would include a CSS rule that does not
display <br>eaks. The problem with this approach is that in some
instances you may want line breaks even when smooth reading. In this
case, you would want to create some sort of classification where you can
indicate "this line break is optional, but that line break is
mandatory." Thus, you could have soft breaks, and hard breaks; an
unclassified <br/> element would be presumed to be a soft break, whereas
<br class="hard"/> would be mandatory, and a CSS rule would be in place
to enforce the line break.

An alternative would be to use an invented element which has no meaning
in XHTML. For example, select <lb/> to indicate a line break in the
original text; when proofing you would use a CSS rule that causes a line
break whenever this particular element is encountered. When there is no
CSS, the HTML spec says that unknown elements should simply be ignored,
so having <lb/> sprinkled throughout the text will be inconsequential
when viewed in a standard HTML User Agent.

When it comes to line-ending hyphenation, you are entering a realm
fraught with controversy. While many, even on this list, would disagree
with me, let's assume for the moment that "hard" hyphens are used for
creating compound words, and "soft" hyphens are used for splitting words
across lines for typographical purposes. The majority seems to believe
that, in XML, the hard hyphen is indicated by the '-' character, whereas
soft hyphens are indicated by the &shy; entity. The minority, however,
is very strident, and firmly convinced that /it/ is the majority. Decide
for yourself.

Whichever position is "right," we need a method to distinguish between
hard (mandatory) hyphens, and soft (optional) hyphens, when they exist
at the end of a typographical line (I assume that optional hyphens will
never exist /inside/ a line). One solution is to simply use the &shy;
element whenever appropriate, and reserve '-' for use as a mandatory
hyphen. Another is to combine hyphenation with the invented <lb> element
we saw above.

If we make the <lb> element a non-empty element, we could encapsulate
the hyphen character inside the element; e.g. '<lb>-</lb>'. Now, when
the display attribute for the <lb> element is set to "none", not only
will there be no line break, the hyphen will disappear as well. Of
course, in the source it will be important that there not be any white
space surrounding the <lb> element, or you may get unanticipated
wrapping, and will certainly see a space inside a "word." It is also
true that in any User Agent which is...CSS-challenged...the hyphen will
appear in all cases. On the other hand, if you want to make the hyphen 
disappear in all cases in these CSS-challenged UA's you could create a 
special class of line-breaks which only add a hyphen when the line break 
is of a certain class, and line breaks are enabled, e.g. '<lb 
class="softhyphen" />' (I tend to prefer this latter solution).

Thus, you would encode your sample text from The Wind in the Willows as:

<p>"Yes, and that's part of the trouble,"
con<lb class="softhyphen">tinued the Rat. "Toad's rich, we all know;
<lb/>but he's not a millionaire. And he's a
hope<lb class="softhyphen">lessly bad driver, and quite regardless of 
law and
<lb/>order. Killed or ruined&mdash;it's got to be one of the
<lb/>two things, sooner or later. Badger! we're
<lb/>his friends&mdash;oughtn't we to do something?"</p>

A hard hyphen would simply exist outside of the <lb> element, but it 
would still be important to not allow white space surrounding the 
element, e.g:

<lb>This is important when using a CSS-<lb>challenged
browser.

[snip]

>>> I dont think it would be faster going the OCR route, but I'm
>>> going to be paying more attention to this.

Like bowerbird, it has pretty much been my experience that starting over
with OCR is faster than trying to re-insert lost data in to existing
files, which is why I think that Project Gutenberg "plain vanilla text"
files will have limited utility in the future. PG has lost so much, both
in terms of markup and provenance, that it's faster to recreate the
texts from scratch than it is to try and fix the old stuff.

[snip]

>>> Interesting idea.  I've been considering using o.c.r. to
>>> re-paginate a proofread text. It sounds like you're suggesting
>>> the opposite would be more fruitful.
>> 
>> well, whether you use the already-proofed text to bring the o.c.r.
>> version up to final-quality, or (vice-versa-like) use the o.c.r.
>> version to bring the already-proofed text to final-stage, the
>> effect is the same either way.  you're comparing the two and
>> implementing whatever changes are necessary to finalize.
> 
> Any existing code around to do something like this ?

Yes.

I have created some code to do this, which I would be happy to share
with you, but I'm hoping someone else has done it better. I'm currently
checking out HTML Match (http://www.htmlmatch.com/) which claims that it
is able to "ignore the source code and compare only the text content of
the web pages." If you're interested, I'll report back on what I find.



From traverso at posso.dm.unipi.it  Tue Jul 29 21:33:13 2008
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Wed, 30 Jul 2008 06:33:13 +0200 (CEST)
Subject: [gutvol-d] getting my wikisource bearings
In-Reply-To: <488FB357.1080307@novomail.net> (message from Lee Passey on Tue, 
	29 Jul 2008 18:18:31 -0600)
References: <bcd.3280c184.35bf921c@aol.com>
	<deea21830807281802k3d8c80b9p67c59e6b2039288c@mail.gmail.com>
	<488FB357.1080307@novomail.net>
Message-ID: <20080730043313.CAEE91035B@posso.dm.unipi.it>

>>>>> "Lee" == Lee Passey <lee at novomail.net> writes:

    Lee> John Vandenberg wrote: [snip]

    >>>> Interesting idea.  I've been considering using o.c.r. to
    >>>> re-paginate a proofread text. It sounds like you're
    >>>> suggesting the opposite would be more fruitful.
    >>>  well, whether you use the already-proofed text to bring the
    >>> o.c.r.  version up to final-quality, or (vice-versa-like) use
    >>> the o.c.r.  version to bring the already-proofed text to
    >>> final-stage, the effect is the same either way.  you're
    >>> comparing the two and implementing whatever changes are
    >>> necessary to finalize.
    >>  Any existing code around to do something like this ?

    Lee> Yes.

    Lee> I have created some code to do this, which I would be happy
    Lee> to share with you, but I'm hoping someone else has done it
    Lee> better. 

http//www.gnu.org/software/wdiff  or emacs ediff in word mode are
doing that excellently.

Carlo

From schultzk at uni-trier.de  Wed Jul 30 02:10:33 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Wed, 30 Jul 2008 11:10:33 +0200
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book
	-- 027
In-Reply-To: <c8d.296e4ea8.35c099fe@aol.com>
References: <c8d.296e4ea8.35c099fe@aol.com>
Message-ID: <EB807CB7-4577-44B4-BD3D-E49CBD6489AA@uni-trier.de>

Hi All, BB,

Am 29.07.2008 um 18:06 schrieb Bowerbird at aol.com:

> keith said:
> >   Have any of you tried thinking about parsing.
>
> i never did like that word -- "parsing".  whenever i say that
> what i am doing is "parsing", everything seems to fall apart.
> but as long as i don't _call_ it that, it all works quite nicely...
>
>
> >   The way I see it you are basically using pattern matching.
>
> i don't know the distinctions in your terminology.
	Let see if I can get this into a nutshell and not get bashed! ;-))

	Parsing does use pattern matching, but not the reverse.
	Pattern-matching basically just find something and does
	not analyze the strcuture of a text.

	Parsing will analyze the text and find its structure based on a
	so-called grammar. The grammar contains the rules for well-formedness.
	In its simplest for a parser will work only on what is well formed.

	A good parser will have rules for handling errors in the text
	and still give the structure.

	A parser can put the text into a particular structure based on the  
grammar.

	So what can the parser do for us. A parser will pull in the text and  
identify, the words
	sentences, quotes, chapter-headers, footenotes or whatever enties  
one defines.
	A parser can have context modes so that excepection handling and the  
identification
	of structures is aided.

	Using pattern matching you have to go through the pattern one after  
another. A parser
	will handle everthing in one pass if you wish by using look ahead  
and or look back.

	In a way you may say parsing is over-kill, but it has advantages  
because you can incorparate
	dictionary look in the process.

	I f you insist your entire preprocess is a multi-pass parser based  
on patterns.
	A parser is harder to develope because it is far more complex.

>
> >   Using parsing you pull in the text at the same time
> >   you are catching all those errors.
>
> my tool _does_ "pull in the text" in order to "catch the errors".
> i don't know how it would do it otherwise.
>
>
> >   An added feature is you have context information.
>
> well, i have two reactions to that.
>
> one, when using my tool, you have the entire book as "context",
> with an entire page of text on-screen in front of you at all times,
> and the rest of the book available to you, and to "find" operations.
	To me context information concerns the structure being analyzed.
	Not so much the co-text.
>
>
> two, as i have tried to have people infer from all of my examples,
> most of the time the _line_ is all the context that you really need.
> indeed, that's one true beauty when using a line-based approach.
> that's why 90% of what my tool does is based at the level of the line.
	Agreed, mostly.
>
>
>
> >   Tools for doing this kind of work would be flex and bison.
> >   Any kind of flaging the text can be incorparated into the parser.
>
> i'm not familiar with those, but if you can repurpose them for use
> in this type of analysis, the so-called open-source proponents here
> would probably throw flowers your way...
	I do admit I am starting to get a itch. Have to think about it and  
see what rules
	and processing and interfaces they have.
>
>
>
> >   On a side note, BB, your routine for names only works for english.
> >   But you are not interested in German anyway.
>
> well, because german capitalizes all nouns (or something like that),
> you have more words that go into the hopper in the first place, yes,
> but since most of those nouns will be in the dictionary, they will be
> eliminated from the name-list, so that's no reason it wouldn't work.
	Just a pet tease ;-)))

	regards
		Keith.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080730/66a72cae/attachment-0001.htm 

From tb at baechler.net  Wed Jul 30 02:53:11 2008
From: tb at baechler.net (Tony Baechler)
Date: Wed, 30 Jul 2008 02:53:11 -0700
Subject: [gutvol-d] Logging hits to PG files
Message-ID: <48903A07.3010307@baechler.net>

Hello,

Obviously, PG has no control over what their mirrors do, so this is 
mostly about gutenberg.org and readingroo.ms.  My question is this: How 
is PG logging and using information regarding what books are downloaded 
and by whom?  I know gutenberg.org keeps Apache logs because they've 
been mentioned here before.  What I'm wondering is what PG uses this 
information for and how it's used.  I've never seen any mention of 
readingroo.ms logs even though it is an official PG server and is owned 
by Greg Newby.

What prompted this question, besides the usual concerns about privacy 
and security, is that I've recently been setting up and using TOR.  
While I like the general concept and I very much like anonymous 
browsing, it is very, very slow and is not good for file downloads.  I'm 
not too worried on one hand whether PG knows what I download or not, but 
on the other hand, an official published statement from PG would be 
nice.  I'm also thinking of people outside of the US who either may not 
legally use this material (but do anyway, obviously) or who may not read 
it because of the restrictions placed on them by their governments.  In 
my case, PG wouldn't get much out of my downloads anyway because I 
download everything in English with a plain text edition, but I would 
still be happier knowing that PG isn't going to use, sell, track, or 
otherwise make use of information like my IP address, browser, etc.  I 
do trust PG to a point, but the philosophy of TOR is to trust no one and 
I'm starting to see more and more how easy it is to track someone's 
browsing habbits.  The only reason why I don't switch to TOR for almost 
everything is that it is very slow and is very short on relays.

[TOR https://www.torproject.org/] is the link for more information on 
TOR.  There are versions for Windows, Mac, Linux, etc.

Thanks very much for providing clarification about this.  If this is in 
the FAQ somewhere, sorry for not finding it, but other than on this list 
and Pre-prints, I've not seen mention of readingroo.ms before.  While 
I'm here, a similar statement about worldebookfair.com would be 
helpful.  I don't trust worldebookfair.com or PGCC because they aren't 
directly under the control of PG, Newby or Hart.

From marcello at perathoner.de  Wed Jul 30 04:32:38 2008
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed, 30 Jul 2008 13:32:38 +0200
Subject: [gutvol-d] Logging hits to PG files
In-Reply-To: <48903A07.3010307@baechler.net>
References: <48903A07.3010307@baechler.net>
Message-ID: <48905156.5090002@perathoner.de>

Tony Baechler wrote:

> Obviously, PG has no control over what their mirrors do, so this is 
> mostly about gutenberg.org and readingroo.ms.  My question is this: How 
> is PG logging and using information regarding what books are downloaded 
> and by whom?

gutenberg.org keeps Apache, Squid and ProFTP logs for one month in our 
private file area. I use them only for log analysis and for the top-100 
list. A cron job deletes all logs older than one month.

These files contain IP addresses, times of access and filenames. Only 
your provider can link the IP address back to you (if you are surfing 
from cable/DSL).

Furthermore we keep the (Analog) statistics forever. These statistics 
contain IPs with more than 1.000 hits/day or 10.000 hits/month. They 
dont say which files those IPs accessed.

I cannot say how long ibiblio keeps backups of our private files, nor if 
ibiblio keeps an own copy of the log files somewhere else and for how long.


<disclaimer>This is not an official statement.</disclaimer>



-- 
Marcello Perathoner
webmaster at gutenberg.org


From Bowerbird at aol.com  Wed Jul 30 09:14:19 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 30 Jul 2008 12:14:19 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book
	-- 027
Message-ID: <d24.2cf1b43d.35c1ed5b@aol.com>

keith said:
>    So what can the parser do for us. A parser will pull in the text and 
identify, 
>    the words sentences, quotes, chapter-headers, footenotes or whatever 
enties

i don't see what that buys us, in terms of the job at hand -- correcting 
errors.


>    Using pattern matching you have to go through the pattern one after 
another. 

well, i've explained a while back that this is the way we _want_ to do this.

generally, a certain "pattern" will be treated similarly whenever it occurs,
so it's fastest to treat each pattern in sequence, rather than mixing them.

preprocessing is typically better-executed as a _book-wide_ methodology,
rather than a _page-by-page_ task, so much that it's part of the 
definition...


>    A parser will handle everthing in one pass if you wish 
>    by using look ahead and or look back.

i can handle everything in one pass too, if i write the code that way.


>    To me context information concerns the structure being analyzed.
>    Not so much the co-text. 

except people don't need that to check the text against the image.
you're overcomplicating the actual task at hand.   it's a simple task.


>    Just a pet tease ;-)))

oh, ok...             ;+)

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080730/20475e12/attachment.htm 

From Bowerbird at aol.com  Wed Jul 30 12:58:22 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 30 Jul 2008 15:58:22 EDT
Subject: [gutvol-d] woman in her own right -- 008 (and final)
Message-ID: <bcc.322faae0.35c221de@aol.com>

the proofers have finished "in her own right" in p1 and p2.

the p1 proofers changed well over 1200 lines:
>    http://z-m-l.com/go/wihor/wihor-c-ocr-to-p1.html

the p2 proofers changed about 300 lines:
>    http://z-m-l.com/go/wihor/wihor-c-p1-to-p2.html

those are the kind of numbers you get from _crappy_preprocessing_.

it's an _insult_ to put that kind of awful text in front of volunteers, yet
this text was _reserved_ for _newcomers_ at distributed proofreaders!
the only justification would be if this was some kind of _hazing_ritual_.

since roger frank has shown he will _take_ things personally, even if i
don't _make_ them personal in the first place, let's get a little personal:
"c'mon roger frank, you can do _much_ better than you've done here...
the proofers give you 100%; how about if you give them at least 50%?"

roger was quick to tell us how many books that he has submitted to p.g.
well, sure, it's easy when you put the work on the backs of the proofers.
he gives them straw, and they spin it into gold and give it back to him...

i would be embarrassed to ask people to help me with this kind of slop
-- embarrassed or ashamed, one of the two, and most probably both --
if i ever did, which i wouldn't...

as it is, this series is finished.   i'm not gonna waste any more of my time
analyzing "data" from an "experiment" as ill-conceived as this one was...

i slap a big old "f" -- for "flunk!" -- on this paper and i'm sendin' it 
back.

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080730/6761e2c7/attachment.htm 

From sly at victoria.tc.ca  Wed Jul 30 13:44:17 2008
From: sly at victoria.tc.ca (Andrew Sly)
Date: Wed, 30 Jul 2008 13:44:17 -0700 (PDT)
Subject: [gutvol-d] woman in her own right -- 008 (and final)
In-Reply-To: <bcc.322faae0.35c221de@aol.com>
References: <bcc.322faae0.35c221de@aol.com>
Message-ID: <Pine.GSO.4.58.0807301332340.17628@vtn1.victoria.tc.ca>


I usually just skim past or delete BB posts.
But I must say this one actually made me angry.

So, I'll try to contain that, and post a civil response.

Just a quick reminder for any newcomers around,
that our friend BB has shown over the last few years
a habit of telling others what they should do,
but has still not (that I am aware of) contributed
anything of measurable substance towards PG.

My experience is that once in a while he does give you
an idea to make you think, but overall his inflammatory
comments and apparent inability to work with others
at all have resulted in his being banned from three
different message areas that I know of.

His previous ban on this mailing list was only temporary
for reasons of wanting to remain fair and open,
which Greg Newby described quite well at the time.

Andrew

On Wed, 30 Jul 2008 Bowerbird at aol.com wrote:

> the proofers have finished "in her own right" in p1 and p2.


From Bowerbird at aol.com  Wed Jul 30 14:29:07 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 30 Jul 2008 17:29:07 EDT
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book --
	029
Message-ID: <c41.3dcb4a43.35c23723@aol.com>

alright, now we're working on the final few clean-up phases...

today we will do a collection of all of the mid-line-hyphenates,
those words which contain a single-dash inside of themselves.

29.   search for mid-line-hyphenates

the list is appended...

these mid-line-hyphenates will help the computer _resolve_
questionable end-line-hyphenates, if we wanted it to do that.

but for now, we're just looking for o.c.r. misrecognitions...

most of these words were correctly recognized; some were not.
mistakes typically involve a speck misrecognized as an en-dash
or -- conversely -- an em-dash misrecognized as an en-dash...

the other class of error here involves end-line hyphenates that
were rejoined by the proofer without a removal of the hyphen...

looking through them, it's fairly easy to pick out the ones which
will probably need corrections.   those are the only ones i'll check.

of the 41 i pulled out for examination, 25 were indeed incorrect...

***

some inconsistencies usually appear involves hyphenates...
this book was no exception.   consider the following cases:

6 occurrences of "to-day", versus 2 occurrences of "today"...

*** to-day (6)
>    from school to-day, and at least provided an emergency
>    sharper than usual to-day." Above the stained
>    and to-day," the doctor replied evasively, "you
>    informed him concisely, "to-day."
>    "Right away! now! to-day!"
>    Hollidew's in Greenstream to-day. I don't know

*** today (2)
>    shoes for a lady today--a generous present for some
>    from Lettice; and, today, he had recognized a note

9 occurrences of "to-morrow", versus 4 for "tomorrow"...

*** to-morrow (9)
>    "I can give you something day after to-morrow,
>    that the latter wanted, must have, to-morrow. But
>    "To-morrow, about seven. Everything will be
>    to the door; it said, "Gone fishing. Back to-morrow."
>    Greenstream, be ready to-morrow--"
>    to-morrow I will feel different."
>    over on the western mountain, to-morrow night, at
>    the obscurity of the maples to-morrow night ...
>    "The stage goes out from Greenstream to-morrow;

*** tomorrow (4)
>    tomorrow, or when I go to church."
>    he would have his wages tomorrow; however, if
>    back with you tomorrow. He's been down to
>    would be dangerous tomorrow.

given these inconsistencies, nobody should hesitate to convert
all of the archaic forms to today's spelling.   (i do that regardless.)

***

25 more lines corrected, for a grand total of 346, on 29 routines...

i'll be back tomorrow with the next tip in this series...

-bowerbird

--------------------------------------------------------------

here is the list of __ mid-line-hyphenates...

the 7 most weird cases are listed first, with the rest alphabetical.
look through this list and mark the ones that look "weird" to you.
then scroll down and see if it matches the list _i_ think are weird.


(WENTY-SEVEN

eye-
willing-

G-G-God

brother-in-law
father-in-law
long-drawn-out

a-calling
a-talking
air-tight
all-over
animal-like
any-thing
ash-pit
bag-like
bare-necked
barely-furnished
beady-eyed
beat-of
black-clad
blood-guiltiness
blood-money
blue-black
blue-green
blue-white
broken-a
bull-dog
business-like
canvas-covered
carelessly-garbed
chocolate-colored
claret-colored
clay-cold
clear-eyed
close-cropped
close-cut
close-haired
close-lipped
co-operation
cold-blooded
conscience-stricken
cross-grained
deep-shaded
deeply-bitten
deeply-grassed
deeply-lined
deeply-scrolled
dining-room
eighty-nine
ever-trimmed
ex-stage
far-reaching
fool-hearted
fore-knowledge
four-legged
four-square
freshly-colored
freshly-flushed
full-lidded
gaily-attired
gaily-patched
gas-eously
gaunt-jawed
glanced,-each
green-sickness
greenish-black
greenish-gold
grey-green
half-absently
half-buried
half-calculated
half-closed
half-distant
half-grown
half-heard
half-heartedly
half-hid
half-mechanically
half-way
hammer-like
heart-breaking
heavily-built
heavy-sweet
high-power
high-rolling
highly-colored
highly-simplified
hollow-sounding
home-knitted
hoof-beats
ill-concealed
ill-considered
ill-defined
ill-directed
ill-proportioned
ill-will
in-those
inwardly-gratifying
iron-like
JacK-son
know;-but
leaden-faced
leaden-grey
leaden-hued
leather-like
left-hand
life-long
life-time
lightly-struck
long-accumulated
long-drawn
long-familiar
long-lost
loose-jointed
loose-living
low-drifting
Mac-Kimmon
Mai-son
machine-like
mahogany-colored
mid-August
milk-white
moon-blanched
mud-coated
naked-seeming
nephews-a
newly-augmented
newly-awakened
newly-minted
nice-hearted
night-like
nine-tenths
off-hand
old-time
olive-colored
one-time
open-handed
out-flung
paper-shavers
pasty-white
pellu-cidly
plum-colored
plush-lined
post-office
prayer-meeting
pride-fully
public-spirited
raw-boned
re-departed
re-entered
red-clad
red-headed
robustly-witted
rock-like
rough-coated
roughly-cleared
Sim-mons'
salt-raised
sap-boil
sap-boiling
sap-boilings
school-girl
school-teacher
self-approval
self-assertion
self-consciousness
self-esteem
self-headed
self-sufficient
semi-obscurity
semi-ruin
semi-surreptitiously
seventy-five
sharp-like
sharp-witted
sharply-cut
shed-like
sheep-cots
silver-plated
sing-song
skull-like
sleep-walker
slightly-built
slightly-varying
slow-Idndling
slowly-formulating
so-so
softly-swelling
stage-driving
stiffly-extended
stone-bound
store-keepers'
straw-colored
sulphur-yellow
sun-heated
swiftly-falling
tar-paper
thirty-eight
thirty-one
thousand-fold
thread-bare
throat.-"There
tightly-folded
to-day
to-morrow
to-night
tobacco-stained
toil-hardened
too-a
toy-like
twenty-eight
twenty-five
twenty-seven
two-seated
ultra-blue
under-garment
under-turning
undi-vined
unformu-lated
unhap-piness
up-flung
up-rolled
vain-longing
variously-colored
violent-handed
weather-beaten
weather-proof
weather-worn
well-fed
well-known
white-hot
white-powdered
wild-eyed
wind-herded
wire-hound
wooden-soled
world-old
yellow-red
yellowish-white


scroll
down
for
the
cases
that
i
considered
to
be
the
"problem"
cases...


scroll
down
for
the
cases
that
i
considered
to
be
the
"problem"
cases...


scroll
down
for
the
cases
that
i
considered
to
be
the
"problem"
cases...


*** the problem cases (the ones marked with asterisks were errors)

(WENTY-SEVEN *

eye- *
willing- *

G-G-God

brother-in-law
father-in-law
long-drawn-out

all-over
any-thing *
beat-of *
blood-guiltiness
broken-a *
co-operation
ex-stage
gas-eously *
glanced,-each *
green-sickness
high-power
ill-will
in-those *
JacK-son *
know;-but *
life-time *
Mac-Kimmon *
Mai-son *
naked-seeming
nephews-a *
night-like
pellu-cidly *
pride-fully *
Sim-mons' *
slow-Idndling *
store-keepers' *
throat.-"There *
too-a *
under-turning
undi-vined *
unformu-lated *
unhap-piness *
vain-longing
world-old



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080730/c94b2887/attachment.htm 

From Bowerbird at aol.com  Wed Jul 30 14:53:25 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 30 Jul 2008 17:53:25 EDT
Subject: [gutvol-d] more revealing data as "mountain blood" exits p2
Message-ID: <c02.3e70cdd6.35c23cd5@aol.com>

"mountain blood" was a test of parallel p1 over at d.p., by rfrank...

after two parallel proofings in p1, their outputs were merged for p2.

p2 did 50 edits, making 61 lines change.   they're appended, and here:
>    http://z-m-l.com/go/mount/mount-c-p2results.html

3/4 of these edits -- 37 -- were "bureaucratic", necessary only   due
to stupid policies on ellipses, end-line-hyphenates, em-dashes, etc.
basically, p2 just piddled through the book making needless changes.

another _7_ errors were _mistakes_in_merging_.   that is, _one_ of the
parallel proofings got the line correct, but the person merging them
chose the line from the _incorrect_ proofing instead of the correct one.

p2 also caught 3 p-book errors that neither parallel proofings found.
as it's not within the purview of proofers to catch these p-book errors,
i don't count this against the p1 proofings.   but i've mentioned it here
to acknowledge this positive result from p2's word-by-word checking.

finally, p2 did catch _2_ o.c.r. errors missed by both p1 proofings...
offsetting that ever-so-slightly, p2 _introduced_ 1 error of its own.

***

so, let me review.

the parallel p1 proofings created a composite text with _2_errors_.
that's pretty fantastic.

ironically, the person who _merged_ those two parallel p1 proofings
made _7_ errors during the merge process, thereby _dwarfing_ the
2 errors that were actually made.

restating that, the person who did the merge made _3_times_
_more_mistakes_ in the merge process than the proofers made.

***

i will leave it up to you to decide if all these proofings were "worth it".
at 2 minutes average per page, every round took about _12_hours_...

i _will_ remind you that aggressive preprocessing of this text corrected
all but _3_ of the o.c.r. errors, and both p1 proofings caught all 3 of 'em.

for the record, here are those 3 errors:
>    If they took away the chair, Gordon knew, he wag
>    If they took away the chair, Gordon knew, he was

>    "Why, damn it fell, Gord!" exclaimed an individual,
>    "Why, damn it t'ell, Gord!" exclaimed an individual,

>    la wed bank; a rate of interest a man can carry without
>    lawed bank; a rate of interest a man can carry without

heck, even though i call that "aggressive preprocessing", the fact is that
i'm documenting the routines to do it, and all of them are quite obvious.

i've now described 29 routines, and none of them were esoteric at all...
yet _none_ of these 29 routines were run against this "mountain" text.
not a one.   why not?   i think that's absolutely _terrible_.   it's a 
disgrace.

had good preprocessing been done, all but 3 pages would've "no diffed"
in both p1 proofings, and been perfect after either one, and these results
would've clearly reflected the excellent state of this text at the end of 
this.

as it was, this text received only a handful of "no diff" pages in either p1,
and then in p2 another 44 of those pages recorded _yet_another_ "diff"...
it is now waiting for p3, and how would anyone know it's almost perfect?

once again, as we've seen time after time with all of these "experiments",
the proofers _rock_,   while the d.p. bureaucracy is _staggeringly_stupid_.

but it will now be clear, to you and to rfrank, that parallel proofing 
"works".

-bowerbird


*********************   results from the "mountain blood" test-book...
*********************   the 50 "errors" found by p2 after the p1 merge...

*********************   bureaucracy ellipsis (20)
*********************   bureaucracy end-line-hyphenate (12)
*********************   bureaucracy em-dash (2)
*********************   bureaucracy 8-bit (2)
*********************   bureaucracy begin-chapter-caps (1)

*********************   merge error (7)

*********************   p-book error (3)

*********************   real error that was caught (2)
*********************   incorrectly changed by p2 (1)


1>     -- p1merged
2>     -- p2 proofing



#>    http://z-m-l.com/go/mount/mountp003.html         1

*********************   bureaucracy 8-bit

1>    ALFRED A KNOPF
2>    ALFRED ? A ? KNOPF




#>    http://z-m-l.com/go/mount/mountp009.html         2

*********************   bureaucracy begin-chapter-caps

1>    The fiery disk of the sun was just lifting above
2>    THE fiery disk of the sun was just lifting above




#>    http://z-m-l.com/go/mount/mountp023.html         3

*********************   merge error

1>    V.
2>    V




#>    http://z-m-l.com/go/mount/mountp031.html         4

*********************   bureaucracy end-line-hyphenate

1>    hidden space, the village lay along its white highway.
2>    hidden space, the village lay along its white high-*way.




#>    http://z-m-l.com/go/mount/mountp034.html         5

*********************   bureaucracy end-line-hyphenate

1>    opposite side the mellow brick face of the Court-*house
2>    opposite side the mellow brick face of the Courthouse




#>    http://z-m-l.com/go/mount/mountp041.html         6

*********************   bureaucracy ellipsis

1>    "our link with the outer world, our faithful messenger....I
2>    "our link with the outer world, our faithful messenger....

*********************   bureaucracy ellipsis (2nd line)

1>    wanted to see you; ah, yes." He
2>    I wanted to see you; ah, yes." He




#>    http://z-m-l.com/go/mount/mountp048.html         7

*********************   bureaucracy end-line-hyphenate

1>    the lush grass, the greenish-gold sparks of the fireflies
2>    the lush grass, the greenish-gold sparks of the fire-*flies




#>    http://z-m-l.com/go/mount/mountp052.html         8

*********************   bureaucracy ellipsis

1>    the secretiveness, of night ... Greenstream village
2>    the secretiveness, of night.... Greenstream village




#>    http://z-m-l.com/go/mount/mountp058.html         9

*********************   bureaucracy end-line-hyphenate

1>    "Give me the man from the woods for an open-handed
2>    "Give me the man from the woods for an openhanded




#>    http://z-m-l.com/go/mount/mountp087.html         10

*********************   bureaucracy ellipsis

1>    injustice and injury. He hardened, grew defiant ... the
2>    injustice and injury. He hardened, grew defiant

*********************   bureaucracy ellipsis (2nd line)

1>    strain of lawlessness brought so many years
2>    ... the strain of lawlessness brought so many years




#>    http://z-m-l.com/go/mount/mountp098.html         11

*********************   bureaucracy 8-bit

1>    They stood before the dark, porchless facade
2>    They stood before the dark, porchless fa?ade




#>    http://z-m-l.com/go/mount/mountp100.html         12

*********************   bureaucracy ellipsis

1>    for frugality, for independence, as a reserve ...
2>    for frugality, for independence, as a reserve

*********************   bureaucracy ellipsis (2nd line)

1>    or for pleasure. It was the hottest hour of the
2>    ... or for pleasure. It was the hottest hour of the

*********************   bureaucracy ellipsis

1>    in that banal setting, suddenly grew unbearable.
2>    in that banal setting, suddenly grew unbearable....




#>    http://z-m-l.com/go/mount/mountp111.html         14

*********************   bureaucracy end-line-hyphenate

1>    would." He turned with a sigh to the log. A crosscut
2>    would." He turned with a sigh to the log. A cross-*cut




#>    http://z-m-l.com/go/mount/mountp125.html         15

*********************   bureaucracy end-line-hyphenate

1>    the dark house.... He shut his eyes for a mo*
2>    the dark house.... He shut his eyes for a mo-*




#>    http://z-m-l.com/go/mount/mountp128.html         16

*********************   bureaucracy em-dash

1>    stirring--three souls redeemed from everlasting torment
2>    stirring---three souls redeemed from everlasting torment




#>    http://z-m-l.com/go/mount/mountp131.html         17

*********************   incorrectly changed by p2

1>    say, Augustus," he demanded in eager, tremulous
2>    say, Augustus,"he demanded in eager, tremulous




#>    http://z-m-l.com/go/mount/mountp134.html         18

*********************   bureaucracy ellipsis

1>    "Teacher, kin I be excused? Teacher! ... Teacher--!'"
2>    "'Teacher, kin I be excused? Teacher! ... Teacher--!'"




#>    http://z-m-l.com/go/mount/mountp151.html         19

*********************   merge error

1>    in silkaleen and back in Al mohair, it'll stand you
2>    in silkaleen and back in A1 mohair, it'll stand you




#>    http://z-m-l.com/go/mount/mountp153.html         20

*********************   bureaucracy ellipsis

1>    charming little wife, large fortune at your disposal.
2>    charming little wife, large fortune at your disposal....

*********************   bureaucracy ellipsis (2nd line)

1>    ... Pompey left one of the solidest estates in this
2>    Pompey left one of the solidest estates in this




#>    http://z-m-l.com/go/mount/mountp154.html         21

*********************   bureaucracy ellipsis

1>    in the far future, perhaps in another generation.
2>    in the far future, perhaps in another generation....

*********************   bureaucracy ellipsis (2nd line)

1>    ... What would you say to a flat eight dollars an
2>    What would you say to a flat eight dollars an




#>    http://z-m-l.com/go/mount/mountp174.html         22

*********************   p-book error

1>    to the rod. General Jackson's head hung panting,
2>    to the road. General Jackson's head hung panting,




#>    http://z-m-l.com/go/mount/mountp176.html         23

*********************   bureaucracy ellipsis

1>    "Yes," she assented, "there was nothing else open.... Won't
2>    "Yes," she assented, "there was nothing else open....

*********************   bureaucracy ellipsis (2nd line)

1>    you come up and smoke a cigarette?
2>    Won't you come up and smoke a cigarette?




#>    http://z-m-l.com/go/mount/mountp196.html         24

*********************   bureaucracy ellipsis

1>    longer. You can't tread on me. It's going to stop
2>    longer. You can't tread on me. It's going to stop ... now."

*********************   bureaucracy ellipsis (2nd line)

1>    ... now."
2>    ??synchlineadded




#>    http://z-m-l.com/go/mount/mountp213.html         25

*********************   merge error

1>    the astute storekeeper into such a satisfactory, retail-*
2>    the astute storekeeper into such a satisfactory, retali-*




#>    http://z-m-l.com/go/mount/mountp217.html         26

*********************   merge error

1>    of Lattice's."
2>    of Lettice's."




#>    http://z-m-l.com/go/mount/mountp219.html         27

*********************   p-book error

1>    hundred per cent. increase."
2>    hundred per cent increase."




#>    http://z-m-l.com/go/mount/mountp246.html         28

*********************   bureaucracy end-line-hyphenate

1>    the best clothes. I'll tell him I'm a poor schoolteacher
2>    the best clothes. I'll tell him I'm a poor school-teacher




#>    http://z-m-l.com/go/mount/mountp256.html         29

*********************   bureaucracy ellipsis

1>    had gone to the sap-boiling.... I sat up all night
2>    had gone to the sap-boiling ... I sat up all night

*********************   bureaucracy ellipsis

1>    ... waiting.... I couldn't wait any longer, Gordon,
2>    ... waiting ... I couldn't wait any longer, Gordon,




#>    http://z-m-l.com/go/mount/mountp281.html         31

*********************   merge error

1>    ??line-was-missing??
2>    On an afternoon of the second autumn following




#>    http://z-m-l.com/go/mount/mountp286.html         32

*********************   bureaucracy ellipsis

1>    now...." her blue gaze blurred with slow tears.
2>    now ..." her blue gaze blurred with slow tears.




#>    http://z-m-l.com/go/mount/mountp308.html         33

*********************   bureaucracy end-line-hyphenate

1>    necessary, and knock the bottom out of the store-keepers'
2>    necessary, and knock the bottom out of the storekeepers'




#>    http://z-m-l.com/go/mount/mountp310.html         34

*********************   bureaucracy em-dash

1>    make her, but it would certainly be accommodating--"
2>    make her, but it would certainly be accommodating--" he

*********************   bureaucracy em-dash (2nd line)

1>    he paused interrogatively.
2>    paused interrogatively.

*********************   bureaucracy ellipsis

1>    for the note, if it comes to that. But the fact is
2>    for the note, if it comes to that. But the fact is ... I've

*********************   bureaucracy ellipsis (2nd line)

1>    ... I've got a lot of money laid out. What's been
2>    got a lot of money laid out. What's been




#>    http://z-m-l.com/go/mount/mountp321.html         36

*********************   bureaucracy end-line-hyphenate

1>    you are the son of your father. I knew your grandfather
2>    you are the son of your father. I knew your grand-*father

*********************   real error that was caught

1>    "We've never been storekeepers,"
2>    "We've never been storekeepers."




#>    http://z-m-l.com/go/mount/mountp323.html         38

*********************   bureaucracy ellipsis

1>    "I don't want to make! I don't want to take anything
2>    "I don't want to make! I don't want to take anything ... never

*********************   bureaucracy ellipsis (2nd line)

1>    ... never again! I want--"
2>    again! I want--"




#>    http://z-m-l.com/go/mount/mountp324.html         39

*********************   bureaucracy end-line-hyphenate

1>    met the Company's agents, heard the agreement outlined;
2>    met the Company's agents, heard the agreement out-*lined;




#>    http://z-m-l.com/go/mount/mountp331.html         40

*********************   merge error

1>    but not Kenny's for nineteen years." Another bore,
2>    but not Henny's for nineteen years." Another bore,




#>    http://z-m-l.com/go/mount/mountp337.html         41

*********************   bureaucracy ellipsis

1>    read ... Why! ... Why, damn it! they had it
2>    read ... Why!... Why, damn it! they had it




#>    http://z-m-l.com/go/mount/mountp338.html         42

*********************   bureaucracy ellipsis

1>    the motives of such ill-considered...."
2>    the motives of such ill-considered ..."




#>    http://z-m-l.com/go/mount/mountp343.html         43

*********************   bureaucracy ellipsis

1>    I only stopped now to warn you away.... I'll hitch
2>    I only stopped now to warn you away ... I'll hitch

*********************   real error that was caught

1>    of wrath, his arm rose, with a finger indicating the*
2>    of wrath, his arm rose, with a finger indicating the




#>    http://z-m-l.com/go/mount/mountp346.html         45

*********************   bureaucracy end-line-hyphenate

1>    Why, you've been the mutton of every little storekeeper
2>    Why, you've been the mutton of every little store-*keeper




#>    http://z-m-l.com/go/mount/mountp362.html         46

*********************   bureaucracy ellipsis

1>    done a thing like that. Why, just see...." Gordon
2>    done a thing like that. Why, just see ..." Gordon




#>    http://z-m-l.com/go/mount/mountp365.html         47

*********************   bureaucracy end-line-hyphenate

1>    *ing, dead planet. Gleams of light shot like quicksilver
2>    *ing, dead planet. Gleams of light shot like quick-*silver




#>    http://z-m-l.com/go/mount/mountp367.html         48

*********************   p-book error

1>    in his brain, was dulled. He place his foot upon the
2>    in his brain, was dulled. He placed his foot upon the

*********************   bureaucracy ellipsis

1>    "If it hadn't been for you, what you did for me
2>    "If it hadn't been for you, what you did for me ... others ... new

*********************   bureaucracy ellipsis (2nd line)

1>    ... others ... new courage, example of bigness--Why!
2>    courage, example of bigness--Why!




#>    http://z-m-l.com/go/mount/mountp368.html         50

*********************   merge error

1>    him to where, on. the bureau, a lamp had been left.
2>    him to where, on the bureau, a lamp had been left.



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080730/7ad28603/attachment-0001.htm 

From Bowerbird at aol.com  Wed Jul 30 15:31:52 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 30 Jul 2008 18:31:52 EDT
Subject: [gutvol-d] planet strappers -- iteration #10
Message-ID: <bf1.335fbd76.35c245d8@aol.com>

"planet strappers" -- the _perpetual_ p1 experiment over at d.p. --
has now finished up with it's 10th iteration of the proofing process.

as reported earlier, this iteration also missed the error on page 33...
better luck next iteration...

that's not to say that this iteration didn't make lots of changes though!
140 lines were changed during this iteration!   it's very hard to believe
that so many o.c.r. errors survived through so many previous iterations.

and indeed, your instincts are correct.   these 140 changes were almost
_entirely_ on "bureaucratic" changes, on ellipses, em-dashes, and so on.

some of them were because the newbies who were doing this proofing
simply don't understand the rules, and are applying them erroneously...

but a good number of the changes are because the rules are _slippery_,
and -- in order to exert a sense of agency -- proofers take advantage...

for instance, how would you handle this end-of-line-hyphenate?
>    than the tonnage of the Port of Baltimore, to-
>    day."

(that's an actual example, taken from "in her own right".)

the d.p. rules state that a proofer should eliminate an unneeded hyphen.
if the word is "today", then you would join it together as "today".   great.

but if the word is "to-day", then you would join it together as "to-day".

and if you're not sure, the d.p. convention is to mark it as "to-*day".   ok.

but remember how i just showed you that some old books are inconsistent
and they can use "today" and "to-day" in the same book?   so a proofer who
saw a case of "today" used on another page would eliminate the hyphen...
another proofer who saw a case of "to-day" on another page would keep it.
and a proofer who saw neither (or both) might mark the word as "to-*day"...

so what you get is a proofer marking it one way in one iteration, and then
another proofer marking it another way in another iteration, and then yet
another proofing marking it the third way in another iteration.   and then
we go back to the first kind of proofer, who changes it in another iteration.

it's rock-paper-scissors, and no one is ever "right", and it can go on 
forever.

this is the kind of indeterminacy that poorly-thought policies buys you...

***

there were a few cases (3) where a non-bureaucratic change was made.

in 2 of these, iteration#10 was correcting an error made by iteration#9,
but earlier rounds had had the line correct, so this doesn't really "count".

in the third case, iteration#10 introduced a new error.   #9 had it right.

how's that?   for every 2 errors you fix, you introduce 1 new one...
talk about a perfect-glove fit for "two steps forward, one step back".

***

to its credit, though, iteration#10 found _3_ more p-book errors.
these really don't matter much, but it's amazing that nobody had
noticed these errors before.   (at least i don't _recall_ that they had;
it's just not important enough to me that i go back and confirm it.)

anyway, here are those 3 cases:

>    http://z-m-l.com/go/plans/plansp060.html
>    the mountain wall, imbedded in the dust of the mare. There
>    the mountain wall, embedded in the dust of the mare. There

>    http://z-m-l.com/go/plans/plansp139.html
>    theirs. I'll cover the rest of this batch: You'll be better than I
>    theirs. I'll cover the rest of this batch. You'll be better than I

>    http://z-m-l.com/go/plans/plansp141.html
>    first task that Nelsen had ever performed in space--the jockying
>    first task that Nelsen had ever performed in space--the jockeying

again, no big deal, but surprising that, 10 rounds in, we still find stuff.

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080730/b189451b/attachment.htm 

From jayvdb at gmail.com  Wed Jul 30 17:22:18 2008
From: jayvdb at gmail.com (John Vandenberg)
Date: Thu, 31 Jul 2008 10:22:18 +1000
Subject: [gutvol-d] getting my wikisource bearings
In-Reply-To: <20080730043313.CAEE91035B@posso.dm.unipi.it>
References: <bcd.3280c184.35bf921c@aol.com>
	<deea21830807281802k3d8c80b9p67c59e6b2039288c@mail.gmail.com>
	<488FB357.1080307@novomail.net>
	<20080730043313.CAEE91035B@posso.dm.unipi.it>
Message-ID: <deea21830807301722l4df26775h844195cb030378de@mail.gmail.com>

On Wed, Jul 30, 2008 at 2:33 PM, Carlo Traverso
<traverso at posso.dm.unipi.it> wrote:
>>>>>> "Lee" == Lee Passey <lee at novomail.net> writes:
>
>    Lee> John Vandenberg wrote: [snip]
>
>    >>>> Interesting idea.  I've been considering using o.c.r. to
>    >>>> re-paginate a proofread text. It sounds like you're
>    >>>> suggesting the opposite would be more fruitful.
>    >>>  well, whether you use the already-proofed text to bring the
>    >>> o.c.r.  version up to final-quality, or (vice-versa-like) use
>    >>> the o.c.r.  version to bring the already-proofed text to
>    >>> final-stage, the effect is the same either way.  you're
>    >>> comparing the two and implementing whatever changes are
>    >>> necessary to finalize.
>    >>  Any existing code around to do something like this ?
>
>    Lee> Yes.
>
>    Lee> I have created some code to do this, which I would be happy
>    Lee> to share with you, but I'm hoping someone else has done it
>    Lee> better.
>
> http//www.gnu.org/software/wdiff  or emacs ediff in word mode are
> doing that excellently.

I gave wdiff a whirl yesterday, first comparing the two online
editions of SBE v42 Book 2, which worked like a treat, and then
attempting to merge the corrected text into the OCR text.

Pagescan:

http://en.wikisource.org/wiki/Page:Sacred_Books_of_the_East_42.djvu/130

----Raw OCR----

3. Thou, (O Agni), rulest over all the animals of
the earth, those which have been born, and those
which are to be born : may not in-breathing leave
this one, nor yet out-breathing, may neither friends
nor foes slay him !
4. May father Dyaus (sky) and mother Pr/thivi
(earth), co-operating, grant thee death from old
age, that thou mayest live in the lap of Aditi a
hundred winters, guarded by in-breathing and out-
breathing !
5. Lead this dear child to life and vigour, O Agni,


----Clean text----

3.
Thou, (O Agni), rulest over all the animals of the earth, those which have been
born, and those which are to be born: may not in-breathing leave this one, nor
yet out-breathing, may neither friends nor foes slay him!



4. May father Dyaus
(sky) and mother Prithivi (earth), co-operating, grant thee death from old age,
that thou mayest live in the lap of Aditi a hundred winters, guarded by
in-breathing and outbreathing!



5. Lead this dear child to life and vigour, O
Agni, Varuna, and king Mitra! As a mother afford him protection, O Aditi, and
all ye gods, that he may attain to old age!


----wdiff output----


[-3.-]
{+3.+}
Thou, (O Agni), rulest over all the animals of the earth, those which have been
born, and those which are to be [-born :-] {+born:+} may not
in-breathing leave this one, nor
yet out-breathing, may neither friends nor foes slay [-him !
4.-] {+him!



4.+} May father Dyaus
(sky) and mother [-Pr/thivi-] {+Prithivi+} (earth), co-operating,
grant thee death from old age,
that thou mayest live in the lap of Aditi a hundred winters, guarded by
in-breathing and [-out-
breathing !
5.-] {+outbreathing!



5.+} Lead this dear child to life and vigour, O
Agni,
[--] {+Varuna, and king Mitra! As a mother afford him protection, O Aditi, and
all ye gods, that he may attain to old age!+}

---- end ----

In the above wdiff ouput, I've lost my end-of-lines that were in the
original OCR text.  I've looked at the wdiff options, and cant see
which would do the trick.

--
John Vandenberg

From jayvdb at gmail.com  Wed Jul 30 18:02:38 2008
From: jayvdb at gmail.com (John Vandenberg)
Date: Thu, 31 Jul 2008 11:02:38 +1000
Subject: [gutvol-d] getting my wikisource bearings
In-Reply-To: <deea21830807301722l4df26775h844195cb030378de@mail.gmail.com>
References: <bcd.3280c184.35bf921c@aol.com>
	<deea21830807281802k3d8c80b9p67c59e6b2039288c@mail.gmail.com>
	<488FB357.1080307@novomail.net>
	<20080730043313.CAEE91035B@posso.dm.unipi.it>
	<deea21830807301722l4df26775h844195cb030378de@mail.gmail.com>
Message-ID: <deea21830807301802h15c19130x5fc13a95323335ac@mail.gmail.com>

On Thu, Jul 31, 2008 at 10:22 AM, John Vandenberg <jayvdb at gmail.com> wrote:
> On Wed, Jul 30, 2008 at 2:33 PM, Carlo Traverso
> <traverso at posso.dm.unipi.it> wrote:
>>>>>>> "Lee" == Lee Passey <lee at novomail.net> writes:
>>
>>    Lee> John Vandenberg wrote: [snip]
>>
>>    >>>> Interesting idea.  I've been considering using o.c.r. to
>>    >>>> re-paginate a proofread text. It sounds like you're
>>    >>>> suggesting the opposite would be more fruitful.
>>    >>>  well, whether you use the already-proofed text to bring the
>>    >>> o.c.r.  version up to final-quality, or (vice-versa-like) use
>>    >>> the o.c.r.  version to bring the already-proofed text to
>>    >>> final-stage, the effect is the same either way.  you're
>>    >>> comparing the two and implementing whatever changes are
>>    >>> necessary to finalize.
>>    >>  Any existing code around to do something like this ?
>>
>>    Lee> Yes.
>>
>>    Lee> I have created some code to do this, which I would be happy
>>    Lee> to share with you, but I'm hoping someone else has done it
>>    Lee> better.
>>
>> http//www.gnu.org/software/wdiff  or emacs ediff in word mode are
>> doing that excellently.
>
> I gave wdiff a whirl yesterday, first comparing the two online
> editions of SBE v42 Book 2, which worked like a treat, and then
> attempting to merge the corrected text into the OCR text.
>
> Pagescan:
>
> http://en.wikisource.org/wiki/Page:Sacred_Books_of_the_East_42.djvu/130
>
> ----Raw OCR----
>
> 3. Thou, (O Agni), rulest over all the animals of
> the earth, those which have been born, and those
> which are to be born : may not in-breathing leave
> this one, nor yet out-breathing, may neither friends
> nor foes slay him !
> 4. May father Dyaus (sky) and mother Pr/thivi
> (earth), co-operating, grant thee death from old
> age, that thou mayest live in the lap of Aditi a
> hundred winters, guarded by in-breathing and out-
> breathing !
> 5. Lead this dear child to life and vigour, O Agni,
>
>
> ----Clean text----
>
> 3.
> Thou, (O Agni), rulest over all the animals of the earth, those which have been
> born, and those which are to be born: may not in-breathing leave this one, nor
> yet out-breathing, may neither friends nor foes slay him!
>
>
>
> 4. May father Dyaus
> (sky) and mother Prithivi (earth), co-operating, grant thee death from old age,
> that thou mayest live in the lap of Aditi a hundred winters, guarded by
> in-breathing and outbreathing!
>
>
>
> 5. Lead this dear child to life and vigour, O
> Agni, Varuna, and king Mitra! As a mother afford him protection, O Aditi, and
> all ye gods, that he may attain to old age!
>
>
> ----wdiff output----
>
>
> [-3.-]
> {+3.+}
> Thou, (O Agni), rulest over all the animals of the earth, those which have been
> born, and those which are to be [-born :-] {+born:+} may not
> in-breathing leave this one, nor
> yet out-breathing, may neither friends nor foes slay [-him !
> 4.-] {+him!
>
>
>
> 4.+} May father Dyaus
> (sky) and mother [-Pr/thivi-] {+Prithivi+} (earth), co-operating,
> grant thee death from old age,
> that thou mayest live in the lap of Aditi a hundred winters, guarded by
> in-breathing and [-out-
> breathing !
> 5.-] {+outbreathing!
>
>
>
> 5.+} Lead this dear child to life and vigour, O
> Agni,
> [--] {+Varuna, and king Mitra! As a mother afford him protection, O Aditi, and
> all ye gods, that he may attain to old age!+}
>
> ---- end ----
>
> In the above wdiff ouput, I've lost my end-of-lines that were in the
> original OCR text.  I've looked at the wdiff options, and cant see
> which would do the trick.

The solution came to me: call it as "wdiff clean-text ocr-text", resulting in:

----

[-3.-]{+3.+} Thou, (O Agni), rulest over all the animals of
the earth, those which have been born, and those
which are to be [-born:-] {+born :+} may not in-breathing leave
this one, nor yet out-breathing, may neither friends
nor foes slay [-him!



4.-] {+him !
4.+} May father Dyaus (sky) and mother [-Prithivi-] {+Pr/thivi+}
(earth), co-operating, grant thee death from old
age, that thou mayest live in the lap of Aditi a
hundred winters, guarded by in-breathing and [-outbreathing!



5.-] {+out-
breathing !
5.+} Lead this dear child to life and vigour, O Agni, [-Varuna, and
king Mitra! As a mother afford him protection, O Aditi, and
all ye gods, that he may attain to old age!-]
{++}

---

Is there a GUI or command line front-end for wdiff, to allow
interactive accept/reject of each change?

--
John Vandenberg

From schultzk at uni-trier.de  Thu Jul 31 00:13:54 2008
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Thu, 31 Jul 2008 09:13:54 +0200
Subject: [gutvol-d] how to clean up ("preprocess") the o.c.r. for a book
	-- 027
In-Reply-To: <d24.2cf1b43d.35c1ed5b@aol.com>
References: <d24.2cf1b43d.35c1ed5b@aol.com>
Message-ID: <BA8AB986-74FA-4073-B835-F4CE63B1CC62@uni-trier.de>


Am 30.07.2008 um 18:14 schrieb Bowerbird at aol.com:

> keith said:
> >   So what can the parser do for us. A parser will pull in the  
> text and identify,
> >   the words sentences, quotes, chapter-headers, footenotes or  
> whatever enties
>
> i don't see what that buys us, in terms of the job at hand --  
> correcting errors.
	Well, what I said about would belong more in helping mark up the text.
	YET, identifing words, sentences, and quotes and making sure they  
are balanced can help in
	finding the errors.

>
>
>
> >   Using pattern matching you have to go through the pattern one  
> after another.
>
> well, i've explained a while back that this is the way we _want_ to  
> do this.
>
> generally, a certain "pattern" will be treated similarly whenever  
> it occurs,
> so it's fastest to treat each pattern in sequence, rather than  
> mixing them.
	Well, parsing will not mix them up and if the parser is written as  
to tag the error
	(or correct it) as being such an such error.

>
>
> preprocessing is typically better-executed as a _book-wide_  
> methodology,
> rather than a _page-by-page_ task, so much that it's part of the  
> definition...
	So what its the problem. A parser could care less.
>
>
>
> >   A parser will handle everthing in one pass if you wish
> >   by using look ahead and or look back.
>
> i can handle everything in one pass too, if i write the code that way.
>
>
> >   To me context information concerns the structure being analyzed.
> >   Not so much the co-text.
>
> except people don't need that to check the text against the image.
> you're overcomplicating the actual task at hand.  it's a simple task.
	I never talked about using the image!

	regards
		Keith.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080731/0a1dc32e/attachment.htm 

From jayvdb at gmail.com  Thu Jul 31 01:18:50 2008
From: jayvdb at gmail.com (John Vandenberg)
Date: Thu, 31 Jul 2008 18:18:50 +1000
Subject: [gutvol-d] getting my wikisource bearings
In-Reply-To: <20080730043313.CAEE91035B@posso.dm.unipi.it>
References: <bcd.3280c184.35bf921c@aol.com>
	<deea21830807281802k3d8c80b9p67c59e6b2039288c@mail.gmail.com>
	<488FB357.1080307@novomail.net>
	<20080730043313.CAEE91035B@posso.dm.unipi.it>
Message-ID: <deea21830807310118j451e51a0ybcef392d296ccfc7@mail.gmail.com>

On Wed, Jul 30, 2008 at 2:33 PM, Carlo Traverso
<traverso at posso.dm.unipi.it> wrote:
>>>>>> "Lee" == Lee Passey <lee at novomail.net> writes:
>
>    Lee> John Vandenberg wrote: [snip]
>
>    >>>> Interesting idea.  I've been considering using o.c.r. to
>    >>>> re-paginate a proofread text. It sounds like you're
>    >>>> suggesting the opposite would be more fruitful.
>    >>>  well, whether you use the already-proofed text to bring the
>    >>> o.c.r.  version up to final-quality, or (vice-versa-like) use
>    >>> the o.c.r.  version to bring the already-proofed text to
>    >>> final-stage, the effect is the same either way.  you're
>    >>> comparing the two and implementing whatever changes are
>    >>> necessary to finalize.
>    >>  Any existing code around to do something like this ?
>
>    Lee> Yes.
>
>    Lee> I have created some code to do this, which I would be happy
>    Lee> to share with you, but I'm hoping someone else has done it
>    Lee> better.
>
> http//www.gnu.org/software/wdiff  or emacs ediff in word mode are
> doing that excellently.

In my investigation, I have found another simple program called
dwdiff, which is mostly commandline compatible with wdiff. Only the
--autopager, --terminal and --avoid-wraps options are not supported.
Here is the full set:

  -C, --copyright            print Copyright then exit
  -V, --version              print program version then exit
  -1, --no-deleted           inhibit output of deleted words
  -2, --no-inserted          inhibit output of inserted words
  -3, --no-common            inhibit output of common words
  -a, --auto-pager           automatically calls a pager
  -h, --help                 print this help
  -i, --ignore-case          fold character case while comparing
  -l, --less-mode            variation of printer mode for "less"
  -n, --avoid-wraps          do not extend fields through newlines
  -p, --printer              overstrike as for printers
  -s, --statistics           say how many words deleted, inserted etc.
  -t, --terminal             use termcap as for terminal displays
  -w, --start-delete=STRING  string to mark beginning of delete region
  -x, --end-delete=STRING    string to mark end of delete region
  -y, --start-insert=STRING  string to mark beginning of insert region
  -z, --end-insert=STRING    string to mark end of insert region

http://os.ghalkes.nl/dwdiff.html

http://www.linux.com/articles/114176

In my wandering of the web, I found the attached file mentioned here

http://mail.python.org/pipermail/tutor/2002-April/013928.html

... hidden away in the archive ...

http://web.archive.org/web/20020313231458/mike-labs.com/wd2h/wd2h.html

wd2h.pl shows the rough algorithm that wdiff is performing.

--
John Vandenberg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wd2h.pl
Type: application/octet-stream
Size: 10431 bytes
Desc: not available
Url : http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080731/6e2ac5b8/attachment-0001.obj 

From rfrank at pobox.com  Thu Jul 31 06:43:51 2008
From: rfrank at pobox.com (Roger Frank)
Date: Thu, 31 Jul 2008 09:43:51 -0400
Subject: [gutvol-d] woman in her own right -- 008 (and final)
In-Reply-To: <Pine.GSO.4.58.0807301332340.17628@vtn1.victoria.tc.ca>
References: <bcc.322faae0.35c221de@aol.com>
	<Pine.GSO.4.58.0807301332340.17628@vtn1.victoria.tc.ca>
Message-ID: <ea7a3fcea7bdbedfc072ecce4678c525@roundcube-imap.mailstore.pobox.com>

On Wed, 30 Jul 2008 13:44:17 -0700 (PDT), Andrew Sly <sly at victoria.tc.ca>
wrote:
> 
> I usually just skim past or delete BB posts.
> But I must say this one actually made me angry.

After Andrew's post, I couldn't resist digging the referenced post out
of the trash to see what it said. ("the proofers have finished 'in
her own right' in p1 and p2.")

> as it is, this series is finished.  i'm not gonna waste any more of my
> time analyzing "data" from an "experiment" as ill-conceived as this
> one was...

"i'm not gonna waste and more of my time...." means to me that BB
felt the analysis he had already done was a waste of time. I wonder
how many would agree. Since the many of BB's posts that I saw before I
put the kill filter on were all about me and my work, I wonder what
crusade BB will start next?

So what else is in here? Will it be amateurish, contrived, selective,
inflammatory, ad hominem, or a combination of these?  Let's see....

> the p1 proofers changed well over 1200 lines

What does that mean? Of course it sounds bad, because it is meant to
sound bad. Well, the page headers were left in for this book (and
noted in the project thread) to see if forcing attention to the top of
the page would make the proofers more accurate regarding top-of-page
paragraph dileneation. So out of the 1200 lines, about half of them
were adjustments to the top of the page markup on most of the 337
pages. Of those that remain, it averages to just about two lines with
corrections per page. I'm comfortable with that.

> since roger frank has shown he will _take_ things personally, even if
> i don't _make_ them personal in the first place, let's get a little
> personal....

An interesting, if specious, justification for a personal attack.

> roger was quick to tell us how many books that he has submitted to
> p.g.  well, sure, it's easy when you put the work on the backs of the
> proofers.  he gives them straw, and they spin it into gold and give it
> back to him...

Absolutely the proofers, formatters, smoothies and the posting team all
play a big part in any book's journey from scanner to posting. I've
made that comment many times. But the quoted passage misses the point
entirely (as usual, as it's meant to do, for effect.)

The reason I mentioned that I had posted several books to PG was not
to compare or minimize anyone who hasn't but to point out that with
postprocessing-time comes experience. That experience can only help a
contributor who is working to make the DP/PG process better. I think
back to when I became a high-school teacher.  I was working at my
engineering job in the day and teaching computer systems engineering
at a local college at night. I decided that I liked teaching better
and decided to become a high-school math teacher. I went to night
school for a few years to get my teaching license and a M.A. in
Education. I knew the math because of my engineering career. I knew
about teaching from my studies, or thought I did. But it was all put
in perspective by a wise, experienced teacher who told me right at the
start of my first math teaching assignment: "You won't know this
material and you won't know how to teach it until you've taught if for
three years." He was absolutely right. I did mention that I had posted
several books to DP and that I felt the experience made me a better
contributor, but it may have been better stated that I have been
logging in and working at DP every day for over three years and with
that has come an understanding that I don't believe anyone can get
without that experience.

I'm not saying anyone without experience should not speak up.  Anyone
can have a good idea, and sometimes the freshest, most interesting
ideas come from people who are distanced from the current process. I
personally wish this list were moderated, so those fresh ideas and
"better ways" could be presented and discussed without the personal
attacks and without getting a letter grade from the self-appointed
evaluator who trolls this list and permeates it with negative posts.

Also, from that same quote above, I consider it an insult to the many
hard-working post-processors to state that "it's easy" under any
circumstances to do that work.

--Roger Frank






From hart at pglaf.org  Thu Jul 31 09:48:48 2008
From: hart at pglaf.org (Michael Hart)
Date: Thu, 31 Jul 2008 09:48:48 -0700 (PDT)
Subject: [gutvol-d] woman in her own right -- 008 (and final)
In-Reply-To: <Pine.GSO.4.58.0807301332340.17628@vtn1.victoria.tc.ca>
References: <bcc.322faae0.35c221de@aol.com>
	<Pine.GSO.4.58.0807301332340.17628@vtn1.victoria.tc.ca>
Message-ID: <Pine.LNX.4.64.0807310937090.23780@pglaf.org>



Since there is not substantive content to this purported
"civil response" it still falls into the FLAME category.

Please refrain from sending messages that only contain a
personal attack, no matter how "civil" you think you may
have stated your attack.

This puts you on the list for being banned if and when a
situation arises if/when this list supports censorship.

Not that I have any particular interest in defending all
the posts by bowerbird, but I feel I should point out to
the concerned parties that he has been set up to quite a
significant degree by "tag team flamers" who alternate a
series of message carefully gauged to antagonize him but
are couched in these "civil" terms and, when confronted,
the "tag team flamers" each say they only send a minimal
number of messages each while bowerbird sent many. . .an
equal number perhaps to those from the tag team flamers.

I have also seen this on a number of other listservers--
and I hope it makes it into some sort of list manual.


Michael Hart
Founder
Project Gutenberg





On Wed, 30 Jul 2008, Andrew Sly wrote:

>
> I usually just skim past or delete BB posts.
> But I must say this one actually made me angry.
>
> So, I'll try to contain that, and post a civil response.
>
> Just a quick reminder for any newcomers around,
> that our friend BB has shown over the last few years
> a habit of telling others what they should do,
> but has still not (that I am aware of) contributed
> anything of measurable substance towards PG.
>
> My experience is that once in a while he does give you
> an idea to make you think, but overall his inflammatory
> comments and apparent inability to work with others
> at all have resulted in his being banned from three
> different message areas that I know of.
>
> His previous ban on this mailing list was only temporary
> for reasons of wanting to remain fair and open,
> which Greg Newby described quite well at the time.
>
> Andrew
>
> On Wed, 30 Jul 2008 Bowerbird at aol.com wrote:
>
>> the proofers have finished "in her own right" in p1 and p2.
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>

From hart at pglaf.org  Thu Jul 31 10:50:14 2008
From: hart at pglaf.org (Michael Hart)
Date: Thu, 31 Jul 2008 10:50:14 -0700 (PDT)
Subject: [gutvol-d] !@! Re:  woman in her own right -- 008 (and final)
In-Reply-To: <Pine.LNX.4.64.0807311003010.23780@pglaf.org>
References: <bcc.322faae0.35c221de@aol.com>
	<Pine.GSO.4.58.0807301332340.17628@vtn1.victoria.tc.ca>
	<Pine.LNX.4.64.0807310937090.23780@pglaf.org>
	<4891EF2C.7020309@xs4all.nl>
	<Pine.LNX.4.64.0807311003010.23780@pglaf.org>
Message-ID: <Pine.LNX.4.64.0807311036280.23780@pglaf.org>



I have received several notes, mostly private, that seem to
have ignored my statement that I am not interested in types
of defenses for bowerbird, and I should add that he will be
the first, and has been in the past, to say no need for it.

However, flames are flames, and should be pointed out by an
internet listserv moderator, which I have done.

If you are all so adamant about toasting bowerbird I now do
a repeat of a theme we have discussed often.

You are ALL welcome to start your own listsevers at expense
to be 100% defrayed by Project Gutenberg.

So, once again, I simply point out that if you don't want a
contact with bowerbird. . .which you all SAY. . .all you do
is start your own listserver and don't let him in, or use a
heavy hand on "moderation" if you do let him in.

These solutions are simple.

Always have been available.

The fact that you don't use them belies the claim that your
real interest is not hearing from bowerbird.

I agree with with many of the comments I have received that
this ongoing flame war is NOT a good thing.

If you just ignored bowerbird, it would not be there.

Once again I have been pilloried for NOT killing him off in
a situation you could have eliminated several easy ways.

If you think you can force me into using moderation weapons
then I suggest you think again. . . .

If I ever use such weapons, those who wanted it will be the
first to go. . . .

This is an open list, and will remain so as long as you are
on it. . .so start your own list if you want otherwise.

We will gladly pay all the expenses, provide the hardware &
software necessary.

The way things are you are just mountaining a molehill.

Please stop.


Hoping to thank you for consideration in the near future,


Michael Hart
Founder
Project Gutenberg


From Bowerbird at aol.com  Thu Jul 31 11:22:52 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 31 Jul 2008 14:22:52 EDT
Subject: [gutvol-d] woman in her own right -- 008 (and final)
Message-ID: <c16.439a2d34.35c35cfc@aol.com>

roger said:
>    "i'm not gonna waste and more of my time...." means to me that 
>    BB felt the analysis he had already done was a waste of time.

some of the analyses i've done have been worthwhile.

but i've already adequately demonstrated that shoddy preprocessing
wastes the time of the proofers, so there's no use proving that again.

do you really think you can continue running crappy "experiments"
and that i'm gonna continue spending my time "analyzing" all of 'em?


>    Well, the page headers were left in for this book (and noted in 
>    the project thread) to see if forcing attention to the top of the page 
>    would make the proofers more accurate regarding top-of-page
>    paragraph dileneation.

this is cute.   "we're gonna leave in some errors on purpose, to see if
it helps the proofers be more accurate when it comes to other errors."

however, as repugnant as that sounds, if it was supported by _data_,
it might be interesting.   but notice that roger has given us no data...

and even if he had, it wouldn't be all that meaningful, because the
top-of-page new-paragraph blank-lines are easy enough to insert
_automatically_, in preprocessing.   which makes the phrase go like:
"we're gonna leave in some errors that we can find automatically to
see if it helps the proofers find other errors we can find automatically."

it'd be better to just find all the errors automatically, and _fix_them_...


>    So out of the 1200 lines, about half of them were adjustments to 
>    the top of the page markup on most of the 337 pages.

um, no.   at most that would be 337, which is hardly "about half".

i don't even count the addition or deletion of blank lines, at the top
of the page or anywhere on a page, as that's a "bureaucratic" change,
since the paragraphing can (and should) be done in preprocessing...


>    Of those that remain, it averages to 
>    just about two lines with corrections per page. 
>    I'm comfortable with that.

well, i'm glad roger is "comfortable with that".

but since the preprocessing that i did on this book
-- using nothing but obvious cleaning routines --
left just 3 errors in the _entire_book_, my own feeling
is that i wouldn't be comfortable unless i did that well.

two corrections per page means "this book needs another round".
3 corrections for the entire book means "this one can go out now".

that's a _huge_ difference.


>   An interesting, if specious, justification for a personal attack.

no.   you _clearly_ demonstrated last week that you _did_indeed_
take what i said personally, even though it wasn't written that way.

i just gave you a sample of what it looks like when i do get personal.
and, by the way, saying "you can do a lot better" is hardly an insult...

you, on the other hand, seem to feel quite comfortable calling me
"a troll" and making all kinds of other uncomplimentary accusations.

it seems that your brand of "politeness" has an escape-clause in it.
(which is true of most "win friends/influence people" proponents;
evidently if someone is not susceptible to your charms, it gives you
a full and complete license to abuse them in any possible manner,
which just goes to show how thin the veneer is on that philosophy.)


>   But the quoted passage misses the point entirely 
>    (as usual, as it's meant to do, for effect.)

no, the "quoted passage" said that you are giving proofers something
that is _shoddy_ -- because it was subjected to bad preprocessing --
when you could be giving them something that is clearly much better,
as i am showing in the separate series on "how to do preprocessing".

i took the error-rate in the book down to _3_errors_.   _three_, roger.
your so-called preprocessing left 1200 errors in the book, or more...
if you don't see the difference, _you_ have "missed the point entirely".

just do the job right.   that's all i'm asking.   is it really all that hard?


>   I think back to when I became a high-school teacher.

well, thanks for the folksy anecdote...

now, will you please go back and improve your preprocessing tool?
because this has _nothing_ to do with you -- at all, in the slightest --
and _everything_ to do with how the workflow at d.p. is constituted...
thank you.

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080731/fd1b6f06/attachment.htm 

From Bowerbird at aol.com  Thu Jul 31 11:33:35 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 31 Jul 2008 14:33:35 EDT
Subject: [gutvol-d] woman in her own right -- 008 (and final)
Message-ID: <c2f.3bfd5472.35c35f7f@aol.com>

andrew said:
>   But I must say this one actually made me angry.

andrew, if you can't refute the evidence i've offered
-- or at least address it in a substantive manner --
then it will be better if you don't say anything at all,
because this just shows how flimsy your reaction is.


>    but has still not (that I am aware of) contributed
>    anything of measurable substance towards PG.

maybe you don't see the value of the many analyses
that i have done on the various d.p. experiments, or
my constructive criticism and frequent suggestions...

but i can assure you the future _will_ see the value
-- what with hindsight being 20/20 and all that --
and then it will mock you for your shortsightedness.

wouldn't be the first time the outsider was right and
the insiders -- all of 'em -- were wrong wrong wrong.

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080731/4cb577ed/attachment.htm 

From richfield at telkomsa.net  Thu Jul 31 11:11:28 2008
From: richfield at telkomsa.net (Jon Richfield)
Date: Thu, 31 Jul 2008 20:11:28 +0200
Subject: [gutvol-d] !@!ACTA trade agreement brief for July 29-31
	Washington
Message-ID: <48920050.6070004@telkomsa.net>

Sorry Michael, this feedback is probably late and certainly plaintive 
rather than constructive.  But you DID ask! so, fwiw:

My son, who has strong views on the subject said, in reply to my passing 
on your message:
======================

This sort of thing is going on with nauseating regularity, alas.

The trouble behind it is that:
  a) a lot of politicians have accepted without review the claims of the
     intellectual property brigade
  b) a lot of companies have discovered that while outright bribery is
     illegal, politically sensitive gifts work wonders from a policy
     standpoint.

This is a classic protectionist move, really.  All the old brands of
the first world want protection from knockoffs from elsewhere, and
they're trying to get the government to do their work for them.


======================

Not that I think that he should be so uncharitably cynical of course, 
but a bad upbringing will out. 
Anyway, the sum is that some people think that in this way they might 
make more money by depriving the public of things that they had been 
entitled to; who are we to disillusion them by frustration?  Obscurely  
I am reminded of something Bierce said:
"...I must take the liberty to remind him that the law
of supply and demand is not imperative; it is not a statute but a
phenomenon. He may reply: "It is imperative; the penalty for
disobedience is failure. If I pay more in salaries and wages than I need
to, my competitor will not; and with that advantage he will drive me
from the field." If his margin of profit is so small that he must eke it
out by coining the sweat of his workwomen into nickels I've nothing to
say to him. Let him adopt in peace the motto, "I cheat to eat." I do not
know why he should eat, but Nature, who has provided sustenance for the
worming sparrow, the sparrowing owl and the owling eagle, approves the
needy man of prey and makes a place for him at table." 

It might strike anyone that I draw a strained comparison between the 
exploiter of women in employment and exploiters of  "intellectual 
property" that they largely had no hand in creating, and far more 
largely doom to oblivion simply by keeping them out of circulation, 
rather than the lesser crime of profiting from printing what is neither 
logically nor honestly theirs. 
True no doubt, but the motto  "I cheat to eat." springs nimbly to mind.  
That they should compound parasitism with dog-in-the-mangerism is 
distasteful rather than astonishing.  People who have noted some of the 
titles that I have provided either for Gutenberg US or AU, might be 
slightly puzzled at my choices, but they embody a strong trend towards 
worthwhile books that are little known and out of print. 
Such legislation tends to the total loss of such books.  That loss is a 
loss to society.  It adds nothing to the material gain for parasites who 
couldn't give a damn either way if it means no money for them one way or 
the other, so I don't waste my breath on them.  Some books are of direct 
value because of their content, and their loss may be loss as such.  
Others are losses because they have value in their relevance to the 
study of ideas in their times and communities.  This too is a loss, so 
pleading that only worthless materials will fail to get published is 
unworthy.  What we have here is the veto of public good in the interests 
of greed and sloth. 

Maybe what we need is some sort of register of titles to which parties 
might submit lists of material that they desire to publish on a 
not-for-profit basis.  Then if someone objects because they have both 
the right and intention to publish it commercially instead, their right 
prevails.  Otherwise we innocents could publish those titles for the 
benefit of readers who are not in a position to inflate the coffers of 
that good Mr. Munniglut  that Bierce referred to in the work I quoted: 
"...contentedly smoothing the folds out of the superior slope of his 
paunch, exuding the peculiar
aroma of his oleaginous personality and larding the new roadway with
the overflow of a righteousness stimulated to action by relish of his
own identity. And ever thereafter the subtle suggestion of a fat
philistinism lingers along that path of progress like an assertion of a
possessory right."

Sorry about that, but I always had a weakness for Bierce's finer 
efforts.  Some of them are already very hard to obtain through normal 
channels.  How many other books are vanishing beyond the mandibles of 
silverfish as we discuss all this?

And some of the material isn't even as melodramatically rhetorical as my 
diatribe.

Cheers,

Jon




From Bowerbird at aol.com  Thu Jul 31 15:57:09 2008
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 31 Jul 2008 18:57:09 EDT
Subject: [gutvol-d] !@!ACTA trade agreement brief for July 29-31
	Washington
Message-ID: <bc6.2bd9777d.35c39d45@aol.com>

jon richfield said:
>   Not that I think that he should be so uncharitably cynical of course,
>    but a bad upbringing will out.

you funny...          :+)

-bowerbird



**************
Get fantasy football with free live scoring. Sign up for 
FanHouse Fantasy Football today.
      
(http://www.fanhouse.com/fantasyaffair?ncid=aolspr00050000000020)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20080731/d7364838/attachment.htm 

From hart at pglaf.org  Thu Jul 31 16:47:02 2008
From: hart at pglaf.org (Michael Hart)
Date: Thu, 31 Jul 2008 16:47:02 -0700 (PDT)
Subject: [gutvol-d] QWQRe: !@!ACTA trade agreement brief for July 29-31
 Washington
In-Reply-To: <48920050.6070004@telkomsa.net>
References: <48920050.6070004@telkomsa.net>
Message-ID: <Pine.LNX.4.64.0807311631060.31412@pglaf.org>

yOn Thu, 31 Jul 2008, Jon Richfield wrote:

> Sorry Michael, this feedback is probably late and certainly plaintive
> rather than constructive.  But you DID ask! so, fwiw:

I reply with the below still intact because so much of it is
worthy of looking at, and may have been overlooked before.

"Intellectual Property Brigade". . .about as good as WIPO,
or "World Intellectual Property Organization."

You might pass on to your son that WIPO is descended from
"The Stationers Guild" and then "The Stationers Company,"
who drafted the original of copyright law as we know it--
directly in response to The Gutenberg Press which ruined,
from The Stationers POV, the monopoly they had had, since
time immemorial. . . .

I have more more to pass on to him, if he wants, and I am
hoping he will take a look at my blog as well.  BTW, if a
file is found there that did NOT pass the spellcheck I am
hoping you or he will point it out.  I lost it, and would
replace it, if I could only remember which one it was.

///

As far as Ambrose Bierce goes, I am as great a fan as any
I know, but I cannot agree that Mr. Munniglut has a right
to mistreat anyone, cheat anyone, defraud anyone, etc. in
his effort to pay for the upkeep of his family, put those
kids through college, etc., etc., etc.

The Munnigluts make it sound so peaceful and proper, when
they say they "owe it to their shareholders" to screw the
world at large.

No sire, I do not agree, no matter how royal you may be.

It is obvious to anyone who looks that the result of each
of the various copyrights and copyright extensions has to
be considered MORE as the destruction of a public domain,
and LESS the actual increased profit to the booksellers.

After all, the ONLY things still selling all that well in
the extension periods are the best of the best sellers; a
law the removes the public domain from everyone, just for
a few percent more profits to those who have already made
the greatest profit seems all too much Reverse Robin Hood
And His Merry Men. . .if you take my meaning.

This is what happens when you let business be government.

It seem to be just the opposite of The Magna Carta, which
was Project Gutenberg's eBook #10,000 for a good reason.

Well, enough for now, but I hope you will encourage a son
and/or other family members to further the conversation.


Michael


> My son, who has strong views on the subject said, in reply to my passing
> on your message:
> ======================
>
> This sort of thing is going on with nauseating regularity, alas.
>
> The trouble behind it is that:
>  a) a lot of politicians have accepted without review the claims of the
>     intellectual property brigade
>  b) a lot of companies have discovered that while outright bribery is
>     illegal, politically sensitive gifts work wonders from a policy
>     standpoint.
>
> This is a classic protectionist move, really.  All the old brands of
> the first world want protection from knockoffs from elsewhere, and
> they're trying to get the government to do their work for them.
>
>
> ======================
>
> Not that I think that he should be so uncharitably cynical of course,
> but a bad upbringing will out.
> Anyway, the sum is that some people think that in this way they might
> make more money by depriving the public of things that they had been
> entitled to; who are we to disillusion them by frustration?  Obscurely
> I am reminded of something Bierce said:
> "...I must take the liberty to remind him that the law
> of supply and demand is not imperative; it is not a statute but a
> phenomenon. He may reply: "It is imperative; the penalty for
> disobedience is failure. If I pay more in salaries and wages than I need
> to, my competitor will not; and with that advantage he will drive me
> from the field." If his margin of profit is so small that he must eke it
> out by coining the sweat of his workwomen into nickels I've nothing to
> say to him. Let him adopt in peace the motto, "I cheat to eat." I do not
> know why he should eat, but Nature, who has provided sustenance for the
> worming sparrow, the sparrowing owl and the owling eagle, approves the
> needy man of prey and makes a place for him at table."
>
> It might strike anyone that I draw a strained comparison between the
> exploiter of women in employment and exploiters of  "intellectual
> property" that they largely had no hand in creating, and far more
> largely doom to oblivion simply by keeping them out of circulation,
> rather than the lesser crime of profiting from printing what is neither
> logically nor honestly theirs.
> True no doubt, but the motto  "I cheat to eat." springs nimbly to mind.
> That they should compound parasitism with dog-in-the-mangerism is
> distasteful rather than astonishing.  People who have noted some of the
> titles that I have provided either for Gutenberg US or AU, might be
> slightly puzzled at my choices, but they embody a strong trend towards
> worthwhile books that are little known and out of print.
> Such legislation tends to the total loss of such books.  That loss is a
> loss to society.  It adds nothing to the material gain for parasites who
> couldn't give a damn either way if it means no money for them one way or
> the other, so I don't waste my breath on them.  Some books are of direct
> value because of their content, and their loss may be loss as such.
> Others are losses because they have value in their relevance to the
> study of ideas in their times and communities.  This too is a loss, so
> pleading that only worthless materials will fail to get published is
> unworthy.  What we have here is the veto of public good in the interests
> of greed and sloth.
>
> Maybe what we need is some sort of register of titles to which parties
> might submit lists of material that they desire to publish on a
> not-for-profit basis.  Then if someone objects because they have both
> the right and intention to publish it commercially instead, their right
> prevails.  Otherwise we innocents could publish those titles for the
> benefit of readers who are not in a position to inflate the coffers of
> that good Mr. Munniglut  that Bierce referred to in the work I quoted:
> "...contentedly smoothing the folds out of the superior slope of his
> paunch, exuding the peculiar
> aroma of his oleaginous personality and larding the new roadway with
> the overflow of a righteousness stimulated to action by relish of his
> own identity. And ever thereafter the subtle suggestion of a fat
> philistinism lingers along that path of progress like an assertion of a
> possessory right."
>
> Sorry about that, but I always had a weakness for Bierce's finer
> efforts.  Some of them are already very hard to obtain through normal
> channels.  How many other books are vanishing beyond the mandibles of
> silverfish as we discuss all this?
>
> And some of the material isn't even as melodramatically rhetorical as my
> diatribe.
>
> Cheers,
>
> Jon
>
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>

From lee at novomail.net  Thu Jul 31 18:32:28 2008
From: lee at novomail.net (Lee Passey)
Date: Thu, 31 Jul 2008 19:32:28 -0600
Subject: [gutvol-d] getting my wikisource bearings
In-Reply-To: <488FB357.1080307@novomail.net>
References: <bcd.3280c184.35bf921c@aol.com>	<deea21830807281802k3d8c80b9p67c59e6b2039288c@mail.gmail.com>
	<488FB357.1080307@novomail.net>
Message-ID: <489267AC.3060105@novomail.net>

Lee Passey wrote:

> John Vandenberg wrote:

[snip]

>> Any existing code around to do something like this ?
> 
> Yes.
> 
> I have created some code to do this, which I would be happy to share
> with you, but I'm hoping someone else has done it better. I'm currently
> checking out HTML Match (http://www.htmlmatch.com/) which claims that it
> is able to "ignore the source code and compare only the text content of
> the web pages." If you're interested, I'll report back on what I find.

OK, for what I am trying to accomplish, I must report that HTML Match sucks.

So far, I still haven't found any FOSS program, other than GNU diff, 
which I can leverage.

So Carlo, have you successfully used wdiff to merge presumably clean 
text into an XML file? To be honest, I'm thinking that dwdiff, with its 
ability to set characters which should be word delimiters may be the answer.

From traverso at posso.dm.unipi.it  Thu Jul 31 22:52:49 2008
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Fri,  1 Aug 2008 07:52:49 +0200 (CEST)
Subject: [gutvol-d] getting my wikisource bearings
In-Reply-To: <489267AC.3060105@novomail.net> (message from Lee Passey on Thu, 
	31 Jul 2008 19:32:28 -0600)
References: <bcd.3280c184.35bf921c@aol.com>	<deea21830807281802k3d8c80b9p67c59e6b2039288c@mail.gmail.com>
	<488FB357.1080307@novomail.net> <489267AC.3060105@novomail.net>
Message-ID: <20080801055249.75FCB93B61@posso.dm.unipi.it>

>>>>> "Lee" == Lee Passey <lee at novomail.net> writes:

    Lee> So Carlo, have you successfully used wdiff to merge
    Lee> presumably clean text into an XML file? To be honest, I'm
    Lee> thinking that dwdiff, with its ability to set characters
    Lee> which should be word delimiters may be the answer.

I haven't merged text into XML recently, but in these cases I filter
out the markup and compare the resulting text.

I know how to modify a tool that I wrote a few years ago to find and
merge differences at the character level to allow merging back text
corrections into marked-up text, but I have never finished it.

Carlo


From bzg at altern.org  Thu Jul 31 20:10:59 2008
From: bzg at altern.org (Bastien Guerry)
Date: Fri, 01 Aug 2008 05:10:59 +0200
Subject: [gutvol-d] !@! Re:  woman in her own right -- 008 (and final)
In-Reply-To: <Pine.LNX.4.64.0807311036280.23780@pglaf.org> (Michael Hart's
	message of "Thu, 31 Jul 2008 10:50:14 -0700 (PDT)")
References: <bcc.322faae0.35c221de@aol.com>
	<Pine.GSO.4.58.0807301332340.17628@vtn1.victoria.tc.ca>
	<Pine.LNX.4.64.0807310937090.23780@pglaf.org>
	<4891EF2C.7020309@xs4all.nl>
	<Pine.LNX.4.64.0807311003010.23780@pglaf.org>
	<Pine.LNX.4.64.0807311036280.23780@pglaf.org>
Message-ID: <m3fxppvuz1.fsf@bzg.ath.cx>

Michael Hart <hart at pglaf.org> writes:

> So, once again, I simply point out that if you don't want a
> contact with bowerbird. . .which you all SAY. . .all you do
> is start your own listserver and don't let him in, or use a
> heavy hand on "moderation" if you do let him in.
>
> These solutions are simple.

I'm not in favor of moderation.

But it's not that easy to build another list.  If I build another list,
I want people to know about this, and I will surely send an email here,
because I believe the gutvol-d list attracted many interesting people.
How then can I be sure that the one doing that much noise on this list
will not join the new list under another name?

Ignoring noise is always possible, but it requires a lot of energy.  
I think people would prefer to spent this energy on discussing things
in a more constructive way.

Anyway.

-- 
Bastien