From richfield at telkomsa.net  Sun Jul  1 03:29:22 2007
From: richfield at telkomsa.net (Jon Richfield)
Date: Sun, 01 Jul 2007 12:29:22 +0200
Subject: [gutvol-d] Sorry folks, but I seem to have missed the obvious.
Message-ID: <46878202.4080201@telkomsa.net>


Re: Sorry folks, but I seem to have missed the obvious.

Thanks to all who replied.  If I don't sound overwhelmed, that is purely 
because I am by now inured to your standards of helpfulness, and 
therefore am not surprised at your patience.  

In order of convenience:

First, BB.

 > you should e-mail a whitewasher. i'll backchannel you an e-address.<

Thanks, but I'll take a rain check on that one.  My problem is not so 
much that I want to stir up the antnest, but that I was not sure that I 
had got my prepared material through to the ants at all, and if not, why 
not.   

Next Josh.

 >1st - Did you get a copyright clearance on the books before you

started?<

Well, sorta-kinda.  As I understand it, I do not need to get clearance 
for items that have already been cleared.  (e.g. Bindle, fragments of 
science, and Just William).  Secondly, it seems nutty to get clearance 
on e.g. Practical taxidermy, which seems to date from the 1880s, though 
its TP&V show no date (I had to deduce it from the text.  Big novelty, 
hm?)  Thirdly, I included the TP&Vs for the others in the ZIP files in 
which I sent them.  

Also, when I despatched them, using the web page allocated for the 
purpose (can't remember the details, but it was all very proper, with my 
nice new password etc) it only complained about one of the books I tried 
to send, and the way it did that was by insisting on getting the 
clearance code before letting me send it.  (I think that was the 
entomological dictionary, or possibly Practical taxidermy.) So I decided 
(which I have just discovered by finger trouble, to be nearly the same 
spelling as decoded, which would have confused the issue) to call it a 
night and wait to see what happened to the others.  So far nothing.

Hence my screaming for the better business bureau.

 >2nd - Word97 isn't a format we support.  If it is a simple text, it is 
fairly simple to convert it to a standard text file, but you may want to 
do that in the future yourself so that you can make sure it "looks 
right" in its final form.  Especially if you're going to be doing a lot 
of books (which it sounds like you are), you'll want to do that (as well 
as use tools like GutCheck) so that your text is as close to "finished" 
as possible when you upload it.<

Yesss... well, it isn't so simple.  (It never is, isn't it? (Had to slip 
that one in before you said it!))  Bindle and William and a lot of the 
vanilla fiction and philosophy do very well in TXT form (except that the 
TOC doesn't add much value, but that does not matter much in machine 
readable form, given that most reading software permits a search 
function of some form.)  Unfortunately, much mathematical and other 
scientific material is simply incomprehensible in TXT format.  Pictures 
are a trifle itchy too.  I don't mind omitting say the illustrations to 
the William books or "Child of the Deep", though a purist might object, 
and other purists (including myself, nearly) can hardly imagine Carroll 
without Tenniel, but books on science, such as Lubbock's "Senses of 
Insects" (also 19th century, amazingly) are almost useless without their 
illustrations, but are invaluable without them.  And, please note, some 
of these are truly great, nowadays badly neglected, books.  

Now, all that is obvious to most of us, but less obvious is a book such 
as Fowler's "The King's English".  My copy is in good condition 
(actually, it is my wife's, which partly explains that) and the scanner 
loved it, so I thought that preparing the final text from the scanned 
material would be a doddle.  That despairing moan that you were 
wondering about a few weeks ago came from this end of the planet.  It 
was the hardest book I have worked on yet, and that is saying something 
(though, heaven help me, I am casting wistful eyes at some that bid fair 
to be worse!)  Firstly, it is the first book where I really do need the 
TOC and the index, which mean that the pagination matters.  In TXT files 
that is a nuisance at best, though it is not a show-stopper.  However, 
the Fowlers' text formatting is fairly parsimonious, but highly 
significant  in semantic terms, which means that any re-formatting would 
be prohibitive.  When those blighters used italics, they meant it!  The 
text would be nearly useless  and completely maddening if the italics 
were not visible.  Checking on that took me weeks, for something hardly 
larger than a booklet!  This is one book that I did not even *bother* to 
convert to TXT format, even though it contained neither formulae nor 
illustrations, just a very little Greek, which I entered manually, and 
could have managed in a TXT file.   

 >3rd - Where did you send the file?  There are specific steps and 
locations to go to upload a new etext, but it's possible you got turned 
around and sent it somewhere that rarely (or even never) gets checked by 
anyone that could help things along.<

Well, I cannot remember the details, but I got myself an ID and a 
password through the PG channels, and submitted the files that it would 
accept, as I described above.  Years ago I simply emailed stuff directly 
to MH, but I see that things have changed since then.  

 >Normally, a "finished" etext usually gets posted within a couple days 
of uploading it, so it sounds like there is something else going on 
here.  Finding out what exactly you've done so far ought to help us 
track down where the pipe got clogged!<

Right, hence my coming forward with cries of peccavi!  :-S

 >PS If the final cleaning steps to get it ready are more work than you 
want to do, you may want to see about just scanning the books, then 
running them through Distributed Proofreaders (www.pgdp.net). They've go 
lots of folks willing to help out at all stages.<

Thanks to you and them, and no doubt I shall make some use of them in 
future, but for some of the books I have been working on, I think it is 
unnecessary, while for others I prefer to do the whole thing.  Simply 
scanning the visual material (say to the stage of the .OPD files) is 
much, much easier, but I get the impression that I would simply be 
adding more to a mountain of undone work.  Conversely, since I have the 
source material I do not have to go to the lengths of precision of 
scanning that Jon Noring proposes, so there may be some sense to my 
taking work to at least near-completion.  I am not yet certain how far 
to take all this, but none of it is as important as getting to a point 
of successful submission, and knowing when I have succeeded.  

Sankar wrote:

 >I do not see any of your books being uploaded or posted.<

Thanks.  That I needed to know.

 >You may remember that I had advised you the steps for uploading a 
book. Later on Joe and myself had advised you about the clearance line. <

Yes, but I hope what I wrote above makes it clear (or a bit clearer 
anyway) where I have erred.

 >Service to Humanity is Service to God<

That works for me.  Or at any rate, it will when I can get it to work!  :-)

OK folks, thanks for your trouble so far.  Suggestions and corrections 
welcome.  For one thing, given that I have this problem with texts that 
are not adequately served by TXT files, and that there is an 
understandable distaste for Word, what can I do?  I seem to remember 
that there is some dissatisfaction with the format in which Word 
produces HTM documents, though *for the most part* it looks reasonable 
to me.  In short, can anyone prescribe the best way to get from Word to 
civilisation without investing in a lot of extra software?

Cheers

Jon


From prosfilaes at gmail.com  Sun Jul  1 06:46:00 2007
From: prosfilaes at gmail.com (David Starner)
Date: Sun, 1 Jul 2007 08:46:00 -0500
Subject: [gutvol-d] Sorry folks, but I seem to have missed the obvious.
In-Reply-To: <46878202.4080201@telkomsa.net>
References: <46878202.4080201@telkomsa.net>
Message-ID: <6d99d1fd0707010646ve5a1831kcd73b9745d066e59@mail.gmail.com>

On 7/1/07, Jon Richfield <richfield at telkomsa.net> wrote:
> Well, sorta-kinda.  As I understand it, I do not need to get clearance
> for items that have already been cleared.  (e.g. Bindle, fragments of
> science, and Just William).  Secondly, it seems nutty to get clearance
> on e.g. Practical taxidermy, which seems to date from the 1880s, though
> its TP&V show no date (I had to deduce it from the text.  Big novelty,
> hm?)  Thirdly, I included the TP&Vs for the others in the ZIP files in
> which I sent them.

No, you need to get clearance on everything. Clearance does two
things: first, it's the way we co?rdinate who's working on what.
Secondly, it's the way that PG verifies and can prove that the
_editions_ (not just the books) are in the public domain; if someone
claims that the Practical Taxidermy that you worked from was printed
in 1925 and they own the copyright, we need more than just "seems to
date from".

From joshua at hutchinson.net  Sun Jul  1 07:00:42 2007
From: joshua at hutchinson.net (joshua at hutchinson.net)
Date: Sun, 1 Jul 2007 14:00:42 +0000 (UTC)
Subject: [gutvol-d] Sorry folks, but I seem to have missed the obvious.
Message-ID: <15708130.1183298442924.JavaMail.?@fh1037.dia.cp.net>

As David pointed out in another reply, the clearances are absolutely 
essential.  We have to have them to cover our butts, if for no other 
reason.

And much of the stuff you talk about CAN be done in a txt file using 
some standard conventions.

i.e., italics are set off with _underlines_ around the word.  Chemical 
and mathematical formula can be done with underscores and carets (H_{2}
O or a^2 + b^2 = c^2)  Illustrations, can have placeholders in the text 
like this:

[Illustration: This is the caption below one illustration]

And if you do a HTML version (highly recommended) then you can put the 
images directly inline in the text.

Avoid Word at every stage.  It'll just cause you grief in the long 
run, because it doesn't really do ANYTHING the way it needs to be done 
for a PG text.

Generally, you'll need to read up in the FAQ for formatting 
information, which is where I'm guessing you have the most work to do.

Good luck,
Josh

>----Original Message----
>From: richfield at telkomsa.net
>Date: Jul 1, 2007 6:29 
>To: <gutvol-d at lists.pglaf.org>
>Subj: Re: [gutvol-d] Sorry folks, but I seem to have missed the 
obvious.
>
>
>
>Re: Sorry folks, but I seem to have missed the obvious.
>
>Thanks to all who replied.  If I don't sound overwhelmed, that is 
purely 
>because I am by now inured to your standards of helpfulness, and 
>therefore am not surprised at your patience.  
>
>In order of convenience:
>
>First, BB.
>
> > you should e-mail a whitewasher. i'll backchannel you an e-address.
<
>
>Thanks, but I'll take a rain check on that one.  My problem is not 
so 
>much that I want to stir up the antnest, but that I was not sure that 
I 
>had got my prepared material through to the ants at all, and if not, 
why 
>not.   
>
>Next Josh.
>
> >1st - Did you get a copyright clearance on the books before you
>
>started?<
>
>Well, sorta-kinda.  As I understand it, I do not need to get 
clearance 
>for items that have already been cleared.  (e.g. Bindle, fragments 
of 
>science, and Just William).  Secondly, it seems nutty to get 
clearance 
>on e.g. Practical taxidermy, which seems to date from the 1880s, 
though 
>its TP&V show no date (I had to deduce it from the text.  Big 
novelty, 
>hm?)  Thirdly, I included the TP&Vs for the others in the ZIP files 
in 
>which I sent them.  
>
>Also, when I despatched them, using the web page allocated for the 
>purpose (can't remember the details, but it was all very proper, with 
my 
>nice new password etc) it only complained about one of the books I 
tried 
>to send, and the way it did that was by insisting on getting the 
>clearance code before letting me send it.  (I think that was the 
>entomological dictionary, or possibly Practical taxidermy.) So I 
decided 
>(which I have just discovered by finger trouble, to be nearly the 
same 
>spelling as decoded, which would have confused the issue) to call it 
a 
>night and wait to see what happened to the others.  So far nothing.
>
>Hence my screaming for the better business bureau.
>
> >2nd - Word97 isn't a format we support.  If it is a simple text, it 
is 
>fairly simple to convert it to a standard text file, but you may want 
to 
>do that in the future yourself so that you can make sure it "looks 
>right" in its final form.  Especially if you're going to be doing a 
lot 
>of books (which it sounds like you are), you'll want to do that (as 
well 
>as use tools like GutCheck) so that your text is as close to 
"finished" 
>as possible when you upload it.<
>
>Yesss... well, it isn't so simple.  (It never is, isn't it? (Had to 
slip 
>that one in before you said it!))  Bindle and William and a lot of 
the 
>vanilla fiction and philosophy do very well in TXT form (except that 
the 
>TOC doesn't add much value, but that does not matter much in machine 
>readable form, given that most reading software permits a search 
>function of some form.)  Unfortunately, much mathematical and other 
>scientific material is simply incomprehensible in TXT format.  
Pictures 
>are a trifle itchy too.  I don't mind omitting say the illustrations 
to 
>the William books or "Child of the Deep", though a purist might 
object, 
>and other purists (including myself, nearly) can hardly imagine 
Carroll 
>without Tenniel, but books on science, such as Lubbock's "Senses of 
>Insects" (also 19th century, amazingly) are almost useless without 
their 
>illustrations, but are invaluable without them.  And, please note, 
some 
>of these are truly great, nowadays badly neglected, books.  
>
>Now, all that is obvious to most of us, but less obvious is a book 
such 
>as Fowler's "The King's English".  My copy is in good condition 
>(actually, it is my wife's, which partly explains that) and the 
scanner 
>loved it, so I thought that preparing the final text from the 
scanned 
>material would be a doddle.  That despairing moan that you were 
>wondering about a few weeks ago came from this end of the planet.  
It 
>was the hardest book I have worked on yet, and that is saying 
something 
>(though, heaven help me, I am casting wistful eyes at some that bid 
fair 
>to be worse!)  Firstly, it is the first book where I really do need 
the 
>TOC and the index, which mean that the pagination matters.  In TXT 
files 
>that is a nuisance at best, though it is not a show-stopper.  
However, 
>the Fowlers' text formatting is fairly parsimonious, but highly 
>significant  in semantic terms, which means that any re-formatting 
would 
>be prohibitive.  When those blighters used italics, they meant it!  
The 
>text would be nearly useless  and completely maddening if the 
italics 
>were not visible.  Checking on that took me weeks, for something 
hardly 
>larger than a booklet!  This is one book that I did not even *bother* 
to 
>convert to TXT format, even though it contained neither formulae nor 
>illustrations, just a very little Greek, which I entered manually, 
and 
>could have managed in a TXT file.   
>
> >3rd - Where did you send the file?  There are specific steps and 
>locations to go to upload a new etext, but it's possible you got 
turned 
>around and sent it somewhere that rarely (or even never) gets checked 
by 
>anyone that could help things along.<
>
>Well, I cannot remember the details, but I got myself an ID and a 
>password through the PG channels, and submitted the files that it 
would 
>accept, as I described above.  Years ago I simply emailed stuff 
directly 
>to MH, but I see that things have changed since then.  
>
> >Normally, a "finished" etext usually gets posted within a couple 
days 
>of uploading it, so it sounds like there is something else going on 
>here.  Finding out what exactly you've done so far ought to help us 
>track down where the pipe got clogged!<
>
>Right, hence my coming forward with cries of peccavi!  :-S
>
> >PS If the final cleaning steps to get it ready are more work than 
you 
>want to do, you may want to see about just scanning the books, then 
>running them through Distributed Proofreaders (www.pgdp.net). They've 
go 
>lots of folks willing to help out at all stages.<
>
>Thanks to you and them, and no doubt I shall make some use of them 
in 
>future, but for some of the books I have been working on, I think it 
is 
>unnecessary, while for others I prefer to do the whole thing.  
Simply 
>scanning the visual material (say to the stage of the .OPD files) is 
>much, much easier, but I get the impression that I would simply be 
>adding more to a mountain of undone work.  Conversely, since I have 
the 
>source material I do not have to go to the lengths of precision of 
>scanning that Jon Noring proposes, so there may be some sense to my 
>taking work to at least near-completion.  I am not yet certain how 
far 
>to take all this, but none of it is as important as getting to a 
point 
>of successful submission, and knowing when I have succeeded.  
>
>Sankar wrote:
>
> >I do not see any of your books being uploaded or posted.<
>
>Thanks.  That I needed to know.
>
> >You may remember that I had advised you the steps for uploading a 
>book. Later on Joe and myself had advised you about the clearance 
line. <
>
>Yes, but I hope what I wrote above makes it clear (or a bit clearer 
>anyway) where I have erred.
>
> >Service to Humanity is Service to God<
>
>That works for me.  Or at any rate, it will when I can get it to 
work!  :-)
>
>OK folks, thanks for your trouble so far.  Suggestions and 
corrections 
>welcome.  For one thing, given that I have this problem with texts 
that 
>are not adequately served by TXT files, and that there is an 
>understandable distaste for Word, what can I do?  I seem to remember 
>that there is some dissatisfaction with the format in which Word 
>produces HTM documents, though *for the most part* it looks 
reasonable 
>to me.  In short, can anyone prescribe the best way to get from Word 
to 
>civilisation without investing in a lot of extra software?
>
>Cheers
>
>Jon
>
>
>
>_______________________________________________
>gutvol-d mailing list
>gutvol-d at lists.pglaf.org
>http://lists.pglaf.org/listinfo.cgi/gutvol-d
>


From Bowerbird at aol.com  Sun Jul  1 14:06:29 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 1 Jul 2007 17:06:29 EDT
Subject: [gutvol-d] Sorry folks, but I seem to have missed the obvious.
Message-ID: <c30.162776a8.33b97155@aol.com>

jon richfield said:
>   Thanks, but I'll take a rain check on that one.? 
>    My problem is not so much that I want to stir up the antnest, 
>    but that I was not sure that I had got my prepared material 
>    through to the ants at all, and if not, why not.??

i wasn't suggesting you should "stir up the antnest",
merely telling you that they can answer your questions
about whether your material got through, and we can't.        :+)

-bowerbird


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070701/6cb99705/attachment.htm 

From brynnahlld at yahoo.com  Sun Jul  1 15:06:45 2007
From: brynnahlld at yahoo.com (Elisa)
Date: Sun, 1 Jul 2007 15:06:45 -0700 (PDT)
Subject: [gutvol-d] Sorry folks, but I seem to have missed the obvious.
In-Reply-To: <mailman.1028.1183291392.22535.mailman@lists.pglaf.org>
Message-ID: <795752.17285.qm@web52603.mail.re2.yahoo.com>

>As I understand it, I do not need to get clearance 
>for items that have already been cleared.  (e.g. Bindle, fragments of 
>science, and Just William).
   
  Quite the contrary. When something's been cleared, that means that someone else has expressed the intention of working on it, so you need to make sure that effort isn't being duplicated. (Just William, for example, is in post-processing at Distributed Proofreaders, so most of the work on it has already been done.) However, quite often people will get a clearance, and then abandon the project, so it won't hurt to ask for clearance, just to make sure the other person still has an interest in the project.
   
  >This is one book that I did not even *bother* to 
>convert to TXT format
   
  My understanding was that all PG books *must* have a TXT format if that's at all possible. There's standard conventions for _italics_ and =bold=. I'd recommend hiding Word97 and checking out a basic freeware text editor, using none of the 'extras' like italics and font changes, . EditPadLite works well for me for programming files, so I know it isn't adding unseen cruft to the files. 

 
---------------------------------
Need Mail bonding?
Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070701/8aae5974/attachment.htm 

From gbnewby at pglaf.org  Sun Jul  1 20:57:22 2007
From: gbnewby at pglaf.org (Greg Newby)
Date: Sun, 1 Jul 2007 20:57:22 -0700
Subject: [gutvol-d] Fwd: Educated Earth Website / Donation to PG (fwd)
Message-ID: <20070702035722.GA16513@mail.pglaf.org>

Has anyone seen the www.educatedearth.net site in action?
(We sent Ben info about sending us money)

>[ben at educatedearth.net - Sun Jul 01 13:03:31 2007]:
>
>Hio. My name is Ben Lovatt, I'm the owner of a humanitarian
>science/technology website called EducatedEarth (
>http://www.EducatedEarth.net ). We raise money in donations (in addition
>to 10% of our profits) and give them to a different organization every
>month. To decide which organization should receive the money, we have our
>members give us suggestions on companies and we let viewers of our site
>vote on where to donate it. Project Gutenberg has been nominated and is in
>this month's poll. You're welcome to encourage your staff and website
>visitors to vote for you.
>
>If your organization was to win, how would I send this money to you?
>
>Thanks,
>Ben Lovatt

From Bowerbird at aol.com  Mon Jul  2 10:53:28 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 2 Jul 2007 13:53:28 EDT
Subject: [gutvol-d] z.m.l. examples reloaded
Message-ID: <c7b.1493285f.33ba9598@aol.com>

i have had some z.m.l. example books up in various locations,
at various times, but i've reloaded them at my z-m-l.com site...

these samples are solid proof-of-concept of z.m.l. usefulness.
(i relate _why_ i've adopted each of these books as an example;
do note that none of the examples were cherry-picked by me.)

they demonstrate z.m.l. at the stage of "continuous proofing",
where the text for every page is displayed alongside its scan,
so the general public can check and report possible errors...

here's "books and culture", from hamilton wright mabie:
>    http://www.z-m-l.com/go/mabie/mabiep123.html
this was google's first revealed public-domain book.

here's "the secret garden", by frances hodgson burnett:
>    http://www.z-m-l.com/go/sgfhb/sgfhbp123.html
this was a book from the p.g. library that was _redone_.

here's "my antonia", written by willa cather:
>    http://www.z-m-l.com/go/myant/myantp123.html
this was a book that jon noring used as his example.

here's "a hacker manifesto", from mckenzie wark:
>    http://www.z-m-l.com/go/ahmmw/ahmmwp012.html
this was a book that the if:book people recommended.

here's "the open library", by brewster kahle:
>    http://www.z-m-l.com/go/tolbk/tolbkp012.html
this book details the philosophy of the open content alliance.

***

and, for another manifestion of z.m.l. in action, see this page:
>    http://www.z-m-l.com/go/vl3.pl

this demo shows how the "no-markup" z.m.l. "master" files
can be converted on-the-fly in real-time to .html versions.
you can examine the z.m.l. file by clicking each link, and
then generate its .html version by clicking each button...

books included in this demo are:
>    a test-suite for project gutenberg e-texts
>    a document listing the 11 rules of zen markup language
>    a presentation given by cory doctorow at microsoft
>    "a christmas carol", by charles dickens
>    "lady clare", by alfred tennyson
>    "the lady of shalott", by alfred tennyson
>    "the mysteries of the caverns", by roger finlay
>    "fort amity", by arthur thomas quiller-couch
>    "the master-knot of human fate", by ellis meredith
>    "the tragedy of pudd'nhead wilson", by mark twain

***

more samples will follow soon...

-bowerbird


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070702/393c6552/attachment.htm 

From Bowerbird at aol.com  Mon Jul  2 11:14:57 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 2 Jul 2007 14:14:57 EDT
Subject: [gutvol-d] i decided to think about the future, not the past
Message-ID: <bf5.16b7b294.33ba9aa1@aol.com>

there was too much chatter for me to proceed on friday, especially
with the iphone changing the world that day by putting the internet
in your pocket, so i decided think about the _future_, not the past...

but i will continue my series on d.p. efficiency with a post on item #3,
preprocessing, later today.   meanwhile, i had already written up this
advance preview on item #4, coming next tuesday, for you to ponder.

here's a revealing factoid...

for literally _years_, until just _this_ spring, distributed proofreaders
did not offer to its proofers -- who were correcting the o.c.r. text --
the basic spellcheck functionality of adding a word to the dictionary.

you know, when you're doing your spellcheck, it tells you
"word not found" and gives you one or more suggestions,
as well as some other options, typified by this screenshot:
>    http://www.z-m-l.com/go/scinterface.png
(this is the word-not-found dialog from the mac's textedit.)

in this screenshot, it's telling me it can't find "bowerbird" in its
english dictionary, and it makes one suggestion (bower-bird),
and allows me to choose whether to (a) ignore it, or (b) guess
(presumably generate a suggestion, i would guess, just a guess),
or (c) find next, or (d) correct it as it was edited in the textbox...
_plus_ at bottom right, two more buttons -- "forget" and "learn".

the ability to "forget" a word easily via a mere button-click is nice
-- you don't get that option up-front in very many spellcheckers --
but let's focus on what our focus has been on: adding a new word.

the d.p. spellchecker just didn't have that "add" option on it.   really!
it couldn't "learn" a new word.   not one.   let alone forget it after that,
it couldn't even learn it in the first place.   pretty sad, don't you think?

well, turns out this failure to be able to learn had some consequences.
indeed, some rather serious and ugly consequences.

for instance, it meant that the _names_ of a book's characters -- 
which will come up very frequently over the course of the story -- 
could not be added to the dictionary to remove their flags globally.

so, for the example of "my antonia", the main character's name of
"antonia" was flagged each and every time it appeared in the book.
and not just that name, but a bunch of _other_ names in the book.
ouch.   that's a whole lotta flags you'd have to ignore.   flag overload.
makes it hard to pick out the real problems amidst all of the fakes...

and hey, maybe you don't think that's too bad, because it's easy to
see at a quick glance whether "antonia" was recognized correctly...

but what about the names i've appended?, all from e-text #13603,
"the hawaiian romance of laieikawai", chock full of hawaiian names...

those are all the names you have to check, some with _very_many_
occurrences; after a while your head be swimming like it's in maui,
with paris hilton recuperating from her recent trip to the county jail.

how'd you like to proof those papayas?   and once you had checked
one of these nightmare names, you'd want to add it to the dictionary
so -- at least if it came up _exactly_ the same -- it would _not_ be
flagged again, so you'd know you didn't have to do the plow-through.

i know _i_ would sure be appreciative...

and i bet there were probably also all kinds of other hawaiian words
that showed up with quite high frequencies in that book, and thus
caused proofers unnecessary pain because they couldn't be added
to the book's dictionary so that they wouldn't continue to be flagged.

yet for _years_ d.p. proofers went without this _core_ functionality...

it was a _bad_ situation...

and it was allowed to drag on for _years_...

to me, that simple fact communicates a _world_ of disrespect for the
proofers who're volunteering their time and energy to help the cause.

a spellchecker is probably _the_ most valuable tool there is to discover
scannos made in the o.c.r. process, yet d.p. provided its proofers with
a substandard version of the tool, one that was clearly inferior.   shame.

they kept saying it was "a shortage of programmers", but meanwhile
they seemed to have enough "programming help" to make all kinds of
modifications to their system, thoroughly up-ending the whole thing,
going from two rounds to four, and then to five, separating proofing
from formatting, etc.   but yet, they couldn't fix a major broken tool...
there's something wrong in the priorities there.   and _badly_ wrong...

oh, they've fixed this terrible shortcoming, finally, to get a little credit,
but they put in its place an unwieldy version of the tool, such that only
the project manager can "add" a word to the dictionary for each book;
the proofers themselves -- who have to bear the brunt of false flags --
evidently cannot be "trusted" with such a decision.   it's really very sad...

i'll have more to say on this general topic.   but keep these aloha names
in your aloha mind as an aloha vivid exemplar of my overall aloha point...

you say goodbye, and i say hello...

-bowerbird

>    Achatinella Ahewahewa Aholenuimakiukai Aikanaka Aiohikupua Aiwohikupua 
Akahiakuleana Akanikolea Akikeehiale Alelekinakina Alihikaua Aukelanuiaiku 
Aukelenuiaiku Aukuuikalani Hakalanileo Halaaniani Halauoloolo Haleakala Halealii 
Halehuki Halemano Haleolono Halepaahao Halepaki Haloalena 
Haluluikekihiokamalama Hamakualoa Hanaaumoe Hanamaulu Hanapepe Hanualele Hauailiki Hauikalani 
Hawaiiakea Heakekoa Hekilikaakaa Hikapoloa Hilopaliku Himatione Hinaaikamalama 
Hinaaimalama Hinaakeahi Hinaikainalama Hinaikamalama Hinakahua Hinaluaikoa 
Hinaluaimoa Hinapaleaoana Hinauluohia Hinawaikolii Hiwahiwa Hoamakeikekula Hokiolele 
Hokuhookelewaa Holaniku Holoholoku Holualoa Honehone Honokaape Honokalahi 
Honokalani Honolahau Honolohau Honopuuwaiakua Honopuwai Honopuwaiakua Honouliuli 
Honuaula Hoohokukalani Hookaakaaikapakaakaua Hookamumu Hookeleiholo 
Hookeleipuna Hooleipalaoa Hoolilimanu Hoomakaukau Hualalai Huawaiakaula Huliamahi 
Hulihonua Hulumaniani Kaahualii Kaawaloa Kaawikiwiki Kaehaikiaholeha Kaelehuluhulu 
Kaeloikamalala Kaeloikamalama Kahakaauhae Kahakaekaea Kahakuikamoana Kahalaoaka 
Kahalaokolepuupuu Kahalaomapuana Kahalaopuna Kahalapmapuana Kahaookamoku 
Kahapaloa Kahaumana Kahauokapaka Kahawalea Kaheawai Kahekili Kahihikolo 
Kahikihonuakele Kahikikolo Kahikiku Kahikimoe Kahikinui Kahikiula Kahioamano Kahoiwai 
Kahoolawe Kahoupokane Kaialeale Kaihalulu Kaihuopalaai Kaiimamao Kaikamahine 
Kaikilani Kaikipaananea Kaikuahine Kaikunane Kailiokalauokekoa Kaipalaoa 
Kaipolohua Kaiwakaapu Kaiwilahilahi Kaiwiopele Kakaalaneo Kakahaekaea Kakakauhanui 
Kakalukaluokewa Kakuhihewa Kalaehina Kalaekini Kalaeloa Kalaepuni Kalahumoku 
Kalakaua Kalamaula Kalanialiiloa Kalaniamanuia Kalanikilo Kalanilonoakea 
Kalanimanuia Kalaniopuu Kalapana Kalapanakuioiomoa Kalaumeki Kalaupapa Kaleikini 
Kalelealuaka Kalenaihaleauau Kalewalo Kalokuna Kalonaikahailaau Kalopulepule 
Kaluapalena Kaluawilinae Kamaainau Kamaakamikioi Kamaakauluohia Kamahaina Kamahualele 
Kamaikaakui Kamakaaulani Kamakaiwa Kamakulua Kamalalawalu Kamalama Kamanuwai 
Kamapuaa Kamehamaha Kamehameha Kamelekapu Kamoamoa Kamohoalii Kamooinanea 
Kamooloa Kanaloakuaana Kaneapua Kaneaukai Kanehunamoku Kaneikamikioi Kanenaiau 
Kanepohihi Kaneulohia Kaneulupo Kanewahineikiaoha Kanikaea Kanikaniaula Kanikapiha 
Kanikawi Kanoakapa Kaohukolokaialea Kaoleioku Kaonohiokala Kapaahulani 
Kapahaelihonua Kapahielihonua Kapaihiahilani Kapakohana Kapalilua Kapapaapuhi 
Kapapaiakea Kapepeekauila Kapuaokaohelo Kapuaokaoheloai Kapuheeuanui 
Kapuhiokalaekini Kapukaihaoa Kapunohu Kapunokaoheloai Karolineninsel Kauaiapuni Kauakahialii 
Kauakuahine Kaukaalii Kaukaukamunolea Kaukihikamalama Kaulaailehua 
Kaulanaikipokii Kaulanapokii Kaululaau Kaumaielieli Kaumailiula Kaumakapili Kaumalumalu 
Kaunakahakai Kaunakakai Kaunalewa Kauwilanuimakehaikalani Kawahineokaliula 
Kawaihae Kawaipapa Kawalakii Kawaluna Kawaomaaukele Kawaunuiaola Kaweleau Keahumoa 
Keakahulilani Keakamilo Kealakaha Kealakekua Kealohikikaupea Kealohilani 
Keanapou Keaomelemele Keauleinakahi Keaulumoku Keawanui Keaweikekahialii 
Keawenuiaumi Keinohoomanawanui Kekaihawewe Kekalukaluokewa Kekalukaluokewaii Kekuhaupio 
Keleanuinohoonaapiapi Keliimalolo Keliiokaloa Keliiomakahanaloa Kenaloakuaana 
Kenntniss Keoneoio Kepakailiula Kepapaialeka Kihanuilulumoku Kihapiilani 
Kihawahine Kiimaluhaku Kikekaala Kilioopu Kilohana Kilokilo Kipahulu Kipapalaula 
Kipapalaulu Kipunuiaiakamau Koeniglichen Kohalalele Kohalaomapuana Koholalele 
Konikonia Kookoolau Koolauloa Koolaupoko Kosmogonie Kotzebue Kuaihelani 
Kuamooakane Kuapakaa Kuauamoa Kuhukulua Kuhuluhulumanu Kuikauweke Kuililoloa 
Kuilioloa Kukailani Kukamaulunuiakea Kukaniloko Kukaohialaka Kukeapua Kukuikiikii 
Kukuipahu Kukululaumania Kulanihakoi Kulukulua Kumukahi Kumukena Kumuniaiake Kumun
uiaiake Kupaahulani Kupololiilialiimualoipo Kupololiilialiimuaoloipo 
Kupukupukehaiaiku Kupukupukehaikalani Kupuupuu Kuwahailo Laamaikahiki Laamaomao 
Lahainaluna Laieikawai Laielohelohe Lalakeenuiakane Lamaloloa Lanalananuiaimakua 
Laniihikapu Lanikahuliomealani Lanikuakaa Lanioaka Lanipipili Lapakahoe 
Laukapalala Laukapalili Laukiamanuikahiki Laukieleula Laupahoehoe Lepeamoa Liliuokalani 
Liluokalani Lolomauna Longapoa Lonoapii Lonoikamakahike Lonoikamakahiki 
Lonoikiaweawealoha Lonoikoualii Lonokaeho Lonopili Luahinekaikapu Luakalai Lulukaina 
Lupewale Maakuakeke Macculloch Mahealani Maheleana Mahinanuikonane Mahukona 
Maiauhaalenalenaupena Mailehaiwale Mailekaluhea Mailelaulii Mailepakaha 
Makahanaloa Makaukiu Makaulanei Makaweli Makeweli Makiioeoe Makuakane Malaekahana 
Malaiakalani Malamanui Malanaihaehae Malanaikuaheahea Malekahana Malelewaa 
Manaiakalani Maniniholokuaua Mantandua Marianen Marquesan Marquesas Maunakalika 
Maunakea Maunalahilahi Maunalei Maunaloa Maunauna Meeresweiten Melanesian 
Melanesien Melbourne Micronesian Mikronesien Moahelehaku Moanaikaiaiwe Moanaliha 
Moanalihaikawaokele Moananuiakea Moananuikalehua Moaulanuiakea Moerenhout 
Mokuekelekahiki Mokuhano Mokukeleikahiki Mokukelekahiki Mokuleia Molokini Moloklni 
Monographie Monowaikeoo Nakinowailua Nakolowailani Namakaokahai Namakaokalani 
Namoeluk Nanakuli Nathaniel Naulukohelewalewa Nihoalaki Niuhelewai Nuumealani 
Oioiapaiho Okipoepoe Olekulua Omaokamao Omaokamau Oneoneihonua Opelemoemoe 
Opukahonua Pahulumoa Pakaalana Palalahuakii Paleaikaahalanalana Palikaulu Paupauwela 
Peleioholani Pelekunu Petroglyphs Pihanakalani Piihonua Piimaiwaa Piimaiwae 
Pikoiakaala Pikoiakaalala Pioholowai Piokeanuenue Pleiades Pohakuokauai Poliahu 
Polomauna Pomaikai Puaakukui Puaamaumau Puaatihaloa Pueonuiokona Puniaiki 
Pupuakea Pupualenalena Pupuhuluena Puuanahulu Puukohala Puukohola Puumahawalea 
Puumaneo Puuoaoaka Puuonale Puuopapai Puupuukaamai Uhumakaikai Ukumehame 
Uweuwelekehau Waawaaikinaanao Waawaaikinaaupo Wahilani Waiahole Waiahulu Waialala 
Waiapuka Waihalau Waiohonu Waiolama Waiopuka Waiulaula Walewale Wawaikalani 


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070702/dd394c62/attachment-0001.htm 

From Bowerbird at aol.com  Mon Jul  2 15:09:19 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 2 Jul 2007 18:09:19 EDT
Subject: [gutvol-d] LoCC and Subject fields
Message-ID: <c02.198a06d0.33bad18f@aol.com>

i said:
>    it's easy to "be found" in cyberspace if you play your cards right...
>
>    play 'em wrong -- by using some library of congress stuff which
>    none of your end-users is hooked into -- and you'll be invisible.
>
>    (this is _not_ to say that that stuff couldn't be of _some_ use, but
>   you'd have to make sure the cost-benefit ratio justified the work.
>   if you really want to pursue _that_ angle, then find a way to lower
>   the costs -- and the best suggestion i have for you there is to dig
>   into the amazon a.p.i. and find out if you can scrape info there --
>   and to raise the benefits -- where the best suggestion i have for
>   _that_ is to get project gutenberg's e-texts pointed to by amazon,
>   and if you manage that, _then_ you will have accomplished much.)

since jason's desired target was _libraries_, i should've said "worldcat"
instead of "amazon", but the basic idea is exactly the same, of course.

i'm kinda surprised that no one else responded to jason.   is it true
that you've all just given up on the p.g. catalog as being helpless?

-bowerbird


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070702/b00e99a0/attachment.htm 

From shabam.dp at gmail.com  Mon Jul  2 15:39:25 2007
From: shabam.dp at gmail.com (shabam)
Date: Mon, 2 Jul 2007 15:39:25 -0700
Subject: [gutvol-d] Fwd: Educated Earth Website / Donation to PG (fwd)
In-Reply-To: <20070702035722.GA16513@mail.pglaf.org>
References: <20070702035722.GA16513@mail.pglaf.org>
Message-ID: <1ac896090707021539j4578a778n70d3fab635bb989c@mail.gmail.com>

Greg,

Well, with 74 of 111 votes going to PG, I think PG is in.  Be sure to let us
know how much PG "wins".  I'd never heard of them before, but this is a good
way for them to get their name out there.  They do not guarantee any amount
to PG, so it could be $5 that the raise, but they get the organizations to
tell their "Staff and visitors" to visit the site and vote for them.

Interesting content.  Looks like they are one of the many sites that look
for interesting content to post on their site, maybe having some original
content on occasion.  They do need to update their catalog though.  A lot of
the UTube stuff came back saying "This video is no longer available"

Ah well.  If they get PG $5, then that is $5 PG would not have had
otherwise.  Overall, I'm not too impressed with the site, as so many of the
videos I tried to watch were no longer available.  They need some way to
clean up these.

Jason

On 7/1/07, Greg Newby <gbnewby at pglaf.org> wrote:
>
> Has anyone seen the www.educatedearth.net site in action?
> (We sent Ben info about sending us money)
>
> >[ben at educatedearth.net - Sun Jul 01 13:03:31 2007]:
> >
> >Hio. My name is Ben Lovatt, I'm the owner of a humanitarian
> >science/technology website called EducatedEarth (
> >http://www.EducatedEarth.net ). We raise money in donations (in addition
> >to 10% of our profits) and give them to a different organization every
> >month. To decide which organization should receive the money, we have our
> >members give us suggestions on companies and we let viewers of our site
> >vote on where to donate it. Project Gutenberg has been nominated and is
> in
> >this month's poll. You're welcome to encourage your staff and website
> >visitors to vote for you.
> >
> >If your organization was to win, how would I send this money to you?
> >
> >Thanks,
> >Ben Lovatt
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>


-- 
Person to person lending.  Lend money to others, and get a $25 bonus.
http://www.prosper.com/join/shabam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070702/e9a62d39/attachment.htm 

From Bowerbird at aol.com  Mon Jul  2 17:09:39 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 2 Jul 2007 20:09:39 EDT
Subject: [gutvol-d] ok, let's talk about d.p. efficiency, item #3
Message-ID: <d4f.c308c18.33baedc3@aol.com>

ok, to review, we've discussed the first two steps of digitization so far,
namely (1) the scans (and the importance of doing them well), and
(2) the o.c.r. (and the importance of using abby v8, the best program).

there are many steps in the digitization process, and it's important to
make sure that _none_ of them will become a weak link in the chain...

after the o.c.r. is finished, step #3 in the recipe is the _o.c.r._cleanup._

in the lingo of distributed proofreaders, this is called "preprocessing";
(it's so-named because it's done before the text goes to the proofers,
contrasted with "postprocessing", which is the d.p. step that happens
after the text has gone through the proofing and formatting rounds.)

it is in the preprocessing -- or, more accurately, the _absence_ of it,
almost completely -- where d.p. reaches its full nadir of inefficiency.

it doesn't take much intelligence to intuit that if you can correct errors
in one fell swoop, that'll be more efficient that fixing 'em one at a time.

and it is the preprocessing stage where you're fixing errors _globally_.

here are some of the chores that i routinely do in "preprocessing":
1.   do basic integrity checks, to find and fix missing or doubled pages
2.   fix section headers, usually the text in and around chapter titles
3.   fix frontmatter pages, which typically return inferior o.c.r. results
4.   fix runheads and pagenumbers, for best navigational grounding
5.   obtain and evaluate the "vocabulary" of the book -- the words in it

these chores are ones that must be done _eventually_, and it is simply
_more_efficient_ to do them sooner than later, so i do 'em right away...
in other words, having these things be _correct_ from the very outset
returns many beneficial aspects in your further handling of the text...

i'm not gonna belabor the point, because i've already had to discuss it
(both here and on the d.p. forums) many times more than it deserves
-- it's common-sense that two numbering systems will be confusing --
but one of the most _maddening_ things that d.p. content providers
_routinely_ fail to do is to name image-files using their pagenumber.

that is, page 37 might be located in the file named "049.png", and 
then of course page 49 will be found in the file named "061.png"...
this is ridiculous.   it's best to name the file for page 37 as "037.png",
so it's absolutely clear just by looking at its name what its content is.

(it's also good to prepend the number with a string to make it unique
-- every scan-set over at d.p. starts with "001.png" and goes up, so
there's nothing to make the names unique.   but forget that for now.)

every time someone who wants to go to page 49 enters their "49",
only to find out that they've ended up with the scan for page 37,
so they must do the subtraction routine in their head to figure out
that the offset is _12_, and thus for page 49 they must enter "61",
that's some of their time and energy that was wasted unnecessarily.

maybe you say it's not much, but when you multiply it by _multiple_
instances on every day of every week of every month of every year...

so one of the first things i do with a d.p. scan-set is rename the files.
when i want to see page 37, i want to enter "37" and be done with it.
to do anything differently than that is to ignore a _basic_ efficiency...

(and of course this helps you to quickly discover any missing pages,
which -- as any experienced person knows -- are always a hassle.)

also related to the vital matter of easy navigation around the text
is the section-headers, which is why i clean those up right away...
i like to be able to ground myself with an accurate table of contents.
i also like the power of a "chapter menu" generated _automatically_,
as well as the ability to "skim" the chapter headings, forward or back.
for all of these things, consistent formatting of headers is required...

i also find that the runheads and pagenumbers are an _essential_
element in providing "grounding" while i am working on the text...
(which, thinking about it, is the very utility they provide to readers.)

ironically, d.p. often strips away these runheads and pagenumbers.
(even more ironically, they're considering a "meta-data" round where
they will have volunteers _re-enter_ the pagenumbers they stripped!
how's that for unbelievable stupidity?)

as for the frontmatter pages, i often find that i have to shuffle some
of them around, both for esthetics and to help in the image-naming
(i.e., it's fine to delete blank pages to make numbers come out right.)

***

but far and away the most important of the 5 things mentioned above
is generating the "vocabulary" of the words that are used in the book.

this is a straightforward task, and it's easy to program a tool to do it:
1.   read in the o.c.r. text.
2.   change spaces and tabs to line-ends, so every word is on its own line.
3.   sort the lines.   (use an ascii sort, so initial-cap words sort to the 
top.)
4.   spellcheck the words, sorting them into piles that "pass" and "fail".
5.   examine the initial-cap words, most of which will be names, and
6.   move high-frequency ones and "looks right" ones to the "pass" pile.
7.   examine the initial-lower words, most of which will be scannos, but
8.   move high-frequency ones and "looks right" ones to the "pass" pile.

the "pass" pile is the "vocabulary" for the book, and it should be used as
the "dictionary" for all further spell-checking that you'll do on this book.
(you might not see the importance of this now, but do keep it in mind.)

if you've gotten _reasonably_ good o.c.r. out of your scans, then you can
generally even have the machine do this entire process _automatically_,
by "passing" instances of a word that occurs 3 or more times in the book.

(and remember, if you _haven't_ gotten "reasonably good" o.c.r., then you
really need to go back and fix _that_ problem before you even proceed...)

as for the "fail" pile, that's a real goldmine in disguise, as it allows you 
to
zero in on the problems with a laser-like focus.   i have a program that
zooms me from one of these bad words to the next, pulling up the text
(with the bad word pre-selected) _and_ the scan of the respective page,
for a quick-and-easy check.   when the "bad" word turns out to be correct,
i click a button that (a) adds the word to the vocabulary, and (b) moves me
to the next bad word.   if the "bad" word is indeed wrong, i just do the 
edit.
if the new word, as edited, is not in the dictionary, it will be added as 
well.

this interface is _amazingly_fast_ at fixing mistakes.   i mean, really 
fast...

and all of this can -- and _should_ -- be done during preprocessing...

if you want, you could fine-tune my system to ignore the bad words that
occur only once in the book (and maybe even only twice or three times),
on the assumption that it's just as "efficient" for the proofers who're doing
each individual page to handle these infrequent bad words.   but if it's me
doing the proofing, i'd rather have the system direct me to the problems,
and facilitate my handling of them, rather than make me locate each one,
so then my "proofing pass" serves as a "verification check" on the change.

and for those errors that pop up _repeatedly_, there is _no_question_ that
it's more efficient to correct them on a _global_ basis than 
_individually_...

sure enough, every once in a while you see some proofers observing that
"it sure seems like it'd be a lot easier to make this change project-wide..."
but they just get ignored, or patted on the head.

notice that my system makes it very simple to add a word to the dictionary.
(in the future, i'll make it just as easy to delete a word from the 
dictionary.)

that brings up a very important point, namely that the book's vocabulary
is constantly in flux, with words being added (and maybe deleted) from it,
possibly right up until the very last word is proofed on the very last 
page...

so one thing you need is a mechanism that lets you _check_ occurrences
of a word, so that you can decide whether it was added _correctly_ or not.
(whenever any word is added, it should be checked throughout the book.)

***

there are lots and lots of other instances where preprocessing can help,
over and above the situation of _correcting_words_ in a global manner.

just to give some common cases:
1.   delete any space before commas, semicolons, colons, and periods
2.   adjust "spacey" quotemarks (ones with whitespace before and after)
3.   change "1" inside a word to an "l" when doing so passes spellcheck
4.   change "0" inside a word to an "o" when doing so passes spellcheck
5.   change an "l" or an "o" to "1" or "0" if all other characters are 
numbers
6.   lowercasing an uppercase "o" or "w" if it happens to occur mid-word
7.   detecting and formatting section-headers in the appropriate manner
8.   finding and fixing pagenumbers that were not recognized correctly
9.   locating and correcting blank lines erroneously injected or deleted

as indicated here, _punctuation_ in general is often an o.c.r. troublespot.
(and, to be fair to the o.c.r., a good number of those excessive spaces are
clearly present in the book, thanks to old-time typographical practices, so
the o.c.r. results are actually accurate, they're just not what we now want.)

so fixing punctuation glitches is one great things about preprocessing...

also, since these changes are made _before_ the text goes to the proofers,
you can usually can make 'em _automatically_ with only minimal checking,
because they will be _verified_ by the proofers, who will catch any errors...

this arena of preprocessing is _the_ one where d.p. is _most_ inefficient,
so i ended up pointing out instances of it over and over and over again
on their forums.   it was one of the easiest fish to shoot in that barrel,
since it comes up _constantly_.

indeed, here are some posts from their messages boards _today_
where the lack of adequate preprocessing has reared its ugly head:

>    unflagging common player names in spaulding guide to baseball
>    http://www.pgdp.net/phpBB2/viewtopic.php?p=331080#331080
>
>    invisible utf8 characters in the code (could be deleted globally)
>    http://www.pgdp.net/phpBB2/viewtopic.php?p=341329#341329
>
>    tables (it's rather easy to handle table formatting programmatically)
>    http://www.pgdp.net/phpBB2/viewtopic.php?p=341302#341302
>
>    blank line(s) at the top of a section header causing bad "diffs"
>    http://www.pgdp.net/phpBB2/viewtopic.php?p=341183#341183

again, all of these examples were obtained from posts made _today_.
and that's not unusual, because preprocessing ramifies in a big way...

i could go on and on giving more examples, but i think it's clear now.
any time you can get the machine to fix an error instead of a human,
you're going to improve your efficiency.   and your volunteer retention.

and i'm not gonna pick on the d.p. "wordcheck" -- the new version of
its spellchecker that enables people to add words to the dictionary --
because they're still feeling their way with it; but it needs improvement.
hopefully, they will be able to figure out how to fix it.   (i told 'em, 
but...)

it's also interesting that -- now that they've "exiled" me -- they can
act like it was all their idea to improve the preprocessing that they do,
and thus go to work on that.   indeed, that's just what they have done.
(without me there saying "that's what i was telling you to do all along".)

let's hope their efforts in this regard don't bog down, because this is
an arena where they waste _far_ too much volunteer time and energy.

ok, so the workflow so far:
1.   get good scans.   (crop and deskew, and test despeckling.)
2.   get good o.c.r. (do a number of tests to get the best output.)
3.   do preprocessing.   (changes made globally are most efficient.)

more later...

-bowerbird

p.s.   i _will_ point out that, in order to get the most out of 
preprocessing,
it's absolutely necessary that you create an interactive correction tool, and
the people over at d.p. don't seem to realize that yet.   hopefully they 
will,
especially since it's not at all difficult to program such a correction 
tool...


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070702/314381eb/attachment-0001.htm 

From ricardofdiogo at gmail.com  Mon Jul  2 17:39:47 2007
From: ricardofdiogo at gmail.com (Ricardo F Diogo)
Date: Tue, 3 Jul 2007 01:39:47 +0100
Subject: [gutvol-d] ok, let's talk about d.p. efficiency, item #3
In-Reply-To: <d4f.c308c18.33baedc3@aol.com>
References: <d4f.c308c18.33baedc3@aol.com>
Message-ID: <9c6138c50707021739h50e795b9v3c34e137304997e9@mail.gmail.com>

2007/7/3, Bowerbird at aol.com <Bowerbird at aol.com>:
>  this is a straightforward task, and it's easy to program a tool to do it:
(...)
>  it's not at all difficult to program such a correction

You may find some inspiration by reading Project Gutenberg's FAQ
http://www.gutenberg.org/wiki/Gutenberg:Tools_FAQ#What_programs_could_I_write_to_help_with_PG_work.3F

>  it's also interesting that -- now that they [DP]'ve "exiled" me -- they can
>  act like it was all their idea to improve the preprocessing that they do,
>  and thus go to work on that.  indeed, that's just what they have done.
>  (without me there saying "that's what i was telling you to do all along".)

ROFL. (sorry).

From Frank.vanDrogen at bc.biol.ethz.ch  Mon Jul  2 22:09:52 2007
From: Frank.vanDrogen at bc.biol.ethz.ch (Frank van Drogen)
Date: Tue, 03 Jul 2007 07:09:52 +0200
Subject: [gutvol-d] ok, let's talk about d.p. efficiency, item #3
In-Reply-To: <d4f.c308c18.33baedc3@aol.com>
References: <d4f.c308c18.33baedc3@aol.com>
Message-ID: <XFE1KlL3R8ZLXEZkI4t00002f70@xfe1.d.ethz.ch>


>(2) the o.c.r. (and the importance of using abby v8, the best program).

Actually, Finereader 7.0 does much better then 8, on many aspects.

Frank 


From Bowerbird at aol.com  Mon Jul  2 23:21:15 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 3 Jul 2007 02:21:15 EDT
Subject: [gutvol-d] ok, let's talk about d.p. efficiency, item #1
Message-ID: <c0c.1a0546aa.33bb44db@aol.com>

juliet said:
>    on many points

well, well.   one thursday juliet is _banning_ me from
discussions on her site, and the very next thursday
she is joining a conversation i've initiated elsewhere.

it's a good thing i am a huge fan of _irony_, isn't it?


>    on many points bowerbird is just uninformed or
>    making unfounded assumptions and accusations. 

that's a strong charge.   let's see your documentation of it.

i've spent a lot of time on your forums getting "informed",
and my "assumptions and accusations" are _well_ founded
and strongly grounded.   and i'll be happy to relate a _ton_
of examples for any point that you believe is questionable
just let me know, and prepare yourself for a big deluge...

in the meantime, drop the ad hominem tactics, please.
and if you can't respond to the topic directly, stay silent.


>    We do strongly encourage deskewed, 
>    reasonably well cropped, decent scans. 

then how come you don't get more of them?

why do so many of your scan-sets look crappy?

i invite _anyone_ to step over to d.p.
and look at their archived scan-sets:
>    http://www.pgdp.org/ols/

just pick a dozen, at random, and look at some sample pages.
maybe you'll agree with juliet that they are all "decent" enough.
or maybe you'll agree with me that _many_ of them fall short...

i'm not telling _you_ what to think.   i'll telling you what _i_ think.

also, for those who do not realize the importance of nice scans,
i strongly encourage you to download from the internet archive
_any_ of the 400 books that were digitized by nicholas hodson.

search t.i.a. for "nick hodson" or "athelstane" to find his work, or:
>    http://www.athelstane.co.uk/

nicholas is one of the most prolific book-digitizers on the planet,
having done every aspect of each of those 400 books by himself.
he creates a text file and an .html version, as well as a groomed
text-to-speech version, and bundles the scan-set into a .pdf, so
you can download that and see for yourself what a _respectable_
scan-set looks like -- deskewed and cropped with a nice margin.
thumb through the scan-set and see how pleasant the esthetic is
when the text-block is in a consistent location on the page-scan.
once you've observed how it should be done, you know it's right.

i downloaded the scan-set .pdf from this hodson book just now:
>    
http://www.archive.org/details/Harry_Collingwood_A_Pirate_of_the_Caribbees
and reminded myself just how delightful it is when it's done right.
even nicholas isn't perfect, as i found a glitch on page 329!, but
even his worst example is as good as the best you'll find from d.p.


>    When there have been content providers who are not doing 
>    a minimally acceptable job at that, they have gotten notes from 
>    me or someone else with authority and experience in scanning.

well, i can believe that is true.   at the same time, however,
the fact still remains that some d.p. scan-sets are awful...

here's that one that i pointed to a while back:
>    http://www.z-m-l.com/go/ortenmc/p123.html
(as usual, you can adjust the number for other pages.)

and here's the o.c.r. that came out of page 123:
>   http://www.z-m-l.com/go/ortenmc/p0-123.txt
(and yes, i do believe it's possible to get better o.c.r.
out of those scans, even though they're quite lousy.)

yet, to see what a heroic job was done by the proofers:
>   http://www.z-m-l.com/go/ortenmc/p1-123.txt

that's a book going through your system _now_, juliet...

if that's a "minimally acceptable" job, then bob's my aunt.

and -- just to remind people again -- i didn't pick it out.
that book was suggested _to_ me, as a project _for_ me...

i have looked at a _lot_ of d.p. scan-sets, and i witness
some really crappy scans, and some really crappy o.c.r.
scans that if i did 'em myself, i'd consider unacceptable.
o.c.r. that if i did it myself, i'd _force_ myself to improve.
i would simply be too embarrassed to put that to others
who are expected to "proof" it...

and, even with all of the many d.p. scan-sets i have seen,
to be fully honest, i am only on the _rarest_ of occasions
actually _impressed_ by the scanning job that was done...

some d.p. scan-sets _are_ clean, to be sure.   here's one:
>    http://www.z-m-l.com/go/goann/goannp123.html
(again, adjust the zero-padded number for another page.)

but even this very-clean scan-set has not been _cropped_.
(click through the pages and see how the upper-left corner
bobs and weaves from one place to another as you proceed.
compare that to the rock-solid appearance of a hodson .pdf.)

and again, let me repeat that it is a _batch_ process to deskew
and crop a set of scans.   it takes 5-10 minutes to set things up,
but then you just click one button to transform _all_ the scans...

and yet it makes a huge difference in the quality of the results.


>    Same for overly large scans, missing pages, etc.

i didn't mention the issue of missing pages because i know you've
_finally_ come to see what a terrible time-sink that they can be...
but they still happen with a frequency that is too high; i detected
a missing page in a book a month ago.   so it still happens.   plus:
>   http://www.pgdp.net/phpBB2/viewtopic.php?p=341491#341491


now, if you just realized that bad scans are also a huge time-drain...


>    As someone else pointed out, Finereader automatically 
>    deskews pages unless they are extremely badly skewed. 

and as i pointed out, in response, this doesn't help the _proofers_
who have to _examine_ those crooked scans, day in and day out...

nor does it help all the end-readers who will look at them too...

and -- once again -- these are _batch_ operations.   why fight 'em?
teach your providers how _easy_ they are, and show 'em how much
time and energy it saves (not to mention creating a nicer product),
and then you won't even need to "require" them to do that for you.


>    Most content providers do draw text blocks for recognnition 

i heartily recommend you use spell-check on your messages...
typos reflect badly on you in your position...


>    Most content providers do draw text blocks for recognnition 
>    where the OCR doesn't get it right. 

this points out one of the most ironic twists in this little tea-kettle:
cropping the scans such that the text-block is consistently located
means that the "blocks" in the o.c.r. program work _much_ better,
and can be drawn _only_one_time_ yet work for the _entire_book_.

which means that cropping also saves time of the person scanning!
not just the proofers down the line, but the scanner-person directly!

still, look for yourself and see that _almost_no_d.p._scan-sets_ have
been cropped consistently.   don't take my word for it.   or juliet's word.
go look for yourself, and you will see that i'm accurate and she is not...

>    http://www.pgdp.org/ols/

go ahead.   i'll wait.   really.   just pick a few books at random.
indeed, if you can find _one_ scan-set that's cropped, say so!

or, if you prefer to see a book that's _currently_ in the system, 
go to the "current projects" thread found in the d.p. forums:
>    http://www.pgdp.net/phpBB2/viewforum.php?f=2

again, pick out _any_ book at random, go to its forum thread and
-- right there in the first message -- click on the link to go to the
"project comments", where you'll click on "detail level" number 4,
which gives a page of links to the scans in the project.   view some.
again, my bet is that you won't find one scan-set that is cropped...


>    Again, the worst offenders will here from me

stealth scanno alert:   _hear_ from me...


>    Again, the worst offenders will here from me
>    once the matter is brought to my attention. 

again, i don't know how you define "worst offenders"...

but i can point to plenty of scan-sets i think are so bad
i think it's unwise to subject them to volunteer proofers.


>    We expect all content providers to do some pre-processing

nice dodge.   but the "some" preprocessing you expect is not enough.
and in the past, it seems to my eyes you did very little preprocessing.
indeed, in many cases, i didn't seem to be able to see any done at all.
(or, to put it more precisely, i saw evidence that you did _not_ do any
of the preprocessing that i would consider to be blatantly necessary.)


>    Probably not as much as bowerbird would advocate, 

it has absolutely nothing to do what "how much" that _i_ "advocate".

it's about how much it's _efficient_ to do.   if an hour of preprocessing
saves three or four hours of work later down the line, then you do it,
and you don't even have to think twice about making that decision...

but you guys haven't done enough preprocessing, or even enough
_testing_ of preprocessing, to have _the_slightest_idea_ on how much
time it can save down the line.   indeed, you underestimate it _wildly_...

i _know_ this, for a fact, because i have done those tests, carefully.
(not that you have to _do_ the tests "carefully", because the results
are _so_ striking the outcome is immediately obvious at the outset.)

as for my so-called "lack of experience", i can tell you that i have a _lot_
of experience scanning and digitizing.   and i was working on projects
where i was paid a _flat_fee_, so it was in my direct self-interest to know
_exactly_ the most efficient way of going about the entire process, and
i learned very fast that any changes i could make _globally_ were golden.
if my workflow was as inefficient as the one at distributed proofreaders,
i would've been working for minimum wage.   as it was, i made out big...


>    but again, but certainly we do far more than "nothing".

you do closer to "nothing" than to what i've found to be efficient.


>    Some things might be a good idea, but in practice 
>    require more effort than I'm willing to do. 

i'm certainly not suggesting that a preprocessor spend two hours
to save the proofers one hour.   that would be a bad use of time...

not even suggesting they spend two hours to save two hours, as
that would be a wash.   (you can change "hours" to "minutes" if you
prefer, although it's easy enough to see the equation is identical.)

i'm not even suggesting that a preprocessor should spend 2 hours
in order to save the proofers 3 hours, since you have more proofers.
(although that starts to approach the point where it's questionable.)

but if the preprocessor is unwilling to spend an hour of their time
-- because it "requires more effort than i'm willing to do" -- and it
_costs_ the proofers down the line a full _4_ hours of time, that is
a flagrant abuse of the time and energy being volunteered to you.

in light of those donations, it's morally wrong to allow a disparity
in the amount of work that one volunteer can displace to another.
and if your system allows _big_ displacements, it needs to be fixed.


>    Remember, again, that the content providers are all volunteers. 

i know that very well.   did you really think i'd "forgotten" it?   c'mon...

it's cases where a content producer is unwilling to spend 2 hours to 
save the later proofers 4 hours that make your workflow inefficient.

would preprocessors like it if the proofers laid 4 hours of work on
_them_ to save 2 proofer hours?   i'm quite sure the answer is "no".

also...

it might help to tell people here that there is _one_person_ at d.p.
who's been testing what i've been saying on a big project over there.
(his handle is dkretz, and the project is the encyclopedia brittanica.)

he has _consistently_ been reporting that his results are _excellent_,
and thus he highly recommends preprocessing as being worthwhile.
(and he hasn't even experienced some of the deep benefits thus far.)

implement my suggestions.   (even if you claim them as your own.)
once you do, you'll find out that i was giving you excellent advice...
and then maybe you'll stop making the ad hominem attacks on me.

(and, as a final reminder on this topic, i do _not_ advocate that it be
the content provider who does this preprocessing work.   it could be
_anyone_.   indeed, i believe this should be a specific _role_, just like
the "postprocessor" is now.   and, for the record, i suggest it's best for
one person to fill both the preprocessor and the postprocessor roles.)


>    the majority of our content providers use Abbyy Finereader.

but there are still _many_ books in the system from other programs,
and i'd guess that a non-insubstantial number of the abbyy ones are
not from abbyy v8, even at this late date.   (do you keep track of that?)

let it be fully known that i _do_ appreciate that you have come to know
the importance of using the best program out there, and now advise it.
(although it is equally important to use the _best_version_ of it as well.)

but i'm making a point here to the people who might not know it yet,
and -- because you once did many books with inferior o.c.r. apps --
there is a large degree of inefficiency in your system due to that fact,
an inefficiency for which your _volunteer_proofers_ are paying a price.

understand i'm not _blaming_ you for the inefficiencies of the past,
even those that are still lagging into the present-day work on-site.
but _neither_ am i willing to forget about present-day texts entirely,
and the inefficiency they still represent, because you have changed 
your policy since.   there's still a huge amount of lingering inefficiency.


>    The ones who don't use it, typically can't afford to buy it 

ok, surely you aren't trying to imply that i say they should buy it.

however, by the same token, you're also not trying to imply that
-- just because person x can't afford to buy the right program --
therefore person x has the right to inflict poor o.c.r. on proofers.
are you?

if someone can't do the task right, for want of the proper tools 
or for any other reason, then they should pick some other task.


>    The ones who don't use it, typically can't afford to buy it 
>    or already have another good OCR program.

what "another good o.c.r. program" would that be?
because i'd like to run some tests to compare them.


>    I did approach Abbyy several years ago about various issues 
>    relating to Finereader and DP's experience with it. But I got nowhere. 

then tell people that.   so they know you have gone to bat for them.

and _try_again_.   and again and again.   show me you really care,
and _i_ will go and bug 'em.   and you now how irritating i can be.

or get someone else to intercede if you can't get the job done...

because abbyy _is_ doing deals.   they're collaborating with _many_
institutions.   so i don't see why they wouldn't work with p.g., since
it's been the premiere free cyberspace library from the get-go, right?


>    The only "old-book" version that I know about is one that 
>    OCR's black letter and fraktur texts. 

at their home page, click "products", then click "o.c.r. software", for:
>    ABBYY FineReader XIX
>    The first omnifont OCR software for Fraktur and old European 
>    language recognition. It is specially designed for converting 
>    ancient documents and books printed in 18-20th centuries 
>    into digital text. It combines all of the power of 
>    ABBYY FineReader Corporate Edition with special capabilities 
>    for reading old European languages.

sounds like the ticket to me.


>    The only "old-book" version that I know about is one that 
>    OCR's black letter and fraktur texts. The pricing on it is 
>    over a thousand dollars, requiring both the purchase of 
>    the OCR package and then upfront purchase of the use of it 
>    on a fixed number of pages. It also only works with Windows,
>    as far as I know, so we wouldn't be able to run it on our LINUX server.
>    If the pricing has come down significantly (which I hope it will) 
>    or there is some other package that I'm not aware of, I always 
>    appreciate having these things pointed out.

check out their site.   talk to them.   work something out.   you can do it.

but _at_least_ test it out, so you can see whether it's worth the money.
and if it _is_, then purchase it, if that's what you have to do.   you have
the money, and your volunteers will donate more if they see it works...


>    All of this has been said in the DP forums, several times. 

but you haven't checked out the abbyy site, or talked to them,
or worked something out.   show you care, juliet.   you can do it.

***

in closing, i'm _glad_ you're finally starting to see the light on how
preprocessing can make your system more efficient.   that's good...
(and i'm proud to have served in helping you to get that realization,
even if it means you choose to dislike me because i did.   it's worth it.)

at the same time, this doesn't excuse the inefficiencies that you have
foisted upon innocent _volunteers_ over the course of _years_ now...
and it surely won't bring back the people who have already left.

-bowerbird


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070703/fe63c726/attachment-0001.htm 

From Bowerbird at aol.com  Mon Jul  2 23:33:51 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 3 Jul 2007 02:33:51 EDT
Subject: [gutvol-d] ok, let's talk about d.p. efficiency, item #3
Message-ID: <c79.16db6b15.33bb47cf@aol.com>

frank said:
>    Actually, Finereader 7.0 does much better then 8, on many aspects.

i've heard that on some occasions.   (i've also heard v7 is faster.)

but i trust nicholas hodson's experience on this issue most of all.

he had about _400_ auto-corrections he'd make to his v7 output.
(can't remember the exact numbers, but i do remember the ratio.)
his testing showed that with v8, however, only 200 were necessary.

i'm open to contrary results from tests that are performed rigorously.
(jose menendez, for example, reports "almost perfect" o.c.r. out of his
1998 version of textbridge, and you can't argue with near-perfection.)

but in the absence of such tests and such results, i believe nicholas...

-bowerbird


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070703/03a034ce/attachment.htm 

From shabam.dp at gmail.com  Tue Jul  3 09:20:29 2007
From: shabam.dp at gmail.com (shabam)
Date: Tue, 3 Jul 2007 09:20:29 -0700
Subject: [gutvol-d] ok, let's talk about d.p. efficiency, item #3
In-Reply-To: <XFE1KlL3R8ZLXEZkI4t00002f70@xfe1.d.ethz.ch>
References: <d4f.c308c18.33baedc3@aol.com>
	<XFE1KlL3R8ZLXEZkI4t00002f70@xfe1.d.ethz.ch>
Message-ID: <1ac896090707030920i1890f49dg5a44e47b3cc3805@mail.gmail.com>

> Actually, Finereader 7.0 does much better then 8, on many aspects.

Not only that, but 7 is much cheaper than 8, (get it on eBay) and for those
of us without an endless supply of cash, this is important.  6 works very
well also.

Besides, better OCR does not mean better product and getting it done
faster.  It only means that the round that runs the fastest gets less work
to do, and a single person gets more responsibility.  The type in projects I
have run (a few black-letter and a couple handwritten manuscripts) run just
as fast as similar projects that have good OCR.  The type of project has a
lot more to do with it than the quality of the OCR. "Alice's Adventures
Underground" was a handwritten manuscript, and was typed in in P1.  It ran
through the rounds very quickly.  My other children's books run just as
quick.  However "Assemble of Goddes" is blackletter poetry in middle
English.  Not very popular.  It runs very slowly, and ran at the same speed
as a blackletter, middle English, poetry that was typed in prior to P1.
After they are done, the type-ins are just as high a quality as OCR'd
projects.

If a person is doing the project themselves, then the OCR is a lot more
important.  The idea behind DP is DISTRIBUTION.  With a bunch of
pre-processing stuff being done by a single person, so that a group of
people can work less defies the idea of distributing the work.  There has
been some talk of distributing pre-processing, and some people do this by
using the OCR pool (because they do not own an OCR program) or using scans
someone else scanned (harvesting), but for the most part, this is all done
by a single person.  As a CP/PM I have to weight the tradeoffs of higher
quality preprocessing and more time to do other stuff.  Doing stuff with my
family wins.

That said, I do provide fairly a high quality product, as do most of the CPs
and most PMs will double check them as well.  I do think that page scans
should be cropped (not to tight, but no 3 inch margins either), the project
should be checked for missing pages (we have been talking about distributing
this for years), and the scans should be readable.  In some cases (like a
picture book I just released) this means using grayscale or color images.
The project should also be run through the preprocessing program we have
(guiprep).

Much beyond this, and the CP and PMs time could be better spent in areas
that need more people (the third proofing round and 2nd formatting round or
post-processing and verification).

Yes, DP is imperfect.  So is everything else in life.  We know this, but we
are all volunteers.  There are no paid staff to spend their lives making DP
better, so we get improvements when the hardworking volunteer staff has time
to take away from their families, work, social life, or what have you.  We
have been talking about these improvements for a long time (since the
beginning of DP), and will continue to talk about these improvements.  They
come in spurts, and some of the best never get implemented (lack of
money/programmer time/agreement).

DP does a great job, and PG could not have as many high quality books as
they do without the help of DP, even as imperfect as it is.  Not everyone
will agree that DP is a great thing.  Some people will always have negative
thought about us.  That is there right.  They often have incorrect
information and think they are right, and don't realize that other people
need to agree and then the time needs to be taken to do it.  Perhaps if
instead of complaining, these people would spend some time helping to
improve things, some of these things would get done, but until we have
endless man hours and money and a big enough stick to convince enough people
that our way is right, and theres is not, some things will never get done.

Any chance of talking about something meaningful?

Jason

-- 
Person to person lending.  Lend money to others, and get a $25 bonus.
http://www.prosper.com/join/shabam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070703/b67d1f3d/attachment.htm 

From Bowerbird at aol.com  Tue Jul  3 09:39:28 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 3 Jul 2007 12:39:28 EDT
Subject: [gutvol-d] ok, let's talk about d.p. efficiency, item #3
Message-ID: <c0c.1a0f04a4.33bbd5c0@aol.com>


jason's post is an excellent example of the confused thinking over at d.p.

i won't bother responding to it for now, but i strongly encourage all of you
to read it, and _examine_ it closely, to see if you can tell why it's 
off-base...

-bowerbird


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070703/7e748ca8/attachment.htm 

From Bowerbird at aol.com  Tue Jul  3 13:55:11 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 3 Jul 2007 16:55:11 EDT
Subject: [gutvol-d] ok, let's talk about d.p. efficiency, item #4
Message-ID: <bbe.e9f39f5.33bc11af@aol.com>

ok, so the workflow so far:
1.   get good scans.   (crop and deskew, and test despeckling.)
2.   get good o.c.r. (do a number of tests to get the best output.)
3.   do preprocessing.   (changes made globally are most efficient.)

item #4 in this series is the _proofing_ itself.

in my model, most of the "proofing" was done in step #3,
where we have our _tools_ zoom us to the "bad words",
i.e., infrequent words which had not passed spellcheck,
and other aspects of the text that seem to be anomalies
(e.g., punctuation irregularities and trackable glitches)...

i have argued before that this laser-like focus on _errors_
is sufficient to take the error-level down to the state where
any further "proofing" that is necessary can (and _should_)
be done by end-users who will read the book for _content_.
(this includes stealth scannos, publisher errors, and so on.)

i call this "final round" of corrections by readers of a book
"continuous proofreading", and conceive of it as a lengthy
(e.g., 6-month) period where the book is _only_ available
to the general public in the text-next-to-its-scan format,
which communicates the proofing expectation to people.
(those who strongly want to read the book will still do so,
and they're precisely the ones who'll be the best proofers.)
only after this 6-month period will a text be released fully.

i've said repeatedly an error-rate of 1-error-per-10-pages
is good enough for a book to go into continuous proofing,
and my research shows that this rate is easily obtainable
with the error-focusing tools that i have programmed, so
i am certain my workflow will prove itself in the real world.

just to make it _clear_, though, let me say it quite directly.

i strongly believe there is no need to proof every word on
every page against the original scan.   if the scans are clean,
and you did a good job of doing the o.c.r., and you were
conscientious about careful use of a tool to clean the o.c.r.,
your "continuous proofers" who are reading for _content_
will move your e-text to a state that approaches perfection.

now, as i said, i'm _absolutely_convinced_ this is the case...

but i couldn't blame you if you are skeptical of my claims...
after all, "proofing against the scan" methodology has been
the primary means of doing this job for a very long time now.

so even though i think that it is _wildly_inefficient_ to proof
every word on every page against the scan, that's not why i
say that _distributed_proofreaders_ is inefficient.   even if you
accept their working assumption that this is the way it _must_
be done, there's still a lot of inefficiency within their workflow.

so, for the rest of this post, i will deal with that d.p. workflow.
(even though -- for my purposes -- it's completely outdated.)

i have said _many_times_ -- both here and elsewhere -- that
the major inefficiency of the d.p. proofing workflow is that it
assumes _every_ page in every book needs the same amount
of attention, as evidenced by the fact that all pages in a book
are subjected to the same number of "rounds".   it seems to me
that that belief is _patently_absurd_, and that some pages are
clearly more difficult than others, and thus need more rounds.

the answer, of course, is a _roundless_system_, one that gives
an individual page as many "rounds" as it needs, to be finished.
how do we know how many rounds that is?   the answer is simple:
when a certain number of people have looked at a specific page
and found "no corrections required" (n.c.r.), we can call it done...

you can set the "certain number" at any level you like. i think
2 is enough, but you could make it 3 or even 4 if you wanted.
but once that criterion number have given a page the "n.c.r.",
you call it "finished" and move on.   it's important to note that
there might _still_ be an error on that page, but when there is,
we'll expect the "continuous proofreading" process will find it.
(so as long as the errors are under-1-in-10-pages, we're fine.)

also note that it's dirt-simple to test whether any changes were
made to a page, simply by whether the "before" equals the "after";
so we don't need to get into any complex ways to find the "diffs".

so that's all i'm gonna say about "roundless" in this message now.
d.p. has said it intends to move to a roundless system "eventually",
and when they do, they'll reduce their inefficiency in a major way,
so i dearly hope that that will happen _sooner_ rather than _later_.

because the current system of sending too many pages through
too many rounds is a big waste of the proofers time and energy.

***

but -- since part of the purpose of this series is to document the
inefficiencies in the distributed proofreaders workflow -- i must
turn an eye toward the proofing rounds as they currently exist...


first and foremost among the problems, of course, is the horrid
spellchecker that was in use for many _years_ up until recently,
which i documented fully in the message that i posted yesterday
entitled "i decided to think about the future, not the past"...

i won't repeat that here, except to say that i'm _greatly_ relieved
they have plugged that hole.   it's impossible for me to ignore the
pain and suffering proofers were subjected to for so many years
-- most especially those proofers in p2 who were _required_ to
use that inferior tool, instead of being given a much better one --
so it'll take a long while before that bruise heals completely, but
at least the bleeding is now stopped, and i'm appreciative for that.

to recap, i started my constructive criticism of d.p. inefficiency in
late 2003, which makes it some three-and-a-half years back, so
i'm glad the last three-and-a-half _months_ have seen progress.
but, you know, it's a little bit late, in my opinion.   ok, a _lot_ late.

for those who might be curious, here's the u.r.l. taking you to my
christmas 2003 "constructive criticism" thread on the d.p. forums:
>    http://www.pgdp.net/phpBB2/viewtopic.php?t=5963
you will find that i laid out most of these points a long time ago...

***

at any rate, here is a brief run-through of my remaining comments
on the state of the _proofing_ interface of distributed proofreaders.

i think there needs to be a better way of flagging possible errors.
the wordcheck screen is ok, but it has to be summoned separately.
in addition, the fact that wordcheck flags _all_ punctuation, and not
just the punctuation that's probably wrong, is very bad overflagging.

next, there needs to be a system of automated diffs, so that proofers
are given quick feedback on every single page that they have done...

(by "automated" diffs, i mean that the earlier proofer is notified by
an automatic process that the diff is available for their inspection.)

in addition to the training aspects, a benefit of automated feedback
is that an automated diff lets the earlier proofer _verify_ work done
by the _later_ proofer, so if the later proofer introduces any errors,
the earlier proofer can draw attention to the goof.   very important...

i believe it's the case that there are efforts underway (but again,
initiated in the last 3.5 months rather than the last 3.5 years) to
bring about these improvements, and i welcome that "initiative".

i'm sure there are some other things i'm forgetting right now
-- it's been a long time since i proofed my pages over at d.p. --
but that's my general wrap-up on the suggestions i would make.

in addition, i will say that there are a bunch of nice features on the
current interface, including a wide variety of pop-up menus that
help proofers with difficult things like greek characters, and so on.
the interface also gives people a choice between "horizontal" and
"vertical" display, which is a nice option that i don't give people, so
score one exclusively for the d.p. side in that regard.

and once the wordcheck capability stabilizes on some solid ground,
there's a good chance people will figure out how to make it perform
all kinds of useful additional functions that will help out proofers...

but the development i am most keen on, in regard to the interface
for proofers, is the development effort of dkretz, who is designing
a new interface to be used in the upcoming "dp50" offshoot of d.p.

while i believe that -- in most regards -- it will be very similar to
the current interface, at least in terms of existing functionalities,
dkretz has focused on building in a wide variety of error-flagging,
and that holds the potential for greatly facilitating the proofing...

also, another wrinkle could potentially be extremely fascinating,
namely that dkretz is planning on using a form of "light" markup,
rather than the kludgy markup that has evolved on the main site...

this "light" markup falls mainly into the category of "formatting"
-- at least as the distinction is currently made on the main site --
but since "light" markup doesn't _obfuscate_ the text (which was
the primary reason that formatting was split off from proofing),
my sense is that the proofers will end up doing all the formatting
-- helped along by the ability to have that formatting displayed,
in a direct-feedback way that lets them quickly do corrections --
and there will be no _need_ for a formatting round, let alone _2_,
which will create a tremendously more efficient overall workflow.

indeed, it seems to me that even the _postprocessing_ role will
also be greatly diminished in importance, because the proofers
will be doing more and more of the tasks associated with that...

but i'll handle that more directly in my _next_ post in this series,
which will focus on _formatting_, the next big step in our recipe.

but that probably won't be until next week, so enjoy the holiday!

-bowerbird


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070703/22a4979f/attachment.htm 

From Bowerbird at aol.com  Wed Jul  4 16:30:58 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 4 Jul 2007 19:30:58 EDT
Subject: [gutvol-d] independence day
Message-ID: <c48.12f492a0.33bd87b2@aol.com>

there's this anonymous person who writes a blog that purports to be
"the secret diary of steve jobs".   fake steve has become one of _the_
most-read blogs in cyberspace, thanks to entries like this one today:
>    
http://fakesteve.blogspot.com/2007/07/music-industry-nobs-have-finally.html

here's the heart of that post:
>    The music companies are in a dying business, and they know it. 
>    Sure, they act all cool because they hang around with rock stars. 
>    But beneath all the glamour these guys are actually operating 
>    two very low-tech businesses. One is a form of loan-sharking: 
>    they put up money to make records, then force recording artists 
>    to pay the money back with exorbitant interest. The other business 
>    is distribution. They've got big warehouses and they control the 
>    shipment of little plastic boxes that happen to have music in them. 
>    The guys running the labels are pretty stupid -- most are just dirtbags 
>    who started out as band managers or promoters -- but now at long last 
>    they are kinda sorta finally vaguely getting clued in to the fact that 
>    both parts of their business model are fucked.

boom.   i suppose it's rather easy to see how this extrapolates to _books_.

all in all, it's a classic fake-steve entry, especially for independence 
day...

-bowerbird


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070704/e3d3858b/attachment.htm 

From tb at baechler.net  Fri Jul  6 05:41:40 2007
From: tb at baechler.net (Tony Baechler)
Date: Fri, 6 Jul 2007 05:41:40 -0700
Subject: [gutvol-d] Hugh McGuire - Jon Udell's Interviews With Innovators
Message-ID: <20070706124140.GC10531@investigative.net>

Hello all,

This is slightly off topic but might be of interest to some of you as 
LibriVox uses PG files as their source material if I understand 
correctly.  Anyway, here is an interview with the founder of LibriVox.


From: "GigaVox Media (All Channels)" <doug at gigavox.com>


Audiobooks are an excellent way to make books available to everyone. When Hugh McGuire founded LibriVox in 2005, he wanted to take advantage of the masses of book lovers across the world to record and make available a catalog of audiobooks. On this week's Interviews with Innovators, Jon Udell speaks with McGuire about the origins, growth and distinctive architecture behind LibriVox. 


URL: http://feeds.gigavox.com/~r/gigavox/network/~3/110500128/detail1783.html
Enclosure: http://feeds.gigavox.com/~r/gigavox/network/~5/110500129/ITC.INNO-HughMcGuire-2007.04.18.mp3

----- End forwarded message -----

From j.hagerson at comcast.net  Fri Jul  6 07:17:46 2007
From: j.hagerson at comcast.net (John Hagerson)
Date: Fri, 6 Jul 2007 09:17:46 -0500
Subject: [gutvol-d] Unwrap lines utility?
Message-ID: <000001c7bfd8$6f4c48e0$1f12fea9@sarek>

I am corresponding with someone who would like to be able to unwrap the
paragraphs from some of the older, plain text, material in our collection. I
provided him the naive, three search-and-replace solution, but he says that
his attempt to implement it on his computer with the file he has chosen
causes his word processor to lock up.

He is running Microsoft Windows XP. Has anyone already written a utility to
do this? If so, please send me a pointer to it.

Thank you very much.


From desrod at gnu-designs.com  Fri Jul  6 07:59:45 2007
From: desrod at gnu-designs.com (David A. Desrosiers)
Date: Fri, 06 Jul 2007 10:59:45 -0400
Subject: [gutvol-d] Unwrap lines utility?
In-Reply-To: <000001c7bfd8$6f4c48e0$1f12fea9@sarek>
References: <000001c7bfd8$6f4c48e0$1f12fea9@sarek>
Message-ID: <1183733985.2664.279.camel@localhost.localdomain>

On Fri, 2007-07-06 at 09:17 -0500, John Hagerson wrote:
> I provided him the naive, three search-and-replace solution, but he
> says that his attempt to implement it on his computer with the file he
> has chosen causes his word processor to lock up. 

I use Text::Wrap to do the same exact thing, available in CPAN: 

http://search.cpan.org/~muir/Text-Tabs+Wrap-2006.1117/lib/Text/Wrap.pm

I can't speak to why your friend's word processor "locks up", but then
again, I don't run legacy operating systems or applications like
Windows, so I won't be of much help there. 

Perl works on Windows, and this module is available there. It might be
easier to just use that instead. 


-- 
David A. Desrosiers
desrod at gnu-designs.com
setuid at gmail.com
Skype...: 860-967-3820


From Bowerbird at aol.com  Fri Jul  6 09:33:00 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 6 Jul 2007 12:33:00 EDT
Subject: [gutvol-d] Unwrap lines utility?
Message-ID: <cc6.158bdce9.33bfc8bc@aol.com>

john said:
>    unwrap the paragraphs from some of the 
>    older, plain text, material in our collection

http://www.z-m-l.com/go/unwrap.pl

1.   paste text into box and click button.
2.   copy unwrapped text underneath box.
3.   paste into new document.

the script won't wrap lines that start with one
(or more) blanks, so you can use that rule to
immunize any lines you don't want to wrap...

-bowerbird


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070706/923ba39a/attachment.htm 

From desrod at gnu-designs.com  Fri Jul  6 09:50:55 2007
From: desrod at gnu-designs.com (David A. Desrosiers)
Date: Fri, 06 Jul 2007 12:50:55 -0400
Subject: [gutvol-d] Unwrap lines utility?
In-Reply-To: <cc6.158bdce9.33bfc8bc@aol.com>
References: <cc6.158bdce9.33bfc8bc@aol.com>
Message-ID: <1183740656.2664.291.camel@localhost.localdomain>

On Fri, 2007-07-06 at 12:33 -0400, Bowerbird at aol.com wrote:
> the script won't wrap lines that start with one
> (or more) blanks, so you can use that rule to
> immunize any lines you don't want to wrap... 

Nor does your script work with multiple lines NOT separated by a blank
line. 

Tsk. Tsk.


-- 
David A. Desrosiers
desrod at gnu-designs.com
setuid at gmail.com
Skype...: 860-967-3820


From ricardofdiogo at gmail.com  Fri Jul  6 11:20:23 2007
From: ricardofdiogo at gmail.com (Ricardo F Diogo)
Date: Fri, 6 Jul 2007 19:20:23 +0100
Subject: [gutvol-d] Unwrap lines utility?
In-Reply-To: <cc6.158bdce9.33bfc8bc@aol.com>
References: <cc6.158bdce9.33bfc8bc@aol.com>
Message-ID: <9c6138c50707061120u5fd11a45j91697ab173934bbc@mail.gmail.com>

2007/7/6, Bowerbird at aol.com <Bowerbird at aol.com>:

>  http://www.z-m-l.com/go/unwrap.pl

ISO-8859-1 doesn't seem to work with it too.  With some improvements,
it may be a good tool.

From ricardofdiogo at gmail.com  Fri Jul  6 11:24:44 2007
From: ricardofdiogo at gmail.com (Ricardo F Diogo)
Date: Fri, 6 Jul 2007 19:24:44 +0100
Subject: [gutvol-d] Unwrap lines utility?
In-Reply-To: <000001c7bfd8$6f4c48e0$1f12fea9@sarek>
References: <000001c7bfd8$6f4c48e0$1f12fea9@sarek>
Message-ID: <9c6138c50707061124o3a4c5eb6x11debe3c77fa785f@mail.gmail.com>

2007/7/6, John Hagerson <j.hagerson at comcast.net>:
> I am corresponding with someone who would like to be able to unwrap the
> paragraphs from some of the older, plain text, material in our collection. I
> provided him the naive, three search-and-replace solution, but he says that
> his attempt to implement it on his computer with the file he has chosen
> causes his word processor to lock up.
>
Same used to happen with my old computer. Guess it must be a
memory/processing thing. I then started to it by processing several
separated text chunks.

From Bowerbird at aol.com  Fri Jul  6 12:37:13 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 6 Jul 2007 15:37:13 EDT
Subject: [gutvol-d] Unwrap lines utility?
Message-ID: <d2e.a7c3170.33bff3e9@aol.com>

david said:
>    Nor does your script work with 
>    multiple lines NOT separated by a blank line.
>    Tsk. Tsk.

um, "multiple lines not separated by a blank line"?
i don't understand what you're saying here, david.
p.g. e-texts have a blank line between paragraphs.

perhaps you could point me to a nonworking file?

***

michael said:
>   Wouldn't be be even easier just to i/o from file to file?

people who want that should backchannel me.
if i get enough requests, i'll set it up that way...

***

ricardo said:
>   ISO-8859-1 doesn't seem to work with it too.

perhaps you could point me to a nonworking file?

-bowerbird


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070706/d71404d7/attachment.htm 

From desrod at gnu-designs.com  Fri Jul  6 13:35:32 2007
From: desrod at gnu-designs.com (David A. Desrosiers)
Date: Fri, 06 Jul 2007 16:35:32 -0400
Subject: [gutvol-d] Unwrap lines utility?
In-Reply-To: <d2e.a7c3170.33bff3e9@aol.com>
References: <d2e.a7c3170.33bff3e9@aol.com>
Message-ID: <1183754132.27406.1.camel@localhost.localdomain>

On Fri, 2007-07-06 at 15:37 -0400, Bowerbird at aol.com wrote:
> um, "multiple lines not separated by a blank line"?
> i don't understand what you're saying here, david.
> p.g. e-texts have a blank line between paragraphs.

Stick this into your form, you'll see what happens. 

Note, this should unwrap to two lines, as it is presented here. Your
code joins the two lines into one long line, breaking it.

---
What is Lorem Ipsum?
Lorem Ipsum is simply dummy text of the printing and typesetting
industry. Lorem Ipsum has been the industry's standard dummy text ever
since the 1500s, when an unknown printer took a galley of type and
scrambled it to make a type specimen book.
---

-- 
David A. Desrosiers
desrod at gnu-designs.com
setuid at gmail.com
Skype...: 860-967-3820


From joshua at hutchinson.net  Fri Jul  6 14:12:49 2007
From: joshua at hutchinson.net (joshua at hutchinson.net)
Date: Fri, 6 Jul 2007 21:12:49 +0000 (UTC)
Subject: [gutvol-d] Unwrap lines utility?
Message-ID: <12549565.1183756369760.JavaMail.?@fh1037.dia.cp.net>

As much as it pains me to defend the bird ...

Your example, by PG text formatting rules, SHOULD rewrap into a single 
paragraph.  In PG texts, paragraphs are denoted by a blank line between 
them (two newline characters).

The original question was about rewrapping some PG texts, so 
bowerbird's methodology is good.

Please, don't make me agree with the bird again.  It makes my head 
hurt.  ;)

Josh

>----Original Message----
>From: desrod at gnu-designs.com
>
>On Fri, 2007-07-06 at 15:37 -0400, Bowerbird at aol.com wrote:
>> um, "multiple lines not separated by a blank line"?
>> i don't understand what you're saying here, david.
>> p.g. e-texts have a blank line between paragraphs.
>
>Stick this into your form, you'll see what happens. 
>
>Note, this should unwrap to two lines, as it is presented here. Your
>code joins the two lines into one long line, breaking it.
>
>---
>What is Lorem Ipsum?
>Lorem Ipsum is simply dummy text of the printing and typesetting
>industry. Lorem Ipsum has been the industry's standard dummy text 
ever
>since the 1500s, when an unknown printer took a galley of type and
>scrambled it to make a type specimen book.
>---
>


From Bowerbird at aol.com  Fri Jul  6 15:21:19 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 6 Jul 2007 18:21:19 EDT
Subject: [gutvol-d] Unwrap lines utility?
Message-ID: <c5c.163f724b.33c01a5f@aol.com>

david said:
>    ---
>   What is Lorem Ipsum?
>   Lorem Ipsum is simply dummy text of the printing and typesetting
>   industry. Lorem Ipsum has been the industry's standard dummy text ever
>   since the 1500s, when an unknown printer took a galley of type and
>   scrambled it to make a type specimen book.
>   ---

as i said earlier, project gutenberg e-texts have a blank line
between paragraphs.   so this text here is non-representative.

the task of _restoring_ paragraphing to a text when the blank lines
have been eliminated is an interesting one -- and a useful one too,
because the blank lines are eliminated in text copied out of a .pdf --
but that's not what john was asking for...

-bowerbird


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070706/ad2428be/attachment.htm 

From Bowerbird at aol.com  Sat Jul  7 12:12:27 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 7 Jul 2007 15:12:27 EDT
Subject: [gutvol-d] enjoying the cinema
Message-ID: <c51.148e94dd.33c13f9b@aol.com>

if anyone out there is wondering if a cinema-screen is 
worth what it costs -- and they're pretty cheap now --
i can assure you that the answer is most definitely "yes!"

in particular, i am finding that the ability to comfortably
fit _3_ "panels" on a single screen is _very_ productive...

here's a sample page from one of my text-cleaning tools:
>    http://www.z-m-l.com/go/triple.jpg

as you can see, the original scan is displayed on the left,
the text-field for making corrections is on the right, and
the formatted version of the text is shown in the middle,
for easy comparison to the original scan for correctness.

also on the right, at the bottom, there are buttons that
allow you to take some actions associated with the page.

and all this functionality has more than ample real estate,
with text that is sized large enough even for older eyes...

-bowerbird


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070707/be7e5288/attachment.htm 

From robert_marquardt at gmx.de  Sun Jul  8 10:56:26 2007
From: robert_marquardt at gmx.de (Robert Marquardt)
Date: Sun, 08 Jul 2007 19:56:26 +0200
Subject: [gutvol-d] an Esperanto book to pick up
Message-ID: <h39293h4e5b5biidfaogm4d72hg67cvm35@4ax.com>

http://librivox.org/forum/viewtopic.php?p=142572#142572
A HTML version is available which is from a PD book and the additions (preface, footnotes etc) have been declared PD by
the author.
-- 
Robert Marquardt (Team JEDI)  http://delphi-jedi.org

From shabam.dp at gmail.com  Sun Jul  8 16:05:54 2007
From: shabam.dp at gmail.com (shabam)
Date: Sun, 8 Jul 2007 16:05:54 -0700
Subject: [gutvol-d] Unwrap lines utility?
In-Reply-To: <12549565.1183756369760.JavaMail.?@fh1037.dia.cp.net>
References: <12549565.1183756369760.JavaMail.?@fh1037.dia.cp.net>
Message-ID: <1ac896090707081605u3349433al22442e60e2bed09e@mail.gmail.com>

John,

What editor is he using?  That could be part of the problem.  Some text
editors (such as MS Word) are more likely to crash on large search and
replaces.  Or they might appear to lock up, when they are working, and
taking a really long time to run.  I use a program called Edit Plus for
this.  It is shareware, and it does these types of replaces fairly quickly.

Another choice is to break the file into smaller chunks that the editor can
handle.

Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070708/e0009019/attachment.htm 

From j.hagerson at comcast.net  Sun Jul  8 17:20:39 2007
From: j.hagerson at comcast.net (John Hagerson)
Date: Sun, 8 Jul 2007 19:20:39 -0500
Subject: [gutvol-d] Unwrap lines utility?
In-Reply-To: <1ac896090707081605u3349433al22442e60e2bed09e@mail.gmail.com>
Message-ID: <004e01c7c1be$fb0c41e0$1f12fea9@sarek>

Thank you for the replies on and off the list.

The gentleman with whom I am corresponding is using Microsoft Word on a
computer running Windows XP. He has only 400MB of memory installed on the
machine, so a larger file does cause issues.

I have pointed him to GutenMark and some of the other utilities. However, he
would prefer a graphical interface to a command line.

I don't know exactly what he wants to do with some of the books. I know that
he wants to "fill the page" with text (meaning full justification, which he
can do in Word).

Thanks again for your assistance.

John

-----Original Message-----
From: gutvol-d-bounces at lists.pglaf.org
[mailto:gutvol-d-bounces at lists.pglaf.org] On Behalf Of shabam
Sent: Sunday, July 08, 2007 6:06 PM
To: Project Gutenberg Volunteer Discussion
Subject: Re: [gutvol-d] Unwrap lines utility?

John,

What editor is he using?? That could be part of the problem.? Some text
editors (such as MS Word) are more likely to crash on large search and
replaces.? Or they might appear to lock up, when they are working, and
taking a really long time to run.? I use a program called Edit Plus for
this.? It is shareware, and it does these types of replaces fairly quickly. 

Another choice is to break the file into smaller chunks that the editor can
handle.

Jason


From piggy at netronome.com  Mon Jul  9 08:24:53 2007
From: piggy at netronome.com (La Monte Henry Piggy Yarroll)
Date: Mon, 09 Jul 2007 11:24:53 -0400
Subject: [gutvol-d] Next project
In-Reply-To: <Pine.LNX.4.64.0706300859010.32755@pglaf.org>
References: <000a01c7bb24$c062b6a0$f2226546@Lydia>	<hcsc83db79n0bnnl12c20ee03pac39g2tc@4ax.com>
	<Pine.LNX.4.64.0706300859010.32755@pglaf.org>
Message-ID: <46925345.10107@netronome.com>

I'd add that I've used advertisements to prove publication dates for 
various works without explicit publication dates. Ads can be downright 
valuable.

Michael Hart wrote:
>
> Personally, I _LIKE_ to see that ads from a hundred years ago,
> I think it gives a greater perspective on the life of the time
> with the first ads for rentable rooms in NYC with kitchenettes
> and the various ship names, travel arrangements, etc. . . .
>
> ...
> Michael S. Hart
> Founder
> Project Gutenberg
>
>
> On Sat, 30 Jun 2007, Robert Marquardt wrote:
>
>> On Sat, 30 Jun 2007 10:41:26 -0400, you wrote:
>>
>>> ...
>>>     I think I will also include the ads as they are unique period
>>> ones including music, seeds, manure and a small home printing press.
>>>
>>> Dick 
>>>
>>
>> The Periodicals Bookshelf should give some examples.
>> http://www.gutenberg.org/wiki/Category:Periodicals_Bookshelf
>> I think the "Bulletin de Lille" may be what you look for.
>>
>> BTW ads are still sh-- eh manure.

From da.ajoy at gmail.com  Mon Jul  9 13:11:41 2007
From: da.ajoy at gmail.com (Daniel Ajoy)
Date: Mon, 09 Jul 2007 15:11:41 -0500
Subject: [gutvol-d] Unwrap lines utility?
Message-ID: <4692502D.20873.13845EEC@da.ajoy.gmail.com>

I use Clippy

http://wots.coolfreepage.com/link.php?id=SW3


Daniel


From Bowerbird at aol.com  Mon Jul  9 17:22:00 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 9 Jul 2007 20:22:00 EDT
Subject: [gutvol-d] knock me down! july 9th, 2007!
Message-ID: <d69.b2f1a38.33c42b28@aol.com>

good grief!   i just got knocked down!   hard!

i went to google books looking for a book
-- "the story of patsy", if you must know --
and saw an option there to "view plain text".

sure enough, google is giving us the o.c.r.!

this is a _game-changer_, folks.   hallelujah!

-bowerbird


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070709/7c42df8a/attachment.htm 

From Bowerbird at aol.com  Tue Jul 10 13:51:52 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 10 Jul 2007 16:51:52 EDT
Subject: [gutvol-d] game-changer
Message-ID: <bbe.f2a2510.33c54b68@aol.com>


yes sir, a game-changer, and not just in one way, in lots of ways.

which means tomorrow, 7/11/2007, is your lucky day, because
you get to read one of those _oh-so-rare_ posts from bowerbird
where he says "i was wrong", and this one will be even-more-rare,
since he says he will say it "more than once".   we took this to mean
"twice", and asked him to confirm it, and he corrected us to _repeat_
"more than once", leading us to think he might even say it _3_ times!
which, reports claim, would stun even the most hardened observers...

-bowerbird


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070710/236e6dde/attachment.htm 

From Bowerbird at aol.com  Tue Jul 10 16:15:27 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 10 Jul 2007 19:15:27 EDT
Subject: [gutvol-d] Unwrap lines utility?
Message-ID: <bf6.18a0848d.33c56d0f@aol.com>

i wrote this on sunday, but didn't send it out.   its relevance has been
increased by the recent developments, however (since it is _google_
who's losing the paragraphing on o.c.r. that it gives to umichigan,
and now to the general public), and thus, here it the message...

in some future posts, i'll deal some more with the general question of 
paragraphing that's missing throughout the book, but the issues are
roughly the same as i outlined here when they occur on a pagebreak...

now, back to the message as written...

***

warning:   this message is of interest only to extreme text-cleaning geeks.
other people should opt out now...

***

since hacker-david brought it up, here's a slight reworking of his example.

>?? ---
>?? What is Lorem Ipsum?
>?? Lorem Ipsum is the dummy text of the printing and typesetting industry. 
>    Lorem Ipsum has been the industry's standard dummy text ever since the 
>    1500s, when an unknown printer took a galley of type and scrambled it
>    to make a type specimen book.
>?? ---

remember that david said his example would be broken into 2 paragraphs,
the question and the answer.

however, in this reworking, it's ambiguous as to whether it should be
_2_ or _3_ paragraphs, since the first line of the answer now contains
one complete sentence, and nothing more.   so it could be broken as
the original example, or instead as three paragraphs, as shown here:

what is lorem ipsum?

lorem ipsum is the dummy text of the printing and typesetting industry. 

lorem ipsum has been the industry's standard dummy text ever since the 
1500s, when an unknown printer took a galley of type and scrambled it
to make a type specimen book.

***

believe it or not, this _does_ have some immediate relevant applicability.

as i noted, text copied out of a .pdf often is stripped of its empty lines.
sadly, the paragraph indents are _also_ lost, which makes it even more
difficult to restore the paragraphing.

and perhaps even worse, umichigan has lost the empty lines in the text
that it is releasing to the general public from the google scanning project.
so this release -- which would be an _excellent_ thing, were the text not
marred with this (and many other) problems, seeing that google itself is
_not_ releasing the actual text, just the images -- is actually kind of 
sad...
>    
http://mdp.lib.umich.edu/cgi/m/mdp/pt?view=text;id=39015016881628;u=1;num=129

on the home front, one of the tasks in digitizing a book is to make sure
that all the paragraphing is correct, and one of the places where this is
the most difficult one paragraph ends at the bottom of one page, with 
a new one beginning at the top of the next page.

much of the time, it's clear when a sentence ends on the bottom line
of a page that it's the last line of a paragraph because the line doesn't
make it anywhere close to the right margin.   but sometimes it's not.
for instance, take a look at this page and tell me if the paragraph ended:
>    http://www.z-m-l.com/go/mabie/mabiep122.html
go to the next page to see if you were correct.

and try these:
>    http://www.z-m-l.com/go/mabie/mabiep040.html
>    http://www.z-m-l.com/go/mabie/mabiep067.html
>    http://www.z-m-l.com/go/mabie/mabiep208.html
>    http://www.z-m-l.com/go/mabie/mabiep214.html

again, you must go to the next page to see if you were correct,
to see if the line at the top of the next page was _indented_,
indicating that it is the start of a new paragraph.

(if you like this game, i've appended a bunch more tests,
from another book, one that has more paragraphs in it.)

distributed proofreaders has an ongoing "discussion" about
whether or not to put a blank line at the top of a page where
the first line is indented as a new paragraph, for reasons that
i cannot comprehend.   _of_course_ there must be a blank line,
to indicate clearly the first line is the start of a new paragraph,
a decision that cannot be reliably made on the previous page.

fortunately, the number of pages that have to be checked for this
in a typical book is relatively small, and can be located fairly easily
by a computer routine that summons them for human eyeballs...

-bowerbird

>    http://www.z-m-l.com/go/myant/myantf013.html
>    http://www.z-m-l.com/go/myant/myantp014.html
>    http://www.z-m-l.com/go/myant/myantp040.html
>    http://www.z-m-l.com/go/myant/myantp074.html
>    http://www.z-m-l.com/go/myant/myantp077.html
>    http://www.z-m-l.com/go/myant/myantp084.html
>    http://www.z-m-l.com/go/myant/myantp093.html
>    http://www.z-m-l.com/go/myant/myantp112.html
>    http://www.z-m-l.com/go/myant/myantp123.html
>    http://www.z-m-l.com/go/myant/myantp126.html
>    http://www.z-m-l.com/go/myant/myantp135.html
>    http://www.z-m-l.com/go/myant/myantp137.html
>    http://www.z-m-l.com/go/myant/myantp172.html
>    http://www.z-m-l.com/go/myant/myantp206.html
>    http://www.z-m-l.com/go/myant/myantp209.html
>    http://www.z-m-l.com/go/myant/myantp236.html
>    http://www.z-m-l.com/go/myant/myantp261.html
>    http://www.z-m-l.com/go/myant/myantp268.html
>    http://www.z-m-l.com/go/myant/myantp274.html
>    http://www.z-m-l.com/go/myant/myantp304.html
>    http://www.z-m-l.com/go/myant/myantp317.html
>    http://www.z-m-l.com/go/myant/myantp321.html
>    http://www.z-m-l.com/go/myant/myantp407.html


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070710/2b6fa066/attachment.htm 

From Bowerbird at aol.com  Wed Jul 11 12:40:35 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 11 Jul 2007 15:40:35 EDT
Subject: [gutvol-d] they didn't care
Message-ID: <d17.10e119fd.33c68c33@aol.com>

>    http://radar.oreilly.com/archives/2007/07/clay_shirky_a_s.html

on the o'riley blogs, jimmy guterman points to:
>    a dizzying presentation by Clay Shirky in which he likens 
>    the guardians of a Shinto shrine to the perl community. 
>    It also includes one of the best sentences I've heard all year, 
>    one that will ring true for anyone who's tried to convince 
>    entrenched thinkers of the value of innovation: 
>    ?They didn't care that they'd seen it work in practice 
>    because they already knew it wouldn't work in theory.?

ring a bell?

watch the movie:
>    http://www.supernova2007.com/downloads/shirky.mov

it's rare you hear a techie using _love_ as operative concept,
but shirky does, and does it well. 

-bowerbird


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070711/b66e1bd7/attachment.htm 

From Bowerbird at aol.com  Wed Jul 11 13:02:23 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 11 Jul 2007 16:02:23 EDT
Subject: [gutvol-d] i was wrong
Message-ID: <c02.1a403758.33c6914f@aol.com>

i love to admit it when "i was wrong".

i have argued that google book scanning would _not_
release their o.c.r. results, even for the public-domain
books they are scanning, because it is clearly against
their best interests from a _business_ point-of-view...

google is spending hundreds of millions of dollars
on the project, so why just hand over those expensive
results to competitors?   it makes no business sense...

however, as i reported on monday, google is indeed 
now giving us their o.c.r. results.   so i was _wrong_.
wrong wrong wrong.   dead wrong.   sorry about that.

i think this also shows that google is _not_ just acting
within the tight constraints of a "business" standpoint.
which -- considering how some people like to paint 'em
as advertising moneygrubbers -- is fairly enlightening.
i hope when people sum up the totals for "do no evil",
google will be rewarded for releasing this text to us...

now, i'm not some google fanboy here.   anyone who has
looked at the o.c.r. they're releasing will recognize that it
is shit.   it is badly in need of correction.   and i'm _certain_
that google has the know-how in-house to clean it up...
so if they really want to impress us, release _that_ text...
in the meantime, though, i'm not gonna complain about
the low quality of this o.c.r. text, i'm just gonna clean it...

the other thing that must be mentioned here, in fairness,
is that there is _some_ reason to believe that google has
released this text only because they feel that they _must_,
to avoid any criticism (and perhaps even legal action) from
visually-impaired people who can't use screen-readers on
the page-scans that google was _originally_ offering to us.

if that's the case, then maybe the act isn't quite so generous
after all.   then again, perhaps they've done it in the name of
"accessibility" -- even though they don't feel that they must --
because they _think_it's_the_right_thing_to_do_, in which case
i believe they should be applauded.   and then we should take
the text they've made available, and run with it.   run far with it.

anyway, one more time, for all those people out there who
can't hear me say it enough:   i was wrong.

***

and that's not the only thing i was wrong about.

i said that nicholas hodson had found that finereader v8 was
significantly better than v7, but in attempting to confirm that,
it appears that my memory wasn't entirely accurate, and that
v8 might not give recognition that is all _that_ much better...

i'm following up on it, and will give you the solid facts later,
but in view of the fact that everyone agrees that v8 is slower,
the case that it is _clearly_ superior to v7 is somewhat shaky.
so let me just say "i was wrong" about that too, and get it over.

-bowerbird


**************************************
 See what's free at http://www.aol.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070711/157baf53/attachment.htm 

From schultzk at uni-trier.de  Wed Jul 11 23:56:19 2007
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Thu, 12 Jul 2007 08:56:19 +0200
Subject: [gutvol-d] i was wrong
In-Reply-To: <c02.1a403758.33c6914f@aol.com>
References: <c02.1a403758.33c6914f@aol.com>
Message-ID: <DC277315-EE57-443C-8178-61A33286F84C@uni-trier.de>

Hi BB,

	
Am 11.07.2007 um 22:02 schrieb Bowerbird at aol.com:

> i love to admit it when "i was wrong".
>
> i have argued that google book scanning would _not_
> release their o.c.r. results, even for the public-domain
> books they are scanning, because it is clearly against
> their best interests from a _business_ point-of-view...
>
> google is spending hundreds of millions of dollars
> on the project, so why just hand over those expensive
> results to competitors?  it makes no business sense...
	Ahh! You have erred Again. I can not give you an it's true
	motives. Yet I would say they are building a market. How does
	Google makes it's money! If Google can sayso and so many
	are visting us that means $$$$$ for them. They probably realized
	that most prefer text over images. So as any good company it
	has change it's business model. Actually, nothing surprising.
	

	[Snip, snip, snip]

> anyway, one more time, for all those people out there who
> can't hear me say it enough:  i was wrong.
>
	So you are human after all !! ;-))) (I could help that one)

	regards
		Keith.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070712/ce93e0e7/attachment.htm 

From Bowerbird at aol.com  Thu Jul 12 11:09:05 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 12 Jul 2007 14:09:05 EDT
Subject: [gutvol-d] where i'm going with this
Message-ID: <d4f.c633428.33c7c841@aol.com>

scans and o.c.r. output of thousands and thousands of public-domain
books are now readily available to any american with internet access...

thank you google!

while the o.c.r. often needs heavy corrections, once that editing is done,
the text of the entire book can be put into one file and transmitted at will.

this text file can be accompanied by image-scans of the illustrations,
or even included in the same zip file that contains all the page-scans
(in which case the illustrations wouldn't have to be treated separately).

the robustness of text-files has been proven over and over, historically.
while other file-formats come and go, blowin' in the wind, the text-file
-- especially when it is converted to new physical media -- is _solid_...

combined with .png or .jpg (both for page-scans and/or illustrations),
we can be assured that this package will be readable far into the future.

now, if only there was a way to code formatting into those text-files.
oh wait, there is, thanks to light-markup systems.   we're good to go!

***

google has been giving us the scans, and now the text as well.
so we have everything we need in order to proof an entire book.

that is, even _before_ the text is corrected, the o.c.r. output of a book
can be packaged with the scan-set, where both are used as _input_
to a tool that helps a person _do_ the corrections for that book, by
comparing o.c.r. text for each page to the image-scan for the page...

so now you see where i'm going with this...

because long-time lurkers on this list will remember that i have been
saying for many years now that i've written such a tool -- codenamed
"banana-cream" -- but this has been returned with much skepticism
by my antagonists here.   they've challenged it, calling it "vaporware"...

because i never released the thing, their little echo-chamber did a
fine job of convincing itself that their allegations had some merit...

um, sorry charlie...

banana-cream has been alive and well all this time, waiting for her
time to go on stage, and now that glorious time has come, yes it has.

the funny thing is, i was already grooming her for appearance soon.
because i was tired of keeping her under wraps.

i was gonna use the books i have on z-m-l.com as her examples, and
i'll probably continue with that "controlled environment" approach, but
will now also whip her into shape to tame the wild google monster too.

the time has finally come that i saw as inevitable some 2.5 years back:
>    http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&
post=2005-01-01,1

***

so let's talk a minute about how a member of the general public
might go about this task of cleaning up the o.c.r. for a book...

now, one of the things that happens, in the course of this cleaning, is
that you find yourself jumping all over the book, checking out things.

in order to perform this frequent behavior as expeditiously as possible,
you need to boil it down to its essence.   if you want to jump to page 69,
the best tool will let you type in a "6" and then a "9" and then hit [enter]
and boom! you're on page 69, with text on one side, image on the other,

press 123[enter] and boom!, you should be on page 123, just like that...

once again, editable text on one side, the page-scan on the other side.
>   http://snowy.arsc.alaska.edu/bowerbird/bc003.jpg

and, if you have room in the middle, a formatted display of the text, so
you can compare to ensure you used the proper z.m.l. for the situation.
>    http://z-m-l.com/go/triple.jpg
(yes, "digitizing" a text means formatting it as well, not just proofing it.)

so, we see the basic user-interface for the tool -- a text/scan hybrid --
where we can jump directly to any page.   when displaying a page, the tool
should fetch the scan from the same directory containing the text file or
-- if the scan isn't there -- download it from the web to that local folder.

it's important to save it for reuse, not to download the thing every time.
moreover, this approach should be utilized by _any_ viewer-program;
it's important that we have each book stored online, so that anyone can
access it from there at any time.   but there's no reason to have a person
continually download the same material over and over while re-reading.
so a plan to methodically mirror the content locally is the best approach...

again, that's what banana-cream does...   it will also download the scans
in a batch via an unattended process, which is how you'll likely do it, but
if you have broadband, you can have it download _while_ you work and
you'll find it hard to stay up with its download (e.g., 12 pages/minute)...

of course, you want the program to _flag_possible_errors_ that it finds
on a page, so you can check 'em.   if they do need correcting, you want
the tool to _faciliate_ that.   again, that, too, be what banana-cream do...

all this will become much more clear when i actually release the app,
so i'm gonna go off and work on that for a little while.   see you later.

(i'd say you can expect to see the initial release of the basic engine
as early as next week, with various checks being bundled in on a
regular basis after that, providing there is any interest in the app.
of course, if no one cares, then i won't bother to work on it much.)

***

(don't worry, the series about the inefficiency of the d.p. workflow
_will_ continue, it's just that with all of these recent developments,
i've got more important things to do right now...)

***

oh yeah, having the o.c.r. from google readily available also means
-- since we have the o.c.r. from the open content alliance as well --
that we can also now focus some serious attention on the strategy of
locating glitches in o.c.r. via _comparison_ of different o.c.r. passes.

my research into this has found that this approach can give results
that are simply _amazing_.   i can't remember how much of that i've
shared with this list, but i documented my research _thoroughly_
over on the d.p. forums.   (and what a big waste of time _that_ was!)

if you want to review it (and see how d.p. people ignored it), visit
the thread i started over there titled "revolutionary o.c.r. proofing":
>    http://www.pgdp.net/phpBB2/viewtopic.php?t=24008

-bowerbird


**************************************
 Get a sneak peak of the all-new AOL at 
http://discover.aol.com/memed/aolcom30tour
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070712/1af92a9c/attachment-0001.htm 

From Bowerbird at aol.com  Thu Jul 12 15:57:08 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 12 Jul 2007 18:57:08 EDT
Subject: [gutvol-d] putting your novel online for free
Message-ID: <c4c.198caf19.33c80bc4@aol.com>

think only unknowns put their novel online for free?

think again.

these days, even nobel-prize-winners-for-literature
-- like elfriede jelinek (class of 2004) -- are doing it.

let's make sure we _reward_ brave pioneers like this,
with our appreciation and a little bit of cold hard cash,
so our cyberlibrary of the future is an appealing place...

>    http://www.elfriedejelinek.com/

-bowerbird


**************************************
 Get a sneak peak of the all-new AOL at 
http://discover.aol.com/memed/aolcom30tour
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070712/3e6c3285/attachment.htm 

From Bowerbird at aol.com  Fri Jul 13 10:37:13 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 13 Jul 2007 13:37:13 EDT
Subject: [gutvol-d] digitizing rare and inaccessible books for
	print-on-demand
Message-ID: <d35.df87414.33c91249@aol.com>

kirtas, the company that makes the neat scanning machines
>    http://www.kirtastech.com/
made a very interesting announcement last month:
>    http://www.kirtas-tech.com/newsletterNew.asp?ID=26

they are joining together with some libraries (cincinnati public
and toronto public) and universities (emory, university of maine)
to scan thousands of rare and inaccessible books and then
distribute them via amazon.com's print-on-demand service.

no word on whether the _electronic_ versions will be free...

on the one hand, since this is being pitched (at least by kirtas)
as a way that these institutions can generate a cash-flow that
supports the cost of the scanning, you might presume "no"...

on the other hand, one of the things that libraries _do_ is to
provide access to books for free, so you might presume "yes".

either way, it's good news for book digitizers, because we can
always buy one copy of the printed book, digitize it ourselves,
and then release the resultant electronic copy out to the wild...

-bowerbird


**************************************
 Get a sneak peak of the all-new AOL at 
http://discover.aol.com/memed/aolcom30tour
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070713/0a0bce41/attachment.htm 

From Bowerbird at aol.com  Sun Jul 15 21:49:00 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 16 Jul 2007 00:49:00 EDT
Subject: [gutvol-d] crop circles
Message-ID: <cfa.144b3057.33cc52bc@aol.com>

another excellent reason to crop your scans
consistently and fairly tightly is that they will
work much better on the iphone if you do...            :+)

-bowerbird


**************************************
 Get a sneak peak of the all-new AOL at 
http://discover.aol.com/memed/aolcom30tour
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070716/57c5bdde/attachment.htm 

From desrod at gnu-designs.com  Mon Jul 16 12:43:45 2007
From: desrod at gnu-designs.com (David A. Desrosiers)
Date: Mon, 16 Jul 2007 15:43:45 -0400
Subject: [gutvol-d] Speaking of OCR and Captcha...
Message-ID: <1184615025.9208.2.camel@localhost.localdomain>

I didn't see mention of this on the list, and a quick Google search
intersecting Project Gutenberg with this other project didn't produce
many relevant results, but I think this could have real promise for
digitizing the PG archives of OCR'd scans: 

---

http://www.recaptcha.net/

        reCAPTCHA improves the process of digitizing books by sending
        words that cannot be read by computers to the Web in the form of
        CAPTCHAs for humans to decipher. More specifically, each word
        that cannot be read correctly by OCR is placed on an image and
        used as a CAPTCHA. This is possible because most OCR programs
        alert you when a word cannot be read correctly.
        
        But if a computer can't read such a CAPTCHA, how does the system
        know the correct answer to the puzzle? Here's how: Each new word
        that cannot be read correctly by OCR is given to a user in
        conjunction with another word for which the answer is already
        known. The user is then asked to read both words. If they solve
        the one for which the answer is known, the system assumes their
        answer is correct for the new one. The system then gives the new
        image to a number of other people to determine, with higher
        confidence, whether the original answer was correct. 
        
        Currently, we are helping to digitize books from the Internet
        Archive
---


-- 
David A. Desrosiers
desrod at gnu-designs.com
setuid at gmail.com
http://projects.plkr.org/
Skype...: 860-967-3820


From f.fuchs at gmx.net  Mon Jul 16 13:52:49 2007
From: f.fuchs at gmx.net (Franz Fuchs)
Date: Mon, 16 Jul 2007 22:52:49 +0200
Subject: [gutvol-d] Aaron Swartz: Announcing the Open Library
Message-ID: <MHBBKILBOBDADPCENMGKGEAACHAA.f.fuchs@gmx.net>


http://www.aaronsw.com/weblog/openlibrary

links to

http://demo.openlibrary.org/about

From Bowerbird at aol.com  Mon Jul 16 14:17:52 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 16 Jul 2007 17:17:52 EDT
Subject: [gutvol-d] Speaking of OCR and Captcha...
Message-ID: <be7.1a7eab11.33cd3a80@aol.com>

luis von ahn is a brilliant fellow.   a macarthur fellow, in fact, and that's
a good thing, because i think only a genius could make this effective...

let me know if it ends up being worthwhile...

-bowerbird


**************************************
 Get a sneak peak of the all-new AOL at 
http://discover.aol.com/memed/aolcom30tour
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070716/fd00b87f/attachment.htm 

From Bowerbird at aol.com  Mon Jul 16 14:20:34 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 16 Jul 2007 17:20:34 EDT
Subject: [gutvol-d] Aaron Swartz: Announcing the Open Library
Message-ID: <bbe.fa76dbd.33cd3b22@aol.com>

somebody's gotta clean up that rat's nest over at internet archive.

and aaron might be just the guy.   i'd call him a boy genius, except
he is no longer a boy, and he hasn't won the macarthur prize.   yet.

let me know if this ends up being worthwhile...

-bowerbird


**************************************
 Get a sneak peak of the all-new AOL at 
http://discover.aol.com/memed/aolcom30tour
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070716/64212d87/attachment.htm 

From bzg at altern.org  Mon Jul 16 16:47:31 2007
From: bzg at altern.org (Bastien)
Date: Tue, 17 Jul 2007 01:47:31 +0200
Subject: [gutvol-d] Aaron Swartz: Announcing the Open Library
In-Reply-To: <bbe.fa76dbd.33cd3b22@aol.com> (Bowerbird@aol.com's message of
	"Mon\, 16 Jul 2007 17\:20\:34 EDT")
References: <bbe.fa76dbd.33cd3b22@aol.com>
Message-ID: <87odibc3n0.fsf@bzg.ath.cx>

Bowerbird at aol.com writes:

> and aaron might be just the guy.  i'd call him a boy genius, except
> he is no longer a boy, and he hasn't won the macarthur prize.  yet.

Hey! This list was more fun when you were the only "genius" around this
place... i'm certainly getting a bit old.

-- 
Bastien

From Bowerbird at aol.com  Mon Jul 16 19:59:45 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 16 Jul 2007 22:59:45 EDT
Subject: [gutvol-d] Aaron Swartz: Announcing the Open Library
Message-ID: <d06.149f0941.33cd8aa1@aol.com>

bastien said:
>    i'm certainly getting a bit old.

i'm an old fart too...               :+)

but aaron, he's a whippersnapper.
he was an internet celebrity at 16.
(for coding, not for dating paris...)

now, at 21, he's a college dropout.
(stanford, if i remember correctly...)
oh well, didn't seem to hurt bill gates
_or_ steve jobs, i'm sure he'll recover.         ;+)

-bowerbird

p.s.   aaron is also the co-inventor of
"markdown", the famous light markup.
and i see he worked it into the new site.


**************************************
 Get a sneak peek of the all-new AOL at 
http://discover.aol.com/memed/aolcom30tour
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20070716/7826a307/attachment.htm 

From robert_marquardt at gmx.de  Mon Jul 16 23:46:57 2007
From: robert_marquardt at gmx.de (Robert Marquardt)
Date: Tue, 17 Jul 2007 08:46:57 +0200
Subject: [gutvol-d] 10.000 downloads of the SF CD
Message-ID: <7cpo93hh63hun29mbdgbbmo7fpafbc6aeh@4ax.com>

Should happen today.
A good 1.4 terabyte of downloads.
-- 
Robert Marquardt (Team JEDI)  http://delphi-jedi.org

From bzg at altern.org  Tue Jul 17 09:00:44 2007
From: bzg at altern.org (Bastien)
Date: Tue, 17 Jul 2007 18:00:44 +0200
Subject: [gutvol-d] Aaron Swartz: Announcing the Open Library
In-Reply-To: <MHBBKILBOBDADPCENMGKGEAACHAA.f.fuchs@gmx.net> (Franz Fuchs's
	message of "Mon\, 16 Jul 2007 22\:52\:49 +0200")
References: <MHBBKILBOBDADPCENMGKGEAACHAA.f.fuchs@gmx.net>
Message-ID: <877iozqatv.fsf@bzg.ath.cx>

"Franz Fuchs" <f.fuchs at gmx.net> writes:

> http://www.aaronsw.com/weblog/openlibrary
> http://demo.openlibrary.org/about

I think it might be interesting to connect the Open Library and the
Freebase Project: http://www.freebase.com

-- 
Bastien

From hart at pglaf.org  Thu Jul 26 09:01:12 2007
From: hart at pglaf.org (Michael Hart)
Date: Thu, 26 Jul 2007 09:01:12 -0700 (PDT)
Subject: [gutvol-d] !@!Re: OCR question (fwd)
Message-ID: <Pine.LNX.4.64.0707260900570.31025@pglaf.org>


---------- Forwarded message ----------
Date: Thu, 26 Jul 2007 11:55:46 -0400
From: Zack <vinum at comcast.net>
To: Michael S. Hart <hart at pobox.com>
Subject: Re: OCR question

Sounds good, thanks.

Michael Hart wrote:
> 
> With you permission, I will forward you question around.
> 
> Michael
> 
> 
> On Sun, 22 Jul 2007, Zack wrote:
> 
>> Hello,
>> 
>> I have a photocopy of an 1836 book that I made
>> in a library, and I'd like to OCR it and submit it to
>> your project. However most of the page images
>> of pages that were not flat when I photographed them,
>> and I was wondering if you might know of any program
>> that can transcribe text that is curved. My hope is
>> that some computer science grad student somewhere
>> has worked on this problem and found a way to
>> find the lines of text even when they are curved, and
>> has made his/her software free.
>> 
>> Thanks,
>> Zack Smith
>> 
>

From greg at durendal.org  Thu Jul 26 09:51:45 2007
From: greg at durendal.org (Greg Weeks)
Date: Thu, 26 Jul 2007 12:51:45 -0400 (EDT)
Subject: [gutvol-d] !@!Re: OCR question (fwd)
In-Reply-To: <Pine.LNX.4.64.0707260900570.31025@pglaf.org>
References: <Pine.LNX.4.64.0707260900570.31025@pglaf.org>
Message-ID: <Pine.LNX.4.63.0707261250420.16513@durendal.durendal.org>

>> On Sun, 22 Jul 2007, Zack wrote:
>>
>>> Hello,
>>>
>>> I have a photocopy of an 1836 book that I made
>>> in a library, and I'd like to OCR it and submit it to
>>> your project. However most of the page images
>>> of pages that were not flat when I photographed them,
>>> and I was wondering if you might know of any program
>>> that can transcribe text that is curved. My hope is
>>> that some computer science grad student somewhere
>>> has worked on this problem and found a way to
>>> find the lines of text even when they are curved, and
>>> has made his/her software free.

A program called unpaper can undo much of the distortion. I use the gimp 
and do it by hand usually.

-- 
Greg Weeks
http://durendal.org:8080/greg/


From piggy at netronome.com  Thu Jul 26 11:18:51 2007
From: piggy at netronome.com (La Monte Henry Piggy Yarroll)
Date: Thu, 26 Jul 2007 14:18:51 -0400
Subject: [gutvol-d] !@!Re: OCR question (fwd)
In-Reply-To: <Pine.LNX.4.63.0707261250420.16513@durendal.durendal.org>
References: <Pine.LNX.4.64.0707260900570.31025@pglaf.org>
	<Pine.LNX.4.63.0707261250420.16513@durendal.durendal.org>
Message-ID: <46A8E58B.7010408@netronome.com>

Greg Weeks wrote:
>>> On Sun, 22 Jul 2007, Zack wrote:
>>>
>>>       
>>>> Hello,
>>>>
>>>> I have a photocopy of an 1836 book that I made
>>>> in a library, and I'd like to OCR it and submit it to
>>>> your project. However most of the page images
>>>> of pages that were not flat when I photographed them,
>>>> and I was wondering if you might know of any program
>>>> that can transcribe text that is curved. My hope is
>>>> that some computer science grad student somewhere
>>>> has worked on this problem and found a way to
>>>> find the lines of text even when they are curved, and
>>>> has made his/her software free.
>>>>         
>
> A program called unpaper can undo much of the distortion. I use the gimp 
> and do it by hand usually.
>
>   
There is an web service which does a pretty good job here:

http://quito.informatik.uni-kl.de/dewarp/dewarp.php

I wish they would just release source code instead, then we could fix 
the fact that it only works with jpeg images.