From Bowerbird at aol.com Wed Nov 1 15:50:04 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Nov 1 15:50:09 2006
Subject: [gutvol-d] gvd061101 -- the niceties of book typography
Message-ID: <4ad.3840eb24.327a8cac@aol.com>
here's your next issue in our "open-source" project, "babelfish".
first, a little more background on our sample book, "my antonia".
you can download an .html version from project gutenberg at...
oh wait... there _is_ no .html version from project gutenberg,
just the plain-text version. gee, that's too bad, isn't it?
that means all the people who would like an .html version are
out of luck. and without an .html version, we cannot easily get
automatic conversions to the various e-book formats. even the
offered plucker conversion will likely be a straight text-dump...
a straight text-dump is _not_ an "electronic-book", not in my book.
oh well, you _can_ download it in .html from manybooks.net:
> http://manybooks.net/titles/catherwietext95myant11.html
indeed, the "custom .html" lets you specify certain parameters,
like the fonts used (ok, you get a choice of 5, but it _is_ a choice),
size (10pt-16pt), leading (1x, 1.25x, 1.5x, 1.75x), justification,
and various indentation and margin parameters. quite nifty!
it'd be neat if project gutenberg offered something like this...
(i should add i was unable to get a working download from this.)
i give huge props to matthew mcclintock, who runs manybooks.net.
since the sidelining of blackmask.com, he is the go-to guy for those
people who want to find p.g. e-texts in the various handheld formats.
he's put a converter on his site that can export out to all these formats:
> PDF
> PDF Large Print
> eReader
> Doc
> Plucker
> iSilo
> zTXT
> Rocketbook
> iPod Notes
> Sony
> TCR
> iRex iLiad PDF
> Custom PDF
> Custom HTML
> RTF
> Newton
> Mobipocket
wow. that's very impressive. it seems that we don't even have to build
any conversion capability at all, we just have to feed matthew some files!
i appreciate all the hard work that he's done in providing such a service!
there is a glitch, though. and it's the same one we experienced above:
a straight text-dump is _not_ an "electronic-book", not in my book,
some manybooks.net conversions are simply a straight text-dump.
(that's not matthew's fault, i'm just stating it as a pure observation.)
i won't dwell on this, i'll just give a few examples.
i downloaded the regular .pdf version, so you can download that
if you want to look at the exact thing that i saw when i wrote this...
here's the "titlepage" of "my antonia", as shown in the .pdf:
> http://snowy.arsc.alaska.edu/bowerbird/misc/anttitle.jpg
ouch. not very pretty. the original titlepage looked like this:
> http://snowy.arsc.alaska.edu/bowerbird/myant/myantf003.png
that's what we expect a title page to look like.
and here's the scan of page #181 of the book:
> http://snowy.arsc.alaska.edu/bowerbird/myant/myantp181.png
but here's what that same page looks like in the .pdf.
> http://snowy.arsc.alaska.edu/bowerbird/misc/weevil.jpg
not only is the new chapter not at the head of a page,
neither is it bigger or bold like we expect of headers.
even worse, the poem in the epigraph is not just
unformatted, but it is even incorrectly wrapped...
typographical niceties like these have been the hallmark
of paper-books for over 100 years, so it's embarrassing
when our newfangled e-books fail to clear that standard.
and many people report that it is a huge turn-off to them.
(and it's hard to tell 'em not to be so picky, because frankly,
when something falls so far below expectations, it _is_ bad.)
nor are some of the _advances_ that we expect of e-books
present here (e.g., there is no hotlinked table of contents).
in these areas, we want our open-source project to do better.
we want it to be able to make a first-rate e-book in .html form,
to the extent that such an animal is possible, so that the various
converter-programs have optimal input for best possible output.
(in this regard, one of the first things i do to a p.g. e-text is to
strip off the header and footer. sorry, chaps, but they're ugly.
and besides, the very first item in an e-book file should be
_the_title_of_the_book_, and the next should be the author.
again, sorry, but that's just the way that it should be, period.)
***
before today's exercise, let me remind people once again that...
...i am a beginner with perl...
my code is _not_ something you should emulate.
(kids, do not try this at home. you might get hurt.)
my formatting of that code, especially, is "unusual",
and will not look very familiar to most perl people.
so be it. i hate those stupid curly braces. hate 'em.
i'll repeat: i'm a beginner with perl.
moreover, _that_is_the_point_. (cue the ring of a bell here.)
you don't need to do anything more than copy sample code
out of a programming primer to get some good functionality,
_providing_ that the file-format of your e-book is dirt-simple.
if your format is complex, like docbook or .tei or x.m.l., then
you're gonna need a sophisticated programmer to get _any_
functionality out of your e-texts, and it'll be slow in coming...
simple is better.
so i thank my "critics" who characterized my perl as elementary.
they've done a better job of making my point than i could have...
***
for your reference:
> http://www.greatamericannovel.com/myant/myantp123.html
you will remember that i am still looking for a contribution to our
open-source thing, in the form of c.s.s., but here goes anyway...
so today's assignment is: churn out the code for that page 123.
> #!/usr/bin/perl
> use CGI::Carp qw(fatalsToBrowser);
>
> ########## read the file in...
> $filename="/home2/yoursiteinfohere/public_html/myant/myant-lf.zml";
> open (inf,"$filename") or print "that file was not available...
\n";
> read (inf,$thebook,2222222); close inf;
>
> ########## changes made here include the c.s.s. stylesheet...
> print "content-type: text/html\n\n";
> print '';
print "\n"; print "\n";
> print '
';
>
> ########## and all the lines on the page...
> foreach $oneline (@oneline) {
> $nn++; if ($nn ne "1" and $nn ne "2" and $nn < $maxminustwo) {
> print $oneline;
> if ($oneline ne "") {print ' '; print "\n";}
> if ($oneline eq "") {print '
'; print "\n";}
> }}}}
>
> ########## then the pagenumber...
> print '
'; print "\n";
> print "\n";
>
> ########## now put in the error-reporting form...
> print '
'; print "\n";
> print "\n";
you can see the results of this code by running this script:
> http://www.greatamericannovel.com/scgi-bin/babelfish10.pl
there are a number of things to notice about this particular routine,
all of which will be dealt with in further detail in coming days...
first, i've reworked the .html so as to make use of a .css stylesheet.
(this lets me indent paragraphs, instead of using blank lines between.
it also allows me to have a proportional-spaced font, not that dreadful
monospaced font that is the default whenever you use the "pre" tag.
and of course the c.s.s. will help us in the future, on the pages which
have various structural features that we will want to display properly.)
second, i have included some links to help the user with navigation,
with one set of them at the top, and an identical set at the bottom...
third, i've pulled in the scan of the page, for easy comparison.
this is necessary when we want to do "continuous proofreading".
fourth, i've added a form that readers can use to report errors,
another essential aspect of our "continuous proofing" system...
i was going to add in each of these things on a separate day,
but i figured you could absorb the shock of all of them at once.
still, at the heart of this routine, we're displaying the text of a page,
something we had already worked out previously. and indeed, this
routine to display a page is the main "engine" in an e-book program.
as to this code, it does a good job of presenting one page, #123.
the links to the surrounding pages (like page 122 and 124)
are hardwired, however, so tomorrow's exercise will require
that we turn them into variables, so that this routine will be
able to present _any_ page in the book, not just page 123...
go ahead, feel free to have a pass at modifying this routine.
after all, that's the point of open-source, that people can
just jump in and join the coding fun any time they want to!
***
now, for some other commentary...
***
oh geez, part 4...
in other news today, some people are unhappy with
the iliad e-book-machine, because it takes 40 seconds
to boot up, and you have to shut it down if you're not
reading because otherwise the battery will run down...
it ends up that people consider this slow boot-up time
to be very "unpaperlike", which is the main claim to fame
that e-ink has been bragging. that's not all, either, since
a relatively slow page-turning time is another liability...
and we won't even talk about a price that is over $700.
our good friend david rothman has this to say:
> Shortcomings like this should long have been solved
only an idiot would have had the expectation that
an early version of this product would be _free_ of
such "shortcomings" as this one...
and only a _pure_ idiot would have led other people on,
in terms of creating that stupid expectation in them...
and only the most _extreme_ of pure idiots would then
lash out at the product-maker for failing to live up to
the unreasonable expectations that the idiot had created.
the unmitigated bile of unrealized hype can be very nasty.
-bowerbird
p.s. above, i commented on the lack of formatting on an epigraph.
as you can see, by referring to my version of that same page,
> http://www.greatamericannovel.com/myant/myantp181.html
i have chosen to format the poem differently than it was formatted
in the paper-book, which is my prerogative as a re-publisher...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20061101/95ae46c6/attachment-0001.html
From jon at noring.name Wed Nov 1 18:11:09 2006
From: jon at noring.name (Jon Noring)
Date: Wed Nov 1 18:17:49 2006
Subject: [gutvol-d] Line by line proofing of OCR text?
Message-ID: <73794793.20061101191109@noring.name>
Everyone,
For quite a while I've been wondering if line-by-line proofing of
OCR text will result in more accurate results, with higher efficiency
compared to the side-by-side page editing used in the initial
proofing stage at Distributed Proofreaders.
The difficulty I have with page-by-page editing is that the OCR text
and the original page scans are big blocks that sit side-by-side, and
I find during proofing that I have to move my eyes back and forth,
which to me is fairly tiring as I try to realign my view -- it
definitely slows me up and I always sense I may be missing something.
Now, if instead we had the following in our proofing window display:
original scan line --> development. He is always able to raise capi-
OCR text/edit window --> developmenl, He is always.able to ra6e capi-
[Of course the original scan line is an actual image of the line, not
ASCII text as shown above. It is scaled as close as possible to the
OCR text line below which is user-editable. And the OCR text example
is something I made up, so don't criticize the choice of OCR errors!
Certainly the standard PG/DP scripts can be run to remove some to most
of the OCR errors before the line-by-line human proofing stage.]
This alignment allows me to do a vertical comparison, which I think
may make it easier to spot any OCR errors. It should, at least for
some people, increase the speed and accuracy of proofing. Well,
that's the hypothesis at least.
Now, certainly it will be argued that the proofer should be able to
see the entire page scan, such as for context, simple pleasure, and to
see if there were errors in generating the page image line. I agree!
But the original page can certainly be displayed to the proofer in a
separate window or to the side. So it is possible to have both (as well
as offer both proofing methods -- there are definitely pages with odd
text layouts where page-level proofing may be more appropriate.)
So, asking the proofing mavens here, has this been tested? What are
the fatal flaws in this? I can't help but think that this has already
been thought of and discarded by Charles Franks when he started DP.
But then, technology has changed the last few years, and maybe this
idea may again be considered.
Jon
From grythumn at gmail.com Wed Nov 1 18:41:05 2006
From: grythumn at gmail.com (Robert Cicconetti)
Date: Wed Nov 1 18:47:24 2006
Subject: [gutvol-d] Line by line proofing of OCR text?
In-Reply-To: <73794793.20061101191109@noring.name>
References: <73794793.20061101191109@noring.name>
Message-ID: <15cfa2a50611011841x5abc4bacxa00c2b468827825c@mail.gmail.com>
I've used similar techniques when single-proofing in an OCR program,
and the trouble is one often needs to zoom out for context.. plus the
fact that'd we'd need to extract character or line position
information from the OCR engines to automate matching the text to the
image.
However, what you've asked for can be manually approximated using the
horizontal interface at DP.. enlarge the font size in the text window,
and increase the zoom level in the image area. You'll get three or
four lines of text, one above the other.
R C
On 11/1/06, Jon Noring wrote:
> Everyone,
>
> For quite a while I've been wondering if line-by-line proofing of
> OCR text will result in more accurate results, with higher efficiency
> compared to the side-by-side page editing used in the initial
> proofing stage at Distributed Proofreaders.
From schultzk at uni-trier.de Thu Nov 2 00:05:28 2006
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Thu Nov 2 00:05:35 2006
Subject: [gutvol-d] gvd061030 -- let's get it started in here
In-Reply-To: <45477F3D.70205@perathoner.de>
References:
<1162248212.5857.1.camel@localhost.localdomain>
<45477F3D.70205@perathoner.de>
Message-ID: <064A2694-3938-46C7-810D-4651CDCABDC6@uni-trier.de>
Hi Marcello,
I will ask my question again: Do you know what you are doing?
Why do not you can your comments to another list, please.
Your are worse than a kindergarden kid.
Personally, I do not know any of the systems, but from what I
heard and seen they are all primitive and none do the JOB and will.
I know what it takes top do the job! It is my profession: Linguistics.
Have you heard of SGML? If that is to complex( not complicated) then
use
XML. But, the problem is not the format, but getting the formatting
done.
According to my analysis so far automatic formating can only be done
to a
max of 80%. The rest has to be proofed manually.
In the early days of PG I have discussed the matter with Micheal
Hart that
plain ASCII is not enough. Today, computers have advance and
computing power
is abundant. My opinion is that PG start using a markup language
from the start.
Sure the scanning and especially the proofing of the text will take
a little longer,
but the benifits are far greater.
The markup should conatin:
Chapter
section,
Character formating,
PG Header,
picture, sound, etc
tags.
Hey XML can do all that. All we need is a common xml template. One
format! a known straucture
a few filters and voila. a neat package exactly what everybody is
trying to create.
If you scan into word, and use a few macros(or one big one) you can
get 90-95% of the mark-up done.
Now add 10% mor time for proofing and you guys and gals have just
what you will ever need.
regards
Keith.
Am 31.10.2006 um 17:52 schrieb Marcello Perathoner:
> David A. Desrosiers wrote:
>
>> Its obvious from reading the snippets, that it is indeed copied
>> out of a
>> rudimentary Perl primer, and not touched by anyone who has a strong
>> grasp of the power of the language at hand.
>
> He's a baby that makes poo in the chamberpot for the first time and
> thinks his parents are watching him because they want poo.
>
>
>> Exactly what is it you are trying to prove with this anyway? We
>> know how
>> to write parsers that can chew up and spit out a Gutenberg etext into
>> other formats, I don't think that's the core of the problem here.
>
> He's just inventing warm water (and trying to get credit for it).
>
> This parser is online. It converts any PG text into a plucker
> database.
> And it is open source and written in gasp! python. We have served
> 130,000 plucker texts in October this way. The only guy who hasn't
> noticed yet is him who notices everything.
>
> There are a few other PG parsers around like GutenMark and my PG to
> TEI
> converter. All of them are open source and working today. So its only
> natural that you-know-who will hold his non-working
> at-the-rate-its-going-never-to-be-released zml parser against them,
> just
> for the fun of causing confusion. Ever wondered who pays him to
> fuzz and
> fudge?
>
>
>
> --
> Marcello Perathoner
> webmaster@gutenberg.org
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
From hyphen at hyphenologist.co.uk Thu Nov 2 00:55:49 2006
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Thu Nov 2 00:56:22 2006
Subject: [gutvol-d] gvd061030 -- let's get it started in here
In-Reply-To: <064A2694-3938-46C7-810D-4651CDCABDC6@uni-trier.de>
References:
<1162248212.5857.1.camel@localhost.localdomain>
<45477F3D.70205@perathoner.de>
<064A2694-3938-46C7-810D-4651CDCABDC6@uni-trier.de>
Message-ID:
On Thu, 2 Nov 2006 09:05:28 +0100, "Schultz Keith J."
wrote:
|Hi Marcello,
|
| I will ask my question again: Do you know what you are doing?
|
|
| Why do not you can your comments to another list, please.
|
| Your are worse than a kindergarden kid.
|
| Personally, I do not know any of the systems, but from what I
| heard and seen they are all primitive and none do the JOB and will.
|
| I know what it takes top do the job! It is my profession: Linguistics.
|
| Have you heard of SGML? If that is to complex( not complicated) then
|use
| XML. But, the problem is not the format, but getting the formatting
|done.
| According to my analysis so far automatic formating can only be done
|to a
| max of 80%. The rest has to be proofed manually.
|
| In the early days of PG I have discussed the matter with Micheal
|Hart that
| plain ASCII is not enough. Today, computers have advance and
|computing power
| is abundant. My opinion is that PG start using a markup language
|from the start.
| Sure the scanning and especially the proofing of the text will take
|a little longer,
| but the benifits are far greater.
|
| The markup should conatin:
| Chapter
| section,
| Character formating,
| PG Header,
| picture, sound, etc
| tags.
|
| Hey XML can do all that. All we need is a common xml template. One
|format! a known straucture
| a few filters and voila. a neat package exactly what everybody is
|trying to create.
| If you scan into word, and use a few macros(or one big one) you can
|get 90-95% of the mark-up done.
| Now add 10% mor time for proofing and you guys and gals have just
|what you will ever need.
ROTFLMAO
When you learn to format things in plain text someone might listen.
--
Dave Fawthrop For Yorkshire Dialect
http://www.gutenberg.org/author/John_Hartley
http://www.gutenberg.org/author/F_W_Moorman
19,000 free e-books at Project Gutenberg! http://www.gutenberg.org
From marcello at perathoner.de Thu Nov 2 03:54:07 2006
From: marcello at perathoner.de (Marcello Perathoner)
Date: Thu Nov 2 03:54:11 2006
Subject: [gutvol-d] gvd061030 -- let's get it started in here
In-Reply-To: <064A2694-3938-46C7-810D-4651CDCABDC6@uni-trier.de>
References: <1162248212.5857.1.camel@localhost.localdomain> <45477F3D.70205@perathoner.de>
<064A2694-3938-46C7-810D-4651CDCABDC6@uni-trier.de>
Message-ID: <4549DC5F.8010002@perathoner.de>
Schultz Keith J. wrote:
> Hey XML can do all that. All we need is a common xml template. One
> format! a known straucture
> a few filters and voila. a neat package exactly what everybody is
> trying to create.
This is the reason why we have to shut up BB: People reading this list
will think that the most vociferous person represents the consensus in
PG research. Not so. BB just pesters eveybody who doesn't want to hear
with *his* at best half-baked ideas about text representation and
delivery. Nobody takes BB seriously, and you shouldn't too.
The state of PG research is:
Consensus has been reached about using a subset of TEI as master format
for PG texts (since PGXML seems to be dead). Which subset is still being
discussed.
There are at least 2 different working toolchains to convert subsets of
TEI to end user formats. Files produced with these toolchains have been
posted.
Of course, everything is still in active research and can change a lot.
But nobody seriously considers using anything other than TEI or XML as
master format.
--
Marcello Perathoner
webmaster@gutenberg.org
From mattsen at arvig.net Thu Nov 2 03:57:26 2006
From: mattsen at arvig.net (Chuck MATTSEN)
Date: Thu Nov 2 04:11:57 2006
Subject: [gutvol-d] gvd061030 -- let's get it started in here
In-Reply-To: <4549DC5F.8010002@perathoner.de>
References:
<1162248212.5857.1.camel@localhost.localdomain>
<45477F3D.70205@perathoner.de>
<064A2694-3938-46C7-810D-4651CDCABDC6@uni-trier.de>
<4549DC5F.8010002@perathoner.de>
Message-ID:
On Thu, 02 Nov 2006 05:54:07 -0600, Marcello Perathoner
wrote:
> This is the reason why we have to shut up BB: People reading this list
> will think that the most vociferous person represents the consensus in
> PG research. Not so. BB just pesters eveybody who doesn't want to hear
> with *his* at best half-baked ideas about text representation and
> delivery. Nobody takes BB seriously, and you shouldn't too.
Oh, I dunno ... I think any thinking person reading the list will quickly
be able to discern the intent behind, and value of, any frequent flyer.
:-)
--
Chuck Mattsen (Mahnomen, MN)
mattsen@arvig.net
From joshua at hutchinson.net Thu Nov 2 05:29:55 2006
From: joshua at hutchinson.net (joshua@hutchinson.net)
Date: Thu Nov 2 05:30:02 2006
Subject: [gutvol-d] Line by line proofing of OCR text?
Message-ID: <15656271.1162474195305.JavaMail.?@fh1038.dia.cp.net>
At first blush, it seems like small return of investment. The
programming required (as well as the difference in how scans/ocr are
prepared) would be very significant, while the increase in quality
would be minuscule. DP gets very good results with their current
method and I think a better return on the programming investment would
be to implement a "roundless" system, where each page is proofed again
and again until a certain "confidence" level is reach. Easy pages may
only be seen a couple times, while a particularly nasty page might get
seen by scores of people. (See DP forums for length discussions of how
this system might work.)
But, as always, the bottleneck is developer time. We ALWAYS have more
work than we have volunteers to do it.
Josh
>----Original Message----
>From: jon@noring.name
>Date: Nov 1, 2006 21:11
>To:
>Subj: [gutvol-d] Line by line proofing of OCR text?
>
>Everyone,
>
>For quite a while I've been wondering if line-by-line proofing of
>OCR text will result in more accurate results, with higher efficiency
>compared to the side-by-side page editing used in the initial
>proofing stage at Distributed Proofreaders.
>
>The difficulty I have with page-by-page editing is that the OCR text
>and the original page scans are big blocks that sit side-by-side, and
>I find during proofing that I have to move my eyes back and forth,
>which to me is fairly tiring as I try to realign my view -- it
>definitely slows me up and I always sense I may be missing something.
>
>Now, if instead we had the following in our proofing window display:
>
>original scan line --> development. He is always able to raise
capi-
>OCR text/edit window --> developmenl, He is always.able to ra6e
capi-
>
>[Of course the original scan line is an actual image of the line, not
>ASCII text as shown above. It is scaled as close as possible to the
>OCR text line below which is user-editable. And the OCR text example
>is something I made up, so don't criticize the choice of OCR errors!
>Certainly the standard PG/DP scripts can be run to remove some to
most
>of the OCR errors before the line-by-line human proofing stage.]
>
>This alignment allows me to do a vertical comparison, which I think
>may make it easier to spot any OCR errors. It should, at least for
>some people, increase the speed and accuracy of proofing. Well,
>that's the hypothesis at least.
>
>Now, certainly it will be argued that the proofer should be able to
>see the entire page scan, such as for context, simple pleasure, and
to
>see if there were errors in generating the page image line. I agree!
>But the original page can certainly be displayed to the proofer in a
>separate window or to the side. So it is possible to have both (as
well
>as offer both proofing methods -- there are definitely pages with odd
>text layouts where page-level proofing may be more appropriate.)
>
>So, asking the proofing mavens here, has this been tested? What are
>the fatal flaws in this? I can't help but think that this has already
>been thought of and discarded by Charles Franks when he started DP.
>But then, technology has changed the last few years, and maybe this
>idea may again be considered.
>
>Jon
>
>
>_______________________________________________
>gutvol-d mailing list
>gutvol-d@lists.pglaf.org
>http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
From joshua at hutchinson.net Thu Nov 2 05:33:18 2006
From: joshua at hutchinson.net (joshua@hutchinson.net)
Date: Thu Nov 2 05:33:21 2006
Subject: [gutvol-d] gvd061030 -- let's get it started in here
Message-ID: <24690987.1162474398658.JavaMail.?@fh1038.dia.cp.net>
Are you sure you meant to address those comments to Marcello?
What you are talking about is what Marcello has done. bowerbird is
the kindergarten kid you seem to be talking about...
Josh
>----Original Message----
>From: schultzk@uni-trier.de
>Date: Nov 2, 2006 3:05
>To: "Project Gutenberg Volunteer Discussion"
>Subj: Re: [gutvol-d] gvd061030 -- let's get it started in here
>
>Hi Marcello,
>
> I will ask my question again: Do you know what you are doing?
>
>
> Why do not you can your comments to another list, please.
>
> Your are worse than a kindergarden kid.
>
> Personally, I do not know any of the systems, but from what I
> heard and seen they are all primitive and none do the JOB and will.
>
> I know what it takes top do the job! It is my profession:
Linguistics.
>
> Have you heard of SGML? If that is to complex( not complicated)
then
>use
> XML. But, the problem is not the format, but getting the
formatting
>done.
> According to my analysis so far automatic formating can only be
done
>to a
> max of 80%. The rest has to be proofed manually.
>
> In the early days of PG I have discussed the matter with Micheal
>Hart that
> plain ASCII is not enough. Today, computers have advance and
>computing power
> is abundant. My opinion is that PG start using a markup language
>from the start.
> Sure the scanning and especially the proofing of the text will
take
>a little longer,
> but the benifits are far greater.
>
> The markup should conatin:
> Chapter
> section,
> Character formating,
> PG Header,
> picture, sound, etc
> tags.
>
> Hey XML can do all that. All we need is a common xml template. One
>format! a known straucture
> a few filters and voila. a neat package exactly what everybody is
>trying to create.
> If you scan into word, and use a few macros(or one big one) you
can
>get 90-95% of the mark-up done.
> Now add 10% mor time for proofing and you guys and gals have just
>what you will ever need.
>
>
>
> regards
> Keith.
>
>Am 31.10.2006 um 17:52 schrieb Marcello Perathoner:
>
>> David A. Desrosiers wrote:
>>
>>> Its obvious from reading the snippets, that it is indeed copied
>>> out of a
>>> rudimentary Perl primer, and not touched by anyone who has a
strong
>>> grasp of the power of the language at hand.
>>
>> He's a baby that makes poo in the chamberpot for the first time and
>> thinks his parents are watching him because they want poo.
>>
>>
>>> Exactly what is it you are trying to prove with this anyway? We
>>> know how
>>> to write parsers that can chew up and spit out a Gutenberg etext
into
>>> other formats, I don't think that's the core of the problem here.
>>
>> He's just inventing warm water (and trying to get credit for it).
>>
>> This parser is online. It converts any PG text into a plucker
>> database.
>> And it is open source and written in gasp! python. We have served
>> 130,000 plucker texts in October this way. The only guy who hasn't
>> noticed yet is him who notices everything.
>>
>> There are a few other PG parsers around like GutenMark and my PG
to
>> TEI
>> converter. All of them are open source and working today. So its
only
>> natural that you-know-who will hold his non-working
>> at-the-rate-its-going-never-to-be-released zml parser against
them,
>> just
>> for the fun of causing confusion. Ever wondered who pays him to
>> fuzz and
>> fudge?
>>
>>
>>
>> --
>> Marcello Perathoner
>> webmaster@gutenberg.org
>>
>> _______________________________________________
>> gutvol-d mailing list
>> gutvol-d@lists.pglaf.org
>> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
>_______________________________________________
>gutvol-d mailing list
>gutvol-d@lists.pglaf.org
>http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
From bill at williamtozier.com Thu Nov 2 05:36:41 2006
From: bill at williamtozier.com (William Tozier)
Date: Thu Nov 2 05:36:52 2006
Subject: [gutvol-d] Line by line proofing of OCR text?
In-Reply-To: <73794793.20061101191109@noring.name>
References: <73794793.20061101191109@noring.name>
Message-ID: <680E209C-05DF-476C-B31F-F47D08FB88BD@williamtozier.com>
On Nov 1, 2006, at 9:11 PM, Jon Noring wrote:
> So, asking the proofing mavens here, has this been tested? What are
> the fatal flaws in this? I can't help but think that this has already
> been thought of and discarded by Charles Franks when he started DP.
> But then, technology has changed the last few years, and maybe this
> idea may again be considered.
Far from being flawed, it's how I proof as well. As another
respondent already pointed out, the DP proofing interface can be
restructured to do something like this. Unfortunately, the diversity
of proofers abilities, habits and interface preferences make it hard
to standardize this sort of thing. Even fonts differ from platform to
platform; if we're not working in Flash or some other typographically
fixed standard, this sort of thing is sunk.
I'd say, though, that more important than presenting single lines to
the reader, the act of forcing the gaze of the proofer to *follow*
lines is what you're really looking for.
When proofing, I always ensure that the insertion cursor in the
page's text field touches every character -- essentially I click
before the first letter, and right-arrow through the entire text. Not
least because this spell-checks every word (client-side, on my Mac),
but also because the result is a word-by-word serial visit to every
portion of the page. Even without Flash, we could imagine a number of
interface elements that do this sort of thing: Something that
serially highlights every word, two per second; an audible reader; a
requirement that the cursor visit each letter before a page is
considered done.
When I was a professional proofreader in a large academic printer,
there were a number of old tried-and-true tricks we were taught:
reading the text backwards, reading it aloud to a partner complete
with punctuation, &c. But they all boiled down to getting the reader
to look at the typeset page as a proofer, not a reader. Slowing them
down to the point where their eyes' habits were no longer
comfortable, and they saw more of everything. Prohibiting saccades,
among other things, and allowing them pay attention to short- and
medium-scale textual patterns at the same time.
There are nearsighted little old ladies and 24-inch monitor-users
among us at DP, and their ability to customize the interface and the
presentation of the work is probably much more a boon than a threat:
it invites more people to work. What we might consider is changing
what that work is, to make it more obvious that it is not the kind of
reading they're used to.
-----
Bill Tozier
AIM: vaguery@mac.com
blog: http://williamtozier.com/slurry
plazes: http://beta.plazes.com/user/BillTozier
skype: vaguery
"Nature, however picturesque, never yet made a poet of a dullard."
--Hjalmar Hjorth Boyesen
From desrod at gnu-designs.com Thu Nov 2 06:20:32 2006
From: desrod at gnu-designs.com (David A. Desrosiers)
Date: Thu Nov 2 06:20:54 2006
Subject: [gutvol-d] gvd061101 -- the niceties of book typography
In-Reply-To: <4ad.3840eb24.327a8cac@aol.com>
References: <4ad.3840eb24.327a8cac@aol.com>
Message-ID: <1162477232.10976.36.camel@localhost.localdomain>
On Wed, 2006-11-01 at 18:50 -0500, Bowerbird@aol.com wrote:
> ...i am a beginner with perl...
^^^^^^^^
You spelled "dangerous" wrong. ;)
--
David A. Desrosiers
desrod@gnu-designs.com
http://gnu-designs.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20061102/7ae9287e/attachment.bin
From jon at noring.name Thu Nov 2 07:54:35 2006
From: jon at noring.name (Jon Noring)
Date: Thu Nov 2 07:54:54 2006
Subject: [gutvol-d] Line by line proofing of OCR text?
In-Reply-To: <680E209C-05DF-476C-B31F-F47D08FB88BD@williamtozier.com>
References: <73794793.20061101191109@noring.name>
<680E209C-05DF-476C-B31F-F47D08FB88BD@williamtozier.com>
Message-ID: <187839134.20061102085435@noring.name>
I'll answer both Joshua and Bill in this message...
Joshua Hutchinson wrote:
> At first blush, it seems like small return of investment. The
> programming required (as well as the difference in how scans/ocr are
> prepared) would be very significant, while the increase in quality
> would be minuscule. DP gets very good results with their current
> method and I think a better return on the programming investment would
> be to implement a "roundless" system, where each page is proofed again
> and again until a certain "confidence" level is reach. Easy pages may
> only be seen a couple times, while a particularly nasty page might get
> seen by scores of people. (See DP forums for length discussions of how
> this system might work.)
A roundless approach definitely is smarter. Compare a page edit with
the prior edit, and when one does not see any new corrections, maybe
twice or three times in a row, there's high confidence the page has
been proofed to zero errors.
Since it seems like the real bottleneck at present in DP (at least
this is my understanding) is the latter stages, not the initial
proofing, then there should be no loss in throughput by implementing
this page edit comparison to get, hopefully, very high accuracy.
> But, as always, the bottleneck is developer time. We ALWAYS have more
> work than we have volunteers to do it.
Yep, this is one of the Laws of the Universe: There will never be
enough developers to do the job as one wants.
Bill Tozier wrote:
> I'd say, though, that more important than presenting single lines to
> the reader, the act of forcing the gaze of the proofer to *follow*
> lines is what you're really looking for.
Yes, this is definitely one of the problems I have with the current
system, knowing where one is on both the original page scan and the
text edit box. It requires effort for the mere mortal to realign
oneself as one goes back and forth between the original and the
proofed text. This realignment is, for a mere mortal like me at least,
pretty uncomfortable and quite inefficient.
For those with photographic memories (I am not of this elite), the
page-by-page approach probably works well for them. So, yes,
everyone is different in their abilities and preferences to proof
pages. I think the line-by-line approach should at least be
experimented with, and I'll look into doing so.
> When proofing, I always ensure that the insertion cursor in the
> page's text field touches every character -- essentially I click
> before the first letter, and right-arrow through the entire text. Not
> least because this spell-checks every word (client-side, on my Mac),
> but also because the result is a word-by-word serial visit to every
> portion of the page. Even without Flash, we could imagine a number of
> interface elements that do this sort of thing: Something that
> serially highlights every word, two per second; an audible reader; a
> requirement that the cursor visit each letter before a page is
> considered done.
Interesting.
> When I was a professional proofreader in a large academic printer,
> there were a number of old tried-and-true tricks we were taught:
> reading the text backwards, reading it aloud to a partner complete
> with punctuation, &c. But they all boiled down to getting the reader
> to look at the typeset page as a proofer, not a reader.
Again interesting. The line-by-line approach definitely forces this
naturally, because usually there's little interesting content-wise in
a single line to distract -- it also eliminates reading since one is
doing a vertical comparison, rather than horizontal.
> There are nearsighted little old ladies and 24-inch monitor-users
> among us at DP, and their ability to customize the interface and the
> presentation of the work is probably much more a boon than a threat:
> it invites more people to work. What we might consider is changing
> what that work is, to make it more obvious that it is not the kind of
> reading they're used to.
One thing I like with the line-by-line system is that it might even
allow proofing on limited hardware, like PDA's. Here we might not even
allow the proofer to make any edits -- but simply to flag whether the
text is right or not. (Hmmm, this is interesting). If the line gets
flagged 2-3 times that no edits occured, we assume it is proofed to
zero errors. If flagged as having an error, then someone else can
actually do the edit. I surmise that with the quality of OCR today,
plus the PG/DP tools to pre-process an OCR text, that in an *average*
book the percentage of lines with errors will be fairly low (less than
10% ???). Anyway...
*****
Now, it is my understanding that most advanced OCR packages can
produce an XML document of the raw OCR text, and the XML data includes
the bounding box information (the coordinates on the original page
scan where a word occurs) and line information. (I'm sure what I just
described is well-known among most of the PG/DP OCR experts, but I'm
sharing it with the others here who may not be aware.)
For example, here's a link which Branko Collin posted a few months ago
in a comment to the TeleRead blog. It points to one of these XML
documents, produced by DJVU OCR:
http://ia201107.eu.archive.org/2/items/englishbookbindings00davenuoft/englishbookbindings00davenuoft_djvuxml.xml
(depending upon one's browser, you may have to look at the source to
see the bare document.)
This XML document contains all the raw OCR text associated with each
scanned page in the DJVU book.
Here's a snippet of the markup from somewhere in the middle, for
"page 36":
XXVIIIGENERALINTRODUCTIONtheeighteenthcenturyanewgracewasaddedbytheinlayingofaleatherofasecondcolour.
For each line, we can easily determine the top line and bottom line
coordinates so the "strip" of the page scan associated with the line
can be displayed (as well as where the first word in the line starts
and where the final word ends -- useful for alignment of the strip
with the editable text.)
[We have a knotty problem if changes are made to the text in a line,
in rewriting the edits back into the XML (I won't explain why.) So we
only use the XML bounding box information to give us the coordinates
of the 'strip' in the image associated with a line, but we won't
update the original XML document. We might produce a different XML doc
with the edited results, though, viz.
the eighteenth century a new grace was addedby the inlaying of a leather of a second colour.
...
Jon Noring
From marcello at perathoner.de Thu Nov 2 08:55:46 2006
From: marcello at perathoner.de (Marcello Perathoner)
Date: Thu Nov 2 08:55:51 2006
Subject: [gutvol-d] Line by line proofing of OCR text?
In-Reply-To: <187839134.20061102085435@noring.name>
References: <73794793.20061101191109@noring.name> <680E209C-05DF-476C-B31F-F47D08FB88BD@williamtozier.com>
<187839134.20061102085435@noring.name>
Message-ID: <454A2312.7080700@perathoner.de>
Jon Noring wrote:
> Yes, this is definitely one of the problems I have with the current
> system, knowing where one is on both the original page scan and the
> text edit box. It requires effort for the mere mortal to realign
> oneself as one goes back and forth between the original and the
> proofed text. This realignment is, for a mere mortal like me at least,
> pretty uncomfortable and quite inefficient.
The quick fix would be to implement a function that puts a horizontal
ruler on the image window if you click on it. (And scrolls the window so
the ruler is in the vertical middle.)
A few lines of JavaScript will do that. Firefox will even support
opacity so you can highlight a portion of text.
>
> the
> eighteenth
Why not break the whole text down into words and use it as captcha
(http://en.wikipedia.org/wiki/Captcha) for the PG website? Everybody who
wants to download a file has to decipher a word. Haha, only serious.
--
Marcello Perathoner
webmaster@gutenberg.org
From Bowerbird at aol.com Thu Nov 2 11:49:02 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Thu Nov 2 11:49:09 2006
Subject: [gutvol-d] gvd061102 -- thoughts on a thursday
Message-ID:
today i'll wait for someone else to contribute to babelfish,
our little open-source project here...
but i have some other thoughts...
***
jon noring said:
> I can't help but think that this has already been thought of
> and discarded by Charles Franks when he started DP.
well, that's probably because charles floated this very idea at
the meetings held in san francisco for the 10,000th e-text...
so that's where you "got" the idea, jon.
i've tested the method, and yes, it would work just fine, except
there's no need to do line-by-line proofing at _all_ these days,
so finding a "better" way to go about doing it is irrelevant...
besides, i'm not sure how this would fit in the d.p. interface,
what with the slicing of each scan into dozens of files...
as for the coordinates of each line or word...
although it's simple enough to get that coordinate information
from an o.c.r. program, it is also very simple to write a routine
that collects the information just by examining the actual scan.
a screenshot of output from such a routine can be seen here:
> http://snowy.arsc.alaska.edu/bowerbird/misc/line-determination.jpg
the number to the left of each line gives its topmost pixel, while
the number to its right indicates the height of its bottom pixels.
considering that the setting of this type was a _manual_ process,
the leading is amazingly consistent throughout, as you'll notice.
those typesetters really had their craft down...
i use this routine to highlight a line -- as shown in the graphic --
where a possible error might exist. my program also selects the
questionable text -- in the editfield displayed next to the scan --
because automating this boring manual work of doing a correction
makes the process go much more quickly. the proofer's attention is
drawn to the red-highlighted line on the scan so they can read that,
and then focus on the text-in-question to correct it when necessary.
***
jon said:
> A roundless approach definitely is smarter.
gee, when both jon and josh agree with me,
i figure it won't be long before things change.
unfortunately "not long" is not the same thing as
"soon" over in the land of distributed proofreaders.
meanwhile, any more reaction to the duguid article?
heck, noring's little "idea" about a proofing wrinkle
has pulled more commentary than duguid's piece...
so it's a good think duguid took his piece to the public,
instead of letting it get buried by taking it to d.p. alone.
***
jon said:
> Yep, this is one of the Laws of the Universe:
> There will never be enough developers to
> do the job as one wants.
you should try the "open-source community",
where there are scads of programmers who
will happily do your programming for free...
***
marcello said:
> Nobody takes BB seriously
wishful thinking!
the .tei folks have been touting their "solution"
for 5 years now, and nothing has yet materialized.
and -- in the 3 years i've been on this listserve --
the size of the library doubled to 20,000 e-texts.
when i've mirrored the whole thing in z.m.l. format,
and can maintain the entire library in my spare time,
while p.g. is still trying to figure out what kind of .tei
they're gonna settle on, and then goes begging for
the expertise needed to maintain that complex format,
let alone get any useful functionality out of it, we'll see
who takes whom seriously...
-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20061102/4d3f06bc/attachment-0001.html
From jon at noring.name Thu Nov 2 12:21:11 2006
From: jon at noring.name (Jon Noring)
Date: Thu Nov 2 12:21:31 2006
Subject: [gutvol-d] gvd061102 -- thoughts on a thursday
In-Reply-To:
References:
Message-ID: <1455561431.20061102132111@noring.name>
jon noring said:
>>?? I can't help but think that this has already been thought of
>>?? and discarded by Charles Franks when he started DP.
> well, that's probably because charles floated this very idea at
> the meetings held in san francisco for the 10,000th e-text...
> so that's where you "got" the idea, jon.
Is that the meeting held at the Internet Archive which you and I
attended? I don't remember Charles mentioning this technique, nor
again when I met him in Las Vegas a few months later. So if he did,
it has bounced around in my subconscious for a while and only now
is emerging as I see a need for it.
Charles (if you're still there), and Juliet, was the line-by-line
editing method mentioned at the PG/IA bash?
> you should try the "open-source community",
> where there are scads of programmers who
> will happily do your programming for free...
Agreed, but even there, there's never enough volunteers to do all that
is often needed.
Jon
From Bowerbird at aol.com Thu Nov 2 14:44:25 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Thu Nov 2 14:44:31 2006
Subject: [gutvol-d] sometimes i read the funniest things
Message-ID:
sometimes i read the funniest things. :+)
like when carlo said this, about me, over on the d.p. boards:
> he regurarly googles himself, and might come back if he
> is named in an open forum. This one is currently not open,
> hence it is not indexed, but it is better to edit the posts
> with the name anyway.
don't be silly, carlo. i haven't done a vanity search in ages,
mostly because "bowerbird" turns up too many false alarms.
(it seems there are some birds called that, in australia and
new guinea. who knew?) ;+)
you really think i care that i'm mentioned over there?
especially since the mentions are uniformly asinine?
i read the d.p. boards to learn stuff about digitizing.
i like to do my homework.
***
for instance, laura said this, yesterday:
> For what it's worth, this is how Wikipedia handles equations
> on the pages where the need exists. Any mathematical equation
> is enclosed in