From Bowerbird at aol.com  Thu Nov  1 08:56:55 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 1 Nov 2007 11:56:55 EDT
Subject: [gutvol-d] a post about the .html versions
Message-ID: <bf9.260412d0.345b5147@aol.com>

here's a message that just came across on the rocketbook listserve...
>    Some Gutenberg books are now available in HTML format, which is 
>    a great improvement on simple text when used on the REB1200. 
>    However the newer ones use CSS extensively to create books that
>    look great when read in a browser, but like crap when converted to 
>    IMP by the ebook librarian (I'm using the breeno one). e.g. text is 
>    indented and truncated, pictures are distorted, etc. I recall that 
>    there were a set of HTML format rules somewhere that specified 
>    exactly what was legal.   Does anyone have a link to it? 
>    Are all CSS styles ignored cleanly? Any ideas or comments?

for your information...

-bowerbird


**************************************
 See what's new at http://www.aol.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071101/85cb9bcb/attachment.htm 

From Bowerbird at aol.com  Thu Nov  1 13:10:35 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 1 Nov 2007 16:10:35 EDT
Subject: [gutvol-d] how to be typographically beautiful
Message-ID: <cf2.1e5308e0.345b8cbb@aol.com>

i'd certainly like to talk about e-book topics...

even if nobody else here seems to want to...             :+)

so anyway, here's one...

there's widespread agreement out in the world that
project gutenberg's e-texts are extremely useful...
but the flip-side is that plain-ascii is _quite_ugly_.

what i've noticed, though, is that it doesn't take much
to bring them up to the standards of book typography.

furthermore, i've noticed that there's general agreement
what needs to be done, as evidenced by common practice.

the best examples are what is done by d.p. post-producers
when they create an .html version.   but i have also examined
a .pdf library produced by the people over at planetpdf.com,
the books from blackmask, manybooks, feedbooks, etc., and
whatever other various conversions i could get my hands on,
including those geared to handhelds (rocketbook, sony, etc.).
even the various layouts of blogs and web-pages are useful,
since they are keyed to make electronic text more readable...

the idea is that you've loaded a plain-ascii p.g. e-text into
your word-processor or desktop-publishing program with
the objective of making it beautiful.   what exactly do you do?

please add to this, the start of a list, off the top of my head:

1.   get rid of that ugly legalese at the top of the file.
2.   make the title-page and front-matter look nice.
3.   hotlink the table of contents. make one if necessary.
4.   make all the headers big, bold, and distinctive, and
5.   start chapters on a new page, maybe even a recto.
6.   get rid of the empty lines between paragraphs, and
7.   use book-style indents on each paragraph instead.
8.   use full justification.   or at least half-ragged.
9.   use a reasonable line-width.   full-screen is too wide.
10.   white-space is free in an e-book, so use it liberally.
11.   make block-quotes distinctive, for remix purposes.
12.   links are great, but spare us the ugly blue underlines.
13.   is an unlucky number.
14.   don't put pagenumbers inside the text/paragraphs.
15.   turn pg-ascii underscored text into _real_ italics.
16.   pictures (even doodad thingees) enliven the text.
17.   navigation aids among chapters are quite useful.
18.   footnotes should have links going _both_ ways.
19.   if it works better that way, turn a table on its side.
20.   resize tables and images so they fit on one screen.
21.   give your readers the luxury of generous leading!
22.                                   (leaving some space for you...)
23.                                   (leaving some space for you...)
24.                                   (leaving some space for you...)
25.                                   (leaving some space for you...)
26.   show where we are in the book (page 39 of 208).
27.   make the framework of the document _obvious_.
28.   what the heck, just for the fun of it, make an index!
29.   make the typesize big enough to be read easily!
30.   get rid of that ugly legalese at the bottom of the file.

these are general strategies.   not all of them will be
applicable to any one specific situation, and some
(e.g., #8) are up to the preferences of the individual.

and obviously, some of these could be fragmented
into a very large number of sub-points, like #10...

but these are the tricks that i've seen being used to
bring some typographical beauty to p.g. e-texts...

-bowerbird

p.s.   feedbooks.com creates very beautiful e-books...


**************************************
 See what's new at http://www.aol.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071101/598843be/attachment.htm 

From lee at novomail.net  Thu Nov  1 15:02:12 2007
From: lee at novomail.net (Lee Passey)
Date: Thu, 01 Nov 2007 15:02:12 -0700
Subject: [gutvol-d] More useless data
Message-ID: <472A4CE4.4020704@novomail.net>

Quick ... what is the most commonly downloaded book from project 
Gutenberg in the last three years?

I promised Jon Noring some data a few months back, and I thought I'd 
deliver it in this forum, because some other people might find it 
interesting.

As most people here know, TPTB at project Gutenberg deny having any 
download statistics beyond the past 30 days. Fortunately, for years now 
the Internet Archive has been trolling the internet, making periodic 
snapshots of web sites, including Project Gutenberg. So I went to the 
Internet Archive and captured the Project Gutenberg statistics pages 
since September 2004. I collated all the data, and came up with a list 
of 408 files which have appeared in the "30 day - Top 100" since that 
date. I added and resorted them, and now have a list of the most popular 
downloads from the PG web site since Sept. 2004.

And the most popular download during the past three years is:

(drumroll please)

The Notebooks of Leonardo Da Vinci - Complete by Leonardo da Vinci 
(etext 5000)

The rest of the top ten are:

2  The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle  (9551 & 
1661)
3  The Art of War by Sun Tzu   (132, 17405 & 20594)
4  Le Kamasutra by Vatsyayana  (14609)
5  Pride and Prejudice by Jane Austen  (1342 & 20686)
6  The War of the Worlds by H. G.  Wells  (36 & 8976)
7  Ulysses by James Joyce  (4300)
8  Little Journeys to the Homes of the Great - Volume 01 of 14 by Elbert 
Hubbard  (12933)
9  Manual of Surgery by Alexander Miles and Alexis Thomson  (17921)
10 Hand Shadows to Be Thrown upon the Wall by Henry Bursill  (12962)
11 The Adventures of Huckleberry Finn by Mark Twain  (76 & 19640)
12 Alice's Adventures in Wonderland by Lewis Carroll  (11, 19573 & 928)

(I know this is 12, but I couldn't bear to leave out Alice and Huck.)

(Caveat: I was unable to get precise 30 day intervals, so this list is 
an approximation. A /very good/ approximation, but an approximation 
nonetheless.)

(Caveat bis: These data are derived from that reported on the PG web 
site. They are only as good as PG's reporting.)

Of course, because the PG corpus is always growing, this kind of linear 
analysis may over-weight early downloads. So I changed the collation 
algorithm a bit. I started with a 6 month baseline, and then as I added 
each 30 day list I increased the weighting by 4%. That is, the data as 
of Feb. 2005 was counted at 100%, but the data from Feb-Mar was counted 
at 104%, the data from Mar-Apr was counted at 108%, the data from 
Apr-May was counted at 112%, etc. Thus, more recent downloads got 
counted more heavily that more distant downloads.

So what is the adjusted top ten list?

1  The Notebooks of Leonardo Da Vinci - Complete by Leonardo da Vinci 
(5000)
2  The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle  (9551 & 
1661)
3  The Art of War by Sun Tzu   (132, 17405 & 20594)
4  Le Kamasutra by Vatsyayana  (14609)
5  Pride and Prejudice by Jane Austen  (1342 & 20686)
6  Manual of Surgery by Alexander Miles and Alexis Thomson  (17921)
7  How to Speak and Write Correctly by Joseph Devlin  (6409)
8  Ulysses by James Joyce  (4300)
9  Little Journeys to the Homes of the Great - Volume 01 of 14 by Elbert 
Hubbard  (12933)
10 The War of the Worlds by H. G.  Wells  (36 & 8976)
11 The Adventures of Huckleberry Finn by Mark Twain  (76 & 19640)

As you can see, the Manual of Surgery is more popular recently, and Hand 
Shadows less so. Alice dropped to 14, so I didn't feel like I could 
include her. What is interesting is that the addition of new files to 
the PG corpus has not had much affect on the most popular file downloads.

The data for all 400+ files can be found at 
http://www.passkeysoft.com/~lee/zero.txt and 
http://www.passkeysoft.com/~lee/four.txt.

Bowerbird, if you want to know where to start in your conversion process 
to z.m.l., I would suggest the books on this list.

I hand manipulated the first 50 entries in each file, to try to count 
multiple editions of the same book as a single entry, the remaining data 
is raw.

Enjoy!

-- 
Nothing of significance below this line.


From hart at pglaf.org  Thu Nov  1 14:48:20 2007
From: hart at pglaf.org (Michael Hart)
Date: Thu, 1 Nov 2007 14:48:20 -0700 (PDT)
Subject: [gutvol-d] !@!Re:  More useless data
In-Reply-To: <472A4CE4.4020704@novomail.net>
References: <472A4CE4.4020704@novomail.net>
Message-ID: <Pine.LNX.4.64.0711011447300.1567@pglaf.org>


With your permission, I'd like to include your files,
and perhaps this report, in a Project Gutenberg file.

Thanks!!!

Michael S. Hart
Founder
Project Gutenberg


On Thu, 1 Nov 2007, Lee Passey wrote:

> Quick ... what is the most commonly downloaded book from project
> Gutenberg in the last three years?
>
> I promised Jon Noring some data a few months back, and I thought I'd
> deliver it in this forum, because some other people might find it
> interesting.
>
> As most people here know, TPTB at project Gutenberg deny having any
> download statistics beyond the past 30 days. Fortunately, for years now
> the Internet Archive has been trolling the internet, making periodic
> snapshots of web sites, including Project Gutenberg. So I went to the
> Internet Archive and captured the Project Gutenberg statistics pages
> since September 2004. I collated all the data, and came up with a list
> of 408 files which have appeared in the "30 day - Top 100" since that
> date. I added and resorted them, and now have a list of the most popular
> downloads from the PG web site since Sept. 2004.
>
> And the most popular download during the past three years is:
>
> (drumroll please)
>
> The Notebooks of Leonardo Da Vinci - Complete by Leonardo da Vinci
> (etext 5000)
>
> The rest of the top ten are:
>
> 2  The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle  (9551 &
> 1661)
> 3  The Art of War by Sun Tzu   (132, 17405 & 20594)
> 4  Le Kamasutra by Vatsyayana  (14609)
> 5  Pride and Prejudice by Jane Austen  (1342 & 20686)
> 6  The War of the Worlds by H. G.  Wells  (36 & 8976)
> 7  Ulysses by James Joyce  (4300)
> 8  Little Journeys to the Homes of the Great - Volume 01 of 14 by Elbert
> Hubbard  (12933)
> 9  Manual of Surgery by Alexander Miles and Alexis Thomson  (17921)
> 10 Hand Shadows to Be Thrown upon the Wall by Henry Bursill  (12962)
> 11 The Adventures of Huckleberry Finn by Mark Twain  (76 & 19640)
> 12 Alice's Adventures in Wonderland by Lewis Carroll  (11, 19573 & 928)
>
> (I know this is 12, but I couldn't bear to leave out Alice and Huck.)
>
> (Caveat: I was unable to get precise 30 day intervals, so this list is
> an approximation. A /very good/ approximation, but an approximation
> nonetheless.)
>
> (Caveat bis: These data are derived from that reported on the PG web
> site. They are only as good as PG's reporting.)
>
> Of course, because the PG corpus is always growing, this kind of linear
> analysis may over-weight early downloads. So I changed the collation
> algorithm a bit. I started with a 6 month baseline, and then as I added
> each 30 day list I increased the weighting by 4%. That is, the data as
> of Feb. 2005 was counted at 100%, but the data from Feb-Mar was counted
> at 104%, the data from Mar-Apr was counted at 108%, the data from
> Apr-May was counted at 112%, etc. Thus, more recent downloads got
> counted more heavily that more distant downloads.
>
> So what is the adjusted top ten list?
>
> 1  The Notebooks of Leonardo Da Vinci - Complete by Leonardo da Vinci
> (5000)
> 2  The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle  (9551 &
> 1661)
> 3  The Art of War by Sun Tzu   (132, 17405 & 20594)
> 4  Le Kamasutra by Vatsyayana  (14609)
> 5  Pride and Prejudice by Jane Austen  (1342 & 20686)
> 6  Manual of Surgery by Alexander Miles and Alexis Thomson  (17921)
> 7  How to Speak and Write Correctly by Joseph Devlin  (6409)
> 8  Ulysses by James Joyce  (4300)
> 9  Little Journeys to the Homes of the Great - Volume 01 of 14 by Elbert
> Hubbard  (12933)
> 10 The War of the Worlds by H. G.  Wells  (36 & 8976)
> 11 The Adventures of Huckleberry Finn by Mark Twain  (76 & 19640)
>
> As you can see, the Manual of Surgery is more popular recently, and Hand
> Shadows less so. Alice dropped to 14, so I didn't feel like I could
> include her. What is interesting is that the addition of new files to
> the PG corpus has not had much affect on the most popular file downloads.
>
> The data for all 400+ files can be found at
> http://www.passkeysoft.com/~lee/zero.txt and
> http://www.passkeysoft.com/~lee/four.txt.
>
> Bowerbird, if you want to know where to start in your conversion process
> to z.m.l., I would suggest the books on this list.
>
> I hand manipulated the first 50 entries in each file, to try to count
> multiple editions of the same book as a single entry, the remaining data
> is raw.
>
> Enjoy!
>
> -- 
> Nothing of significance below this line.
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>

From lee at novomail.net  Thu Nov  1 16:36:41 2007
From: lee at novomail.net (Lee Passey)
Date: Thu, 01 Nov 2007 16:36:41 -0700
Subject: [gutvol-d] !@!Re:  More useless data
In-Reply-To: <Pine.LNX.4.64.0711011447300.1567@pglaf.org>
References: <472A4CE4.4020704@novomail.net>
	<Pine.LNX.4.64.0711011447300.1567@pglaf.org>
Message-ID: <472A6309.9020502@novomail.net>

Michael Hart wrote:
> 
> With your permission, I'd like to include your files,
> and perhaps this report, in a Project Gutenberg file.

No need to ask for permission; anything I post to a public forum I 
always dedicate to the public domain.

-- 
Nothing of significance below this line.


From marcello at perathoner.de  Thu Nov  1 15:57:04 2007
From: marcello at perathoner.de (Marcello Perathoner)
Date: Thu, 01 Nov 2007 23:57:04 +0100
Subject: [gutvol-d] More useless data
In-Reply-To: <472A4CE4.4020704@novomail.net>
References: <472A4CE4.4020704@novomail.net>
Message-ID: <472A59C0.1090809@perathoner.de>

Lee Passey wrote:

> And the most popular download during the past three years is:
> 
> (drumroll please)
> 
> The Notebooks of Leonardo Da Vinci - Complete by Leonardo da Vinci 
> (etext 5000)

The "Notebooks" were featured in a prominent blog in a "read a page a
day" series. Probably the same couple hundred people accessed the whole
book every day just to read the "page of the day". This went on for
almost 2 years.

The "Manual of Surgery" gets requested almost exclusively referring to a
google-image search containing the words "penis enlargement".


But you cleanly missed the

  *5 most popular PG downloads of all times!!!*

(Because I filter them or they would have topped the list every day from
day one to eternity).

Try to guess ...

Give up?

Scroll down ...


Drum roll !!!


#6557 The Fall of the House of Usher.mp3
#9695 Bleak House by Charles Dickens.mp3
#6550 The House of Mapuhi by Jack London.mp3
#9280 House of Mirth by Edith Wharton.mp3
#9714 A House to Let by Charles Dickens.mp3


Explaining the rationale behind the filtering is left as an exercise to Lee.


-- 
Marcello Perathoner
webmaster at gutenberg.org


From hart at pglaf.org  Thu Nov  1 18:30:50 2007
From: hart at pglaf.org (Michael Hart)
Date: Thu, 1 Nov 2007 18:30:50 -0700 (PDT)
Subject: [gutvol-d] More useless data
In-Reply-To: <472A59C0.1090809@perathoner.de>
References: <472A4CE4.4020704@novomail.net> <472A59C0.1090809@perathoner.de>
Message-ID: <Pine.LNX.4.64.0711011830090.4849@pglaf.org>


And we leave the rationale that all five on your list contain:

"House"

to whom?


On Thu, 1 Nov 2007, Marcello Perathoner wrote:

> Lee Passey wrote:
>
>> And the most popular download during the past three years is:
>>
>> (drumroll please)
>>
>> The Notebooks of Leonardo Da Vinci - Complete by Leonardo da Vinci
>> (etext 5000)
>
> The "Notebooks" were featured in a prominent blog in a "read a page a
> day" series. Probably the same couple hundred people accessed the whole
> book every day just to read the "page of the day". This went on for
> almost 2 years.
>
> The "Manual of Surgery" gets requested almost exclusively referring to a
> google-image search containing the words "penis enlargement".
>
>
> But you cleanly missed the
>
>  *5 most popular PG downloads of all times!!!*
>
> (Because I filter them or they would have topped the list every day from
> day one to eternity).
>
> Try to guess ...
>
> Give up?
>
> Scroll down ...
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Drum roll !!!
>
>
> #6557 The Fall of the House of Usher.mp3
> #9695 Bleak House by Charles Dickens.mp3
> #6550 The House of Mapuhi by Jack London.mp3
> #9280 House of Mirth by Edith Wharton.mp3
> #9714 A House to Let by Charles Dickens.mp3
>
>
> Explaining the rationale behind the filtering is left as an exercise to Lee.
>
>
>
> -- 
> Marcello Perathoner
> webmaster at gutenberg.org
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>

From j.hagerson at comcast.net  Thu Nov  1 19:08:46 2007
From: j.hagerson at comcast.net (John Hagerson)
Date: Thu, 1 Nov 2007 21:08:46 -0500
Subject: [gutvol-d] More useless data
In-Reply-To: <Pine.LNX.4.64.0711011830090.4849@pglaf.org>
Message-ID: <003b01c81cf5$4d501bc0$1f12fea9@sarek>

It appears that "House" is a musical genre of relatively recent vintage.
These files may have been found by searching "house and mp3." I fear that
many of the people who download these particular audio books are
disappointed by what they contain.

John Hagerson

-----Original Message-----
From: gutvol-d-bounces at lists.pglaf.org
[mailto:gutvol-d-bounces at lists.pglaf.org] On Behalf Of Michael Hart
Sent: Thursday, November 01, 2007 8:31 PM
To: Project Gutenberg Volunteer Discussion
Subject: Re: [gutvol-d] More useless data


And we leave the rationale that all five on your list contain:

"House"

to whom?


From Bowerbird at aol.com  Fri Nov  2 10:20:42 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 2 Nov 2007 13:20:42 EDT
Subject: [gutvol-d] on the importance of remixability
Message-ID: <be0.22d2f2fb.345cb66a@aol.com>

kottke, the link blogger, interviews yochai benkler,
the author of "the wealth of networks", over here:
>    http://www.kottke.org/07/11/yochai-benkler

on why he made his book free online, benkler says:
>    But for me what was more important than simply 
>    the freedom to download, was the freedom to 
>    do things with the book. That's why I held out for 
>    licensing the book under a CC noncommercial 
>    sharealike license. The fact that people were able 
>    to take the book and convert it into other formats, 
>    including making readings of some portions; that 
>    some people began to translate portions of the book; 
>    these were the reasons that mattered. 

-bowerbird


**************************************
 See what's new at http://www.aol.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071102/6a2124bf/attachment.htm 

From Bowerbird at aol.com  Fri Nov  2 11:10:01 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 2 Nov 2007 14:10:01 EDT
Subject: [gutvol-d] your pudding sampler menu
Message-ID: <c33.1b618747.345cc1f9@aol.com>

my tool-chain is starting to cohere across the entire workflow, so
here's a reminder about the pudding samples available right now.
all of these are in-progress, so constructive criticism is welcomed.

give -- cross-platform viewer-program for z.m.l. (dated now, but...)
>    download from the "zml-talk" group at yahoogroups

zandbox -- cross-platform z.m.l. authoring tool
>    backchannel me for a copy

banana cream -- cross-platform proofreading engine
>    backchannel me for a copy

babelfish -- prototype web-app viewer-program for z.m.l.
>    http://z-m-l.com/go/babelfish19.pl

verylovely -- canned online zml-to-html conversion demo
>    http://www.z-m-l.com/go/vl3.pl

zmldingus -- live online zml-to-html conversion app
>    http://www.z-m-l.com/go/zmldingus093.pl

"continuous proofreading" mode: various sample books
>    http://z-m-l.com/go/myant/myantp001.html
>    http://z-m-l.com/go/mabie/mabiep001.html
>    http://z-m-l.com/go/tolbk/tolbkp001.html
>    http://z-m-l.com/go/sgfhb/sgfhbp001.html
>    http://z-m-l.com/go/ahmmw/ahmmwp001.html

.pdf samples -- sample of the zml-to-pdf conversion process
>    http://z-m-l.com/oyayr/oyayr.zml
>    http://z-m-l.com/oyayr/oya-sunday.pdf
>    http://snowy.arsc.alaska.edu/bowerbird/alice01/alice01/alice01.zml
>    http://snowy.arsc.alaska.edu/bowerbird/alice01/alice01/alice01b.pdf

.html samples -- sample of the zml-to-html conversion process
>    http://snowy.arsc.alaska.edu/bowerbird/alice01/alice01/alice01.zml
>    http://snowy.arsc.alaska.edu/bowerbird/alice01/alice01/alice01.html

-bowerbird


**************************************
 See what's new at http://www.aol.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071102/caad43a5/attachment.htm 

From robert_marquardt at gmx.de  Fri Nov  2 23:30:31 2007
From: robert_marquardt at gmx.de (Robert Marquardt)
Date: Sat, 03 Nov 2007 07:30:31 +0100
Subject: [gutvol-d] your pudding sampler menu
In-Reply-To: <c33.1b618747.345cc1f9@aol.com>
References: <c33.1b618747.345cc1f9@aol.com>
Message-ID: <m95oi3h2ttqfshdv924r0vipd5fmuvg7nb@4ax.com>

This message like a few others was put into the spam folder by my mail provider. Interestingly by a human designed
filter. Very amusing :-)
-- 
Robert Marquardt (Team JEDI)  http://delphi-jedi.org

From lee at novomail.net  Sat Nov  3 08:45:21 2007
From: lee at novomail.net (Lee Passey)
Date: Sat, 03 Nov 2007 08:45:21 -0700
Subject: [gutvol-d] Diff tools
Message-ID: <472C9791.9070600@novomail.net>

I'm making good progress on my TEIification of Mark Twains 
_Puddn'head_Wilson_. What I want to do at this point is "diff" one or 
more versions, including a couple I have OCRed myself. What I /don't/ 
want to do is strip out markup before performing the diff.

Is anyone aware of any tool I can use to diff two (or more) files 
without degrading or normalizing the text? For example, something that 
can compare an XHML file with an allegedly identical impoverished text file?

From traverso at posso.dm.unipi.it  Sat Nov  3 13:48:05 2007
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Sat,  3 Nov 2007 21:48:05 +0100 (CET)
Subject: [gutvol-d] Diff tools
In-Reply-To: <472C9791.9070600@novomail.net> (message from Lee Passey on Sat, 
	03 Nov 2007 08:45:21 -0700)
References: <472C9791.9070600@novomail.net>
Message-ID: <20071103204805.1ACF993B66@posso.dm.unipi.it>

>>>>> "Lee" == Lee Passey <lee at novomail.net> writes:

    Lee> I'm making good progress on my TEIification of Mark Twains
    Lee> _Puddn'head_Wilson_. What I want to do at this point is
    Lee> "diff" one or more versions, including a couple I have OCRed
    Lee> myself. What I /don't/ want to do is strip out markup before
    Lee> performing the diff.

    Lee> Is anyone aware of any tool I can use to diff two (or more)
    Lee> files without degrading or normalizing the text? For example,
    Lee> something that can compare an XHML file with an allegedly
    Lee> identical impoverished text file?

I fear that I do not understand.

It isn't clear to me what you want to have as result: do you want a
list of differences, including those originating from the markup?

Or you want to build a version with markup including markup for the
variants of the text?

I personally would like to have the second, and I more or less know
how I would build a tool to get from a TEI file and a TXT file a TEI
file with the variants marked, with some manual tweaking necessary
where the modifications cross other markup. The key ingredients would
be wdiff and some code for diff analysis that I already have.

Carlo

From Bowerbird at aol.com  Sat Nov  3 14:50:35 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 3 Nov 2007 17:50:35 EDT
Subject: [gutvol-d] Diff tools
Message-ID: <cda.1ce7d29e.345e472b@aol.com>

c'mon, guys, .tei is a worldwide standard, so there just has to be
_lots_and_lots_ of diff tools that will do whatever anyone wants...

you're just not _looking_ hard enough...

-bowerbird


**************************************
 See what's new at http://www.aol.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071103/b94742d0/attachment.htm 

From jeroen.mailinglist at bohol.ph  Sat Nov  3 15:39:43 2007
From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account))
Date: Sat, 03 Nov 2007 23:39:43 +0100
Subject: [gutvol-d] Diff tools
In-Reply-To: <20071103204805.1ACF993B66@posso.dm.unipi.it>
References: <472C9791.9070600@novomail.net>
	<20071103204805.1ACF993B66@posso.dm.unipi.it>
Message-ID: <472CF8AF.1010907@bohol.ph>


I am not aware of any open source tool that does this, but Beyond
Compare allows you to specify
filters to run before the compare itself, which you can use to filter
out tags, etc., without modifying the files in question.

This will allow you to find differences in the character sequences,
while ignoring markup.

It shouldn't be too much work to build similar functionality in one of
the many open source alternatives available.

Jeroen


Carlo Traverso wrote:
>>>>>> "Lee" == Lee Passey <lee at novomail.net> writes:
>>>>>>             
>
>     Lee> I'm making good progress on my TEIification of Mark Twains
>     Lee> _Puddn'head_Wilson_. What I want to do at this point is
>     Lee> "diff" one or more versions, including a couple I have OCRed
>     Lee> myself. What I /don't/ want to do is strip out markup before
>     Lee> performing the diff.
>
>     Lee> Is anyone aware of any tool I can use to diff two (or more)
>     Lee> files without degrading or normalizing the text? For example,
>     Lee> something that can compare an XHML file with an allegedly
>     Lee> identical impoverished text file?
>
> I fear that I do not understand.
>
> It isn't clear to me what you want to have as result: do you want a
> list of differences, including those originating from the markup?
>
> Or you want to build a version with markup including markup for the
> variants of the text?
>
> I personally would like to have the second, and I more or less know
> how I would build a tool to get from a TEI file and a TXT file a TEI
> file with the variants marked, with some manual tweaking necessary
> where the modifications cross other markup. The key ingredients would
> be wdiff and some code for diff analysis that I already have.
>
> Carlo
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
>   


From marcello at perathoner.de  Sat Nov  3 15:45:13 2007
From: marcello at perathoner.de (Marcello Perathoner)
Date: Sat, 03 Nov 2007 23:45:13 +0100
Subject: [gutvol-d] Diff tools
In-Reply-To: <472C9791.9070600@novomail.net>
References: <472C9791.9070600@novomail.net>
Message-ID: <472CF9F9.8060005@perathoner.de>

Lee Passey wrote:

> What I want to do at this point is "diff" one or 
> more versions, including a couple I have OCRed myself. What I /don't/ 
> want to do is strip out markup before performing the diff.

Do you want to compare the text or the tagging?

If you want to compare the text, I strongly advise to strip the markup
before doing the diff.


> Is anyone aware of any tool I can use to diff two (or more) files 
> without degrading or normalizing the text? For example, something that 
> can compare an XHML file with an allegedly identical impoverished text file?

You get more than 2 megagoogles for "xml diff". I guess there might be
something for you in it.


-- 
Marcello Perathoner
webmaster at gutenberg.org


From lee at novomail.net  Sun Nov  4 10:37:11 2007
From: lee at novomail.net (Lee Passey)
Date: Sun, 04 Nov 2007 11:37:11 -0700
Subject: [gutvol-d] Diff tools
In-Reply-To: <20071103204805.1ACF993B66@posso.dm.unipi.it>
References: <472C9791.9070600@novomail.net>
	<20071103204805.1ACF993B66@posso.dm.unipi.it>
Message-ID: <472E1157.7080004@novomail.net>

Carlo Traverso wrote:

[snip]

> It isn't clear to me what you want to have as result: do you want a
> list of differences, including those originating from the markup?
> 
> Or you want to build a version with markup including markup for the
> variants of the text?

I'm afraid that I may have sacrificed clarity for brevity. What I am 
looking for is more the second than the first. Let me see if I can 
illustrate with a few use cases.

I have a 24-bit full-color image of page 56 of a particular edition of 
Puddn'head. I take that scan and downsample it in various ways resulting 
in 10 additional images, which may be gray-scale or black and white 
using different threshold values. I ran all 11 images through ABBYY, 
with various degrees of success. In three of the 11 result files one 
word was mis-recognized all in the same way. In four of the 11 result 
files one word was mis-recognized in different ways. In three of the 11 
result files one word was incorrectly characterized as italic.

What I want is a process by which I can diff all the versions, and via a 
voting algorithm "fix" the errors (inserting a marker so a human can 
revisit the change later). In two of these cases surrounding markup is 
irrelevant, but in one of them it is significant.

To be honest, I don't really think I'm going to find a tool that will do 
precisely what I want, but I'm hoping to find some components I can 
cobble together to get close.

In a second use case, I have an OCRed version of Puddn'head from a scan 
set obtained from Google. I also have the Google OCR text, but the text 
has been saved without markup.

I want a process whereby I can compare the Google OCR text to my OCR 
marked up text (and perhaps texts from other sources as well, such as 
the Internet Archive), giving me an output that I can use in an 
automated procedure to merge changes back into the /marked/ text. The 
key here is automation; removing the markup, normalizing the files, 
diffing the normalized files, and then relying on a human to search the 
marked up version to find where changes need to be made is not the 
desired outcome.

> I personally would like to have the second, and I more or less know
> how I would build a tool to get from a TEI file and a TXT file a TEI
> file with the variants marked, with some manual tweaking necessary
> where the modifications cross other markup. The key ingredients would
> be wdiff and some code for diff analysis that I already have.

I think this is very close to what I'm looking for.


From lee at novomail.net  Sun Nov  4 10:48:27 2007
From: lee at novomail.net (Lee Passey)
Date: Sun, 04 Nov 2007 11:48:27 -0700
Subject: [gutvol-d] Diff tools
In-Reply-To: <472CF9F9.8060005@perathoner.de>
References: <472C9791.9070600@novomail.net> <472CF9F9.8060005@perathoner.de>
Message-ID: <472E13FB.1050900@novomail.net>

Marcello Perathoner wrote:

[snip]

> You get more than 2 megagoogles for "xml diff". I guess there might be
> something for you in it.

And therein lies the problem. The solution may lie in the 2 megagoogles, 
but I probably won't ever find it; there's just too many results.

An alternative to the Google brute force method (which so far has 
already led me to "HTML Compare", which unfortunately is too UI 
oriented) is a more targeted approach by posting a message to a forum or 
two where there is a possibility that someone has already encountered a 
tool similar to that which I am seeking, and can give me a more direct 
pointer.

Both approaches are useful, and usually complementary.

From Bowerbird at aol.com  Sun Nov  4 13:52:13 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 4 Nov 2007 16:52:13 EST
Subject: [gutvol-d] Diff tools
Message-ID: <d1d.166b7d60.345f990d@aol.com>

carlo said:
>   Lee>    Is anyone aware of any tool I can use to diff two (or more)
>   Lee>    files without degrading or normalizing the text? For example,
>   Lee>    something that can compare an XHML file with an allegedly
>   Lee>    identical impoverished text file?
>   
>   I fear that I do not understand.
>   
>   It isn't clear to me what you want to have as result: do you want a
>   list of differences, including those originating from the markup?
>   
>   Or you want to build a version with markup including markup for the
>   variants of the text?

c'mon carlo, it's obvious what lee wants...

he wants to use the "impoverished text file" to compare to -- 
and make corrections to -- the file he has already marked up.

perhaps you could suggest to him that he should apply markup
to the "impoverished text file" -- remember, it's _easy_ to do! --
after which he can do a straightforward comparison of the files...          
:+)

(but for those of you who contemplate doing this in the future,
make the corrections _first_, and only _then_ apply the markup.)

-bowerbird

p.s.   i see lee has made more posts, so i might have to take him out of
my spam folder for this thread, since this promises to be _very_ juicy...
let's see if i can be strong enough to avoid this temptation!              
;+)


**************************************
 See what's new at http://www.aol.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071104/1dd5aea2/attachment.htm 

From Bowerbird at aol.com  Mon Nov  5 14:26:38 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 5 Nov 2007 17:26:38 EST
Subject: [gutvol-d] give one, get a year
Message-ID: <cf1.1fb582dc.3460f29e@aol.com>

thinking of doing that give-one-get-one on the o.l.p.c.?

t-mobile just sweetened the deal for you, with a free year of hotspot 
wi-fi...

>    http://www.olpcnews.com/laptops/xo1/olpc_xo_sales_commitments.html

-bowerbird


**************************************
 See what's new at http://www.aol.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071105/7d0d9737/attachment.htm 

From jon at noring.name  Mon Nov  5 17:48:55 2007
From: jon at noring.name (Jon Noring)
Date: Mon, 5 Nov 2007 18:48:55 -0700
Subject: [gutvol-d] [forward] ANN: P5 Version 1.0 of the TEI Guidelines has
	been released
Message-ID: <541883530.20071105184855@noring.name>

[Posted to TEI-L by Christian Wittern <wittern at kanji.zinbun.kyoto-u.ac.jp>,
Institute for Research in Humanities, Kyoto University. Forwarding it
here for those interested. Jon]


Dear TEI users,

After more than 6 years of, at times quite intensive, development, it is
with great pleasure that I announce the release of version 1.0 of P5, the
latest and greatest version of the Guidelines of the Text Encoding
Initiative, which officially happened Nov. 2 at the TEI Members Meeting in
Maryland.  You will find the new version online at
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html; a PDF and
even printed books are expected to appear in due time.

The main development work has been carried out by the TEI Technical Council
and the Editors Lou Burnard and Syd Bauman.  I would like to take this
opportunity to warmly thank all previous members of the Council but also
especially the current members, who shared quite a big of the work, which
magically increased as the release date was approaching:

David Birnbaum, Tone Merete Bruvik, Arianna Ciula, James Cummings, Matthew
Driscoll, Daniel O'Donnel, Dot Porter, Sebastian Rahtz Laurent Romary, Conal
Tuohy, John Walsh.

Christian Wittern

Chair, TEI Technical Council


-- 
 Christian Wittern
 Institute for Research in Humanities, Kyoto University
 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN


From piggy at netronome.com  Tue Nov  6 10:09:35 2007
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Tue, 06 Nov 2007 13:09:35 -0500
Subject: [gutvol-d] Diff tools
In-Reply-To: <472E13FB.1050900@novomail.net>
References: <472C9791.9070600@novomail.net> <472CF9F9.8060005@perathoner.de>
	<472E13FB.1050900@novomail.net>
Message-ID: <4730ADDF.10909@netronome.com>

Lee Passey wrote:
> Marcello Perathoner wrote:
>
> [snip]
>
>   
>> You get more than 2 megagoogles for "xml diff". I guess there might be
>> something for you in it.
>>     
>
> And therein lies the problem. The solution may lie in the 2 megagoogles, 
> but I probably won't ever find it; there's just too many results.
>
> An alternative to the Google brute force method (which so far has 
> already led me to "HTML Compare", which unfortunately is too UI 
> oriented) is a more targeted approach by posting a message to a forum or 
> two where there is a possibility that someone has already encountered a 
> tool similar to that which I am seeking, and can give me a more direct 
> pointer.
>
> Both approaches are useful, and usually complementary.
>   

I see that ubuntu gutsy has xmldiff which sounds like it can solve the 
xml-xml problem.

When I encounter the megagoogle problem (and what I'm looking for is not 
on the first page), I turn to clusty.com. (In the interests of full 
disclosure: I have several close friends who work for the company that 
runs clusty.com.)


From Bowerbird at aol.com  Tue Nov  6 13:12:09 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 6 Nov 2007 16:12:09 EST
Subject: [gutvol-d] Diff tools
Message-ID: <c5b.1cac56f5.346232a9@aol.com>

piggy said:
>    I see that ubuntu gutsy has xmldiff which 
>    sounds like it can solve the xml-xml problem.

oh geez, i thought the absence of messages meant that
lee had gotten a solution.   instead, it looks like it means
_no_one_ has a solution for this relatively common task...

and yes, i know that if you have to _explain_ a joke,
that means it's not very _funny_...

but i suppose _some_ people got the laugh when i
made my earlier message, so i don't see the harm in
explaining it to those who have no sense of humor...

the xml/tei crowd loved to tell you, over the years, how -- 
because there's so many institutions using heavy markup
--there are all kinds of tools now for dealing with it, and
we could depend on open-source to create even more...

but the fact of the matter is that, when they get around to
actually digitizing a book, those tools seem to disappear...

indeed, here is a _very_basic_ task -- comparing a new
digitization to a previous one, to find the differences --
and nobody's stepping up to say "here's a tool to do it..."

let alone _lots_ of people pointing to _lots_ of such tools.

nobody is saying, "let me go and ask on another listserve,
where i'm sure they'll give us some answers right away..."

and folks, this is for an extremely straightforward e-text!
>    http://www.gutenberg.org/etext/102

indeed, i've even included it as one of my z.m.l. demo-files:
>    http://z-m-l.com/go/vl3.pl
>    http://www.z-m-l.com/go/vlpuddnhead.zml

there's little in this e-text that requires even light-markup,
let alone heavy-markup: chapter headings, epigraphs, and
not a whole lot more, if i remember it correctly...   sheesh...

-bowerbird


**************************************
 See what's new at http://www.aol.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071106/fa92b7b2/attachment.htm 

From Bowerbird at aol.com  Wed Nov  7 11:09:52 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 7 Nov 2007 14:09:52 EST
Subject: [gutvol-d] p.o.d. of the entire catalog
Message-ID: <c23.275f5764.34636780@aol.com>

at some point in time down the line, i'll be able to offer
p.o.d. of the entire p.g. catalog.   as i have said before,
i will direct part of the proceeds to the p.g. foundation
and part to michael hart for his longstanding devotion.

one question here now is about the i.s.b.n. numbers.
(and yes, i know the "n" at the end of "i.s.b.n." stands
for "number", so that "i.s.b.n. numbers" is redundant.)

i.s.b.n. are much cheaper in big blocks than small ones.
immensely so.   (because they are a part of the system
that's designed to impose a high cost of entry on any
small publishers, to the benefit of the larger houses.)

would p.g. be willing to pick up the cost of the i.s.b.n.?
it'd likely be smartest to buy a block of 50,000 or so...
what conditions, if any, would be imposed in return?
and how long would an official decision on this take,
from request to approval to the issuance of a check?

please understand that this is _not_ asking for a "favor".
i've asked for favors, like when i asked for web-space.
i have no trouble discerning when i'm asking for a favor,
or saying that's what i'm doing.   but this is not that case.

either way, it's not going to make any difference at all
in the amount of money that p.g. ultimately receives,
because if the decision is a "no", i'll absorb the cost, so
the upshot is that it means it'll be that much longer until
the project moves into the black and p.g. gets any cash.
so either way, p.g. will underwrite it, directly or indirectly.

so the only question is whether p.g. owns the i.s.b.n. or not,
and thus has the ability to continue selling the publications
once i go to heaven...

-bowerbird


**************************************
 See what's new at http://www.aol.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071107/92e8e842/attachment.htm 

From hart at pglaf.org  Wed Nov  7 12:13:31 2007
From: hart at pglaf.org (Michael Hart)
Date: Wed, 7 Nov 2007 12:13:31 -0800 (PST)
Subject: [gutvol-d] !@! Re:  p.o.d. of the entire catalog
In-Reply-To: <c23.275f5764.34636780@aol.com>
References: <c23.275f5764.34636780@aol.com>
Message-ID: <Pine.LNX.4.64.0711071212590.6570@pglaf.org>


What do 50,000 ISBN's cost???

mh

On Wed, 7 Nov 2007, Bowerbird at aol.com wrote:

> at some point in time down the line, i'll be able to offer
> p.o.d. of the entire p.g. catalog.   as i have said before,
> i will direct part of the proceeds to the p.g. foundation
> and part to michael hart for his longstanding devotion.
>
> one question here now is about the i.s.b.n. numbers.
> (and yes, i know the "n" at the end of "i.s.b.n." stands
> for "number", so that "i.s.b.n. numbers" is redundant.)
>
> i.s.b.n. are much cheaper in big blocks than small ones.
> immensely so.   (because they are a part of the system
> that's designed to impose a high cost of entry on any
> small publishers, to the benefit of the larger houses.)
>
> would p.g. be willing to pick up the cost of the i.s.b.n.?
> it'd likely be smartest to buy a block of 50,000 or so...
> what conditions, if any, would be imposed in return?
> and how long would an official decision on this take,
> from request to approval to the issuance of a check?
>
> please understand that this is _not_ asking for a "favor".
> i've asked for favors, like when i asked for web-space.
> i have no trouble discerning when i'm asking for a favor,
> or saying that's what i'm doing.   but this is not that case.
>
> either way, it's not going to make any difference at all
> in the amount of money that p.g. ultimately receives,
> because if the decision is a "no", i'll absorb the cost, so
> the upshot is that it means it'll be that much longer until
> the project moves into the black and p.g. gets any cash.
> so either way, p.g. will underwrite it, directly or indirectly.
>
> so the only question is whether p.g. owns the i.s.b.n. or not,
> and thus has the ability to continue selling the publications
> once i go to heaven...
>
> -bowerbird
>
>
>
> **************************************
> See what's new at http://www.aol.com
>

From creeva at gmail.com  Wed Nov  7 12:19:50 2007
From: creeva at gmail.com (Brent Gueth)
Date: Wed, 7 Nov 2007 15:19:50 -0500
Subject: [gutvol-d] !@! Re: p.o.d. of the entire catalog
In-Reply-To: <Pine.LNX.4.64.0711071212590.6570@pglaf.org>
References: <c23.275f5764.34636780@aol.com>
	<Pine.LNX.4.64.0711071212590.6570@pglaf.org>
Message-ID: <2510ddab0711071219h64ae8401x46ac424f4ad780f2@mail.gmail.com>

According to ISBN.org $1,570.00 per 1000 plus 180 processing fee - that's
the highest lot number I could find by purchasing online - bowerbird may
have a cheaper lot for getting that large of a lot.

On Nov 7, 2007 3:13 PM, Michael Hart <hart at pglaf.org> wrote:

>
> What do 50,000 ISBN's cost???
>
> mh
>
> On Wed, 7 Nov 2007, Bowerbird at aol.com wrote:
>
> > at some point in time down the line, i'll be able to offer
> > p.o.d. of the entire p.g. catalog.   as i have said before,
> > i will direct part of the proceeds to the p.g. foundation
> > and part to michael hart for his longstanding devotion.
> >
> > one question here now is about the i.s.b.n. numbers.
> > (and yes, i know the "n" at the end of "i.s.b.n." stands
> > for "number", so that "i.s.b.n. numbers" is redundant.)
> >
> > i.s.b.n. are much cheaper in big blocks than small ones.
> > immensely so.   (because they are a part of the system
> > that's designed to impose a high cost of entry on any
> > small publishers, to the benefit of the larger houses.)
> >
> > would p.g. be willing to pick up the cost of the i.s.b.n.?
> > it'd likely be smartest to buy a block of 50,000 or so...
> > what conditions, if any, would be imposed in return?
> > and how long would an official decision on this take,
> > from request to approval to the issuance of a check?
> >
> > please understand that this is _not_ asking for a "favor".
> > i've asked for favors, like when i asked for web-space.
> > i have no trouble discerning when i'm asking for a favor,
> > or saying that's what i'm doing.   but this is not that case.
> >
> > either way, it's not going to make any difference at all
> > in the amount of money that p.g. ultimately receives,
> > because if the decision is a "no", i'll absorb the cost, so
> > the upshot is that it means it'll be that much longer until
> > the project moves into the black and p.g. gets any cash.
> > so either way, p.g. will underwrite it, directly or indirectly.
> >
> > so the only question is whether p.g. owns the i.s.b.n. or not,
> > and thus has the ability to continue selling the publications
> > once i go to heaven...
> >
> > -bowerbird
> >
> >
> >
> > **************************************
> > See what's new at http://www.aol.com
> >
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071107/8c2fe8de/attachment.htm 

From creeva at gmail.com  Wed Nov  7 12:21:41 2007
From: creeva at gmail.com (Brent Gueth)
Date: Wed, 7 Nov 2007 15:21:41 -0500
Subject: [gutvol-d] !@! Re: p.o.d. of the entire catalog
In-Reply-To: <Pine.LNX.4.64.0711071212590.6570@pglaf.org>
References: <c23.275f5764.34636780@aol.com>
	<Pine.LNX.4.64.0711071212590.6570@pglaf.org>
Message-ID: <2510ddab0711071221o37cf506etdeb4f5dee6b89904@mail.gmail.com>

I got my information from this page in case I'm misreading it - I'll let you
interpret the numbers yourself

On Nov 7, 2007 3:13 PM, Michael Hart <hart at pglaf.org> wrote:

>
> What do 50,000 ISBN's cost???
>
> mh
>
> On Wed, 7 Nov 2007, Bowerbird at aol.com wrote:
>
> > at some point in time down the line, i'll be able to offer
> > p.o.d. of the entire p.g. catalog.   as i have said before,
> > i will direct part of the proceeds to the p.g. foundation
> > and part to michael hart for his longstanding devotion.
> >
> > one question here now is about the i.s.b.n. numbers.
> > (and yes, i know the "n" at the end of "i.s.b.n." stands
> > for "number", so that "i.s.b.n. numbers" is redundant.)
> >
> > i.s.b.n. are much cheaper in big blocks than small ones.
> > immensely so.   (because they are a part of the system
> > that's designed to impose a high cost of entry on any
> > small publishers, to the benefit of the larger houses.)
> >
> > would p.g. be willing to pick up the cost of the i.s.b.n.?
> > it'd likely be smartest to buy a block of 50,000 or so...
> > what conditions, if any, would be imposed in return?
> > and how long would an official decision on this take,
> > from request to approval to the issuance of a check?
> >
> > please understand that this is _not_ asking for a "favor".
> > i've asked for favors, like when i asked for web-space.
> > i have no trouble discerning when i'm asking for a favor,
> > or saying that's what i'm doing.   but this is not that case.
> >
> > either way, it's not going to make any difference at all
> > in the amount of money that p.g. ultimately receives,
> > because if the decision is a "no", i'll absorb the cost, so
> > the upshot is that it means it'll be that much longer until
> > the project moves into the black and p.g. gets any cash.
> > so either way, p.g. will underwrite it, directly or indirectly.
> >
> > so the only question is whether p.g. owns the i.s.b.n. or not,
> > and thus has the ability to continue selling the publications
> > once i go to heaven...
> >
> > -bowerbird
> >
> >
> >
> > **************************************
> > See what's new at http://www.aol.com
> >
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071107/2cc7e7b7/attachment.htm 

From creeva at gmail.com  Wed Nov  7 12:45:38 2007
From: creeva at gmail.com (Brent Gueth)
Date: Wed, 7 Nov 2007 15:45:38 -0500
Subject: [gutvol-d] !@! Re: p.o.d. of the entire catalog
In-Reply-To: <2510ddab0711071219h64ae8401x46ac424f4ad780f2@mail.gmail.com>
References: <c23.275f5764.34636780@aol.com>
	<Pine.LNX.4.64.0711071212590.6570@pglaf.org>
	<2510ddab0711071219h64ae8401x46ac424f4ad780f2@mail.gmail.com>
Message-ID: <2510ddab0711071245s19471a48p17725f57c9e242d2@mail.gmail.com>

Since my last mail didn't include the link www.isbn.org has this as
their price list - more concise then the link I meant to send anyways

ISBN List of Services & Fees

REGULAR PROCESSING
 (15 business days turnaround)

ISBN price list (the categories below include the combined processing
fees and registration fees):

10 ISBNs:  $275.00
 100 ISBNs:  $995.00
 1000 ISBNs:  $1,750.00
 ________________________________


PRIORITY PROCESSING
 (48 business hours turnaround)

ISBN price list (the categories below include the combined processing
fees and registration fees):

10 ISBNs:  $375.00
 100 ISBNs:  $1,095.00
 1000 ISBNs:  $1,850.00
 ________________________________


EXPRESS PROCESSING
 (24 business hours turnaround)

ISBN price list (the categories below include the combined processing
fees and registration fees):

10 ISBNs:  $400.00
 100 ISBNs:  $1,120.00
 1000 ISBNs:  $1,875.00
 ________________________________


SELECTING BAR CODES:

The EAN-13 bar codes are the bar code translations for ISBNs. Most
bookstores, distributors, and industry related sectors require EAN-13
bar codes on books and book type products.

Bar code price list:

1-5 bar codes:  $25 per bar code
 (i.e. 3 bar codes at $25 per unit will total $75)

6-10 bar codes:  $23 per bar code
 (i.e. 6 bar codes at $23 per unit will total $138)

11-100 bar codes:  $21 per bar code
 (i.e. 11 bar codes at $21 per unit will total $231)
________________________________


On Nov 7, 2007 3:19 PM, Brent Gueth <creeva at gmail.com> wrote:
> According to ISBN.org $1,570.00 per 1000 plus 180 processing fee - that's the highest lot number I could find by purchasing online - bowerbird may have a cheaper lot for getting that large of a lot.
>
>
>
>
>
> On Nov 7, 2007 3:13 PM, Michael Hart <hart at pglaf.org> wrote:
>
> >
> > What do 50,000 ISBN's cost???
> >
> > mh
> >
> > On Wed, 7 Nov 2007, Bowerbird at aol.com wrote:
> >
> > > at some point in time down the line, i'll be able to offer
> > > p.o.d. of the entire p.g. catalog.   as i have said before,
> > > i will direct part of the proceeds to the p.g. foundation
> > > and part to michael hart for his longstanding devotion.
> > >
> > > one question here now is about the i.s.b.n. numbers.
> > > (and yes, i know the "n" at the end of "i.s.b.n." stands
> > > for "number", so that "i.s.b.n. numbers" is redundant.)
> > >
> > > i.s.b.n. are much cheaper in big blocks than small ones.
> > > immensely so.   (because they are a part of the system
> > > that's designed to impose a high cost of entry on any
> > > small publishers, to the benefit of the larger houses.)
> > >
> > > would p.g. be willing to pick up the cost of the i.s.b.n.?
> > > it'd likely be smartest to buy a block of 50,000 or so...
> > > what conditions, if any, would be imposed in return?
> > > and how long would an official decision on this take,
> > > from request to approval to the issuance of a check?
> > >
> > > please understand that this is _not_ asking for a "favor".
> > > i've asked for favors, like when i asked for web-space.
> > > i have no trouble discerning when i'm asking for a favor,
> > > or saying that's what i'm doing.   but this is not that case.
> > >
> > > either way, it's not going to make any difference at all
> > > in the amount of money that p.g. ultimately receives,
> > > because if the decision is a "no", i'll absorb the cost, so
> > > the upshot is that it means it'll be that much longer until
> > > the project moves into the black and p.g. gets any cash.
> > > so either way, p.g. will underwrite it, directly or indirectly.
> > >
> > > so the only question is whether p.g. owns the i.s.b.n. or not,
> > > and thus has the ability to continue selling the publications
> > > once i go to heaven...
> > >
> > > -bowerbird
> > >
> > >
> > >
> > > **************************************
> > > See what's new at http://www.aol.com
> > >
> > _______________________________________________
> > gutvol-d mailing list
> > gutvol-d at lists.pglaf.org
> > http://lists.pglaf.org/listinfo.cgi/gutvol-d
> >
>
>

From jon at noring.name  Wed Nov  7 12:54:58 2007
From: jon at noring.name (Jon Noring)
Date: Wed, 7 Nov 2007 13:54:58 -0700
Subject: [gutvol-d] !@! Re:  p.o.d. of the entire catalog
In-Reply-To: <Pine.LNX.4.64.0711071212590.6570@pglaf.org>
References: <c23.275f5764.34636780@aol.com>
	<Pine.LNX.4.64.0711071212590.6570@pglaf.org>
Message-ID: <1123139035.20071107135458@noring.name>

> What do 50,000 ISBN's cost???

Hmmm, hard to say.

I went to the ISBN.org site, which sells ISBNs for the United States:

   http://www.isbn.org/standards/home/isbn/us/isbn-fees.asp

The largest block shown there is 1000 ISBNs, for $1750.

It is a sliding scale, so there is hope:

order 10: $27.50 each
order 100: $9.95 each
order 1000: $1.75 each

Nothing is said if someone wanted to order a much larger block of
ISBNs, such as 50,000.

But I think one can safely say it is unlikely Bowker will sell ISBN's
for a lot less than $1.75 each. How much lower they'd go, I don't have
a clue. Nor do I know if they'd give PGLAF a break. (If they don't go
below $1.75 per ISBN, then 50,000 will sell for $87,500 -- gulp.)

Btw, we have to understand that there will be very few orders, if any,
for the vast majority of PG texts. So in some ways these obscure
titles will be "subsidized" by the better selling titles.

Anyway, if this amount of money is too much for PGLAF, Bowerbird has
offered to buy the ISBNs (if I read what he said correctly.)


Now if one can find a POD company willing to sell PG ebooks using some
identifier other than ISBN, that would be a better way to go. I've
always liked UUID as an identifier -- it's free and can be generated
by anyone. ISBN is a terrible book identifier anyway, which I've
written about in the past -- and even worse for ebooks.


Jon Noring


From Bowerbird at aol.com  Wed Nov  7 15:17:31 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 7 Nov 2007 18:17:31 EST
Subject: [gutvol-d] !@! Re: p.o.d. of the entire catalog
Message-ID: <cd6.1c1ecf1b.3463a18b@aol.com>

michael said:
>   What do 50,000 ISBN's cost???

i have no idea.

it's been a long time since i cared, and got a block.


>   10 ISBNs:? $275.00
>    100 ISBNs:? $995.00
>    1000 ISBNs:? $1,750.00

yeah, that's the pricing structure i remembered...       :+)

if you want 10, they're $27.50 each.
if you want 100, they're $9.95 each.
if you want 1000, they're $1.75 each.

these are _numbers_, for crying out loud.

there's very little _good_ reason why they
should be "cheaper when you buy in bulk",

they'd save a little in "administrative costs",
sure, but there's very little good reason why
those costs should be anything but trivial...

no, this is simply the big publishing industry
protecting itself from small competitors by
raising the cost of entry as high as they can...

so the price curve has a truly ridiculous slope.

and wait til you see how cheap it is for 10,000!

-bowerbird


**************************************
 See what's new at http://www.aol.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071107/6be2367b/attachment-0001.htm 

From donovan at abs.net  Wed Nov  7 15:30:31 2007
From: donovan at abs.net (D Garcia)
Date: Wed, 7 Nov 2007 19:30:31 -0400
Subject: [gutvol-d] !@! Re:  p.o.d. of the entire catalog
In-Reply-To: <Pine.LNX.4.64.0711071212590.6570@pglaf.org>
References: <c23.275f5764.34636780@aol.com>
	<Pine.LNX.4.64.0711071212590.6570@pglaf.org>
Message-ID: <200711071830.32572.donovan@abs.net>

On Wednesday 07 November 2007 15:13, Michael Hart wrote:
> What do 50,000 ISBN's cost???
>
> mh

Why would PG want to buy ISBNs when so many of the titles already have one 
assigned?

From jon at noring.name  Wed Nov  7 15:40:07 2007
From: jon at noring.name (Jon Noring)
Date: Wed, 7 Nov 2007 16:40:07 -0700
Subject: [gutvol-d] !@! Re: p.o.d. of the entire catalog
In-Reply-To: <cd6.1c1ecf1b.3463a18b@aol.com>
References: <cd6.1c1ecf1b.3463a18b@aol.com>
Message-ID: <1906343196.20071107164007@noring.name>

Bowerbird wrote:

>  no, this is simply the big publishing industry
>  protecting itself from small competitors by
>  raising the cost of entry as high as they can...

From those I've talked with in the publishing industry, most large
U.S. publishers would prefer ISBN to be free, as it is in many other
countries. ISBN pricing is a real sticking point.

Interestingly, the high cost of ISBNs in the U.S. has led many larger
publishers to reuse the same ISBN for different ebook formats of the
same title (which is a no-no per the ISBN ISO Standard), and this has
led to problems in the ebook retail market where the PDF, LIT and
MobiPocket (to name three) versions of a title are rolled into the
same ISBN. Imagine what would happen if the hard cover and soft cover
print versions of a title were given the same ISBN number?


>  so the price curve has a truly ridiculous slope.
>  
>  and wait til you see how cheap it is for 10,000!

Well, Bowker does not include a 10,000 option in its order form. But
it is possible for very large clients they will give a further
per-number discount. Maybe Greg or Michael, representing PGLAF,
should give Bowker a call and see if there is a further discount for
very big accounts, and maybe a special discount for PG being a
non-profit. Until Greg or Michael does that, it is premature to say
with any certainty what Bowker will charge for 50,000 ISBN numbers.
(No doubt PG could negotiate with Bowker -- everything is negotiable
-- but how much Bowker will discount for 50,000 is hard to predict.)

Jon Noring


From jon at noring.name  Wed Nov  7 16:23:02 2007
From: jon at noring.name (Jon Noring)
Date: Wed, 7 Nov 2007 17:23:02 -0700
Subject: [gutvol-d] my thoughts on ISBN
Message-ID: <209164248.20071107172302@noring.name>

This discussion of getting 50,000 ISBN numbers, and then the comment
that PG is already assigning ISBNs (to what?, some titles?), brings up
an interesting side topic.

Wikipedia has good background summary of ISBN:

   http://en.wikipedia.org/wiki/Isbn

It was developed in 1966 by British (paper) book sellers, well before
the digital era. It is an ISO standard. Note that sellers developed
it, not publishers.

When developed, and until recently, ISBN was intended to be a
"Manifestation" identifier. That is, it was intended to identify the
particular "object" for sale -- it was NOT a title ("Expression")
identifier. That's why ISBN has close ties to barcodes. It's more like
a UPC code.

So the hard cover of a title is given a different ISBN from the
paperback edition, and from the large print edition, etc.

And this is important for retailers who have to keep track of sales
since they sell "objects", not "titles". To a retailer, if they can
sell 10,000,000 books, they don't care what the titles are. That is,
they sell *books*, not *titles*. And ISBN is a book identifier, not a
title identifier.

As the ebook era dawned (for the large publishers sort of began about
1999/2000), book publishers all of a sudden saw that a title may need
to be cast into a number of formats, so all of a sudden the need for
ISBNs for a single title substantially increased. Since the large
publishers are still pretty frugal folk, several of them decided that
an "ebook" is an "ebook" and simply assigned the same ISBN no matter
the format. All of a sudden, many publishers are now using ISBN as an
"Expression" identifier, which is a "no-no" per the ISO standard (but
understandable given the high cost Bowker charges for ISBN.) This has
forced ebook retailers to internally expand the ISBN to include the
ebook format code, since otherwise how can they and their customers
differentiate between the different format versions (e.g., PDF from
LIT)?

PG can certainly use an ISBN as a sort of "Expression" code, but it is
non-standard. And if so, then why even use ISBN?

Thus, IMHO PG should only concern itself with ISBN when there is a
market need for it. Otherwise PG should stay away from ISBN like the
plague. Since POD may require ISBN, that may be a need. But I'd
arrange with a POD provider to see if they'd accept a "home grown"
ID that may look like an ISBN but is not an ISBN -- maybe uses
hexadecimal instead of decimal notation, or something else, so it
won't "clash" with any valid ISBNs out there. (Hmmm, the ISO standard
behind ISBN might actually have some odd extensions that are never
used, but there to use...)

Jon Noring


From Bowerbird at aol.com  Wed Nov  7 16:47:47 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 7 Nov 2007 19:47:47 EST
Subject: [gutvol-d]
 =?iso-8859-1?q?!=40!_Re=3A=A0_p=2Eo=2Ed=2E_of_the_enti?=
 =?iso-8859-1?q?re_catalog?=
Message-ID: <bd1.1e6a2b34.3463b6b3@aol.com>

donovan said:
>    Why would PG want to buy ISBNs when 
>    so many of the titles already have one assigned?

well, because that's what a publisher does
when you republish a public-domain book,
so the i.s.b.n. points to _your_ publication,
and not to any of the _previous_ editions...

more specifically, when these versions go in
google's system, and an end-user clicks to
get a printed copy, then i will get the order.

also, _my_ publications will be "full-view",
so we circumvent that ridiculous situation
where a public-domain book is "locked up"
by publishers to increase hard-copy sales.

there are a lot of p.g. e-texts that've been
"repurposed" to print.   i'm _fine_ with that,
right up until they put it in the "limited view"
section of google print.   so i'm fixing that...

and giving people the option of supporting
project gutenberg and michael hart with the
purchase of a nicely-formatted hard-copy of
their favorite books from project gutenberg,
nice formatting they can't easily get otherwise.

so that's why...

but i thought all that would be fairly obvious.

-bowerbird


**************************************
 See what's new at http://www.aol.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071107/698b3ab1/attachment.htm 

From lee at novomail.net  Wed Nov  7 19:47:13 2007
From: lee at novomail.net (Lee Passey)
Date: Wed, 07 Nov 2007 20:47:13 -0700
Subject: [gutvol-d] Musings on ISBNs (was Re: p.o.d. of the entire catalog)
In-Reply-To: <200711071830.32572.donovan@abs.net>
References: <c23.275f5764.34636780@aol.com>	<Pine.LNX.4.64.0711071212590.6570@pglaf.org>
	<200711071830.32572.donovan@abs.net>
Message-ID: <473286C1.9000309@novomail.net>

D Garcia wrote:

> Why would PG want to buy ISBNs when so many of the titles already have one 
> assigned?

ISBNs were not invented until 1966. Because most of the PG corpus is in 
the Public Domain, and first published long before 1966, if a title 
/does/ have one or more ISBNs assigned it's one that was assigned in a 
subsequent printing by some publisher. As demonstrated by Mr. 
Perathoner's example of September 5, some of the most popular titles can 
have dozens, if not hundreds, of ISBNs, each assigned by a different 
publisher.

Indeed, the ISBN is most useful in identifying a /publisher/ not a title 
or an author. In fact, the only real use I can see for an ISBN is so 
when a bookstore owner is running low on stock (or has a request for a 
rare book) s/he can go to Bowker's Books In Print, find the publisher, 
and call in another order. For someone outside the retail chain ISBNs 
are virtually useless.

Most of the PG corpus probably did come from books that had ISBNs, and 
some are an amalgam of multiple books each having its own ISBN. Whatever 
these ISBNs were (if they existed at all), however, is lost in the mists 
of time.

The prices mentioned here for ISBNs is if you obtain them from the U.S. 
ISBN agency, which is R.R.Bowker Co. Project Gutenberg is an 
international organization, so if it really wanted to obtain a block of 
ISBNs for its own use it makes sense to me to obtain them from a /non/ 
U.S. agency from which they are typically available at a /much/ reduced 
price (free, if some reports can be believed).

It may be that the need for an ISBN comes from a POD provider, and there 
is no intent to ever register a title for inclusion in Books In Print. 
If the POD provider doesn't validate the ISBN, its possible to just make 
one up.

Of course, it would be bad form to claim an ISBN that some other company 
has, or may have, the rights to use. However, ISBNs have an interesting 
property: the last digit in the ISBN is a checksum digit. Because this 
checksum digit is based on a calculation modulo 11, for every valid ISBN 
there are 10 /invalid/ ISBNs which differ only by the last, or checksum, 
digit.

If PG wanted to create a number that could be used in place of an ISBN, 
without risk that it would ever collide with a real ISBN it would 
suffice to create a method of generating unique 9-digit numbers, compute 
the standard ISBN checksum, and then add 1 (or some other number less 
than 11) to the checksum before computing the modulus. You wouldn't be 
able to register the publication with Books In Print, but for all other 
uses it ought to be just fine.

From robert_marquardt at gmx.de  Wed Nov  7 23:41:47 2007
From: robert_marquardt at gmx.de (Robert Marquardt)
Date: Thu, 08 Nov 2007 08:41:47 +0100
Subject: [gutvol-d] I want the imagemap extension for the Wiki
Message-ID: <43f5j3lminejf1v89jee16u88h0ogim6ta@4ax.com>

My idea is to create an "Adventskalender" for the Christmas time.
Here  random example found by Google image search
http://www.gedichte-garten.de/adventskalender/adventskalender.shtml

A free Christmas or winter image and some numbers placed on it should do the trick. The numbers linked to books from the
Christmas Bookshelf. Rigged up on Nov 31 and deleted on Dec 25.
-- 
Robert Marquardt (Team JEDI)  http://delphi-jedi.org

From marcello at perathoner.de  Thu Nov  8 04:26:36 2007
From: marcello at perathoner.de (Marcello Perathoner)
Date: Thu, 08 Nov 2007 13:26:36 +0100
Subject: [gutvol-d] I want the imagemap extension for the Wiki
In-Reply-To: <43f5j3lminejf1v89jee16u88h0ogim6ta@4ax.com>
References: <43f5j3lminejf1v89jee16u88h0ogim6ta@4ax.com>
Message-ID: <4733007C.1050503@perathoner.de>

Robert Marquardt wrote:

> My idea is to create an "Adventskalender" for the Christmas time. 
> Here  random example found by Google image search 
> http://www.gedichte-garten.de/adventskalender/adventskalender.shtml
> 
> A free Christmas or winter image and some numbers placed on it should
> do the trick. The numbers linked to books from the Christmas
> Bookshelf. Rigged up on Nov 31 and deleted on Dec 25.

We are in a quandary here: the current "supported" ImageMap extension is
for MediaWiki 1.9+.

We are still running MediaWiki 1.6.8 because ibiblio used to have PHP4.
Since ibiblio switched to PHP5 somwhere in August I had no time to
upgrade to the current version.

All I can do in the short term is to install this outdated version:

  http://www.mediawiki.org/wiki/Extension:ImageMap_%28McNaught%29


How is that with you?


-- 
Marcello Perathoner
webmaster at gutenberg.org


From robert_marquardt at gmx.de  Thu Nov  8 06:38:58 2007
From: robert_marquardt at gmx.de (Robert Marquardt)
Date: Thu, 08 Nov 2007 15:38:58 +0100
Subject: [gutvol-d] I want the imagemap extension for the Wiki
In-Reply-To: <4733007C.1050503@perathoner.de>
References: <43f5j3lminejf1v89jee16u88h0ogim6ta@4ax.com>
	<4733007C.1050503@perathoner.de>
Message-ID: <to76j39o424vq57vahv529i622uodifctp@4ax.com>

On Thu, 08 Nov 2007 13:26:36 +0100, you wrote:

>We are still running MediaWiki 1.6.8 because ibiblio used to have PHP4.
>Since ibiblio switched to PHP5 somwhere in August I had no time to
>upgrade to the current version.
>
>All I can do in the short term is to install this outdated version:
>
>  http://www.mediawiki.org/wiki/Extension:ImageMap_%28McNaught%29
>
>
>How is that with you?

I could not yet find out how the .map file works, but as long as we get it working it should be good enough.
We can uninstall the extension at the end of the year.
-- 
Robert Marquardt (Team JEDI)  http://delphi-jedi.org

From nwolcott2ster at gmail.com  Thu Nov  8 06:38:17 2007
From: nwolcott2ster at gmail.com (Norm Wolcott)
Date: Thu, 8 Nov 2007 09:38:17 -0500
Subject: [gutvol-d] Musings on ISBNs (was Re: p.o.d. of the entire
	catalog)
References: <c23.275f5764.34636780@aol.com>	<Pine.LNX.4.64.0711071212590.6570@pglaf.org><200711071830.32572.donovan@abs.net>
	<473286C1.9000309@novomail.net>
Message-ID: <00a501c82215$0862a800$660fa8c0@atlanticbb.net>

ISBN's are free in Canada. The application form is on their website. (google
isbn canada). You do need a Canada address however for them to send you the
ISBN's. Canada ISBN's are not searchable at Barnes and Noble for example, at
least when I tried at their store fpr one (a real book) nothing came up.
They may have been using Booksin Print which only has US ISBN's.


nwolcott2 at post.harvard.edu
----- Original Message -----
From: "Lee Passey" <lee at novomail.net>
To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
Sent: Wednesday, November 07, 2007 10:47 PM
Subject: [gutvol-d] Musings on ISBNs (was Re: p.o.d. of the entire catalog)


> D Garcia wrote:
>
> > Why would PG want to buy ISBNs when so many of the titles already have
one
> > assigned?
>
> ISBNs were not invented until 1966. Because most of the PG corpus is in
> the Public Domain, and first published long before 1966, if a title
> /does/ have one or more ISBNs assigned it's one that was assigned in a
> subsequent printing by some publisher. As demonstrated by Mr.
> Perathoner's example of September 5, some of the most popular titles can
> have dozens, if not hundreds, of ISBNs, each assigned by a different
> publisher.
>
> Indeed, the ISBN is most useful in identifying a /publisher/ not a title
> or an author. In fact, the only real use I can see for an ISBN is so
> when a bookstore owner is running low on stock (or has a request for a
> rare book) s/he can go to Bowker's Books In Print, find the publisher,
> and call in another order. For someone outside the retail chain ISBNs
> are virtually useless.
>
> Most of the PG corpus probably did come from books that had ISBNs, and
> some are an amalgam of multiple books each having its own ISBN. Whatever
> these ISBNs were (if they existed at all), however, is lost in the mists
> of time.
>
> The prices mentioned here for ISBNs is if you obtain them from the U.S.
> ISBN agency, which is R.R.Bowker Co. Project Gutenberg is an
> international organization, so if it really wanted to obtain a block of
> ISBNs for its own use it makes sense to me to obtain them from a /non/
> U.S. agency from which they are typically available at a /much/ reduced
> price (free, if some reports can be believed).
>
> It may be that the need for an ISBN comes from a POD provider, and there
> is no intent to ever register a title for inclusion in Books In Print.
> If the POD provider doesn't validate the ISBN, its possible to just make
> one up.
>
> Of course, it would be bad form to claim an ISBN that some other company
> has, or may have, the rights to use. However, ISBNs have an interesting
> property: the last digit in the ISBN is a checksum digit. Because this
> checksum digit is based on a calculation modulo 11, for every valid ISBN
> there are 10 /invalid/ ISBNs which differ only by the last, or checksum,
> digit.
>
> If PG wanted to create a number that could be used in place of an ISBN,
> without risk that it would ever collide with a real ISBN it would
> suffice to create a method of generating unique 9-digit numbers, compute
> the standard ISBN checksum, and then add 1 (or some other number less
> than 11) to the checksum before computing the modulus. You wouldn't be
> able to register the publication with Books In Print, but for all other
> uses it ought to be just fine.
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d


From creeva at gmail.com  Thu Nov  8 07:10:35 2007
From: creeva at gmail.com (Brent Gueth)
Date: Thu, 8 Nov 2007 10:10:35 -0500
Subject: [gutvol-d] Musings on ISBNs (was Re: p.o.d. of the entire
	catalog)
In-Reply-To: <00a501c82215$0862a800$660fa8c0@atlanticbb.net>
References: <c23.275f5764.34636780@aol.com>
	<Pine.LNX.4.64.0711071212590.6570@pglaf.org>
	<200711071830.32572.donovan@abs.net> <473286C1.9000309@novomail.net>
	<00a501c82215$0862a800$660fa8c0@atlanticbb.net>
Message-ID: <2510ddab0711080710s424fe56dx98b8e5eeefcc614@mail.gmail.com>

Beyond the Canadian suggestion - why don't we work with the creative
commons folks to come up with an open format since I'm sure they are
going to run into the same issue at some point.  If a collaboration
worked together for a .10 or .5 maintenance fee for each I'm sure
there would be a large adoption for the open community to start
registering more items if the barrier to entry was significantly
lowered.

On Nov 8, 2007 9:38 AM, Norm Wolcott <nwolcott2ster at gmail.com> wrote:
> ISBN's are free in Canada. The application form is on their website. (google
> isbn canada). You do need a Canada address however for them to send you the
> ISBN's. Canada ISBN's are not searchable at Barnes and Noble for example, at
> least when I tried at their store fpr one (a real book) nothing came up.
> They may have been using Booksin Print which only has US ISBN's.
>
>
> nwolcott2 at post.harvard.edu
>
> ----- Original Message -----
> From: "Lee Passey" <lee at novomail.net>
> To: "Project Gutenberg Volunteer Discussion" <gutvol-d at lists.pglaf.org>
> Sent: Wednesday, November 07, 2007 10:47 PM
> Subject: [gutvol-d] Musings on ISBNs (was Re: p.o.d. of the entire catalog)
>
>
> > D Garcia wrote:
> >
> > > Why would PG want to buy ISBNs when so many of the titles already have
> one
> > > assigned?
> >
> > ISBNs were not invented until 1966. Because most of the PG corpus is in
> > the Public Domain, and first published long before 1966, if a title
> > /does/ have one or more ISBNs assigned it's one that was assigned in a
> > subsequent printing by some publisher. As demonstrated by Mr.
> > Perathoner's example of September 5, some of the most popular titles can
> > have dozens, if not hundreds, of ISBNs, each assigned by a different
> > publisher.
> >
> > Indeed, the ISBN is most useful in identifying a /publisher/ not a title
> > or an author. In fact, the only real use I can see for an ISBN is so
> > when a bookstore owner is running low on stock (or has a request for a
> > rare book) s/he can go to Bowker's Books In Print, find the publisher,
> > and call in another order. For someone outside the retail chain ISBNs
> > are virtually useless.
> >
> > Most of the PG corpus probably did come from books that had ISBNs, and
> > some are an amalgam of multiple books each having its own ISBN. Whatever
> > these ISBNs were (if they existed at all), however, is lost in the mists
> > of time.
> >
> > The prices mentioned here for ISBNs is if you obtain them from the U.S.
> > ISBN agency, which is R.R.Bowker Co. Project Gutenberg is an
> > international organization, so if it really wanted to obtain a block of
> > ISBNs for its own use it makes sense to me to obtain them from a /non/
> > U.S. agency from which they are typically available at a /much/ reduced
> > price (free, if some reports can be believed).
> >
> > It may be that the need for an ISBN comes from a POD provider, and there
> > is no intent to ever register a title for inclusion in Books In Print.
> > If the POD provider doesn't validate the ISBN, its possible to just make
> > one up.
> >
> > Of course, it would be bad form to claim an ISBN that some other company
> > has, or may have, the rights to use. However, ISBNs have an interesting
> > property: the last digit in the ISBN is a checksum digit. Because this
> > checksum digit is based on a calculation modulo 11, for every valid ISBN
> > there are 10 /invalid/ ISBNs which differ only by the last, or checksum,
> > digit.
> >
> > If PG wanted to create a number that could be used in place of an ISBN,
> > without risk that it would ever collide with a real ISBN it would
> > suffice to create a method of generating unique 9-digit numbers, compute
> > the standard ISBN checksum, and then add 1 (or some other number less
> > than 11) to the checksum before computing the modulus. You wouldn't be
> > able to register the publication with Books In Print, but for all other
> > uses it ought to be just fine.
> > _______________________________________________
> > gutvol-d mailing list
> > gutvol-d at lists.pglaf.org
> > http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>

From Bowerbird at aol.com  Thu Nov  8 15:10:35 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 8 Nov 2007 18:10:35 EST
Subject: [gutvol-d] Musings on ISBNs (was Re: p.o.d. of the entire
	catalog)
Message-ID: <c40.214afd77.3464f16b@aol.com>

brent said:
>    why don't we work with the creative commons folks 
>    to come up with an open format

yeah!   fight the power!   up with the people!   right on!       :+)

listen, folks, i'm glad i got you all "musing" and everything,
but i should've just asked michael backchannel about this...

i will almost certainly need isbn's -- real ones, the u.s. kind,
which will have to be purchased from the bowker b*st*rds -- 
because my guess is that's what google requires these days...
(and objective number 1 is the google system, so we _know_
that people are informed they can read these books for free.)

and even if google doesn't, then the p.o.d. place i use might.
(because objective number 2 is giving people pretty output.)

and bookstores absolutely do.   not that i intend to put books
in bookstores, but i'm not gonna turn down any orders either.

so bowker's books-in-print is one target.   and so is amazon.

but, you know, best of luck with that whole revolution thing.
no longer will we allow the i.s.b.n. to step down on our neck!

-bowerbird


**************************************
 See what's new at http://www.aol.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071108/59a9d87d/attachment.htm 

From jon at noring.name  Thu Nov  8 15:16:22 2007
From: jon at noring.name (Jon Noring)
Date: Thu, 8 Nov 2007 16:16:22 -0700
Subject: [gutvol-d] Musings on ISBNs (was Re: p.o.d. of the entire
	catalog)
In-Reply-To: <c40.214afd77.3464f16b@aol.com>
References: <c40.214afd77.3464f16b@aol.com>
Message-ID: <1488200996.20071108161622@noring.name>

Bowerbird said:

>  listen, folks, i'm glad i got you all "musing" and everything,
>  but i should've just asked michael backchannel about this...

Probably. And definitely Greg or Michael needs to call the Bowker folk
to get their pricing for 50,000 ISBNs.

My "musings" were to aid in understanding the role and alternatives to
ISBN, but I also noted the pragmatic reality of getting U.S. ISBNs.

Jon


From gutenberg at gagravarr.org  Sat Nov 10 10:23:58 2007
From: gutenberg at gagravarr.org (Nick Burch)
Date: Sat, 10 Nov 2007 18:23:58 +0000 (GMT)
Subject: [gutvol-d] UK based volunteer to scan a few books?
Message-ID: <Pine.LNX.4.64.0711101819380.19741@urchin.earth.li>

Hi All

I hope this is the right volunteer list to post on for this...

I've got 7 books from the late 19th century, which I've checked and are 
out of copyright, and seem interesting enough to contribute to the 
project. However, I don't have a scanner.

Is there a volunteer in the UK who'd be interested in scanning them in, if 
I were to post the books to them? (I'll happily pay for postage)

Nick

From desrod at gnu-designs.com  Sat Nov 10 17:08:57 2007
From: desrod at gnu-designs.com (David A. Desrosiers)
Date: Sat, 10 Nov 2007 20:08:57 -0500
Subject: [gutvol-d] "Turning the Pages of an eBook - Realistic Electronic
	Books"
Message-ID: <1194743337.6413.2.camel@localhost.localdomain>

I just found this Google video on YouTube, and found some of the items
discussed (as well as all the eye-candy demos), to be quite interesting,
especially with regard to our recent discussions about digitizing ebooks
in a way that represents the "real" book structure. 

http://www.youtube.com/watch?v=9Y-BM3Z5xy0


-- 
David A. Desrosiers
desrod at gnu-designs.com
setuid at gmail.com
http://projects.plkr.org/
Skype...: 860-967-3820


From ajhaines at shaw.ca  Sun Nov 11 14:49:02 2007
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Sun, 11 Nov 2007 14:49:02 -0800
Subject: [gutvol-d] Multi-volume book set with master index
Message-ID: <000501c824b5$0e565ac0$6401a8c0@ahainesp2400>

I'm working on a 4-volume set of books.  The set's master index is in volume 
4.

Which is preferred:

- to also include the index in volumes 1-3 of the set (for readers' 
convenience), or
- to leave those volumes as they are?


Regards,
Al 


From ralf at ark.in-berlin.de  Mon Nov 12 01:28:58 2007
From: ralf at ark.in-berlin.de (Ralf Stephan)
Date: Mon, 12 Nov 2007 10:28:58 +0100
Subject: [gutvol-d] Multi-volume book set with master index
In-Reply-To: <000501c824b5$0e565ac0$6401a8c0@ahainesp2400>
References: <000501c824b5$0e565ac0$6401a8c0@ahainesp2400>
Message-ID: <20071112092858.GA28414@ark.in-berlin.de>

> Which is preferred:
> 
> - to also include the index in volumes 1-3 of the set (for readers' 
> convenience), or
> - to leave those volumes as they are?

I have the same problem somewhere on the horizon and I settled
my plans with doing all four without index and a fifth complete
with index edition. YMMV.


ralf


From gbnewby at pglaf.org  Mon Nov 12 08:16:52 2007
From: gbnewby at pglaf.org (Greg Newby)
Date: Mon, 12 Nov 2007 08:16:52 -0800
Subject: [gutvol-d] Multi-volume book set with master index
In-Reply-To: <000501c824b5$0e565ac0$6401a8c0@ahainesp2400>
References: <000501c824b5$0e565ac0$6401a8c0@ahainesp2400>
Message-ID: <20071112161652.GB6326@mail.pglaf.org>

On Sun, Nov 11, 2007 at 02:49:02PM -0800, Al Haines (shaw) wrote:
> I'm working on a 4-volume set of books.  The set's master index is in volume 
> 4.
> 
> Which is preferred:
> 
> - to also include the index in volumes 1-3 of the set (for readers' 
> convenience), or
> - to leave those volumes as they are?
> 

Hi, Al.  It's definitely up to you.  If the index will be live
(that is, hyperlinks into the right locations in the HTML
documents for the different volumes), it will be challenging
to set up the links to external files (because you won't know
the eBook #).

We can pre-assign a set of eBook #s, but even so that's not
so user-friendly (since people could rename after download).

If it's not live/linked, then this is less of an issue.  To
me, having a duplication of the index, with references
to each separate volume, would be slightly more user-friendly
at the expense of making the individual volumes' files larger.
  -- Greg


From jon at noring.name  Tue Nov 13 10:42:28 2007
From: jon at noring.name (Jon Noring)
Date: Tue, 13 Nov 2007 11:42:28 -0700
Subject: [gutvol-d] Announcing: The Digital Text Community mailing list
Message-ID: <111525527.20071113114228@noring.name>

Everyone,

I am announcing the start of "The Digital Text Community", a public
mailing list (on YahooGroups) devoted to serious discussion of
digitizing "ink-on-paper" publications.

The full group description is found at the group's "home page" at:

   http://groups.yahoo.com/group/digital-text/


The primary reason why I am starting DTC is that there is,
suprisingly, no independent forum to discuss the various technical and
non-technical issues of digitizing "ink-on-paper" publications.

Current discussion on digitizing paper publications is disjointly
spread around in various nooks and crannies of the Internet. For
example, there are forums for particular digitization projects such
as those run by Project Gutenberg (e.g. "gutvol-d") and Distributed
Proofreaders (an online set of forums.)

And then there are forums which touch upon various issues of text
digitization but which is not their main focus. Examples are Book
People (which John Mark Ockerbloom is closing the end of the month)
and The eBook Community (a YahooGroup which I administer.)

The summary purpose of DTC is given in the last paragraph of the
DTC group description:

   "This group is not affiliated with any particular project or
   organization, but rather is independent. It is hoped this group
   will be a bridge between the various text digitization projects,
   enabling information exchange for everyone?s benefit."

Do consider subscribing to DTC. If you need any help with subscribing
to the group, let me know. Look forward to seeing you there!


Jon Noring
The Digital Text Community Administrator


From Bowerbird at aol.com  Thu Nov 15 09:08:34 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 15 Nov 2007 12:08:34 EST
Subject: [gutvol-d] rumor-mongers
Message-ID: <ce0.1c7f1f75.346dd712@aol.com>

the rumor-mongers have another "announce-date" for the kindle.
(that's amazon's reader-machine, if you're out of the rumor loop.)

this time it's monday, november 19th.

could be.   but i doubt it.   i seriously doubt it.

but nonetheless, because one guy says he is invited to 
an amazon press-conference, and he says it "might" be
for the purpose of announcing the kindle... well, heck...
that's all the good reason that the rumor-mongers need
to rerun their tired speculation again.

and so we have it.   mobileread runs it:
>    http://www.mobileread.com/forums/showthread.php?t=16111

and the rothman teleblawg runs it:
>    http://www.teleread.org/blog/?p=7637

of course, they ran similar items back in september, promising
that _october_ would be the due-date, and they also followed up
on a n.y. times article (a retraction of the october prediction that
moved it up to "end-year"), until i reminded them that only the
most clueless of companies would release a niche gadget product
_then_, after the _very_end_ of the year's big gift-buying season...
and, believe me, amazon is _not_ a "clueless" company.

and of course, they _also_ ran similar items back this _spring_,
predicting that the release would be then.

oh, and of course, they _also_ ran similar items _last_fall_ -- 
yes, they're now over a year late on their original predictions
-- so maybe you should examine their track-record on this
and decide you just don't have time for this kind of noise...

for me, on the other hand, this stuff is wildly amusing...
i don't know how i'd get through a week without having
teleblawg speculation giving me laughs along the way...
i definitely couldn't make this stuff up.

-bowerbird

p.s.   and rothman, once again, even brought back his _$50_
pricepoint, this time talking about o.l.p.c., in an entry today.   
somehow, the rumors always seem sparkly and fresh-baked.
it's amazing, isn't it?


**************************************
 See what's new at http://www.aol.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071115/6bb74589/attachment.htm 

From sly at victoria.tc.ca  Fri Nov 16 14:55:14 2007
From: sly at victoria.tc.ca (Andrew Sly)
Date: Fri, 16 Nov 2007 14:55:14 -0800 (PST)
Subject: [gutvol-d] International Year of Languages
Message-ID: <Pine.GSO.4.58.0711161454310.19052@vtn1.victoria.tc.ca>

I'm just going to throw out an idea here to see what people think.

The U.N. General Assembly has declared 2008 to be the
"International Year of Languages".

What do fellow gutvol-d inhabitants think of the idea of having
a day, or perhaps a week, where we try to have texts posted in
as many languages as possible. This could mean "saving up" some
of them, so as to have them all ready around the same time.

I could envision that making a good press release.

I also have ideas for different places and people I could
go to, to encourage more participation in different langauges.
This might be easier if I can say that it is to be done for a
special day, or event.

I've tried to see if there is one particular day, or time of
year that would be most appropriate. I can find a number of
schools having some kind of "World Language Day" in 2008,
but they are all on different days. What might work best for
the purposes of PG?

Feedback?

Andrew

From piggy at netronome.com  Fri Nov 16 19:35:02 2007
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Fri, 16 Nov 2007 22:35:02 -0500
Subject: [gutvol-d] International Year of Languages
In-Reply-To: <Pine.GSO.4.58.0711161454310.19052@vtn1.victoria.tc.ca>
References: <Pine.GSO.4.58.0711161454310.19052@vtn1.victoria.tc.ca>
Message-ID: <473E6166.9060601@netronome.com>

Andrew Sly wrote:
> I'm just going to throw out an idea here to see what people think.
>
> The U.N. General Assembly has declared 2008 to be the
> "International Year of Languages".
>
> What do fellow gutvol-d inhabitants think of the idea of having
> a day, or perhaps a week, where we try to have texts posted in
> as many languages as possible. This could mean "saving up" some
> of them, so as to have them all ready around the same time.
>   
I think this is a delightful idea.

I have a nice set of small Georgian books I've been meaning to put 
through DPEU. I also have an Enga book I think I can clear.
> I could envision that making a good press release.
>
> I also have ideas for different places and people I could
> go to, to encourage more participation in different langauges.
> This might be easier if I can say that it is to be done for a
> special day, or event.
>
> I've tried to see if there is one particular day, or time of
> year that would be most appropriate. I can find a number of
> schools having some kind of "World Language Day" in 2008,
> but they are all on different days. What might work best for
> the purposes of PG?
>   
What about trying to pick a day for each language appropriate to that 
language?
> Feedback?
>
> Andrew
>   


From ricardofdiogo at gmail.com  Sat Nov 17 12:46:46 2007
From: ricardofdiogo at gmail.com (Ricardo F Diogo)
Date: Sat, 17 Nov 2007 20:46:46 +0000
Subject: [gutvol-d] International Year of Languages
In-Reply-To: <473E6166.9060601@netronome.com>
References: <Pine.GSO.4.58.0711161454310.19052@vtn1.victoria.tc.ca>
	<473E6166.9060601@netronome.com>
Message-ID: <9c6138c50711171246q5e9dc194r19781d79da1a7867@mail.gmail.com>

2007/11/17, La Monte H.P. Yarroll <piggy at netronome.com>:

> What about trying to pick a day for each language appropriate to that
> language?

Sounds great. For Portuguese it'd be June 10 (Day of Portugal, Camoes
and the Portuguese Communities) and November 5 (Day of Portuguese
Language in Brazil).

Ricardo

From Bowerbird at aol.com  Mon Nov 19 10:42:27 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 19 Nov 2007 13:42:27 EST
Subject: [gutvol-d] the game
Message-ID: <d1f.1195afbb.34733313@aol.com>

ok, the game just got a little more interesting:        :+)
>    
http://www.amazon.com/gp/product/B000FI73MA/ref=sa_menu_kdp3/103-5393010-8448654

i'm interested in the utility of this thing as a general web-browser
-- how well will it work, and how will such use impact the costs? --
but in general i am impressed with this.   and bezos _did_ get it out
before thanksgiving -- with even 3 days to spare! -- so that's good.

so then, let's see what the reviews are from the early buyers...

-bowerbird


**************************************
 See what's new at http://www.aol.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071119/208a4134/attachment.htm 

From Bowerbird at aol.com  Thu Nov 22 12:38:21 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 22 Nov 2007 15:38:21 EST
Subject: [gutvol-d] happy thanksgiving
Message-ID: <c40.22f1688d.347742bd@aol.com>

have a happy thanksgiving, all!              :+)

including any native american "indians" out there!

-bowerbird


**************************************
Check out AOL's list of 2007's hottest 
products.

(http://money.aol.com/special/hot-products-2007?NCID=aoltop00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071122/5e3175bc/attachment.htm 

From robert_marquardt at gmx.de  Sun Nov 25 21:13:22 2007
From: robert_marquardt at gmx.de (Robert Marquardt)
Date: Mon, 26 Nov 2007 06:13:22 +0100
Subject: [gutvol-d] I want the imagemap extension for the Wiki
In-Reply-To: <43f5j3lminejf1v89jee16u88h0ogim6ta@4ax.com>
References: <43f5j3lminejf1v89jee16u88h0ogim6ta@4ax.com>
Message-ID: <77lkk3hobo9ttk6nk044hl674ag5qm0neu@4ax.com>

On Thu, 08 Nov 2007 08:41:47 +0100, you wrote:

>My idea is to create an "Adventskalender" for the Christmas time.
>Here  random example found by Google image search
>http://www.gedichte-garten.de/adventskalender/adventskalender.shtml
>
>A free Christmas or winter image and some numbers placed on it should do the trick. The numbers linked to books from the
>Christmas Bookshelf. Rigged up on Nov 31 and deleted on Dec 25.

Marcelo has installed the extension, but now i am completely unable to do the work and time is running short.
The calendar should be rigged up at Nov 31.
Can i get some help?

I asked Juliet Sutherland from DP to give me a list of 24 Christmas books.
To complete the work we need a free picture. Best a simple winter landscape instead of such agressive Santa pictures.
The numbers 1 to 24 have to be painted upon it with equal-sized boxes around it. This should be not too complicated.
I fear i have to lean on Marcelo to create the imagemap.
-- 
Robert Marquardt (Team JEDI)  http://delphi-jedi.org

From ralf at ark.in-berlin.de  Mon Nov 26 03:00:31 2007
From: ralf at ark.in-berlin.de (Ralf Stephan)
Date: Mon, 26 Nov 2007 12:00:31 +0100
Subject: [gutvol-d] I want the imagemap extension for the Wiki
In-Reply-To: <77lkk3hobo9ttk6nk044hl674ag5qm0neu@4ax.com>
References: <43f5j3lminejf1v89jee16u88h0ogim6ta@4ax.com>
	<77lkk3hobo9ttk6nk044hl674ag5qm0neu@4ax.com>
Message-ID: <20071126110031.GA6402@ark.in-berlin.de>

You wrote 
> To complete the work we need a free picture. Best a simple winter landscape instead of such agressive Santa pictures.

Take 24:

http://commons.wikimedia.org/wiki/Winter


ralf


From robert_marquardt at gmx.de  Mon Nov 26 21:34:32 2007
From: robert_marquardt at gmx.de (Robert Marquardt)
Date: Tue, 27 Nov 2007 06:34:32 +0100
Subject: [gutvol-d] I want the imagemap extension for the Wiki
In-Reply-To: <20071126110031.GA6402@ark.in-berlin.de>
References: <43f5j3lminejf1v89jee16u88h0ogim6ta@4ax.com>
	<77lkk3hobo9ttk6nk044hl674ag5qm0neu@4ax.com>
	<20071126110031.GA6402@ark.in-berlin.de>
Message-ID: <92bnk39lhdl5cvqpqtni5vqps9kd5hnjab@4ax.com>

On Mon, 26 Nov 2007 12:00:31 +0100, you wrote:

>You wrote 
>> To complete the work we need a free picture. Best a simple winter landscape instead of such agressive Santa pictures.
>
>Take 24:
>
>http://commons.wikimedia.org/wiki/Winter
>
>
>ralf

Thanks, Landscape_in_Bavarian_in_wintertime.jpg is ideal.
Maybe ifind the energy to do some work.
-- 
Robert Marquardt (Team JEDI)  http://delphi-jedi.org

From Bowerbird at aol.com  Wed Nov 28 13:20:08 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 28 Nov 2007 16:20:08 EST
Subject: [gutvol-d] hope you all had a lovely thanksgiving weekend
Message-ID: <d63.9dbf406.347f3588@aol.com>

hope you all had a lovely long weekend...
now it's back to work...             :+)

***

here's a long message, on many topics,
conveniently combined here to minimize
wear-and-tear on your delete-key finger...          ;+)

***

the challenge...

as a way of presenting the challenge to myself,
i've put up a graphic showing the opening page
of 3 different editions of "alice in wonderland",
all of them attained from the internet archive:
>    http://z-m-l.com/misc/thechallenge.png

as you can see, the o.c.a. gets very good o.c.r.
indeed, on some books, it's amazingly accurate.

you can also see that, even still, it ain't perfect...

the objective is to use the different versions to
converge upon an _error-free_ version of each
-- with as little human interaction as possible --
retaining linebreaks idiosyncratic to each edition.

for example, load this into another browser-tab:
>    http://z-m-l.com/misc/thechallenge2.png
and toggle between the two to show differences.

i'll be making my own tool to accomplish this, but
i welcome the efforts of other programmers too...
perhaps then we could compare _our_ outputs to
come to an even _more_ satisfying convergence...

if you _are_ a programmer who'd like to take this on,
you should also take a quick look at both of these files:
>    http://z-m-l.com/go/pap/pride%20and%20prejudice(4).txt
>    http://z-m-l.com/go/pap/pride_and_prejudice(4).html

in addition, you might want to examine these demos:
>    http://snowy.arsc.alaska.edu/bowerbird/oneoo/oneoo-compweball.html
>    http://snowy.arsc.alaska.edu/bowerbird/oneoo/oneoo-compwebone.html
which involved a similar comparison-across-editions.

***

oh, and by the way, i've created a version of "alice"
based on yet _another_ edition -- from google --
which you can peruse here, if you would care to:
>    http://z-m-l.com/go/aiwon/aiwonp001.html

the scans are fairly crappy, actually _really_ crappy,
but hey, sometimes you get what you pay for, right?

and the o.c.r. -- as you'd imagine -- was atrocious,
so i used the p.g. e-text as my base.   what i found
was that that file -- which i'd thought was _clean_ --
actually contains quite a bit of noise.   some of that
might be due to it coming from another edition, yes,
(the british spellings, for sure); however, there were
also a few outright _errors_.   someone might want to
clean up the p.g. file if they can find the source-text.

not that i'm crabbing about "faithfulness", mind you,
or "trustworthiness", or any of those other bogeymen;
just saying there are errors that you might want to fix.

actually, i've often thought that "alice" was the answer
for why we didn't want to have p.g. e-texts adhere to
a specific version, at least in our low-bandwidth past...
the two footnotes that say "later editions added this"
were a particularly adept way of handling that matter,
given the alternative of mounting _another_ edition
varying from the first only by the additional passages.
indeed, you'll note that -- even in the edition i posted,
which is fairly "faithful" to the 1898 edition's scans --
i included those two notes, as a worthwhile addition...

anyway, there are errors in pg#11, if anyone cares...

oh yeah, and my version isn't totally clean yet either.
i made so many changes that it needs a second pass.
plus, since i've changed the linebreaks, you will need
a tool like the one i outlined above to do the job right.
but still, in the absence of anything better, it is there:
>     http://z-m-l.com/go/aiwon/aiwon.zml

***

since thanksgiving has come and gone, i can report on
this year's update to a prediction i made two years ago:
>    http://www.teleread.org/blog/2005/11/29/you-can-buy-the-mit-100-lapt
op-for-200/

at that time, david rothman had been yammering on for
several years about "the coming $50 e-book-machine";
when he repeated this ridiculousness yet another time,
i finally decided i would call him on this little bit of spin.

rothman responded with:
>    Folks, tune in a year from now, and we?ll see who?s right

i bet him that it would be _five_years_ before his cheapo
machine would be readily available, or i'd buy him lunch.

he came back with a lot of blah-blah-blah about o.l.p.c.

well, the first year came and went, with no such machine,
not at $50, and not $100, and not $150, not even _$200_,
which he'd predicted for 2005, indeed not at _any_ price...

so he was eating crow for thanksgiving...

and now the second year has come and gone, and _still_
no $50 machine, or $100 machine.   the o.l.p.c. _is_ finally
available for sale -- as a charity case (i gave, did you?) --
but it's a _$200_ machine, and you have to buy _2_ of 'em.
(but you only get one, as the other is your "contribution";
i'm cool with that, since it's an _extremely_ good cause...)

so david's "prediction"?   2 years late, with 100% over-run.

meaning he was eating crow again for _this_ thanksgiving.

and, of course, you know about the other machines now.
the sony costs about $300, amazon's will run you $400,
there are a few other contenders in that same range, and
the iliad tops out all the prices at an ungodly $600-plus.

if all this doesn't make you realize that a $50 prediction
back in 2005 (or even going all the way back to _1992_,
which is what rothman constantly likes to remind us) is
_pure_folly_, then you, my friends, grasp reality poorly...

an e-book-machine is a _computer_.   it needs a _chip_
and a _screen_, which are the expensive elements of any
computer.   so you can't make a cheap e-book-machine.

and when you can make an inexpensive e-book-machine,
you'll be able to make an inexpensive _computer_ as well,
and _no_one_ will want a limited-usage e-book-machine,
not when they can get a full computer for the same price.
so it's _mindless_spin_ to talk of cheap e-book-machines.

anyway, this is pretty much what i was expecting all along.
next year the o.l.p.c. (and its commercial rivals) will cost
about $200 (without requiring you buy more than one)...
the thanksgiving after that, the price'll be around $100,
and the year after that -- 5 years from my original bet --
the price _might_ drop to as low as $50.   (or might not...)
rothman, completely wrong.   bowerbird, completely right.

and hey, the o.l.p.c. has done a _big_ favor to _everyone_.
by issuing the mere _threat_ to create a low-price laptop,
and having the crack mary lou jepsen make good on that
threat by solving the way-too-expensive-screen problem
(the e-ink greedsters thought that they had a monopoly),
the commercial side has been forced to attend to the task.
otherwise they would have put if off as long as they could...

***

anyway, back to the digitization workbench...

a supporter of the "epub" file-format digitized a copy of
"woodcraft", an early classic environmentally-geared book:
>    http://www.zianet.com/jgray

as i've said all along, i love it when people use that format,
because it requires them to put in a whole bunch of work
laying out the _structure_ of the book, so it's as easy as pie
for me to then remix all of their work into z.m.l. pudding...

so i did that.

i thought it'd be a good exercise for my .pdf converter, so
i ran that, making some improvements to it along the way.

i decided to keep the linebreaks in the _text_ version of
the file which i had obtained from the site listed above,
which meant i had to use a pointsize that's fairly _small_.
(more later, since that's a problem with p.g. e-texts too.)

there was a _glossary_ in this edition, so i expanded the
_footnote_ routines to handle glossary items as well, and
that's a nice addition.   it finds the terms automatically, so
-- other than enclosing the words within [brackets] in the
glossary section -- there's nothing else you'll need to do.
(the routine finds the terms in the body-text all by itself,
and creates the front-links and back-links automatically.)
kinda nifty, if i do say so myself.

indeed, i got kind of link-happy.   first, i figured that i would
create an _html_ version of the text as well.   easy enough...

then, i decided to have every page of the .pdf _link_up_to_
the .html version online, to demonstrate how you would do
scholarly references in my z.m.l. cyberlibrary infrastructure.

so each .pdf page (which represents _my_ "original" p-page)
also links up to the online .html version, which _also_ mimics
the "original" p-page, even displaying the "scan" next to it...

so what we've got are _throughly_cross-linked_versions_ that
are faithful (gawd, there's that stupid word again) "reprints" of
the "original" p-book.   this interlocking mesh makes me happy.

there were also two references in the book to _other_ books,
each time to a specific _page_ in that other book, so i linked
those references in the .pdf to those _pages_ in those _books_,
again demonstrating how scholarly references are accomplished.
some people make a big deal out of such interbook linking, but
i show that it's a very simple matter of straightforward execution.

in addition, images are two-way-linked to the list of illustrations,
and to the next-and-previous illustrations, _and_ to a full-page
version of the illustration, plus to an online version of the image.

more links than you can shake a stick at, and i wasn't done yet...

jon noring has a reference i.d. for every _paragraph_ in his demo
version of "my antonia", so i figured i had to match that capacity...

and then i decided i'd do that one better, just to be interesting...
so i included in the .pdf links to every _line_ in the .html version.

but you cannot win this game if you only think one move ahead.

so i decided instead that i would link to every _word_ in the .html.
that's right, click on any _word_ in that .pdf, and your browser will
jump to that exact _word_ in the canonical .html version online...

and if noring finds it necessary, i'll link to every goshdarn _letter_.

i've already coded the routine, jon.   all i have to do is toggle a flag,
and _boom_ it goes.   don't make me push that button, jon...      :+)

anyway, all these .pdf links are created automatically, as per z.m.l.
rework the parameters, as far as rewrapping or resizing the text,
or changing the number of lines per page, and all of the links are
automatically recomputed and recreated, without any intervention.
after all, that's the kind of thing that computers are good at, right?

***

ok, so here's where i admit that i was _wrong_.   it doesn't happen
too often, folks, because i'm not wrong very often, but when i am,
i always admit it, and that's what i'm doing right now, so listen up.

whenever i used to think about conversions _from_ z.m.l. format
into other formats -- even ones like .html and .pdf which i clearly
acknowledged as _useful_ ones -- i downplayed them in my mind.

that's because my mission is to make those formats _unnecessary_.
so offering a _conversion_ facility just seemed like a waste of time...

thus, i put these converters way _way_ down on my list of priorities.
indeed, the only reason they were on my list of priorities _at_all_ is
because i figured i had to have parity with the heavy-markup crowd,
(and -- at least early on -- the ability to convert to "any other format"
was one of their big selling points.   but they've backed off that now.)

i assumed _some_ people would get some usage out of converters,
but i never thought that _i_ would have much use for them, if any...

however...

now that i've got some excellent versions of these converters done,
i realize that i was wrong, wrong, wrong.   i am going to have _lots_
of use for these babies, yes i am, both the .html versions _and_ .pdf,
and -- most especially, i realize now -- the _combination_ of the two!

i've been able to imbue them with the same kind of super-navigation
that i've always had in my z.m.l. viewer-program, so they are a _great_
way to demo that fantastic feature, so people will be able to get ideas
about what z.m.l. means in practice.   but also, with the ability to _link_
the versions to each other -- especially the .html version on the web,
which serves as the "canonical" version for reference purposes -- i've
attained a coherent synergistic package that will be very hard to beat.
(just to give one example, i've always thought of annotations like this.)

and with the ability of these formats to go places where my viewer-app
might not run, i've basically got all the bases covered for my approach.

which means that i'm gonna get lotsa mileage outta these converters...

so, i was wrong, and i wouldn't have discovered that if i hadn't persisted
in coding these converters, so i'd like to thank the people who made me
think that these converters were "necessary" in some fashion or another.

anyway, i'll be posting all of these files online in the next few days, and
i'll let you know when they're available...

***

here's the "secret diary of the amazon kindle":

>    november 18th -- businessweek cover-strory goes up on the web
>    november 19th -- press conference where jeff announces the thing
>    november 20th -- whoa!   we've now sold all 36 units we had in stock!
>    november 21st -- place order for another 36 units, with a _rush_ on it.
>    november 22nd -- thanksgiving been berry berry good to us, yes sir...
>    november 23rd -- sold out again!   place new order, doubling size (72).
>    november 24th -- back friday rocked!   place _another_ double order!
>    november 25th -- ok, things are settling down, after the initial frenzy.
>    november 26th -- monday's are _always_ kinda slow with web orders...
>    november 27th -- maybe we can place a double order (72) tomorrow.
>    november 28th -- make it a single order (36) -- better safe than sorry.
>    november 29th -- we seem to have settled in at 9 orders per day.   ok...
>    november 30th -- yep, another 9 orders today.   (well, 8, but that's 
close.)
>    december 1st -- christmas is on the way, so let's gear up some hype, ok?
>    
>    252 -- total units ordered
>    240 -- total units sold
>    ----
>    012 -- units still held in stock

that's just my little "funny" on the people who are saying "wow, the kindle 
is
_sold_out_, so it _must_ be a success!"   since we don't know how many units
they had in the first place.   (and it's ironic, because of all the _rumors_ 
about
the kindle, i don't think _a_single_one_ ever mentioned a _production_run_.)

but hey, even if the kindle turns out to be a complete bust, it won't "fail",
as bezos has deep enough pockets to keep it around forever if he wants.
nobody uses the "wiki" that is offered for every book on the amazon site,
but amazon lets it hang around anyway.   it'll be the same with the kindle.

and perhaps even more importantly, the kindle _won't_ be "a complete bust".
yeah, yeah, the d.r.m. stinks -- "defective by design", as the expression 
goes --
but ordinary people are amazingly tolerant of d.r.m. (until it bites their 
butt)...

and yeah, yeah, there's a wide range of other problems with the kindle as 
well.

but so what?   _every_ e-book-machine that has gotten put into enough hands
has managed to find a good number of fans.   people _loved_ their 
rocketbooks.
they loved their ipaqs.   they now love their sony-readers, and love their 
iliads...

and it's all for the exact same reasons that people love _paper_ books, 
because
the love you feel for the _content_ slops over to the medium on which you 
read.

so a good percentage of the people who _buy_ a kindle will _love_ their 
kindle...

not matter _what_ any "critics" say.

and that's the bottom line.

***

so, really, a back-and-forth on the positives and negatives of the kindle is 
just
the sound of a lot of people yacking...   but -- to _my_ mind, anyway -- what 
_is_
an interesting is why did bezos announce-and-release this thing the way he 
did?

i thought he would be smart enough to pre-announce it and use amazon's huge
hype-machine to spin away all of the negatives before the machine was 
released,
and to hype enough interest to make large crowds of buyers appear 
immediately.
if he really wanted to sell this thing for christmas, he'd have announced it 
in june,
and produced enough units that they were available in brick-and-mortar 
stores...

as it was, this mid-november release was too little and too late, and it 
missed out
on its chance for a long marketing campaign, and had to face criticism right 
away.

all of this makes me think that amazon felt forced to announce before they 
wanted.

and the reason _that_ makes me scratch my head is because i saw the 
_same_thing_
happen with the recent charity-angle sale of the o.l.p.c., which negroponte 
had been
insisting for years that he wouldn't do.   (and he had good reasons for that 
decision.)

are these two premature releases related?   might one of them have caused the 
other?
if the o.l.p.c. machine proves to be a good book-reader, could it have been 
seen by
amazon to be a "first mover" whom they needed to compete against?   or vice 
versa?

or...   what if _both_ these premature releases were caused by another 
development?
what if -- as some have speculated -- apple is announcing a tablet-mac in 
january?
of even just a new paperback-sized ipod touch?   (what a sweet e-reader 
that'd be!)

if amazon and/or o.l.p.c. got wind of an upcoming tablet-mac, they might've 
thought
they had to get _something_ out the door, and pronto, or be completely swept 
away...

-bowerbird


**************************************
Check out AOL's list of 2007's hottest 
products.

(http://money.aol.com/special/hot-products-2007?NCID=aoltop00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071128/74649405/attachment-0001.htm 

From jon at noring.name  Wed Nov 28 14:04:32 2007
From: jon at noring.name (Jon Noring)
Date: Wed, 28 Nov 2007 15:04:32 -0700
Subject: [gutvol-d] hope you all had a lovely thanksgiving weekend
In-Reply-To: <d63.9dbf406.347f3588@aol.com>
References: <d63.9dbf406.347f3588@aol.com>
Message-ID: <1579433771.20071128150432@noring.name>

Bowerbird wrote, in part:

>  jon noring has a reference i.d. for every _paragraph_ in his demo
>  version of "my antonia", so i figured i had to match that capacity...

<smile/>

Was your motive to match that capacity because I did it, or because
it's simply a good thing to do for the benefit of users?

It's hard to tell if your motives are to show me up, or to benefit the
end-user. Those who read your messages may get the impression you have
one very large chip on your shoulder.


>  and then i decided i'd do that one better, just to be interesting...
>  so i included in the .pdf links to every _line_ in the .html version.

This is also doable in XML since I mark the location of line breaks,
an "id" can be added to those if desired.

Or having "id" on all the major block-level stuff, one can use the
formalism of XPointer to address right down to a letter in a word.


>  but you cannot win this game if you only think one move ahead.

The winners here should be the users, not the developers.


>  so i decided instead that i would link to every _word_ in the .html.
>  that's right, click on any _word_ in that .pdf, and your browser will
>  jump to that exact _word_ in the canonical .html version online...
>  
>  and if noring finds it necessary, i'll link to every goshdarn _letter_.

<laugh/>


>  i've already coded the routine, jon.?  all i have to do is toggle a flag,
>  and _boom_ it goes.?  don't make me push that button, jon...????  :+)

Great work! The important thing is that you've come to realize, as I
have been talking about for years, the importance of robust inter- and
intra-publication linking. Glad to see you are implementing this in
your system.

Jon Noring


From Bowerbird at aol.com  Wed Nov 28 16:04:18 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 28 Nov 2007 19:04:18 EST
Subject: [gutvol-d] the sad man still sitting back at the poker table
Message-ID: <c5a.1e826fbe.347f5c02@aol.com>

after i'd spent a little time joking around with a few of the other players,
cashed in my chips, and had a nice seafood meal in the casino restaurant
(during which i tossed back more than a couple of glasses of champagne),
i was leaving the joint when i spotted one lonely player still back at the 
table.

he dealt some cards around, to the empty chairs, and then i heard him mutter,
"i'll see your bet, and raise you a _new_ e-book listserve", as if the game 
were
still on, and he had any chips left.   i snorted out a big laugh, and hit the 
road...

-bowerbird


**************************************
Check out AOL's list of 2007's hottest 
products.

(http://money.aol.com/special/hot-products-2007?NCID=aoltop00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071128/9e3b892d/attachment.htm 

From lee at novomail.net  Thu Nov 29 11:19:36 2007
From: lee at novomail.net (Lee Passey)
Date: Thu, 29 Nov 2007 12:19:36 -0700
Subject: [gutvol-d] hope you all had a lovely thanksgiving weekend
In-Reply-To: <d63.9dbf406.347f3588@aol.com>
References: <d63.9dbf406.347f3588@aol.com>
Message-ID: <474F10C8.8050003@novomail.net>

Bowerbird at aol.com wrote:

> hope you all had a lovely long weekend...

Well, it was a bit frustrating for the very reasons you allude to below.

[snip]

> the objective is to use the different versions to
> converge upon an _error-free_ version of each
> -- with as little human interaction as possible --
> retaining linebreaks idiosyncratic to each edition.

[snip]

> i'll be making my own tool to accomplish this, but
> i welcome the efforts of other programmers too...
> perhaps then we could compare _our_ outputs to
> come to an even _more_ satisfying convergence...

I find that the whole problem space quickly gets very thorny. You see, 
the line breaks you want to retain constitute markup, as do little 
things like blank lines or indentation to represent paragraph breaks. So 
the problem becomes how to compare multiple versions of OCRed text 
without losing the markup.

My strategy has been to leverage the GNU diff program, which is quite 
sophisticated and quite powerful. diff, like all difference engines I am 
aware of takes a line-oriented approach: it identifies lines of text 
which are different, in the context of other lines which are identical. 
So, it order to use difference engine like diff (or Beyond Compare for 
that matter) the texts to be compared need to be normalized so that, as 
much as possible, similar text begins similarly. Additionally, good 
normalization will allow differing text to be synchronized regularly.

So the goal is to create normalized texts consisting of a number of 
lines which start in uniform locations, and which are are relatively 
short, but not so short that a difference engine can't resync as needed.

The basic unit of language seems to me to be the sentence, so it makes 
sense that a good starting point would be to start each sentence on its 
own line. Now it's really hard for a computer program to figure out what 
/is/ a sentence without Natural Language Processing, so I decided to 
simply start a new line at the first whitespace following 
sentence-ending punctuation (.?!).

This will sometimes cause lines to be broken in odd places (e.g. Dr., 
Mr., z.m.l. or e.g.) but creation of several smaller lines for 
comparison purposes is not really a drawback in this instance.

Of course, older texts, particularly 19th century texts, use extremely 
long sentences, so simply creating lines according to punctuation 
doesn't really create lines which are short enough for comparison 
purposes. So I chose, for no other reason than gut-feeling, to also wrap 
lines at 50 characters, at whitespace delimiters.

My experience showed, however, that one of the most common OCR errors is 
in interpreting random defects in paper as punctuation, or in 
miscounting the number of spaces between words. A single perceived (but 
not real) punctuation mark can throw off several lines of text. So I 
designed my word wrapping function to not count whitespace and 
punctuation when determining a break point.

I chose to preserve blank lines that would normally be wrapped 
otherwise, exclusively because it made it a little more convenient for 
me during development; there is no reason it should be necessary.

Thus, the first chapter of Alice in Wonderland from IA, normalized, 
would be (I have prefaced each line with '>' to try and prevent mail 
clients from wrapping the quotes):

 > DOWN THE BABBIT-HOLE.
 >
 > ALICE was beginning to get very tired of sitting by her
 > sister on the bank, and of having nothing to do: once or twice
 > she had peeped into the book her sister was reading, but it had
 > no nictures or conversations in it, "and what is
 >
 > 2 DOWN THE
 >
 > the use of a boot," thought Alice, "without pictures or
 > conversations?"
 >
 > So she was considering in her own mind, (as well as she could, for
 > the hot day made her feel very sleepy and stupid,) whether the
 > pleasure of making a daisy - chain would be worth the trouble

This same passage of from the 2003 Perathoner edition would be:

 > Down the Rabbit-Hole
 >
 > Alice was beginning to get very tired of sitting by her
 > sister on the bank, and of having nothing to do: once or twice
 > she had peeped into the book her sister was reading, but it had
 > no pictures or conversations in it, "and what is the use of a
 > book," thought Alice "without pictures or conversation?"
 >
 > So she was considering in her own mind (as well as she could, for
 > the hot day made her feel very sleepy and stupid), whether the
 > pleasure of making a daisy-chain would be worth the trouble

As you can see, the two passages line up quite well. If the 
header/footer text can be extracted from the IA text the two passages 
would probably line up precisely.

The obvious problem with this normalization process is that important 
markup (for you line breaks, for me much more) is lost. My solution to 
this problem is thanks to Matt Russotto who pointed out to me that 
markup can be stored segregated from its text.

Thus, when normalizing any marked-up text whenever markup is encountered 
you could record in a separate data segment the place where the markup 
occurs in the normalized text. For example, if your markup for a line 
break is "\n", page breaks are "\pg", and paragraphs are "\p", and you 
were normalizing the IA text of Alice you might have a data segment 
something like:

\n:1:21
\n:2:0
\p:3:0
\n:3:40
\n:4:33
\n:5:19
\n:6:2
\n:6:48
\n:7:0
\pg:8:0
\n:8:10
\n:9:0
\n:10:32
\n:11:15
\n:12:0
\p:13:0
\n:13:39

It should now be possible to "de-normalize" the normalized text by 
adding back in the markup and get a file identical to what you started 
with. (This is an important test and validation point; before continuing 
development, make sure that you can normalize and de-normalize files 
without data loss or change.)

Now, using the above two normalized passages from _Alice in Wonderland_, 
you should be able to use diff's patch capability to merge changes from 
one normalized text into a the other normalized text, then use your 
de-normalize routine to add the markup back into the corrected text.

Not surprisingly, this "merge and de-normalize" process is much more 
complex than it sounds. As a trivial example, if the merge process 
causes lines to be added to or deleted from the master text, all of the 
markup locations stored in the data segment will become invalid. 
Likewise if a change causes a word length to change (as in the infamous 
'modem' vs. 'modern' scanno) the location of your line break is going to 
shift incorrectly.

I think that the "merge" process is the most complex and error prone 
component of the total solution, and I don't currently know how it can 
be done completely reliably, but I do believe that this paradigm can be 
used to automate a large part of what is now a purely human effort.


From traverso at posso.dm.unipi.it  Thu Nov 29 15:54:26 2007
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Fri, 30 Nov 2007 00:54:26 +0100 (CET)
Subject: [gutvol-d] hope you all had a lovely thanksgiving weekend
In-Reply-To: <474F10C8.8050003@novomail.net> (message from Lee Passey on Thu, 
	29 Nov 2007 12:19:36 -0700)
References: <d63.9dbf406.347f3588@aol.com> <474F10C8.8050003@novomail.net>
Message-ID: <20071129235426.26FEB93B71@posso.dm.unipi.it>


Why don't you try wdiff? A lot can be done with it (or with mdiff, of
which wdiff is a component). 

Carlo

From Bowerbird at aol.com  Thu Nov 29 18:13:37 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 29 Nov 2007 21:13:37 EST
Subject: [gutvol-d] hope you all had a lovely thanksgiving weekend
Message-ID: <d1e.135633e5.3480cbd1@aol.com>

carlo said:
>   Why don't you try wdiff?

because it's easier for me to write my own program that
will produce better results than i could get out of wdiff?

far _far_ better results, as in not even a little bit close...

but maybe that's because _i_ don't know how to get wdiff
to best do what i specified.   if you do, feel free to share it.
i'm sure people other than me will benefit from a tutorial.

but frankly, i'm quite skeptical wdiff can even _do_ the job.

let alone do it well.

so go ahead, carlo, prove me wrong...

-bowerbird

p.s.   by the way, this is the same mistake you all made at d.p.
with "wordcheck", i.e., having it depend on the aspell checker.
that dependence meant the programmer had to twist himself
into a pretzel, and _still_ ended up giving you inferior results
compared to what he'd have gotten programming that himself.
and i told you this, point blank, in advance.   but evidently, this
was the type of info that was "damaging to your community..."
and yeah, maybe "your leaders are making stupid decisions" _is_
a message that's too radical to let your minions be exposed to...
but, like i said, prove me wrong...


**************************************
Check out AOL's list of 2007's hottest 
products.

(http://money.aol.com/special/hot-products-2007?NCID=aoltop00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071129/08b8ece0/attachment.htm 

From lee at novomail.net  Thu Nov 29 21:35:23 2007
From: lee at novomail.net (Lee Passey)
Date: Thu, 29 Nov 2007 22:35:23 -0700
Subject: [gutvol-d] hope you all had a lovely thanksgiving weekend
In-Reply-To: <20071129235426.26FEB93B71@posso.dm.unipi.it>
References: <d63.9dbf406.347f3588@aol.com> <474F10C8.8050003@novomail.net>
	<20071129235426.26FEB93B71@posso.dm.unipi.it>
Message-ID: <474FA11B.9000809@novomail.net>

Carlo Traverso wrote:

> Why don't you try wdiff? A lot can be done with it (or with mdiff, of
> which wdiff is a component).

An interesting suggestion; what did you have in mind?

As you know, wdiff is a front-end to GNU diff which attempts to solve 
the very problem I mentioned at the beginning of my post: for diff's to 
be effective, the input files must be normalized. My approach to 
normalization was to try to force each sentence to begin on a new line, 
and to wrap sentences thereafter in short segments (approx. 50 
characters). wdiff's approach is to normalize the text by putting each 
/word/ on a separate line, and then making an attempt to reassemble the 
results into a usable format.

One of the wrinkles we face is the requirement Bowerbird established 
that markup must be retained throughout the process (a requirement which 
I believe is fundamental). I'm afraid I don't see how wdiff can be used 
while still meeting that requirement. My approach was to record markup 
separate from the raw text, with pointers back into the text. For this 
to work (and I'd welcome alternative suggestions) when changes get 
merged back into the "master" text (a fairly arbitrary selection, 
probably based on which version has retained the most markup) the 
pointers will probably need to be adjusted as corrections are made. 
Thus, I don't see how wdiff could be used to create a patch file (which 
might be edited by hand before use) which is then used to patch the 
master, and finally add the markup back in.

On the other hand, maybe the lesson from wdiff is not that the program 
itself could be used but the approach could be used. Maybe the 
normalization process should create a file with "lines of words" which 
the "de-normalization" process could deal with more effectively.

It's definitely something I'll experiment with, but if you have any 
suggestions as to how wdiff could be integrated into the process, please 
share them. Remember, however, the two most fundamental requirements: 1. 
markup must be retained from beginning to end (although if it is removed 
in interim steps that's not a big deal), and 2. The process must be 
mostly automated; what I am trying to achieve is a mostly automated 
process which may require some slight human intervention, not a mostly 
manual process that is augmented by some slight machine-assistance.

From marcello at perathoner.de  Thu Nov 29 23:00:04 2007
From: marcello at perathoner.de (Marcello Perathoner)
Date: Fri, 30 Nov 2007 08:00:04 +0100
Subject: [gutvol-d] hope you all had a lovely thanksgiving weekend
In-Reply-To: <474FA11B.9000809@novomail.net>
References: <d63.9dbf406.347f3588@aol.com>
	<474F10C8.8050003@novomail.net>	<20071129235426.26FEB93B71@posso.dm.unipi.it>
	<474FA11B.9000809@novomail.net>
Message-ID: <474FB4F4.3040607@perathoner.de>

Lee Passey wrote:

<del/>

Given these two files:

> <lg>
> <l>'Tis the voice of the sluggard;</l>
> <l rend="margin-left: 2em">I heard him complain,</l>
> <l>"You have waked me too soon,</l>
> <l rend="margin-left: 2em">I must slumber again."</l>
> </lg>

and this:

> 'Tis the voice of the Lobster; I heard him declare,
> 'You have baked me too brown, I must sugar my hair.'

what *exact* results do you expect from the diff?


-- 
Marcello Perathoner
webmaster at gutenberg.org


From robert_marquardt at gmx.de  Fri Nov 30 01:31:40 2007
From: robert_marquardt at gmx.de (Robert Marquardt)
Date: Fri, 30 Nov 2007 10:31:40 +0100
Subject: [gutvol-d] The Advent Calendar will be up tomorrow
Message-ID: <pplvk3djhaup8sc7selq89qrgatiehkhdp@4ax.com>

Have a look at the tst version here:
http://www.gutenberg.org/wiki/User:Marcello/ImageMapTest
-- 
Robert Marquardt (Team JEDI)  http://delphi-jedi.org

From klofstrom at gmail.com  Fri Nov 30 03:22:42 2007
From: klofstrom at gmail.com (Karen Lofstrom)
Date: Fri, 30 Nov 2007 01:22:42 -1000
Subject: [gutvol-d] The Advent Calendar will be up tomorrow
In-Reply-To: <pplvk3djhaup8sc7selq89qrgatiehkhdp@4ax.com>
References: <pplvk3djhaup8sc7selq89qrgatiehkhdp@4ax.com>
Message-ID: <1e8e65080711300322r35c60b59hed40ac52dc3fb4b5@mail.gmail.com>

On Nov 29, 2007 11:31 PM, Robert Marquardt <robert_marquardt at gmx.de> wrote:

> Have a look at the tst version here

Perfect picture!

Myself, I'd prefer a different font for the numbers -- something serif
or ornate. Perhaps a different color? Silver? I'd put an ornate frame
around the picture, and elaborate the lines dividing it into sections.
Something more Victorian Christmassy.

Also, I'd like the numbers  in the same position in each rectangle
(bottom center?), but then I'm a stickler for symmetry. Feel free to
ignore me as an outlier, unless others feel the same way.

The concept as a whole, however, is just fine. Thanks so much for
working on it.

--
Karen Lofstrom

From johnson.leonard at gmail.com  Fri Nov 30 03:36:26 2007
From: johnson.leonard at gmail.com (Leonard Johnson)
Date: Fri, 30 Nov 2007 06:36:26 -0500
Subject: [gutvol-d] The Advent Calendar will be up tomorrow
In-Reply-To: <pplvk3djhaup8sc7selq89qrgatiehkhdp@4ax.com>
References: <pplvk3djhaup8sc7selq89qrgatiehkhdp@4ax.com>
Message-ID: <748ba8e50711300336g6066d752h6ba65f375fdeb3ed@mail.gmail.com>

On Nov 30, 2007 4:31 AM, Robert Marquardt <robert_marquardt at gmx.de> wrote:

> Have a look at the tst version here:
> http://www.gutenberg.org/wiki/User:Marcello/ImageMapTest
> --
> Robert Marquardt (Team JEDI)  http://delphi-jedi.org
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d


I like it as is.

Is this going to remain on the user wiki? Is there a possibility for a link
from the main page?

Len Johnson
-- 
http://members.cox.net/leaonarddjohnson/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071130/c4c29a94/attachment.htm 

From robert_marquardt at gmx.de  Fri Nov 30 06:14:56 2007
From: robert_marquardt at gmx.de (Robert Marquardt)
Date: Fri, 30 Nov 2007 15:14:56 +0100
Subject: [gutvol-d] The Advent Calendar will be up tomorrow
In-Reply-To: <1e8e65080711300322r35c60b59hed40ac52dc3fb4b5@mail.gmail.com>
References: <pplvk3djhaup8sc7selq89qrgatiehkhdp@4ax.com>
	<1e8e65080711300322r35c60b59hed40ac52dc3fb4b5@mail.gmail.com>
Message-ID: <3c60l39oof53ho2b6p2vo20srrdbdjli6i@4ax.com>

On Fri, 30 Nov 2007 01:22:42 -1000, you wrote:

>Myself, I'd prefer a different font for the numbers -- something serif
>or ornate. Perhaps a different color? Silver? I'd put an ornate frame
>around the picture, and elaborate the lines dividing it into sections.
>Something more Victorian Christmassy.
>
>Also, I'd like the numbers  in the same position in each rectangle
>(bottom center?), but then I'm a stickler for symmetry. Feel free to
>ignore me as an outlier, unless others feel the same way.

I had to ask for help because i am not able to do any work right now.
I got this and accepted it as it is.
Yes, there are many ideas for the designs, but you could work on it for weeks and drown in all those designs.
-- 
Robert Marquardt (Team JEDI)  http://delphi-jedi.org

From robert_marquardt at gmx.de  Fri Nov 30 06:22:18 2007
From: robert_marquardt at gmx.de (Robert Marquardt)
Date: Fri, 30 Nov 2007 15:22:18 +0100
Subject: [gutvol-d] The Advent Calendar will be up tomorrow
In-Reply-To: <748ba8e50711300336g6066d752h6ba65f375fdeb3ed@mail.gmail.com>
References: <pplvk3djhaup8sc7selq89qrgatiehkhdp@4ax.com>
	<748ba8e50711300336g6066d752h6ba65f375fdeb3ed@mail.gmail.com>
Message-ID: <3n60l3devcg49a88g6h5gu6m46gedtt5mb@4ax.com>

On Fri, 30 Nov 2007 06:36:26 -0500, you wrote:

>I like it as is.
>
>Is this going to remain on the user wiki? Is there a possibility for a link
>>from the main page?

Of course. Just like the Christmas Bookshelf we promoted last year (and will promote this year also)
The SF CD promotion will be replaced tomorrow. Next year i think we should promote the Children bookshelves.

The Advent Calendar page will be removed on Dec 25. Next year we can create a new one.
I am not sure if we should do it again next year though. Better do something new like a Christmas CD.
In two years maybe a calendar again, but with audio books. I will challenge Librivox for that.
-- 
Robert Marquardt (Team JEDI)  http://delphi-jedi.org

From lee at novomail.net  Fri Nov 30 09:12:48 2007
From: lee at novomail.net (Lee Passey)
Date: Fri, 30 Nov 2007 10:12:48 -0700
Subject: [gutvol-d] hope you all had a lovely thanksgiving weekend
In-Reply-To: <474FB4F4.3040607@perathoner.de>
References: <d63.9dbf406.347f3588@aol.com>	<474F10C8.8050003@novomail.net>	<20071129235426.26FEB93B71@posso.dm.unipi.it>	<474FA11B.9000809@novomail.net>
	<474FB4F4.3040607@perathoner.de>
Message-ID: <47504490.90600@novomail.net>

Marcello Perathoner wrote:
> Lee Passey wrote:
>
> <del/>
>
> Given these two files:
>   
>> <lg>
>> <l>'Tis the voice of the sluggard;</l>
>> <l rend="margin-left: 2em">I heard him complain,</l>
>> <l>"You have waked me too soon,</l>
>> <l rend="margin-left: 2em">I must slumber again."</l>
>> </lg>
>>     
>
> and this:
>   
>> 'Tis the voice of the Lobster; I heard him declare,
>> 'You have baked me too brown, I must sugar my hair.'
>>     
>
> what *exact* results do you expect from the diff?
>   

An excellent question. First, let me thank you for the example, it has 
helped me refine my own algorithms to be more precise.

Step one: normalize the two files. Using my current algorithm creating 
lines of approx. 50 characters you get:

[start poem.xml.norm]
 'Tis the voice of the sluggard; I heard him complain, "You have
 waked me too soon, I must slumber again."
 <?xml version="1.0" encoding="utf-8"?>
 <lg>
 <l><TEXT where="1:0,1:31" /></l><TEXT where="1:31,1:32" />
 <l rend="margin-left: 2em"><TEXT where="1:32,1:53" /></l><TEXT 
where="1:53,1:54" />
 <l><TEXT where="1:54,2:18" /></l><TEXT where="2:18,2:19" />
 <l rend="margin-left: 2em"><TEXT where="2:19,2:41" /></l><TEXT 
where="2:41,3:0" />
 </lg>
[end poem.xml.norm]

and

[start poem.txt.norm]
 'Tis the voice of the Lobster; I heard him declare, 'You have
 baked me too brown, I must sugar my hair.'
[end poem.txt.norm]

Step two: compare the two normalized files. The resulting diff file is:

[start poem.diff]
 1,2c1,9
 < 'Tis the voice of the sluggard; I heard him complain, "You have
 < waked me too soon, I must slumber again."
 ---
 > 'Tis the voice of the Lobster; I heard him declare, 'You have
 > baked me too brown, I must sugar my hair.'
 > <?xml version="1.0" encoding="utf-8"?>
 > <lg>
 > <l><TEXT where="1:0,1:31" /></l><TEXT where="1:31,1:32" />
 > <l rend="margin-left: 2em"><TEXT where="1:32,1:53" /></l><TEXT 
where="1:53,1:54" />
 > <l><TEXT where="1:54,2:18" /></l><TEXT where="2:18,2:19" />
 > <l rend="margin-left: 2em"><TEXT where="2:19,2:41" /></l><TEXT 
where="2:41,3:0" />
 > </lg>
[end poem.diff]

That was the easy part.

Step 3 is more complex: decide which of the two competing versions is 
the one you want in the result.

The portion of the diff file that represents the markup can be discarded 
at this point. For a completely automated solution, you would want to 
repeat this process with other versions of the same text, and perhaps 
using a voting algorithm select the text which the majority of versions 
consider correct. Other options include considering one text as 
canonical, or actually having a human edit the diff file so that only 
desired changes remain.

So far, this step is where I have expended the least amount of effort.

Step 4 is the hardest: merging accepted changes from the diff file back 
into the "master" file. Interestingly, your example is quite easy to 
merge back in. Assuming that all the changes from the text file are 
preferable, my current program yields:

[start newpoem.xml]
  <?xml version="1.0" encoding="utf-8"?>
 <lg>
 <l>'Tis the voice of the Lobster;</l>
 <l rend="margin-left: 2em">I heard him declare,</l>
 <l>'You have baked me too brown,</l>
 <l rend="margin-left: 2em">I must sugar my hair.'</l>
 </lg>
[end newpoem.xml]

What I am discovering is that the "de-normalization" program, which 
merges the changes and restores the markup, seems to be following the 
80/20 rule: 80% of the cases can be solved fairly easily; the remaining 
20% of the cases will require 4 times the effort. Actually, it's 
starting to look more like a 95/5 rule; the 5% of the changes which are 
anomalous seem to be highly intractable. Mr. Traverso's suggestion to 
use word-based normalization may help solve these problems; but in some 
cases I'm afraid that the only solution may be to embed a milestone in 
the resulting output and require a human to resolve the discrepancy.


From Bowerbird at aol.com  Fri Nov 30 13:38:09 2007
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 30 Nov 2007 16:38:09 EST
Subject: [gutvol-d] hope you had a lovely holiday
Message-ID: <c0f.159278b5.3481dcc1@aol.com>

oh gee, there's some activity in my spam folder.

do i open it up?   or leave it be?

it's friday, the weekend!, so i do believe i'll ignore it.

maybe monday i'll look at it.

or maybe not.

(if anyone wants to advise me to not even bother, as
it's worthless, those'll be welcome words to my ears.)

meanwhile, i'm still looking forward to carlo's tutorial.

-bowerbird


**************************************
Check out AOL's list of 2007's hottest 
products.

(http://money.aol.com/special/hot-products-2007?NCID=aoltop00030000000001)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20071130/28ae7ade/attachment.htm 

From piggy at netronome.com  Fri Nov 30 15:40:25 2007
From: piggy at netronome.com (La Monte H.P. Yarroll)
Date: Fri, 30 Nov 2007 18:40:25 -0500
Subject: [gutvol-d] The Advent Calendar will be up tomorrow
In-Reply-To: <3n60l3devcg49a88g6h5gu6m46gedtt5mb@4ax.com>
References: <pplvk3djhaup8sc7selq89qrgatiehkhdp@4ax.com>	<748ba8e50711300336g6066d752h6ba65f375fdeb3ed@mail.gmail.com>
	<3n60l3devcg49a88g6h5gu6m46gedtt5mb@4ax.com>
Message-ID: <47509F69.20301@netronome.com>

Robert Marquardt wrote:
> On Fri, 30 Nov 2007 06:36:26 -0500, you wrote:
>
>   
>> I like it as is.
>>
>> Is this going to remain on the user wiki? Is there a possibility for a link
>> >from the main page?
>>     
>
> Of course. Just like the Christmas Bookshelf we promoted last year (and will promote this year also)
> The SF CD promotion will be replaced tomorrow. Next year i think we should promote the Children bookshelves.
>
> The Advent Calendar page will be removed on Dec 25. Next year we can create a new one.
> I am not sure if we should do it again next year though. Better do something new like a Christmas CD.
> In two years maybe a calendar again, but with audio books. I will challenge Librivox for that.
>   
Will the links be enabled separately day by day?


From robert_marquardt at gmx.de  Fri Nov 30 21:58:46 2007
From: robert_marquardt at gmx.de (Robert Marquardt)
Date: Sat, 01 Dec 2007 06:58:46 +0100
Subject: [gutvol-d] The Advent Calendar will be up tomorrow
In-Reply-To: <47509F69.20301@netronome.com>
References: <pplvk3djhaup8sc7selq89qrgatiehkhdp@4ax.com>	<748ba8e50711300336g6066d752h6ba65f375fdeb3ed@mail.gmail.com>
	<3n60l3devcg49a88g6h5gu6m46gedtt5mb@4ax.com>
	<47509F69.20301@netronome.com>
Message-ID: <2vt1l3dhg8beqp31guqq8aaltc06198r91@4ax.com>

On Fri, 30 Nov 2007 18:40:25 -0500, you wrote:
  
>Will the links be enabled separately day by day?

No. Just like a chocolate calendar you should be able to abuse it.
-- 
Robert Marquardt (Team JEDI)  http://delphi-jedi.org