From hart at pglaf.org  Thu Jun  1 06:51:06 2006
From: hart at pglaf.org (Michael Hart)
Date: Thu Jun  1 06:51:07 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital
	libraries
In-Reply-To: <486.1ad8b05.31afe6bf@aol.com>
References: <486.1ad8b05.31afe6bf@aol.com>
Message-ID: <Pine.LNX.4.60.0606010645460.677@pglaf.org>


On Thu, 1 Jun 2006 Bowerbird@aol.com wrote:

> karl said:
>>    Sure, these ASCII files are also useful for special purposes,
>>    but telling us again and again that's the best solution
>>    for all books and all times, is highly arguable.
>
> to my mind, the only problem with the ascii files is
> the absence of book typography -- bold headings,
> justified lines, bottom-balanced pages, pagination!,
> properly rendered footnotes, all the looks-nice stuff,
> leading to a display that is so boring it becomes tedium.

Some interesting points there, particularly that last one,
as I have had multiple comments from our readers that they
LIKE not having such boringly justified right margination,
as it helps them better keep track of what line is next.

As for the footnotes, I still agree with those who want an
appendix containing all of them, rather than having breaks
between pages contain them.  I like this with paper books,
and even more with eBooks, as it is trivial to switch from
the text to the footnote and back again.

I'll leave the pagination and margination issues to reader
choice as their own personal decisions, along with fonts.


Michael

From hart at pglaf.org  Thu Jun  1 07:24:10 2006
From: hart at pglaf.org (Michael Hart)
Date: Thu Jun  1 07:24:12 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital
	libraries
In-Reply-To: <44a.28419a8.31af625f@aol.com>
References: <44a.28419a8.31af625f@aol.com>
Message-ID: <Pine.LNX.4.60.0606010719000.677@pglaf.org>


On Wed, 31 May 2006 Bowerbird@aol.com wrote:

> sebastien said:
>>    Most of the time the original typesetting does not matter much.
>
> different people can disagree on that.

And how!

>
>
>>    I believe you are missing the point.
>>    Michael doesn't care as much about collections of pictures
>>    as he does about digitalized text.
>
> different people disagree with michael.

Please stop quoting from what I said about illustrations
when bandwidth was a serious issue. . .not that everyone
has broadband these days. . .or wants pictures; however,
the point was made long ago when making an eBook larger,
often many times larger, would stop people from reading.

Don't forget just how much effort we put in to making an
illustrated copy of Alice in Wonderland with the best of
several resolution tests for each illustration, just for
the purpose of making it small enough for more readers.

However, this is all pretty much in the past now for the
people on this list, but we should never forget that the
world at large still may have bandwidth issues, and this
new attention to reading eBooks on cell phone may play a
major role in accentuating this issue.


>>    As long as scans and/or OCR technologies are so disappointing,
>>    we'll have to rely on higher-level human brains with initiatives
>>    such as PGDP or ebooksgratuits.com
>
> or methodologies which are better.
>
>
>>    Of course having easy access to pictures is useful and
>>    much better than nothing and serves you well, but
>>    that's not what PG and ebooks are about.
>
> different people can disagree on that too.
>
>
>>    ebooks are much more than photographs of regular analog books.
>
> yes, but photographs of regular analog books
> _might_ qualify as e-books, for _some_ people.
>
> different people can disagree on that too.
>
>
>>    3. is the top we are heading for. 2. is just a step on the way.
>
> but #2 might serve the needs of person x just fine.
>
>
>>    I did that and got
>>    20845628 bytes for 604 pages.
>
> scans are resource hogs.   nobody disagrees about that.
>
> one argument is that since these resources are now plentiful,
> it doesn't matter that scans are resource hogs.
>
> different people can disagree on that too.
>
> as long as we can easily move scan-sets to digitized text,
> i don't see much purpose in continuing to debate these two
> as if they were competitors.   they're not.   they're complimentary.
>
> -bowerbird
>
From hart at pglaf.org  Thu Jun  1 09:09:51 2006
From: hart at pglaf.org (Michael Hart)
Date: Thu Jun  1 09:09:54 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital
	libraries
In-Reply-To: <44a.28419a8.31af625f@aol.com>
References: <44a.28419a8.31af625f@aol.com>
Message-ID: <Pine.LNX.4.60.0606010725420.677@pglaf.org>


On Wed, 31 May 2006 Bowerbird@aol.com wrote:

> sebastien said:

[snip] see previous message


>>    ebooks are much more than photographs of regular analog books.
>
> yes, but photographs of regular analog books
> _might_ qualify as e-books, for _some_ people.
>
> different people can disagree on that too.
>
>
>>    3. is the top we are heading for. 2. is just a step on the way.
>
> but #2 might serve the needs of person x just fine.
>
>
>>    I did that and got
>>    20845628 bytes for 604 pages.
>
> scans are resource hogs.   nobody disagrees about that.
>
> one argument is that since these resources are now plentiful,
> it doesn't matter that scans are resource hogs.
>
> different people can disagree on that too.
>
> as long as we can easily move scan-sets to digitized text,
> i don't see much purpose in continuing to debate these two
> as if they were competitors.   they're not.   they're complementary.
>
> -bowerbird
>

Several issues worth thinking about here:

File size, bandwidth, storage:  important to whom?

Are all scans food for OCR?

Do raw scans qualify as eBooks?


File size, bandwidth, storage:  important to whom?

Perhaps the way to think about this is to consider
just how many more or less readers we would get if
the file sizes were that much larger or smaller.

In the end, I think we should provide both.


Are all scans good for OCR?

Some operations deliberately do not put their high
resolution scans online for downloading, rather an
automated process reduces the resolution, so these
scans are no longer suitable for OCRing.

Requests for those higher resolution scans seem to
have a very limited success rate.

The odds of being able to create a complete eBook,
using those scans that are usually made available,
perhaps about 1/4 to 1/3, based on the reports you
have probably already seen.

Once you go through the effort of scanning missing
pages, rescanning the pages that did not work with
your OCR programs, etc., it often might seem worth
the effort simply to scan the entire book with the
higher resolution scans that you can then post for
others to use.


Do raw scans qualify as eBooks?

Obviously those who would prefer to claim a larger
number of eBooks in using smaller amount of effort
would prefer to be able to claim raw scans=eBooks.

As mentioned in the various steps above, scanning,
such as it is, can be nearly completely automated,
to the point of cutting off book bindings, feeding
the pages to the scanner in the same way as copier
machines let you feed in stacks of pages, and then
claiming the result of that minimal labor as eBook
output in the catalog.

This is the "quick and dirty approach" and doesn't
cost much in terms of time, effort or money and it
does provide a reasonably readable output if pages
go through smoothly.  Apparently they don't always
go so smoothly, as many of the books were reported
to have missing pages not to mention pages scanned
poorly enough to be a problem; the report I recall
mentioned some 30% as being acceptable:  but these
do not take into account some setups intentionally
created to be not suitable for OCR.

***

I suppose the real question comes down to purposes
for making eBooks.

Obviously Google, Yahoo, Amazon, and those Library
of Congress projects all have different purposes:
and it remains to be seen how much of the purposes
will be revealed as they each start to move from a
single percentage point of their goals to counting
a majority of their collection as completed.

The various university projects still seem to be a
great deal concerned with keep their eBooks out of
the hands of the public, as has Google, though the
Google philosophy may be in the process of change.
Right now it's hard to tell what Google has chosen
as their goal; will they really try to do millions
of books in the next 54 months after perhaps stats
of .1 million in the first 18 months?  Will Google
change their philosophy per downloading scans, and
or downloading their full text searching database?

Until Google decides to actually proofread eBooks,
I don't think they will want anyone to see what an
eBook from Google looks like in full text:  simply
because it would be too obvious that proofreading,
even on a moderate basis, is not part of the plan.
However, I _DO_ think that the "second pass" eBook
collection, whether done by Google or others, will
be good enough, simply due to advanced technology,
someone will do it all over again, 10 times better
and 10 times faster and 10 times cheaper.

However, I don't predict this before 2020.

So, there it is in a nutshell, what eBooks will be
in the near and distant future, as I see it.

Will raw scans ever be the default?

No.

Why?

Because full text will become easier to and people
will keep making more and more full text eBooks in
contrast to the raw scans.

Obviously raw scans will continue to be cheap/easy
for another few years, perhaps long enough for the
Google, Yahoo, etc., efforts to claim some success
in that area, but by the time they could claim any
real success we will find that full text is coming
along fast enough that the Google efforts would be
lost in the shuffle as better full text emerges.

My own goal has always been for the public to have
their own home eLibraries, just as they have their
own home computers.  These eLibraries should be an
entirely flexible set of products that can be read
in virtually any hardware/software combination for
the world at large to use.  Such libraries are not
dependent on particular search engines, or formats
or any other particular product.  Everyone will be
free to keep their own copies of these libraries--
the number of persons owning libraries from now on 
will rise on the same order as did people owning a
book after the invention of Gutenberg's Press.


Thanks!!!

Give the world eBooks in 2006!!!

Michael S. Hart
Founder
Project Gutenberg

Blog at http://hart.pglaf.org

From hart at pglaf.org  Thu Jun  1 09:16:03 2006
From: hart at pglaf.org (Michael Hart)
Date: Thu Jun  1 09:16:04 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital
	libraries
In-Reply-To: <shzmgy55dd.fsf@tux.gnu.franken.de>
References: <3b0.31c4827.31adca78@aol.com>
	<Pine.LNX.4.60.0605311059520.4695@pglaf.org>
	<shzmgy55dd.fsf@tux.gnu.franken.de>
Message-ID: <Pine.LNX.4.60.0606010913470.14706@pglaf.org>


On Wed, 31 May 2006, Karl Eichwalder wrote:

> Michael Hart <hart@pglaf.org> writes:
>
>> No one seems to thinks Gallica is really an eBook collection, raw
>> scans seems to be most of what is available, and even those are a
>> set of low-res versions that is not really suitable for OCRing.
>
> OCRing is important, but OCR without the scans nearby is often not
> enough.  I think gallica is one of the best e-book collections.  Their
> PDF are very useful (you can download complete books as PDFs pretty
> easily and they are readable)!  This way I can access the Bulletin
> Monumental.
>
>> I must admit that I am relying on my friends here, as my Francias
>> is not really good enough to know if I didn't miss something that
>> would have provided better results on their site.
>
> Sure, you must know the way to create and download PDFs:

Each .pdf file seemed to just hold a .gif file. . .or is there
something else going on there that was missed?


>
> www.gallica.fr ->
> Recherche      ->
> "Mots du titre" - enter the title, for example "Bulletin Monumental"
> In the "R?sultat de la recherche: click on "Bulletin Monumental"
> Select the volume, you are interested in, for example "1861 (S?r. 2)"
> Now "T?l?charger" and "ok" if you are interested in the complete book
>
> Then wait, PDF preparation takes time.  Click
> Vous pouvez le t?l?charger "en cliquant ici."  or use the supplied FTP
> address.

And this is supposed to prepare the book as a single .pdf file?

Searchable?


Thanks!!!

Give the world eBooks in 2006!!!

Michael S. Hart
Founder
Project Gutenberg

Blog at http://hart.pglaf.org
From Bowerbird at aol.com  Thu Jun  1 10:03:04 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Thu Jun  1 10:03:14 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital
	libraries
Message-ID: <388.4a90b23.31b077c8@aol.com>

michael said:
>    as I have had multiple comments from our readers that they
>    LIKE not having such boringly justified right margination,
>    as it helps them better keep track of what line is next.

and the beauty of a good viewer-app is it lets each user decide.


>    As for the footnotes, I still agree with those who want an
>    appendix containing all of them, rather than having breaks
>    between pages contain them.? 

it doesn't have to be "either/or" with e-books, it can be "both".


>    I like this with paper books, and even more with eBooks, as it is 
>    trivial to switch from the text to the footnote and back again.

i think it's even better to have both displayed at the same time;
no switching required.


>    I'll leave the pagination and margination issues to reader
>    choice as their own personal decisions, along with fonts.

i agree.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060601/3f308cb6/attachment.html
From Bowerbird at aol.com  Thu Jun  1 10:10:36 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Thu Jun  1 10:10:46 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital
	libraries
Message-ID: <249.b88eb40.31b0798c@aol.com>

michael said:
>    Please stop quoting from what I said about illustrations
>   when bandwidth was a serious issue. . .not that everyone
>   has broadband these days. . .or wants pictures; however,
>   the point was made long ago when making an eBook larger,
>   often many times larger, would stop people from reading.

i too have changed my position on this only recently, when the
penetration of broadband in homes in the u.s. passed over 50%.

but like you, i am still cognizant that not everyone has broadband,
and that those who don't are on the poor side of the digital divide
and thus must be given priority in our thinking.   since this has been
a cornerstone of my thinking all along, i'm sure it will continue to be.
i know -- personally -- many people with hand-me-down machinery,
far too many for me to have this vital issue slip from my radar-screen...

-bowerbird

p.s.   i myself just moved to an ibook g4 with o.s.x. a little over a year 
ago...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060601/2c9cff7f/attachment.html
From nwolcott2ster at gmail.com  Thu Jun  1 12:42:42 2006
From: nwolcott2ster at gmail.com (Norm Wolcott)
Date: Thu Jun  1 12:49:08 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digitallibraries
References: <3b0.31c4827.31adca78@aol.com><Pine.LNX.4.60.0605311059520.4695@pglaf.org>
	<shzmgy55dd.fsf@tux.gnu.franken.de>
Message-ID: <005e01c685b4$5b75eba0$650fa8c0@gw98>

The gallica pdf's are very low resoloution mostly. Where there are diagrams
they hardly come out at all, especially mathematical ones with small
lettters on them. I t may he helpful to have a copy of the book nearby.
OCR'ing pdf's is not for the faint hearted, as they are ot designed for this
purpose. However they are good for layout of the original publications and
for copyright use as the date of publication is usually given. Also shows
the title page often omitted from other pdf files.

I believe some gallica are available in text format if you push the "text"
button.
nwolcott2@post.harvard.edu
----- Original Message -----
From: "Karl Eichwalder" <ke@gnu.franken.de>
To: <gutvol-d@lists.pglaf.org>
Sent: Wednesday, May 31, 2006 4:03 PM
Subject: Re: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of
digitallibraries


Michael Hart <hart@pglaf.org> writes:

> No one seems to thinks Gallica is really an eBook collection, raw
> scans seems to be most of what is available, and even those are a
> set of low-res versions that is not really suitable for OCRing.

OCRing is important, but OCR without the scans nearby is often not
enough.  I think gallica is one of the best e-book collections.  Their
PDF are very useful (you can download complete books as PDFs pretty
easily and they are readable)!  This way I can access the Bulletin
Monumental.

> I must admit that I am relying on my friends here, as my Francias
> is not really good enough to know if I didn't miss something that
> would have provided better results on their site.

Sure, you must know the way to create and download PDFs:

www.gallica.fr ->
Recherche      ->
"Mots du titre" - enter the title, for example "Bulletin Monumental"
In the "R?sultat de la recherche: click on "Bulletin Monumental"
Select the volume, you are interested in, for example "1861 (S?r. 2)"
Now "T?l?charger" and "ok" if you are interested in the complete book

Then wait, PDF preparation takes time.  Click
Vous pouvez le t?l?charger "en cliquant ici."  or use the supplied FTP
address.

I hope this helps.

--
http://www.gnu.franken.de/ke/                           |      ,__o
                                                        |    _-\_<,
                                                        |   (*)/'(*)
Key fingerprint = F138 B28F B7ED E0AC 1AB4  AA7F C90A 35C3 E9D0 5D1C
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/listinfo.cgi/gutvol-d

From hart at pglaf.org  Thu Jun  1 13:32:18 2006
From: hart at pglaf.org (Michael Hart)
Date: Thu Jun  1 13:32:19 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digitallibraries
In-Reply-To: <005e01c685b4$5b75eba0$650fa8c0@gw98>
References: <3b0.31c4827.31adca78@aol.com><Pine.LNX.4.60.0605311059520.4695@pglaf.org>
	<shzmgy55dd.fsf@tux.gnu.franken.de>
	<005e01c685b4$5b75eba0$650fa8c0@gw98>
Message-ID: <Pine.LNX.4.60.0606011331240.22984@pglaf.org>


On Thu, 1 Jun 2006, Norm Wolcott wrote:

>
> I believe some gallica are available in text format if you push the "text"
> button.

>From what my French friends tell me, this is only around 1% of them,
and that sometimes the full text versions disappear after a while.

mh
From traverso at dm.unipi.it  Thu Jun  1 13:40:12 2006
From: traverso at dm.unipi.it (Carlo Traverso)
Date: Thu Jun  1 13:39:12 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digitallibraries
In-Reply-To: <005e01c685b4$5b75eba0$650fa8c0@gw98> (nwolcott2ster@gmail.com)
References: <3b0.31c4827.31adca78@aol.com><Pine.LNX.4.60.0605311059520.4695@pglaf.org>
	<shzmgy55dd.fsf@tux.gnu.franken.de>
	<005e01c685b4$5b75eba0$650fa8c0@gw98>
Message-ID: <200606012040.k51KeCF16696@pico.dm.unipi.it>

>>>>> "Norm" == Norm Wolcott <nwolcott2ster@gmail.com> writes:

    Norm> The gallica pdf's are very low resoloution mostly. Where
    Norm> there are diagrams they hardly come out at all, especially
    Norm> mathematical ones with small lettters on them. I t may he
    Norm> helpful to have a copy of the book nearby.  OCR'ing pdf's is
    Norm> not for the faint hearted, as they are ot designed for this
    Norm> purpose. However they are good for layout of the original
    Norm> publications and for copyright use as the date of
    Norm> publication is usually given. Also shows the title page
    Norm> often omitted from other pdf files.

But why you download pdf from gallica? For OCR you should download
tiff, that is perfectly suited, and does not pose conversion problems.

The gallica pdf is just a wrapper for the tiff files (compare a
gallica pdf with a gallica tiff: the tiff is integrally contained in
the pdf, with some extra wrapper) for every page).

For example FineReader, if you feed a pdf, passes through ghostscript,
substantially "printing" the pfd and converting the resulting bitmap;
if you choose the wrong dpi while converting, you lose resolution;
it instead directly uses a tiff file (tiff is the internal image
format in FineReader).

gallica pdf is OK if you want to read (but a multipage tiff viewer is
even better). But not for OCR. You cannot blame gallica if you cannot
tick the correct box when you download.

Carlo Traverso

From Bowerbird at aol.com  Thu Jun  1 15:11:26 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Thu Jun  1 15:11:36 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital
	libraries
Message-ID: <3eb.30681fd.31b0c00e@aol.com>

michael said:
>    Perhaps the way to think about this is to consider
>    just how many more or less readers we would get if
>    the file sizes were that much larger or smaller.

there are something like 100,000 books available at google.
d.p. digitizes about 2,000 books a year.   they can't keep up.


>    In the end, I think we should provide both.

in the end, users will turn exclusively to "digital reprints"
-- digital text that mimics the scans so accurately that
there's really no good reason to consult the scans at all.

after 10 or 20 years of nobody downloading the scans,
we'll be able to feel comfortable taking them offline...


>    Some operations deliberately do not put their high
>   resolution scans online for downloading, rather an
>   automated process reduces the resolution, so these
>   scans are no longer suitable for OCRing.

yeah, that's sad.   but what are you gonna do about it?


>    The odds of being able to create a complete eBook,
>   using those scans that are usually made available,
>   perhaps about 1/4 to 1/3, based on the reports you
>   have probably already seen.

yeah, that's sad too.   but that's a quality-control issue
that i suspect the scanning operations will solve soon...


>    Once you go through the effort of scanning missing
>   pages, rescanning the pages that did not work with
>   your OCR programs, etc., it often might seem worth
>   the effort simply to scan the entire book with the
>   higher resolution scans that you can then post for
>   others to use.

i don't think -- for most books -- that will be the case.

but perhaps that's because i don't see much use for
high-resolution scans.   i am _not_ in love with scans.
like i said above, they will eventually be left behind.

the important point _today_, though, is that we have
a shitload of scan-sets, more than we can process now,
and it's silly to ignore them when we _could_ offer them
for people to _read_ now, even if they aren't digitized...


>    Do raw scans qualify as eBooks?

does it matter?   they are what they are.   no more, no less.
and almost everyone sees them for exactly what they are.


>    This is the "quick and dirty approach" and doesn't
>    cost much in terms of time, effort or money

um, scanning does indeed take time, effort, and money,
at least if you're doing it on a scale of millions of books...


>    I suppose the real question comes down to 
>    purposes for making eBooks.

i'm not sure of that.   we make e-books for people to read,
and so their text can be searched and easily repurposed...

scans get us part of the way.   digital text gets us the rest...


>    The various university projects still seem to be a
>   great deal concerned with keep their eBooks out of
>   the hands of the public, as has Google, though the
>   Google philosophy may be in the process of change.

the michigan librarian pledged that all public-domain books
scanned from their library will be made available to the public.
i assume he meant the scan-sets.   but from them, we will soon
be able to automatically get digital text, so there's no difference.


>    Right now it's hard to tell what Google has chosen
>   as their goal; will they really try to do millions
>   of books in the next 54 months after perhaps stats
>   of .1 million in the first 18 months?? 

they most certainly will.


>    Will Google change their philosophy per downloading scans, 

if we open up negotiations with them, _maybe_.   we can hope.


>    and or downloading their full text searching database?

they'll never make their text-database public, as that's the 
competitive edge for which they are paying many millions...

do you really think they're gonna hand it over to microsoft?


>    Until Google decides to actually proofread eBooks,

if you mean "ensure that their digital text is highly accurate"
-- which can be completely orthogonal to "proofreading" --
then you can be certain that they will "decide" to take that step.
inaccurate text gives bad search results; google won't tolerate that.


>    My own goal has always been for the public to have their own 
>   home eLibraries, just as they have their own home computers.? 

that's the goal for a lot of us.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060601/8feb4872/attachment.html
From Bowerbird at aol.com  Thu Jun  1 15:21:57 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Thu Jun  1 15:22:03 2006
Subject: [gutvol-d] roger frank,
	breakout programming star over at distributed proofreaders
Message-ID: <441.2a0911d.31b0c285@aol.com>

recently i mentioned 3 programmers over at d.p.

lately, roger frank is doing some excellent work too.

some involves the task of creating a project-specific dictionary,
which is a very powerful tool d.p. has mostly ignored up to now.

another excellent arena roger is working on involves
the flagging of suspicious words.   a brief overview is at:
>    http://pgdp.rfrank.net/ruby/dp-view.html

text-to-html conversion routines are another thing that
roger has been working on.   all of these ideas are _fine_.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060601/71ddd981/attachment.html
From hart at pglaf.org  Fri Jun  2 09:22:37 2006
From: hart at pglaf.org (Michael Hart)
Date: Fri Jun  2 09:22:39 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital
	libraries (fwd)
Message-ID: <Pine.LNX.4.60.0606020922220.21188@pglaf.org>


>From one of my French friends much more familiar with Gallica.


On Thu, Jun 01, 2006 at 09:16:03AM -0700, Michael Hart wrote:
> Each .pdf file seemed to just hold a .gif file. . .or is there
> something else going on there that was missed?

Gallica just has _pictures_.

It is hardly more than a scanning bank.

They can either serve them as PDF or TIFF files.
When they did the word to have a TXT file (which they did for about 1%
of their books), they give that too. (ex: _L'=EEle des pingouins_, both
on Gallica as text and on PG).


> And this is supposed to prepare the book as a single .pdf file?

Yes.

> Searchable?

No, just a bunch of pictures.

The document is just like a document with a picture of a different
painting, or a photograph of a different landscape, on each page.

It's like HTML: HTML can have text (-> searchable) or display a sequence
of pictures (-> non searchable, even if they are pictures of pages with
text). PDF is more confusing because the layout depends less on the
viewer than with HTML (one can define custom colors/sizes/margins in
HTML with CSS and the like; not so with PDF).

Make the experiment with the ZIP file I pointed to. It contains a PDF
file, small & light & searchable.

The PDF file produced by Gallica (take the example given by the other
person) is heavy and not searchable.

I did the tests with xpdf but it should be the same with Acrobat Reader.

To know whether a PDF file is text or a picture, I'm not sure what to
do. Here are hints:

. pdftotext will only work with text-PDF

. searching too

. if the letters look dirty, with noise, or the lines are not quite
    horizontal, it is a picture.
From hart at pglaf.org  Fri Jun  2 09:52:55 2006
From: hart at pglaf.org (Michael Hart)
Date: Fri Jun  2 09:52:57 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital
	libraries
In-Reply-To: <3eb.30681fd.31b0c00e@aol.com>
References: <3eb.30681fd.31b0c00e@aol.com>
Message-ID: <Pine.LNX.4.60.0606020932480.21188@pglaf.org>


On Thu, 1 Jun 2006 Bowerbird@aol.com wrote:

> michael said:
>>    Perhaps the way to think about this is to consider
>>    just how many more or less readers we would get if
>>    the file sizes were that much larger or smaller.
>
> there are something like 100,000 books available at google.
> d.p. digitizes about 2,000 books a year.   they can't keep up.

We work with all possible sources to get eBooks.


>>    In the end, I think we should provide both.
>
> in the end, users will turn exclusively to "digital reprints"
> -- digital text that mimics the scans so accurately that
> there's really no good reason to consult the scans at all.

I seem to get plenty of messages from scholarly types who
think source scans will always be in high demands, at the
ivory tower level, at least.


> after 10 or 20 years of nobody downloading the scans,
> we'll be able to feel comfortable taking them offline...

after 10-20 years the actual hardware requirements will
appear so drastically reduced that the load will be nil.


>>    Some operations deliberately do not put their high
>>   resolution scans online for downloading, rather an
>>   automated process reduces the resolution, so these
>>   scans are no longer suitable for OCRing.
>
> yeah, that's sad.   but what are you gonna do about it?

Once you provide a better alternative, you force those
who should have done it originally to do it better too.


>>    The odds of being able to create a complete eBook,
>>   using those scans that are usually made available,
>>   perhaps about 1/4 to 1/3, based on the reports you
>>   have probably already seen.
>
> yeah, that's sad too.   but that's a quality-control issue
> that i suspect the scanning operations will solve soon...

I was under the impression that much of this low-quality
was intentional, so I don't think those will be improving,
at least until someone provides a better mousetrap.


>>    Once you go through the effort of scanning missing
>>   pages, rescanning the pages that did not work with
>>   your OCR programs, etc., it often might seem worth
>>   the effort simply to scan the entire book with the
>>   higher resolution scans that you can then post for
>>   others to use.
>
> i don't think -- for most books -- that will be the case.

All depends on how much effort it is for the particular person
in question. . .if it's a lot of effort to get the materials,
but low effort to do the scanning, you may as well replace the
entire file with your better examples of what should be done.


> but perhaps that's because i don't see much use for
> high-resolution scans.   i am _not_ in love with scans.
> like i said above, they will eventually be left behind.

1.  Makes for better OCR

2.  The scholarly types, as above.


> the important point _today_, though, is that we have
> a load of scan-sets, more than we can process now,
> and it's silly to ignore them when we _could_ offer them
> for people to _read_ now, even if they aren't digitized...

Yes, and we should.


>>    Do raw scans qualify as eBooks?
>
> does it matter?   they are what they are.   no more, no less.
> and almost everyone sees them for exactly what they are.

It matters to the integrity of the eBook world.


>>    This is the "quick and dirty approach" and doesn't
>>    cost much in terms of time, effort or money
>
> um, scanning does indeed take time, effort, and money,
> at least if you're doing it on a scale of millions of books...

_I_ have no intention of quitting until I can give away a million books,
and I have about the same intention of spending any real money on it.

It will be interesting to see who can put a million eBooks online first,
and how good they are.


>>    I suppose the real question comes down to
>>    purposes for making eBooks.
>
> i'm not sure of that.   we make e-books for people to read,
> and so their text can be searched and easily repurposed...

This is obviously NOT the goal of many.


> scans get us part of the way.   digital text gets us the rest...

Yep. . .scans are just one step, I say it's the easiest.


>>    The various university projects still seem to be a
>>   great deal concerned with keep their eBooks out of
>>   the hands of the public, as has Google, though the
>>   Google philosophy may be in the process of change.
>
> the michigan librarian pledged that all public-domain books
> scanned from their library will be made available to the public.
> i assume he meant the scan-sets.   but from them, we will soon
> be able to automatically get digital text, so there's no difference.

I can only hope he meant something more worthwhile to the masses
than what most of the current scan-sets provide and that he will
be able to find some way to keep the ball rolling.


>>    Right now it's hard to tell what Google has chosen
>>   as their goal; will they really try to do millions
>>   of books in the next 54 months after perhaps stats
>>   of .1 million in the first 18 months??
>
> they most certainly will.

We'll see, and I am taking bets.


>>    Will Google change their philosophy per downloading scans,
>
> if we open up negotiations with them, _maybe_.   we can hope.

What is it that St. Augustine was quoted as saying?  A bit like:

"Work as though everything depends on you,
Pray as though everything depends on God."

I think we should work as though it all depends on us,
and hope that Google will get somewhere.


>>    and or downloading their full text searching database?
>
> they'll never make their text-database public, as that's the
> competitive edge for which they are paying many millions...

They claim all those million are spent on scaning, not OCR.


> do you really think they're gonna hand it over to microsoft?

Or to the world at large?


>>    Until Google decides to actually proofread eBooks,
>
> if you mean "ensure that their digital text is highly accurate"
> -- which can be completely orthogonal to "proofreading" --
> then you can be certain that they will "decide" to take that step.
> inaccurate text gives bad search results; google won't tolerate that.

Actually, you have it backwards there. . .think about it. . . .

Google's monster speciality is SEARCH ENGINES!!!

They are MUCH more interested in writing a search engine that will
read fuzzy OCR text than in increasing the accuracy of the text.


>>    My own goal has always been for the public to have their own
>>   home eLibraries, just as they have their own home computers.?
>
> that's the goal for a lot of us.

!!!

>
> -bowerbird
>


Thanks!!!

Give the world eBooks in 2006!!!

Michael S. Hart
Founder
Project Gutenberg

Blog at http://hart.pglaf.org
From marcello at perathoner.de  Fri Jun  2 10:29:01 2006
From: marcello at perathoner.de (Marcello Perathoner)
Date: Fri Jun  2 10:29:05 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of
	digital	libraries
In-Reply-To: <Pine.LNX.4.60.0606020932480.21188@pglaf.org>
References: <3eb.30681fd.31b0c00e@aol.com>
	<Pine.LNX.4.60.0606020932480.21188@pglaf.org>
Message-ID: <4480755D.3090301@perathoner.de>

Michael Hart wrote:

> Google's monster speciality is SEARCH ENGINES!!!
> 
> They are MUCH more interested in writing a search engine that will
> read fuzzy OCR text than in increasing the accuracy of the text.

You mean a search engine that finds "I)arwin" when I search for "Darwin"?

That search engine would have to automagically decide that "I)" looks
quite a bit the same as "D".

But that's the same thing an OCR software already does! to match
characters against ink stains. If they come up with some better
algorithm to do that, they would be foolish not to use it directly on
the scanned texts.

Somewhere they have to keep the OCRed text of their books. It would take
much less cycles to clean up the text (once) instead of having the
search engine do a fuzzy match every time a user does a search.


-- 
Marcello Perathoner
webmaster@gutenberg.org

From hart at pglaf.org  Fri Jun  2 10:36:35 2006
From: hart at pglaf.org (Michael Hart)
Date: Fri Jun  2 10:36:36 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital
	libraries
In-Reply-To: <4480755D.3090301@perathoner.de>
References: <3eb.30681fd.31b0c00e@aol.com>
	<Pine.LNX.4.60.0606020932480.21188@pglaf.org>
	<4480755D.3090301@perathoner.de>
Message-ID: <Pine.LNX.4.60.0606021033010.21188@pglaf.org>


On Fri, 2 Jun 2006, Marcello Perathoner wrote:

> Michael Hart wrote:
>
>> Google's monster speciality is SEARCH ENGINES!!!
>>
>> They are MUCH more interested in writing a search engine that will
>> read fuzzy OCR text than in increasing the accuracy of the text.
>
> You mean a search engine that finds "I)arwin" when I search for "Darwin"?
>
> That search engine would have to automagically decide that "I)" looks
> quite a bit the same as "D".

Someone posted a number of such examples they found a while back,
and it appeared as if that was the general idea.


> But that's the same thing an OCR software already does! to match
> characters against ink stains. If they come up with some better
> algorithm to do that, they would be foolish not to use it directly on
> the scanned texts.

I think they will probably wait several iterations of improvement
before it becomes obvious to them that they should improve the text.


> Somewhere they have to keep the OCRed text of their books. It would take
> much less cycles to clean up the text (once) instead of having the
> search engine do a fuzzy match every time a user does a search.

They probably have enough computing power not to be worried about that,
but perhaps eventually they will have a large enough collection for the
thought to come.


Michael
From Bowerbird at aol.com  Fri Jun  2 11:31:55 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Fri Jun  2 11:32:16 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital
	libraries
Message-ID: <3fb.31210cc.31b1de1b@aol.com>

michael said:
>    We work with all possible sources to get eBooks.

all of your sources, combined, won't be able to keep up.
not until the digitization becomes nearly-automatic.
which, as i've said, is not that far down the line anyway.


>    I seem to get plenty of messages from scholarly types 
>    who think source scans will always be in high demands,
>    at the ivory tower level, at least.

i'm not sure they know what they want.

in fact, i'm almost sure that they don't...


>    after 10-20 years the actual hardware requirements will
>    appear so drastically reduced that the load will be nil.

it won't be the resources required (or not required) that
makes us take down the scans, it will be the lack of demand.
digital reprints will do the same job, better, with fewer resources.


>    Once you provide a better alternative, you force those
>    who should have done it originally to do it better too.

if you can scare up the $250-million budget, please do.
as i've said elsewhere, that's what we spend in _two_days_
on the war in iraq, so you wouldn't _think_ it's all that hard
to find the same amount for such a culturally important task.
but i don't see anyone except google stepping up to the plate.


>    I was under the impression that much of this low-quality
>    was intentional, so I don't think those will be improving,
>    at least until someone provides a better mousetrap.

that's exactly why i asked "what can we do about it?"


>    All depends on how much effort it is for the particular person
>    in question. . .if it's a lot of effort to get the materials,
>    but low effort to do the scanning, you may as well replace the
>    entire file with your better examples of what should be done.

i agree.   and in the cases where we can't use google's scans,
that's what we'll have to do.   let's just hope, though, that that
won't be the case for the bulk of those 10-million unique titles.


>    1.? Makes for better OCR

i'm rooting for better o.c.r.   i sincerely hope that it happens,
and i suspect the abbyy folks still have tricks up their sleeves.

and, just to remind everybody here, they have _already_ made
a version of their software that's specially-adapted for old books,
a version that nobody here, to my knowledge, has even _tried_,
so y'all will need to do some convincing in order to convince me
that you're really as concerned with the o.c.r. thing as you claim.

but as for me, i'm not counting on the o.c.r. much at all;
i'll take what is currently available in regard to o.c.r. tech.

my aim is to jack up the post-o.c.r. correction routines,
using a wide array of automagic.


>    2.? The scholarly types, as above.

let 'em use their scholar dollars to create whatever they need.
i can't be bothered with their esotericism.   i just love the books.


>    Yes, and we should.

well, i'm glad i finally got _that_ tooth pulled!          ;+)


>    It matters to the integrity of the eBook world.

my integrity does not turn on semantics.


>    _I_ have no intention of quitting 
>    until I can give away a million books,

that's what i love about you, big boy, your dedication.


>    It will be interesting to see who can put a million eBooks 
>    online first, and how good they are.

i agree.


>    Yep. . .scans are just one step, I say it's the easiest.

depends on how many you do.


>    I can only hope he meant something more worthwhile 
>    to the masses than what most of the current scan-sets provide 
>    and that he will be able to find some way to keep the ball rolling.

i'll be happy to help him out, just like i'm happy to help you out.


>    We'll see, and I am taking bets.

pizza.   loser buys in the winner's city.
you can fly out to santa monica and
spend some time with me sometime
when it's wintry cold there in illinois.


>    I think we should work as though it all depends on us,
>    and hope that Google will get somewhere.

you're not the best person to do that negotiation anyway.


>    They claim all those million are spent on scaning, not OCR.

it is.   and that's why it will take a _negotiation_ to get them to
release the public-domain scans.   they won't do it "just because".

but i think there _are_ some things we can offer in negotiation.

one would be the quality-control that we're willing to do for 'em.
although i think they'll realize soon they need to do this themselves,
at the time when they've still got the book right there by the scanner,
it never hurts to have another entity take a look at your work later...

another would be an offer to serve as their "reading room", which
would mean we'd dish the pages to people for reading, so google
could instead concentrate completely on being "the search engine".
(this might mean we'd have to agree not to furnish our own search 
capability, but as long as their engine is nicely integrated into our
presentation regime, i don't think that would be a problem at all;
many websites use google as their search-engine even at present.)

and perhaps most importantly, what we could offer is huge help
in the form of friend-of-the-court briefs that would be supportive
of google's scanning project in facing their various legal challenges.
public opinion will be very important when this comes to judgment,
and a good-faith effort like turning their public-domain scans loose
could go a _long_ way in drumming up public support for their work.
on the other hand, a selfish attitude on google's part would make 'em
look bad, and that appearance could be quite devastating to their case.

i assume all of these points are reasonably apparent to google already,
so the "negotiation" wouldn't have to be antagonistic in nature.   indeed,
it might be very short and very sweet, and we could find ourselves with
100,000 scan-sets on our machines before we knew it.   that possibility
sounds too good to me to pass up without giving it serious consideration.


>    Actually, you have it backwards there. . .think about it. . . .
>    Google's monster speciality is SEARCH ENGINES!!!
>    They are MUCH more interested in writing a search engine that will
>    read fuzzy OCR text than in increasing the accuracy of the text.

if you can search fuzzy text, you can correct fuzzy text.   that's the point.

if google lets its text remain fuzzy, it will be because they _decided_to_.
and there are a couple reasons why they might well decide to do that,
but i'd rather not take a chance of making them real by discussing them.

still, as i've said, i myself will show the world how to correct fuzzy text,
within 5 years, assuming that abbyy hasn't already solved the problem.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060602/a4d894e8/attachment.html
From ke at gnu.franken.de  Fri Jun  2 11:50:26 2006
From: ke at gnu.franken.de (Karl Eichwalder)
Date: Fri Jun  2 12:38:18 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital
	libraries (fwd)
In-Reply-To: <Pine.LNX.4.60.0606020922220.21188@pglaf.org> (Michael Hart's
	message of "Fri, 2 Jun 2006 09:22:37 -0700 (PDT)")
References: <Pine.LNX.4.60.0606020922220.21188@pglaf.org>
Message-ID: <sh1wu75r4d.fsf@tux.gnu.franken.de>

Michael Hart <hart@pglaf.org> writes:

> Gallica just has _pictures_.

Gallica _has_ pictures and that's very nice.

>> Searchable?
>
> No, just a bunch of pictures.

Searching isn't the only thing that matter's.  Think about children's
book where pictures are very important.  The same is valid for book
about architecture, etc.

As I said earlier, we need both sides of the coin--: the pictures and
the text or the text and the pictures (= scans).  Not necessarily within
the same file (PDF, Djvu, or .tar.bz2), but catalogued or archived in a
way that it is possible to download the wanted files easily.

-- 
http://www.gnu.franken.de/ke/                           |      ,__o
                                                        |    _-\_<,
                                                        |   (*)/'(*)
Key fingerprint = F138 B28F B7ED E0AC 1AB4  AA7F C90A 35C3 E9D0 5D1C
From maitri.vr at gmail.com  Fri Jun  2 10:19:06 2006
From: maitri.vr at gmail.com (maitri venkat-ramani)
Date: Sat Jun  3 08:04:12 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital
	libraries
In-Reply-To: <Pine.LNX.4.60.0606020932480.21188@pglaf.org>
References: <3eb.30681fd.31b0c00e@aol.com>
	<Pine.LNX.4.60.0606020932480.21188@pglaf.org>
Message-ID: <6ebf94650606021019o20137648s60df33e87b8ebd67@mail.gmail.com>

Page scans are not eBooks.  They are not universally searchable,
readable and editable, and have full ability to become proprietary,
i.e. owned or copy-protected.

Maitri

On 6/2/06, Michael Hart <hart@pglaf.org> wrote:
>
> On Thu, 1 Jun 2006 Bowerbird@aol.com wrote:
>
> > michael said:
> >>    Perhaps the way to think about this is to consider
> >>    just how many more or less readers we would get if
> >>    the file sizes were that much larger or smaller.
> >
> > there are something like 100,000 books available at google.
> > d.p. digitizes about 2,000 books a year.   they can't keep up.
>
> We work with all possible sources to get eBooks.
>
>
> >>    In the end, I think we should provide both.
> >
> > in the end, users will turn exclusively to "digital reprints"
> > -- digital text that mimics the scans so accurately that
> > there's really no good reason to consult the scans at all.
>
> I seem to get plenty of messages from scholarly types who
> think source scans will always be in high demands, at the
> ivory tower level, at least.
>
>
> > after 10 or 20 years of nobody downloading the scans,
> > we'll be able to feel comfortable taking them offline...
>
> after 10-20 years the actual hardware requirements will
> appear so drastically reduced that the load will be nil.
>
>
> >>    Some operations deliberately do not put their high
> >>   resolution scans online for downloading, rather an
> >>   automated process reduces the resolution, so these
> >>   scans are no longer suitable for OCRing.
> >
> > yeah, that's sad.   but what are you gonna do about it?
>
> Once you provide a better alternative, you force those
> who should have done it originally to do it better too.
>
>
> >>    The odds of being able to create a complete eBook,
> >>   using those scans that are usually made available,
> >>   perhaps about 1/4 to 1/3, based on the reports you
> >>   have probably already seen.
> >
> > yeah, that's sad too.   but that's a quality-control issue
> > that i suspect the scanning operations will solve soon...
>
> I was under the impression that much of this low-quality
> was intentional, so I don't think those will be improving,
> at least until someone provides a better mousetrap.
>
>
> >>    Once you go through the effort of scanning missing
> >>   pages, rescanning the pages that did not work with
> >>   your OCR programs, etc., it often might seem worth
> >>   the effort simply to scan the entire book with the
> >>   higher resolution scans that you can then post for
> >>   others to use.
> >
> > i don't think -- for most books -- that will be the case.
>
> All depends on how much effort it is for the particular person
> in question. . .if it's a lot of effort to get the materials,
> but low effort to do the scanning, you may as well replace the
> entire file with your better examples of what should be done.
>
>
> > but perhaps that's because i don't see much use for
> > high-resolution scans.   i am _not_ in love with scans.
> > like i said above, they will eventually be left behind.
>
> 1.  Makes for better OCR
>
> 2.  The scholarly types, as above.
>
>
> > the important point _today_, though, is that we have
> > a load of scan-sets, more than we can process now,
> > and it's silly to ignore them when we _could_ offer them
> > for people to _read_ now, even if they aren't digitized...
>
> Yes, and we should.
>
>
> >>    Do raw scans qualify as eBooks?
> >
> > does it matter?   they are what they are.   no more, no less.
> > and almost everyone sees them for exactly what they are.
>
> It matters to the integrity of the eBook world.
>
>
> >>    This is the "quick and dirty approach" and doesn't
> >>    cost much in terms of time, effort or money
> >
> > um, scanning does indeed take time, effort, and money,
> > at least if you're doing it on a scale of millions of books...
>
> _I_ have no intention of quitting until I can give away a million books,
> and I have about the same intention of spending any real money on it.
>
> It will be interesting to see who can put a million eBooks online first,
> and how good they are.
>
>
> >>    I suppose the real question comes down to
> >>    purposes for making eBooks.
> >
> > i'm not sure of that.   we make e-books for people to read,
> > and so their text can be searched and easily repurposed...
>
> This is obviously NOT the goal of many.
>
>
> > scans get us part of the way.   digital text gets us the rest...
>
> Yep. . .scans are just one step, I say it's the easiest.
>
>
>
> >>    The various university projects still seem to be a
> >>   great deal concerned with keep their eBooks out of
> >>   the hands of the public, as has Google, though the
> >>   Google philosophy may be in the process of change.
> >
> > the michigan librarian pledged that all public-domain books
> > scanned from their library will be made available to the public.
> > i assume he meant the scan-sets.   but from them, we will soon
> > be able to automatically get digital text, so there's no difference.
>
> I can only hope he meant something more worthwhile to the masses
> than what most of the current scan-sets provide and that he will
> be able to find some way to keep the ball rolling.
>
>
> >>    Right now it's hard to tell what Google has chosen
> >>   as their goal; will they really try to do millions
> >>   of books in the next 54 months after perhaps stats
> >>   of .1 million in the first 18 months?
> >
> > they most certainly will.
>
> We'll see, and I am taking bets.
>
>
> >>    Will Google change their philosophy per downloading scans,
> >
> > if we open up negotiations with them, _maybe_.   we can hope.
>
> What is it that St. Augustine was quoted as saying?  A bit like:
>
> "Work as though everything depends on you,
> Pray as though everything depends on God."
>
> I think we should work as though it all depends on us,
> and hope that Google will get somewhere.
>
>
> >>    and or downloading their full text searching database?
> >
> > they'll never make their text-database public, as that's the
> > competitive edge for which they are paying many millions...
>
> They claim all those million are spent on scaning, not OCR.
>
>
> > do you really think they're gonna hand it over to microsoft?
>
> Or to the world at large?
>
>
> >>    Until Google decides to actually proofread eBooks,
> >
> > if you mean "ensure that their digital text is highly accurate"
> > -- which can be completely orthogonal to "proofreading" --
> > then you can be certain that they will "decide" to take that step.
> > inaccurate text gives bad search results; google won't tolerate that.
>
> Actually, you have it backwards there. . .think about it. . . .
>
> Google's monster speciality is SEARCH ENGINES!!!
>
> They are MUCH more interested in writing a search engine that will
> read fuzzy OCR text than in increasing the accuracy of the text.
>
>
> >>    My own goal has always been for the public to have their own
> >>   home eLibraries, just as they have their own home computers.
> >
> > that's the goal for a lot of us.
>
> !!!
>
> >
> > -bowerbird
> >
>
>
>
> Thanks!!!
>
> Give the world eBooks in 2006!!!
>
> Michael S. Hart
> Founder
> Project Gutenberg
>
> Blog at http://hart.pglaf.org
>
>
From Bowerbird at aol.com  Sat Jun  3 10:17:07 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Sat Jun  3 10:17:14 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital
	libraries
Message-ID: <490.1d7f4ba.31b31e13@aol.com>

maitri said:
>    Page scans are not eBooks.? They are not universally searchable,
>    readable and editable, and have full ability to become proprietary,
>    i.e. owned or copy-protected.

thanks for joining the thread.   where ya been?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060603/d498f6b5/attachment.html
From prosfilaes at gmail.com  Sat Jun  3 10:45:28 2006
From: prosfilaes at gmail.com (David Starner)
Date: Sat Jun  3 10:52:36 2006
Subject: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital
	libraries
In-Reply-To: <6ebf94650606021019o20137648s60df33e87b8ebd67@mail.gmail.com>
References: <3eb.30681fd.31b0c00e@aol.com>
	<Pine.LNX.4.60.0606020932480.21188@pglaf.org>
	<6ebf94650606021019o20137648s60df33e87b8ebd67@mail.gmail.com>
Message-ID: <6d99d1fd0606031045q69c419f6x90af037faf55cadc@mail.gmail.com>

On 6/2/06, maitri venkat-ramani <maitri.vr@gmail.com> wrote:
> Page scans are not eBooks.  They are not universally searchable,
> readable and editable, and have full ability to become proprietary,
> i.e. owned or copy-protected.

Everything can be taken proprietary; in fact, most proprietary eBook
formats are text based. It's far easier to lock up text then it is to
lock up images; if you display images, at the least you can grab them
with a digital camera.
From cannona at fireantproductions.com  Sun Jun  4 10:44:57 2006
From: cannona at fireantproductions.com (Aaron Cannon)
Date: Sun Jun  4 11:04:40 2006
Subject: [gutvol-d] any one know java script?
Message-ID: <7.0.1.0.0.20060604124119.01a19e00@fireantproductions.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>Hi all.  I'm looking for someone who knows Java Script to help out
>with creating a script that will select a specific range of check
>boxes.  I have a script that will check all check boxes, but I need
>one that will, if for example I check the first and then the 19th check box, it
>will check boxes 2-18 as well.  This is for the PG CD/DVD request system.
>
>If anyone knows how to do something like this or if my above
>description wasn't clear, please let me know.
>
>Thanks!
>
>Sincerely
>Aaron Cannon
>
>


- --
E-mail: cannona@fireantproductions.com
Skype: cannona
MSN Messenger: cannona@hotmail.com (Do not send E-mail to the hotmail address.)

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (MingW32) - GPGrelay v0.959
Comment: Key available from all major key servers.

iD8DBQFEgxwrI7J99hVZuJcRAvlrAKCutvTA/nU9wb6u6xnj5pwNY20AEwCgsH89
45G+XKb4SKnfxKXklC/8KPU=
=SaA3
-----END PGP SIGNATURE-----

From tb at baechler.net  Mon Jun  5 00:26:45 2006
From: tb at baechler.net (Tony Baechler)
Date: Mon Jun  5 00:24:12 2006
Subject: [gutvol-d] any one know java script?
In-Reply-To: <7.0.1.0.0.20060604124119.01a19e00@fireantproductions.com>
References: <7.0.1.0.0.20060604124119.01a19e00@fireantproductions.com>
Message-ID: <7.0.1.0.2.20060605002200.02c814b0@baechler.net>

Hi,

On the client side, you can get the GreaseMonkey extension for 
Firefox that will do what you want.  You still need to know 
javascript but it has a lot of sample scripts.  On the server side, I 
don't know.  You would probably have to embed it in the html page.

For the blind and others prevented from reading by a print 
disability, there is a site called http://www.bookshare.org/ .  This 
site has a $50 annual fee and a $25 set up fee but has many books on 
Java, programming, javascript, and other technology related items.  I 
highly recommend it.  Unfortunately it is only open to US 
citizens.  They have all of the O'Reilly books.  O'Reilly Media 
publishes only computer and tech books.  My point is that I've seen a 
couple javascript books there and one of them would probably do what 
you want.  If you already have many scanned books that you've scanned 
for yourself and include the title, author and copyright pages, you 
can upload them and get a free or reduced subscription cost.

From cannona at fireantproductions.com  Mon Jun  5 08:03:05 2006
From: cannona at fireantproductions.com (Aaron Cannon)
Date: Mon Jun  5 08:27:25 2006
Subject: [gutvol-d] any one know java script?
In-Reply-To: <7.0.1.0.2.20060605002200.02c814b0@baechler.net>
References: <7.0.1.0.0.20060604124119.01a19e00@fireantproductions.com>
	<7.0.1.0.2.20060605002200.02c814b0@baechler.net>
Message-ID: <7.0.1.0.0.20060605100004.01dc07a0@fireantproductions.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Yeah, figuring it out on my own is the last-case option. :)

I'll probably turn to bookshare if I don't find someone who knows how
to do it.  I'll have to renew though, as my subscription has
lapsed.  No big deal though, I was meaning to anyway.

Thanks for the info.

Sincerely
Aaron Cannon


At 02:26 AM 6/5/2006, you wrote:
>Hi,
>
>On the client side, you can get the GreaseMonkey extension for
>Firefox that will do what you want.  You still need to know
>javascript but it has a lot of sample scripts.  On the server side,
>I don't know.  You would probably have to embed it in the html page.
>
>For the blind and others prevented from reading by a print
>disability, there is a site called http://www.bookshare.org/ .  This
>site has a $50 annual fee and a $25 set up fee but has many books on
>Java, programming, javascript, and other technology related
>items.  I highly recommend it.  Unfortunately it is only open to US
>citizens.  They have all of the O'Reilly books.  O'Reilly Media
>publishes only computer and tech books.  My point is that I've seen
>a couple javascript books there and one of them would probably do
>what you want.  If you already have many scanned books that you've
>scanned for yourself and include the title, author and copyright
>pages, you can upload them and get a free or reduced subscription cost.
>
>_______________________________________________
>gutvol-d mailing list
>gutvol-d@lists.pglaf.org
>http://lists.pglaf.org/listinfo.cgi/gutvol-d
>


- --
E-mail: cannona@fireantproductions.com
Skype: cannona
MSN Messenger: cannona@hotmail.com (Do not send E-mail to the hotmail address.)

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (MingW32) - GPGrelay v0.959
Comment: Key available from all major key servers.

iD8DBQFEhEfGI7J99hVZuJcRAqwoAJ9bkB3YpTCovUy6RXikrtr1B8aCJQCgj1o6
AW0HzFfe0w8k5a0Z9Qvkbxs=
=VBYl
-----END PGP SIGNATURE-----

From mlockey at magma.ca  Mon Jun  5 11:52:54 2006
From: mlockey at magma.ca (Michael Lockey)
Date: Mon Jun  5 11:53:03 2006
Subject: [gutvol-d] DP-Canada progress report
In-Reply-To: <20060528190002.AE8348CB9B@pglaf.org>
Message-ID: <200606051852.k55IqtwP008277@mail3.magma.ca>

We are experiencing some delays while awaiting the DP-EU source.  They're
heavily overworked there, but not a problem...

The forums are up at dp-can.cybernetik.ca/phpbb2, so please feel free to log
on and make suggestions as the site becomes more accessible.

Content providers may start entering through userid dpscans with password of
image$. The files go into the directory \inetpub\dp-uploads. Users can
create their own subdirectories, e.g. "\vasa", and thereunder put their
projects, e.g. "\vasa\FamousParrotsIhaveKnown".

(I might note that I'm not ready to put this up with a proper web address
until it's a lot more stable than it is: the situation is only temporary, of
course.)

Michael Lockey

(Hoping to surprize and amuse you once we are flying; many thanks to Don
Kretz for all his work.)

From traverso at dm.unipi.it  Mon Jun  5 12:06:22 2006
From: traverso at dm.unipi.it (Carlo Traverso)
Date: Mon Jun  5 12:04:47 2006
Subject: [gutvol-d] DP-Canada progress report
In-Reply-To: <200606051852.k55IqtwP008277@mail3.magma.ca> (mlockey@magma.ca)
References: <200606051852.k55IqtwP008277@mail3.magma.ca>
Message-ID: <200606051906.k55J6M820121@pico.dm.unipi.it>

>>>>> "Michael" == Michael Lockey <mlockey@magma.ca> writes:

    Michael> We are experiencing some delays while awaiting the DP-EU
    Michael> source.  They're heavily overworked there, but not a
    Michael> problem...

    Michael> The forums are up at dp-can.cybernetik.ca/phpbb2, so
    Michael> please feel free to log on and make suggestions as the
    Michael> site becomes more accessible.

    Michael> Content providers may start entering through userid
    Michael> dpscans with password of image$. The files go into the
    Michael> directory \inetpub\dp-uploads. Users can create their own
    Michael> subdirectories, e.g. "\vasa", and thereunder put their
    Michael> projects, e.g. "\vasa\FamousParrotsIhaveKnown".

An important piece of info missing: ISO-8859-1 or UTF-8?

(a second piece of info: are zip files allowed?)

Carlo

From squadette at gmail.com  Mon Jun  5 12:21:38 2006
From: squadette at gmail.com (Alexey Mahotkin)
Date: Tue Jun  6 08:03:38 2006
Subject: [gutvol-d] typo in Leonardo Da Vinci Notebooks
Message-ID: <bb5b640b0606051221w69617b6p4a1a8ecd78b73441@mail.gmail.com>

hello,

a rather common OCR-style typo:

http://www.gutenberg.org/dirs/etext04/8ldvc10.txt

"zvith", should be "with".


Thank you for all your work,

--alexm
From sly at victoria.tc.ca  Tue Jun  6 08:22:21 2006
From: sly at victoria.tc.ca (Andrew Sly)
Date: Tue Jun  6 08:22:25 2006
Subject: [gutvol-d] typo in Leonardo Da Vinci Notebooks
In-Reply-To: <bb5b640b0606051221w69617b6p4a1a8ecd78b73441@mail.gmail.com>
References: <bb5b640b0606051221w69617b6p4a1a8ecd78b73441@mail.gmail.com>
Message-ID: <Pine.GSO.4.58.0606060817210.4475@vtn1.victoria.tc.ca>

Greetings Alexey...

Thanks for mentioning this.

However, gutvol-d is a general discussion mailing list;
It is entirely possible that this error will never
be dealt with if you mention it here.

I've forwarded it to our "errata" email address.

Also, see the faq at: http://www.gutenberg.org/faq/R-26

Thanks,
Andrew

On Mon, 5 Jun 2006, Alexey Mahotkin wrote:

> hello,
>
> a rather common OCR-style typo:
>
> http://www.gutenberg.org/dirs/etext04/8ldvc10.txt
>
> "zvith", should be "with".
>
>
> Thank you for all your work,
>
> --alexm
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
From mlockey at magma.ca  Tue Jun  6 16:39:45 2006
From: mlockey at magma.ca (Michael Lockey)
Date: Tue Jun  6 16:39:57 2006
Subject: [gutvol-d] DP-Canada progress report
In-Reply-To: <200606051906.k55J6M820121@pico.dm.unipi.it>
Message-ID: <200606062339.k56Ndjuv025949@mail3.magma.ca>

>An important piece of info missing: ISO-8859-1 or UTF-8?
>(a second piece of info: are zip files allowed?)

>Carlo

That's, of course, dependent on DP-EU's help  (for which we are NOT pushing:
they're busy, we're happy for what we can get!)  Zip files are allowed.

Cheers,

Michael

From marcello at perathoner.de  Tue Jun  6 16:52:02 2006
From: marcello at perathoner.de (Marcello Perathoner)
Date: Tue Jun  6 16:52:06 2006
Subject: [gutvol-d] All people with accounts on ibiblio! PG site moving to
	wiki
Message-ID: <44861522.2050905@perathoner.de>

I'm moving the static part of the PG site to a wiki. This will allow
more people to participate in the site maintenance and improvement.

Everybody who currently has shell access on ibiblio should stop editing
the html pages and transfer their content to the wiki instead. The wiki
will soon replace most of the PG site, except for the online catalog and
a few other pages.

The wiki can be reached at:

  http://www.gutenberg.org/wiki/


1. Currently all new users have to be added by a sysop (me). If you
speak wiki and want an account, mail me your username and initial
password. You will be added to the "gutenberg" group.

2. The wiki has a 'private' section. All pages starting with
"Gutenberg:" are editable by the "gutenberg" group only. This will be
the 'official' PG site.

3. The rest of the wiki works just like wikis are supposed to work. You
may put it to any use you like that helps "producing and distributing
ebooks".


-- 
Marcello Perathoner
webmaster@gutenberg.org

From ajhaines at shaw.ca  Wed Jun  7 09:12:30 2006
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Wed Jun  7 09:13:38 2006
Subject: [gutvol-d] Dagger/sword symbol, Scandinavian countries
Message-ID: <000501c68a4d$2e067d90$6401a8c0@ahainesp2400>

A couple of questions, just to satisfy my curiosity:

- A book I'm working on has a small dagger or sword symbol (point down, 
handle up) next to some dates.  It looks something like the "dagger" symbol 
in Windows' Arial font, Unicode U2020.  Two examples, with an exclamation 
mark substituting for the symbol, are "Occam ! c. 1349" and "Colet ! 1519". 
What does this symbol mean?

- On several books' copyright page, I've seen the statement "all rights 
reserved, including that of translation into foreign languages, including 
the Scandinavian."  Why are Scandinavian languages specially noted like 
this?

Al 


From dixonm at pobox.com  Wed Jun  7 09:53:26 2006
From: dixonm at pobox.com (Meredith Dixon)
Date: Wed Jun  7 10:04:44 2006
Subject: [gutvol-d] Dagger/sword symbol, Scandinavian countries
In-Reply-To: <000501c68a4d$2e067d90$6401a8c0@ahainesp2400>
References: <000501c68a4d$2e067d90$6401a8c0@ahainesp2400>
Message-ID: <44870486.2070009@pobox.com>

Al Haines (shaw) wrote:

> - A book I'm working on has a small dagger or sword symbol (point down, 
> handle up) next to some dates....What does this symbol mean?

Date of death.
From hyphen at hyphenologist.co.uk  Wed Jun  7 11:24:12 2006
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Wed Jun  7 11:24:23 2006
Subject: [gutvol-d] Dagger/sword symbol, Scandinavian countries
In-Reply-To: <44870486.2070009@pobox.com>
References: <000501c68a4d$2e067d90$6401a8c0@ahainesp2400>
	<44870486.2070009@pobox.com>
Message-ID: <ua6e825ra7jt7rvstnc02oia48o0u81udt@4ax.com>

On Wed, 07 Jun 2006 12:53:26 -0400,  Meredith Dixon <dixonm@pobox.com>
wrote:

|Al Haines (shaw) wrote:
|
|> - A book I'm working on has a small dagger or sword symbol (point down, 
|> handle up) next to some dates....What does this symbol mean?
|
|Date of death.

ROTFLMAO

They murder all authors in Scandinavia ;-)

-- 
Dave Fawthrop <dave hyphenologist co uk> 
"Intelligent Design?" my knees say *not*. 
"Intelligent Design?" my back says *not*.
More like "Incompetent design". Sig (C) Copyright Public Domain

From sly at victoria.tc.ca  Wed Jun  7 18:19:00 2006
From: sly at victoria.tc.ca (Andrew Sly)
Date: Wed Jun  7 18:19:37 2006
Subject: [gutvol-d] Dagger/sword symbol, Scandinavian countries
In-Reply-To: <000501c68a4d$2e067d90$6401a8c0@ahainesp2400>
References: <000501c68a4d$2e067d90$6401a8c0@ahainesp2400>
Message-ID: <Pine.GSO.4.58.0606071806050.28446@vtn1.victoria.tc.ca>


On Wed, 7 Jun 2006, Al Haines (shaw) wrote:

> A couple of questions, just to satisfy my curiosity:
>
> - A book I'm working on has a small dagger or sword symbol (point down,
> handle up) next to some dates.  It looks something like the "dagger" symbol
> in Windows' Arial font, Unicode U2020.  Two examples, with an exclamation
> mark substituting for the symbol, are "Occam ! c. 1349" and "Colet ! 1519".
> What does this symbol mean?

Yes, as already mentioned, the dagger symbol is used to indicate
date of death. You might also occassionally see it used as a footnote
marker. And yes, U+2020 is the correct code point for this character.

> - On several books' copyright page, I've seen the statement "all rights
> reserved, including that of translation into foreign languages, including
> the Scandinavian."  Why are Scandinavian languages specially noted like
> this?
>

I only have guesses here. It could be something to do with the
state of international laws at the time. Or perhaps there had
been a significant number of unauthorized scandanavian translations.
I do know that many of the Finnish texts in PG are translations of
a surprisingly broad spectrum of works from other languages.

For example, Beaumarchais' "Marriage of Figaro",
Edward Bellamy's "Looking Backward, 2000 to 1887",
Dante's "Divine Comedy", Dickens' "David Copperfield",
and works by Epictetus, Gustave Flaubert, Goethe, Henrik Ibsen,
Moliere, Nietzsche, Shakespeare, Sir Walter Scott, Tolstoy,
Harriet Beecher Stowe, and Jules Verne.

Andrew
From hyphen at hyphenologist.co.uk  Thu Jun  8 02:40:15 2006
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Thu Jun  8 02:40:28 2006
Subject: [gutvol-d] URL for a single author is PG catalogue?
Message-ID: <tlrf82ptppodp57rdhhrohm6p9njmv1dqk@4ax.com>

I am trying to beg a link to the PG catalogue for John Hartley's books from
my local Library.   He was a Halifax poet and they specialise in paper
copies of his works.

Is there a way of giving them a link to the PG catalogue which will go
straight to John Hartley's books?

This must be quite a common problem, could someone consider including it in
the PG FAQ?

-- 
Dave Fawthrop <dave hyphenologist co uk> 
"Intelligent Design?" my knees say *not*. 
"Intelligent Design?" my back says *not*.
More like "Incompetent design". Sig (C) Copyright Public Domain

From sly at victoria.tc.ca  Thu Jun  8 02:48:41 2006
From: sly at victoria.tc.ca (Andrew Sly)
Date: Thu Jun  8 02:48:46 2006
Subject: [gutvol-d] URL for a single author is PG catalogue?
In-Reply-To: <tlrf82ptppodp57rdhhrohm6p9njmv1dqk@4ax.com>
References: <tlrf82ptppodp57rdhhrohm6p9njmv1dqk@4ax.com>
Message-ID: <Pine.GSO.4.58.0606080247460.29058@vtn1.victoria.tc.ca>


According to the explanation at http://www.gutenberg.org/howto-link
the best form to use is:
http://www.gutenberg.org/author/John_Hartley
This has been implemented for quite some time now.

For an example of it in use, see the Gutenberg link from:
http://en.wikipedia.org/wiki/John_Hartley_%28poet%29

Andrew


On Thu, 8 Jun 2006, Dave Fawthrop wrote:

> I am trying to beg a link to the PG catalogue for John Hartley's books from
> my local Library.   He was a Halifax poet and they specialise in paper
> copies of his works.
>
> Is there a way of giving them a link to the PG catalogue which will go
> straight to John Hartley's books?
>
> This must be quite a common problem, could someone consider including it in
> the PG FAQ?
>
>
From marcello at perathoner.de  Thu Jun  8 10:40:03 2006
From: marcello at perathoner.de (Marcello Perathoner)
Date: Thu Jun  8 10:40:07 2006
Subject: [gutvol-d] Works of Bertolt Brecht
Message-ID: <448860F3.4020506@perathoner.de>

Famous German playwright Bertolt Brecht died 14. Aug. 1956. His works
will therefore be in the public domain in life+50 countries by January.

I'm a big fan of Brecht so I'm willing to put quite a lot of work into
an electronic edition of his works.

I have an edition of his "Gesammelte Werke" (collected works, 20
volumes) Copyright Suhrkamp Verlag Frankfurt am Main 1967 that I'm
willing to sacrifice to the good cause. Is this edition eligible for
processing by DP-EU or any other DP?


-- 
Marcello Perathoner
webmaster@gutenberg.org

From fvandrog at scripps.edu  Thu Jun  8 11:20:13 2006
From: fvandrog at scripps.edu (Frank van Drogen)
Date: Thu Jun  8 11:20:13 2006
Subject: [gutvol-d] Works of Bertolt Brecht
In-Reply-To: <448860F3.4020506@perathoner.de>
References: <448860F3.4020506@perathoner.de>
Message-ID: <7.0.1.0.0.20060608111844.01d02908@scripps.edu>

At 10:40 AM 6/8/2006, you wrote:
>Famous German playwright Bertolt Brecht died 14. Aug. 1956. His works
>will therefore be in the public domain in life+50 countries by January.
>
>I'm a big fan of Brecht so I'm willing to put quite a lot of work into
>an electronic edition of his works.
>
>I have an edition of his "Gesammelte Werke" (collected works, 20
>volumes) Copyright Suhrkamp Verlag Frankfurt am Main 1967 that I'm
>willing to sacrifice to the good cause. Is this edition eligible for
>processing by DP-EU or any other DP?

DP-EU would be perfectely happy to process his works; and the 
Gesammelte Werke should be fine from copyright perspectives.

Frank 

From Bowerbird at aol.com  Thu Jun  8 12:06:24 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Thu Jun  8 12:06:41 2006
Subject: [gutvol-d] Dagger/sword symbol, Scandinavian countries
Message-ID: <37c.44730fd.31b9cf30@aol.com>

andrew said:
>    Edward Bellamy's "Looking Backward, 2000 to 1887"

gosh i enjoyed that book as a youngster...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060608/ecde0406/attachment.html
From hart at pglaf.org  Thu Jun  8 13:13:03 2006
From: hart at pglaf.org (Michael Hart)
Date: Thu Jun  8 13:13:05 2006
Subject: [gutvol-d] !@! 4 Weeks:  The Big Push, Well Not So Big This Time 
Message-ID: <Pine.LNX.4.60.0606081312440.29445@pglaf.org>


As most of you are aware, it is 4 weeks until we complete our
35th year of Project Gutenberg history, and we have about 380
eBooks left to make it to 20,000.

This would be about 95 per week. . .we did 82 this week.

So it's not such a Big Push as we did to get to 10,000, but a
rather smaller push, which is why you haven't heard me say an
awfully lot about it. . .things are working out much a closer
match to reaching 20,000 on our 35th anniversary than anyone,
myself included, would likely have predicted.

However, especially since I am planning on taking a week off,
right at July 4th, I am best man at my best friend's wedding,
I am trying to get as much as possible done before I leave as
soon as I can after sending out the Newsletter a week before.

I am working on the July 5th Newsletter, and will have it out
in a fairly complete manner half a day after the previous one
goes out, and am hoping that some of our volunteers will have
the wherewithal to update it and send it out July 5th with an
entirely up to date revision, that hopefully will hit 20,000.

If you have any books that are near completion, but would not
be totally through all the various processes, we can put them
in the "PrePrints" section now, where perhaps a few people in
the next few weeks can help with them.


More later,

I'm just trying to make it one day at a time right now. . . .


Thanks!!!

Give the world eBooks in 2006!!!

Michael S. Hart
Founder
Project Gutenberg

Blog at http://hart.pglaf.org
From cannona at fireantproductions.com  Fri Jun  9 07:58:11 2006
From: cannona at fireantproductions.com (Aaron Cannon)
Date: Fri Jun  9 08:00:03 2006
Subject: [gutvol-d] All people with accounts on ibiblio! PG site
	moving to wiki
In-Reply-To: <44861522.2050905@perathoner.de>
References: <44861522.2050905@perathoner.de>
Message-ID: <7.0.1.0.0.20060609095424.0190f0a0@fireantproductions.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Marcello and all.  The idea of a wiki is a great one!  The only
problem I have with it is that it is really slow.  How are you
planning on dealing with that?  Perhaps a daily static dump of the
Gutenberg Namespace could be made and visitors could be referred to
that.  Updates wouldn't show up immediately, but it might not matter
in most cases.

Again, thanks for setting this up!  It will make updating pages much easier.

Sincerely
Aaron Cannon


At 06:52 PM 6/6/2006, you wrote:
>I'm moving the static part of the PG site to a wiki. This will allow
>more people to participate in the site maintenance and improvement.
>
>Everybody who currently has shell access on ibiblio should stop editing
>the html pages and transfer their content to the wiki instead. The wiki
>will soon replace most of the PG site, except for the online catalog and
>a few other pages.
>
>The wiki can be reached at:
>
>   http://www.gutenberg.org/wiki/
>
>
>1. Currently all new users have to be added by a sysop (me). If you
>speak wiki and want an account, mail me your username and initial
>password. You will be added to the "gutenberg" group.
>
>2. The wiki has a 'private' section. All pages starting with
>"Gutenberg:" are editable by the "gutenberg" group only. This will be
>the 'official' PG site.
>
>3. The rest of the wiki works just like wikis are supposed to work. You
>may put it to any use you like that helps "producing and distributing
>ebooks".
>
>
>--
>Marcello Perathoner
>webmaster@gutenberg.org
>
>_______________________________________________
>gutvol-d mailing list
>gutvol-d@lists.pglaf.org
>http://lists.pglaf.org/listinfo.cgi/gutvol-d


- --
E-mail: cannona@fireantproductions.com
Skype: cannona
MSN Messenger: cannona@hotmail.com (Do not send E-mail to the hotmail address.)

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (MingW32) - GPGrelay v0.959
Comment: Key available from all major key servers.

iD8DBQFEiY0cI7J99hVZuJcRAtx/AJ4mB2XR+BMwybvzg/Nz+CgIMfyHyQCfW9G4
qByJ8dJRiFwkB1shgl6S5Co=
=ck3b
-----END PGP SIGNATURE-----

From hart at pglaf.org  Thu Jun  8 09:36:06 2006
From: hart at pglaf.org (Michael Hart)
Date: Fri Jun  9 08:27:54 2006
Subject: [gutvol-d] Thank You for all of your work (fwd)
Message-ID: <Pine.LNX.4.60.0606080934190.26117@pglaf.org>


As usual at this time of the year, I will be sending you some
"Thank You Notes" from our Project Gutenberg readers.

Here is one message, in it's entirety, that I hope you enjoy!


Thanks!!!

Give the world eBooks in 2006!!!

Michael S. Hart
Founder
Project Gutenberg

Blog at http://hart.pglaf.org


---------- Forwarded message ----------
Date: Sat, 03 Jun 2006 17:48:23 +0100
From: Amy <amy.e.pedersen@gmail.com>
To: hart@pobox.com
Subject: Thank You for all of your work

Dear Project Gutenberg,

I don?t know if many people take the time to thank you, but I just wanted to 
express my gratitude for the services you provide. Thank you all for your work 
and dedication. Your work is profoundly appreciated. I am a Peace Corps 
Volunteer working deep in rural Namibia. I had always admired Project Gutenberg 
(even donating some time through Distributed Proofreaders) but I have only 
begun to realize how truly important it is since I have been here. When I was 
in America it was nice to have access to books whenever I felt like it, without 
having to go to the trouble of going to a library or bookshop, but here it is 
vital. Bookshops are rare in Namibia (the nearest one to my village is over 250 
kilometres away) and the books they sell are often very very expensive 
(especially considering my limited financial resources.) Also, the books they 
sell are often only in Afrikaans or German, neither of which I understand 
(Peace Corps taught me the tribal language in my village?KhoeKhoe?instead.) 
Libraries are even rarer and often badly under stocked. I am trying to build up 
a school library, but we are dependent on donations and it is much more 
important to get easy to read picture books to help the children with their 
English than to get classics for my own consumption. Project Gutenberg has 
become my library. I didn?t realize the importance of plain vanilla texts until 
I got here and realized how slow and expensive internet is. The zipped plain 
vanilla texts often take less than 5 or 10 minutes to download and provide 
hours of reading enjoyment. Thank you for being an equalizing force in 
literacy, allowing books to reach those who would otherwise have a hard time 
getting them. Your work is thoroughly appreciated. I have shared your site with 
other volunteers who also enjoy it. Thank you so much. I am immensely grateful.

Sincerely,

Amy Elizabeth Pedersen
From Bowerbird at aol.com  Fri Jun  9 09:36:40 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Fri Jun  9 09:36:56 2006
Subject: [gutvol-d] All people with accounts on ibiblio! PG site moving to
	wiki
Message-ID: <493.292ea68.31bafd98@aol.com>


i see that _marcello_ has blocked _me_ as a "troll".

ironic, eh?

t.e.i. is lagging, but the smear campaign continues.

whatever, i've got better things to think about...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060609/9ff28a67/attachment.html
From marcello at perathoner.de  Fri Jun  9 12:26:41 2006
From: marcello at perathoner.de (Marcello Perathoner)
Date: Fri Jun  9 12:26:45 2006
Subject: [gutvol-d] All people with accounts on ibiblio! PG site moving
	to	wiki
In-Reply-To: <493.292ea68.31bafd98@aol.com>
References: <493.292ea68.31bafd98@aol.com>
Message-ID: <4489CB71.6080302@perathoner.de>

Bowerbird@aol.com wrote:

> i see that _marcello_ has blocked _me_ as a "troll".

Proactive conflict management.


-- 
Marcello Perathoner
webmaster@gutenberg.org

From marcello at perathoner.de  Fri Jun  9 12:40:02 2006
From: marcello at perathoner.de (Marcello Perathoner)
Date: Fri Jun  9 12:40:07 2006
Subject: [gutvol-d] All people with accounts on ibiblio! PG site	moving
	to wiki
In-Reply-To: <7.0.1.0.0.20060609095424.0190f0a0@fireantproductions.com>
References: <44861522.2050905@perathoner.de>
	<7.0.1.0.0.20060609095424.0190f0a0@fireantproductions.com>
Message-ID: <4489CE92.1080007@perathoner.de>

Aaron Cannon wrote:

> Hello Marcello and all.  The idea of a wiki is a great one!  The only
> problem I have with it is that it is really slow.  How are you
> planning on dealing with that?  Perhaps a daily static dump of the
> Gutenberg Namespace could be made and visitors could be referred to
> that.  Updates wouldn't show up immediately, but it might not matter
> in most cases.

ibiblio is slow currently. They say they are moving to new servers at
the end of summer. That may alleviate the problem.

Currently only about 4% of requests are accessing the pages we are going
to put on the wiki.

Also, MediaWiki is slower if you are logged in. If you are not logged
in, most pages will come out of the page cache. Of course, the page
cache will not fill before the public starts using the wiki.


-- 
Marcello Perathoner
webmaster@gutenberg.org

From Bowerbird at aol.com  Fri Jun  9 13:15:19 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Fri Jun  9 13:15:25 2006
Subject: [gutvol-d] All people with accounts on ibiblio! PG site moving to
	wiki
Message-ID: <370.4a7ddeb.31bb30d7@aol.com>

marcello said:
>    Proactive conflict management.

well, i suppose if you really cannot help yourself
from becoming entangled, that's understandable.

good luck with your wiki.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060609/a09c6a6c/attachment.html
From kreeder at mailsnare.net  Sun Jun 11 08:36:37 2006
From: kreeder at mailsnare.net (kreeder@mailsnare.net)
Date: Sun Jun 11 08:56:23 2006
Subject: [gutvol-d] Auction of rare books at the end of June
Message-ID: <20060611153637.tqof2bgqo0w0c0cs@horde.mailsnare.net>

This article appeared in the Cincinnati Enquirer last week, thought I'd share it
in case others might also find it interesting:


Treasure of rare books on the block

Historical Society expects $4.5M+ from auction

By Margaret A. McGurk
Enquirer staff writer

In the middle of the 1920s, a newly married Cornelius J. Hauck began to collect
books.  At first, he and his wife, Harriet Wesche, looked only for botanical
subjects: trees, plants and flowers.

In the next 40 years, the hobby blossomed into a passionate love affair with
everything rare and glorious in the realm of the written word.  The scion of a
prominent Cincinnati brewery and banking family, Hauck bought books printed on
paper, chiseled in stone, carved into jade, wrapped in leather and silver and
jewels.

Taken all together, those books form a spectacular treasure that stayed locked
in a vault for 40 years, unknown to most outside the Cincinnati Historical
Society, to which Hauck donated the collection in 1966, a year before his
death.

This month, its anonymity ends.  On June 27 and 28, Christie's auction house in
New York will sell the Hauck collection under the title "The History of the
Book."

Prices are expected to exceed $4.5 million.

"Given the fact that we are really a regional history organization, it doesn't
make sense for us to keep them," said Douglass W. McDonald, head of the
Cincinnati Museum Center, which includes the historical society.  Money from
the sale will be used to care for the 50,000-plus books in the historical
society collection.

"We think it is important for these works to be put in the hands of people who
can bring them more to the public's attention, and . . . where the world's
scholars will be made aware of this collection."

Francis Wahlgren, head of the books and manuscripts department at Christie's,
said Hauck's books remained largely unknown in part because they were bought
with the help of an unusually discreet adviser, Emil Offenbacher of New York.

Offenbacher, a book dealer, bought many of the pieces on Hauck's behalf at
estate sales and auctions in the '30s and '40s but did not reveal Hauck's
identity.

"His name is not bantered about the room," Wahlgren said.  "Many book dealers
would let that out, (that) they had a big client in Cincinnati and so forth. 
That never happened with Offenbacher."

As a result, "There are things in there none of us have seen in 40 or more
years," he said.  "They are museum pieces in the sense that any examples that
have survived tend to be in museums.  They're unobtainable."

The collection includes 900 items, to be sold in 700 lots, including ancient
cuneiform tablets, illuminated manuscripts, rare bindings, sacred texts in
Arabic and Hebrew and fragments of Greek papyrus, as well as modern miniatures
and first editions.

Because of the breadth of the collection, Christie's enlisted specialists in
jewelry, silver, Asian art, Islamic artifacts, decorative arts and many other
areas to assess and catalog the items.

"No book collection has ever required such a team effort," he said.

At least one local archivist regrets that the museum center did not make a
greater effort to find a way to keep the collection intact, and in Cincinnati.

Kevin Grace, University of Cincinnati archivist and head of the rare books
department for the UC library system, said: "It's disappointing that they
didn't try and get a local buyer first.  It's a shame it's going to be
dispersed and leave the city."

The museum's decision to sell came as a surprise, he said.  "We didn't find out
about it until Christie's had it listed as an upcoming auction.  If we'd known
before, it might have given us the time to court somebody to endow the
purchase. . . .  We already have a very fine rare-book collection, and this
would add to it.  And since it was a Cincinnati-compiled collection, it would
be nice to have it remain in the city."

Museum spokesman Rodger Pille said some institutions outside Cincinnati that
specialize in rare books were contacted informally about the possibility of
buying the entire collection, "but at the end of the day, we determined that
the auction provided a way for every one of those institutions to supplement
their collections."

The collection has never been exhibited in full, although a few items were shown
during the museum center's "Prized Possessions" show in 2000.

In recent weeks, about 40 pieces were displayed in London, Paris and Munich to
entice European buyers, Wahlgren said.

"In the book world, it's a huge source of excitement," he said.  "This means a
major new collector will be brought to light.  A book from this collection will
be known as the 'Hauck copy.'"

* * * *

Hauck collection

The collection's single most valuable item, with an estimated sale price of
$600,00 to $800,000, is "The Book of Friendship", an illuminated manuscript
created between 1596 and 1633 to memorialize the crowned heads of Europe.

A 20th-century Chinese-Tibetan portable "pocket shrine" carries the catalog's
lowest price estimate, at $50-$150.  A number of items are listed at less than
$500.

The newest book in the collection is a 1955 limited edition of Surrealist poems
by Paul ?luard, listed at $1,500 to $2,000.

The oldest is a Mesopotamian cuneiform cone dating to 2250 B.C., being sold with
a newer but similar item; estimated price for both is $1,000 to $1,500.

Francis Wahlgren, head of the books and manuscripts department at Christie's
auction house, said his personal favorite among the 900 items in the Hauck
collection is a 17th-century Dutch merchant's book on coins that has its own
set of scales.  Wahlgren described it as the original owner's "Blackberry, his
technology at hand."

See some of the rare items and get more info about the collection at
Cincinnati.com.  Keyword: photos  [Note: I think I found the appropriate page
at this site, but my browser showed it to be empty.)

* * * *

"The History of the Book" auction will be at Christie's, 20 Rockefeller Plaza in
Manhattan, beginning at 10 a.m. June 27 and 28.

Viewing days are June 23-26.

The 679-page catalogs are $35 and can be ordered online at www.christies.com or
by phone at 800-395-6300.

From mattsen at arvig.net  Sun Jun 11 09:23:09 2006
From: mattsen at arvig.net (Chuck MATTSEN)
Date: Sun Jun 11 09:45:19 2006
Subject: [gutvol-d] Auction of rare books at the end of June
In-Reply-To: <20060611153637.tqof2bgqo0w0c0cs@horde.mailsnare.net>
References: <20060611153637.tqof2bgqo0w0c0cs@horde.mailsnare.net>
Message-ID: <op.tazo4vn5989pjw@notebook>

On Sun, 11 Jun 2006 10:36:37 -0500, <kreeder@mailsnare.net> wrote:

> See some of the rare items and get more info about the collection at
> Cincinnati.com.  Keyword: photos  [Note: I think I found the appropriate  
> page
> at this site, but my browser showed it to be empty.)

Seems okay here:
http://news.enquirer.com/apps/pbcs.dll/gallery?Avis=AB&Dato=20060606&Kategori=LIFE&Lopenr=606003&Ref=PH&SectionCat=all
or http://tinyurl.com/o3df7


-- 
Chuck Mattsen (Mahnomen, MN)
mattsen@arvig.net

From Bowerbird at aol.com  Sun Jun 11 10:38:58 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Sun Jun 11 10:39:04 2006
Subject: [gutvol-d] utf8 prototyping
Message-ID: <111.5fbb62fa.31bdaf32@aol.com>

i've begun prototyping utf8 capability in my apps.
if anyone would like to help test that, let me know.

i remember getting a bunch of flak back when
i advocated stripping a few diacritical marks in
english texts for the sake of wide compatibility
(since english readers understand it fine anyway).

here's a chance for those people to show that
they weren't just flapping their yaps...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060611/a21aa8d4/attachment.html
From Bowerbird at aol.com  Mon Jun 12 11:12:15 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Mon Jun 12 11:12:33 2006
Subject: [gutvol-d] Fwd: an open letter to the google book scanning people
Message-ID: <438.3673597.31bf087f@aol.com>

Skipped content of type multipart/alternative-------------- next part --------------
An embedded message was scrubbed...
From: Bowerbird@aol.com
Subject: an open letter to the google book scanning people
Date: Mon, 12 Jun 2006 14:11:41 EDT
Size: 4837
Url: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060612/bd86f00e/attachment.mht
From Bowerbird at aol.com  Mon Jun 12 11:41:00 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Mon Jun 12 11:41:09 2006
Subject: [gutvol-d] translucent windows
Message-ID: <3f9.47be333.31bf0f3c@aol.com>

the mac allows windows to have varying background opacity,
from totally opaque through translucent to fully transparent...

can anyone think of _any_ possible 
e-book use for transparent windows?

because it looks really cool, and even though it's not cross-plat,
i'd _love_ to   be able to find _some_ reason to implement it...

any reason.

but alas, i'm coming up empty...         ;+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060612/fda82cc7/attachment.html
From realmjit at yahoo.com  Mon Jun 12 12:19:44 2006
From: realmjit at yahoo.com (Mjit Raindancer-Stahl)
Date: Mon Jun 12 12:26:27 2006
Subject: [gutvol-d] Re: translucent windows
In-Reply-To: <20060612190003.E006C8CBC7@pglaf.org>
Message-ID: <20060612191944.83601.qmail@web30210.mail.mud.yahoo.com>

> 
> can anyone think of _any_ possible 
> e-book use for transparent windows?

Anatomy books.

My favorite anatomy books allow the reader to view the human
body system by system, with each system on a clear overlay. 

M'jit 
AIM/Yahoo!IM/Ebay: Realmjit
realmjit@yahoo.com | answerwitch@gundo.com

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
From Bowerbird at aol.com  Tue Jun 13 13:05:51 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Tue Jun 13 13:05:59 2006
Subject: [gutvol-d] viewer-program for p.g. e-texts
Message-ID: <40f.39b4f8c.31c0749f@aol.com>

one of the best viewer-programs around
-- for those of you on the p.c. platform --
is "ybook", by simon hayes.   and it's free...

it even has a hookup with the p.g. catalog,
so you can download e-texts from inside it.

ybook also lets you wrap a book you've written
in a standalone executable .exe, which is nifty...

http://members.iinet.net.au/~simonh/spacejock/yBook.html

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060613/9433cd27/attachment.html
From Bowerbird at aol.com  Tue Jun 13 13:07:48 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Tue Jun 13 13:07:54 2006
Subject: [gutvol-d] dan poynter and e-books
Message-ID: <383.42ff849.31c07514@aol.com>

on another listserve, dan poynter
-- the guru of self-publishing --
says this:

>   I have been reading (many) books 
>    on my Pocket PC for years.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060613/a6dd7fb1/attachment.html
From Bowerbird at aol.com  Wed Jun 14 11:50:10 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Jun 14 11:50:24 2006
Subject: [gutvol-d] annotating the movies
Message-ID: <4cd.1556433.31c1b462@aol.com>

wanna create your own version of
mystery science theater 3000,
complete with smart-ass comments
coming from the audience members
pictured in silhouette down front?

then get a mac and run "peanut gallery".
>    http://peanutgallery.kaisakura.com/
and you can be such an audience-member,
meaning that "it's ok to talk during the film".

as the website says:
>   Interact with each other via Maya-rendered 30fps* 
>    animated characters, inline real-time text chat, and voice.
>    Peanut Gallery isn't just a video player ? it's a Shared Media 
Experience!

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060614/c59893a5/attachment.html
From sly at victoria.tc.ca  Thu Jun 15 12:26:55 2006
From: sly at victoria.tc.ca (Andrew Sly)
Date: Thu Jun 15 12:26:59 2006
Subject: [gutvol-d] PG text in library catalog
Message-ID: <Pine.GSO.4.58.0606151224070.4697@vtn1.victoria.tc.ca>

Well! I've had my first experience of running into a
Project Gutenberg citation in a major "traditional" library catalog.
This was through Amicus, a collection of records from Canadian
libraries. The only unfortunate thing is that it is presented
via NetLibrary, which limits and controls access to its texts.

          NAME(S):*Burroughs, Edgar Rice, 1875-1950
                   NetLibrary, Inc
         TITLE(S): The mucker [electronic resource] / Edgar Rice Burroughs
        PUBLISHER: Champaign, Ill. (P.O. Box 2782, Champaign 61825) :
                    Project Gutenberg, [199u].

      E-LOCATIONS: http://www.netLibrary.com/urlapi.asp?action=summary&v=1
                   &bookid=1085499 *McMaster only
            NOTES: Also available on the Internet.  MODE OF ACCESS via web
                    browser by entering the following URL:
                    http://www.netLibrary.com/urlapi.asp?action=summary&v=
                    1&bookid=1085499
                   Electronic reproduction. Boulder, Colo. : NetLibrary,
                    2001. Available via World Wide Web. Access may be
                    limited to NetLibrary affiliated libraries.
          NUMBERS: ISBN:  0585016860 (electronic bk.) :
                   ISBN:  0585016860 (electronic bk.)
   CLASSIFICATION: LC Call no.:  PS3503.U687 .M83
         SUBJECTS: Electronic books
                   Science fiction

From Bowerbird at aol.com  Thu Jun 15 12:43:24 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Thu Jun 15 12:43:35 2006
Subject: [gutvol-d] PG text in library catalog
Message-ID: <425.36aa8e5.31c3125c@aol.com>

andrew said:
>    The only unfortunate thing is that it is presented via 
>    NetLibrary, which limits and controls access to its texts.

to my eyes, this is starting to look like an i.q. test for librarians.

ironic, isn't it?   for me, a library is a place where books that 
normally cost money can be borrowed for free.   but with this,
a library is becoming a place that pays for books that are free.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060615/64b240e2/attachment.html
From desrod at gnu-designs.com  Thu Jun 15 13:02:31 2006
From: desrod at gnu-designs.com (David A. Desrosiers)
Date: Thu Jun 15 13:09:28 2006
Subject: [gutvol-d] PG text in library catalog
In-Reply-To: <425.36aa8e5.31c3125c@aol.com>
References: <425.36aa8e5.31c3125c@aol.com>
Message-ID: <Pine.LNX.4.64.0606151600320.16591@aphrodite.gnu-designs.com>


> ironic, isn't it?  for me, a library is a place where books that 
> normally cost money can be borrowed for free.  but with this, a 
> library is becoming a place that pays for books that are free.

 	You must be new to this Internet thing ;)

 	Joking aside, lots of common terms that we are used to are 
being redefined to mean precisely the exact opposite. "Free 
membership" (just enter your credit card number or user name here), or 
"Download these titles now" (as soon as we receive them in stock; 4-6 
weeks minimum).

 	Oh, and my favorite recent one... "Net Neutrality".

 	The irony with the Doublespeak never ceases to amaze me.


David A. Desrosiers
desrod@gnu-designs.com
http://gnu-designs.com
From greg at durendal.org  Thu Jun 15 13:01:23 2006
From: greg at durendal.org (Greg Weeks)
Date: Thu Jun 15 13:30:04 2006
Subject: [gutvol-d] PG text in library catalog
In-Reply-To: <Pine.GSO.4.58.0606151224070.4697@vtn1.victoria.tc.ca>
References: <Pine.GSO.4.58.0606151224070.4697@vtn1.victoria.tc.ca>
Message-ID: <Pine.LNX.4.63.0606151555190.9828@durendal.durendal.org>

On Thu, 15 Jun 2006, Andrew Sly wrote:

> Well! I've had my first experience of running into a
> Project Gutenberg citation in a major "traditional" library catalog.
> This was through Amicus, a collection of records from Canadian
> libraries. The only unfortunate thing is that it is presented
> via NetLibrary, which limits and controls access to its texts.

I've ran into a number of these citations via NetLibrary from the Carnegie 
library in Pittsburgh. They don't have the complete Gutenberg catalog.

-- 
Greg Weeks
http://durendal.org:8080/greg/

From greg at durendal.org  Thu Jun 15 13:01:23 2006
From: greg at durendal.org (Greg Weeks)
Date: Thu Jun 15 13:30:05 2006
Subject: [gutvol-d] PG text in library catalog
In-Reply-To: <Pine.GSO.4.58.0606151224070.4697@vtn1.victoria.tc.ca>
References: <Pine.GSO.4.58.0606151224070.4697@vtn1.victoria.tc.ca>
Message-ID: <Pine.LNX.4.63.0606151555190.9828@durendal.durendal.org>

On Thu, 15 Jun 2006, Andrew Sly wrote:

> Well! I've had my first experience of running into a
> Project Gutenberg citation in a major "traditional" library catalog.
> This was through Amicus, a collection of records from Canadian
> libraries. The only unfortunate thing is that it is presented
> via NetLibrary, which limits and controls access to its texts.

I've ran into a number of these citations via NetLibrary from the Carnegie 
library in Pittsburgh. They don't have the complete Gutenberg catalog.

-- 
Greg Weeks
http://durendal.org:8080/greg/

From Bowerbird at aol.com  Thu Jun 15 15:03:58 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Thu Jun 15 15:04:08 2006
Subject: [gutvol-d] the newest d.p. iteration
Message-ID: <4ad.29fbf48.31c3334e@aol.com>

the newest iteration over at distributed proofreaders
will be _3_ proofing rounds and 2 formatting rounds,
with provisions for skipping some of these rounds...

with this new change, i think we can safely say that
d.p. has wasted a lot of time studying its workflow
and _still_ not come to the point of perfecting it...

so it's time for me to once again interject my opinion.

1.   pre-proofing clean-up programs could handle
_many_ of the problems that are found in your o.c.r.
(careful image handling could solve most of the rest.)

2.   if d.p. used zen markup, it could save itself
from the drudgery of those "formatting rounds".
conversion from plain-ascii to html is now routine.
(pushing out each page to check its formatting is a 
tremendous waste of bandwidth.   but who cares?)

3.   no matter how many rounds you add, it will _still_
be the case that some pages will have needed more.
(some _pages_, *not* some _books_; it's silly to treat
all of the pages in a book as being of equal difficulty.)
d.p. needs to go "roundless", treat pages individually.

4.   duplicate proofings by independent proofers can be
crosschecked to quickly and easily spot any differences,
which can then be dispatched with a minimum of effort.
this double-key strategy can be used on individual pages.

again, these are all things that i've been saying for years.
if all the energy that's been spent on "research" would've
been used to implement these recommendations instead,
it would've been a lot less work, and d.p. would now have
a good workflow.   as it is, it will probably take a year or so
for the problems in the newest system to reveal themselves,
and then more work after that to install all of my suggestions.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060615/3379ee76/attachment.html
From marcello at perathoner.de  Thu Jun 15 15:17:55 2006
From: marcello at perathoner.de (Marcello Perathoner)
Date: Thu Jun 15 15:18:03 2006
Subject: [gutvol-d] the newest d.p. iteration
In-Reply-To: <4ad.29fbf48.31c3334e@aol.com>
References: <4ad.29fbf48.31c3334e@aol.com>
Message-ID: <4491DC93.7070300@perathoner.de>

Bowerbird@aol.com wrote:

> if all the energy that's been spent on "research" would've
> been used to implement these recommendations instead,
> it would've been a lot less work, and d.p. would now have
> a good workflow.

Why don't you start your own distributed proofing project with all those
nifty processes and tools you have by now devised? Seeing how superior
all your ideas are, you should be able to churn out twice as many books
as DP with no effort at all.


-- 
Marcello Perathoner
webmaster@gutenberg.org

From Bowerbird at aol.com  Thu Jun 15 17:57:11 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Thu Jun 15 17:57:23 2006
Subject: [gutvol-d] the newest d.p. iteration
Message-ID: <31b.502718f.31c35be7@aol.com>

marcello said:
>    Why don't you start your own distributed proofing project

because i anticipate that with further research on my part,
combined with ever-increasing o.c.r. progress from abbyy,
we won't even need much human proofreading in the future.

besides, i've already prototyped my "continuous proofreading",
and i'll be putting that into place when google hands me their
full pre-1923 library of page-scans, which i recently requested.

and, to be honest with you, i've become more and more bored
with these old books, which -- face it -- we focus on _mostly_
because their copyright has expired.   i'd say that 4 out of 5 of
the e-texts that are being posted these days are _not_ "classics".
(not that nonclassics don't deserve to be preserved as well, but...)

further, much of the copyright-constrained stuff of recent decades
is merely pap the publishing industry thought might make money.
much of it, i couldn't give a shit if it makes it to cyberspace or not...

what really excites me now is our new possibility to let _everything_
that _anyone_ might write see the light of day and find its audience.

we are finally free of the shackles of the past, meaning that we can
free ourselves of the corporate mindset that's blinded us up to now.
(and the government one before it, and the religious one before it.)
we can now travel far past the edge of the envelope; that's exciting.

so rather than converting old books from paper to electronic form, 
i want to help new born-digital works find their place in cyberspace.

i want to encourage writers to see our imaginations can now be free,
in a way that has _never_ been true before in all of our long history.
in other words, the human race now has a truly unique opportunity!

don't get me wrong, i am _really_happy_ old works are being rescued.
it's just that, for my own self, the relevance of new works is more juicy.

-bowerbird

p.s.   plus, as voice recognition improves over the next 5 years or so,
i expect that o.c.r. will take a back seat to voice-transcribed books...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060615/3bf97e6c/attachment.html
From brad at chenla.org  Fri Jun 16 06:58:30 2006
From: brad at chenla.org (Brad Collins)
Date: Fri Jun 16 07:05:08 2006
Subject: [gutvol-d] the newest d.p. iteration
In-Reply-To: <31b.502718f.31c35be7@aol.com> (Bowerbird@aol.com's message of
	"Thu, 15 Jun 2006 20:57:11 EDT")
References: <31b.502718f.31c35be7@aol.com>
Message-ID: <m3y7vxrym1.fsf@chenla.org>

Bowerbird@aol.com writes:

> p.s.  plus, as voice recognition improves over the next 5 years or
> so, i expect that o.c.r. will take a back seat to voice-transcribed
> books...

ROFL !!!

-- 
Brad Collins <brad@chenla.org>, Banqwao, Thailand
From kth at srv.net  Fri Jun 16 08:13:02 2006
From: kth at srv.net (Kevin Handy)
Date: Fri Jun 16 08:19:05 2006
Subject: [gutvol-d] the newest d.p. iteration
In-Reply-To: <m3y7vxrym1.fsf@chenla.org>
References: <31b.502718f.31c35be7@aol.com> <m3y7vxrym1.fsf@chenla.org>
Message-ID: <4492CA7E.7020408@srv.net>

Brad Collins wrote:

>Bowerbird@aol.com writes:
>
>  
>
>>p.s.  plus, as voice recognition improves over the next 5 years or
>>so, i expect that o.c.r. will take a back seat to voice-transcribed
>>books...
>>    
>>
>
>ROFL !!!
>
>  
>
Ewe no, he mite bee rite. Wee maybe waisting oar thyme.

This voice recognition get off me you stupid cat. Stuff
will obviously get off of me now! Have fewer problems
than ow! Ow! OW! Get off me! What we are doing now.
Yowl! Snarl! Growl! Ow! OW! OW!

From kth at srv.net  Fri Jun 16 08:13:02 2006
From: kth at srv.net (Kevin Handy)
Date: Fri Jun 16 08:19:06 2006
Subject: [gutvol-d] the newest d.p. iteration
In-Reply-To: <m3y7vxrym1.fsf@chenla.org>
References: <31b.502718f.31c35be7@aol.com> <m3y7vxrym1.fsf@chenla.org>
Message-ID: <4492CA7E.7020408@srv.net>

Brad Collins wrote:

>Bowerbird@aol.com writes:
>
>  
>
>>p.s.  plus, as voice recognition improves over the next 5 years or
>>so, i expect that o.c.r. will take a back seat to voice-transcribed
>>books...
>>    
>>
>
>ROFL !!!
>
>  
>
Ewe no, he mite bee rite. Wee maybe waisting oar thyme.

This voice recognition get off me you stupid cat. Stuff
will obviously get off of me now! Have fewer problems
than ow! Ow! OW! Get off me! What we are doing now.
Yowl! Snarl! Growl! Ow! OW! OW!

From hart at pglaf.org  Fri Jun 16 09:41:52 2006
From: hart at pglaf.org (Michael Hart)
Date: Fri Jun 16 09:41:54 2006
Subject: [gutvol-d] PG text in library catalog
In-Reply-To: <Pine.LNX.4.63.0606151555190.9828@durendal.durendal.org>
References: <Pine.GSO.4.58.0606151224070.4697@vtn1.victoria.tc.ca>
	<Pine.LNX.4.63.0606151555190.9828@durendal.durendal.org>
Message-ID: <Pine.LNX.4.60.0606160940060.12979@pglaf.org>


On Thu, 15 Jun 2006, Greg Weeks wrote:

> On Thu, 15 Jun 2006, Andrew Sly wrote:
>
>> Well! I've had my first experience of running into a
>> Project Gutenberg citation in a major "traditional" library catalog.
>> This was through Amicus, a collection of records from Canadian
>> libraries. The only unfortunate thing is that it is presented
>> via NetLibrary, which limits and controls access to its texts.
>
> I've ran into a number of these citations via NetLibrary from the Carnegie 
> library in Pittsburgh. They don't have the complete Gutenberg catalog.

NetLibrary has sold perhaps millions of PG eBooks for ~100 to
college libraries. . .libraries, I might add, who wouldn't take
them when I offered them free of charge. . . .

Including my own local Big 10 University of Illinois.

;-)

From Bowerbird at aol.com  Fri Jun 16 10:38:41 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Fri Jun 16 10:38:50 2006
Subject: [gutvol-d] the newest d.p. iteration
Message-ID: <504.251e86.31c446a1@aol.com>

i said:
>   p.s.? plus, as voice recognition improves over the next 5 years or so, 
>    i expect that o.c.r. will take a back seat to voice-transcribed books...

then brad said:
>   ROFL !!!

then kevin said:
>    Ewe no, he mite bee rite. Wee maybe waisting oar thyme.

ya know, i never know what i'm gonna say that's gonna
set people off.   (but i should have learned by now that
it'll probably be a throwaway line in the p.s. rather than
the meat of the substance in the body of the message.)

but hey, i don't mind the challenge.   it helps me develop
some logic that i might not have bothered with otherwise.

obviously kevin here has never used voice-recognition,
because no system would give us the line he gives us.

that's _not_ to say voice-recognition is problem-free.
there are a lot of problems with it.   a ton of problems.

but there used to be a ton of problems with o.c.r. too.
and people still slogged through it anyway, didn't they?

the reason people will slog through the problems anyway
with voice recognition is because it will be a lot more _fun_
and _easy_ to just _read_ a book through rather than to sit
inside an editing system, and that will make the difference.

over the past 5 years, some 35,000 people signed up at d.p.

roughly 10% -- about 3,500 -- were around when d.p. reset
its subscription base a while back.   those were the top 10%,
so that wasn't a bad thing, but it does go to show that the
o.c.r. route is just a little bit too trying for the average bear.
even when you distribute out the work.

but hey, if that other 90% could do their part to help out
by simply recording a book -- they _did_ once express 
enough interest in the cause to sign up, remember -- 
then maybe they could have been retained as helpers...

and maybe a whole order of magnitude of more helpers
could be _recruited_ if the means of helping were so fun.

with libre vox, people are already recording old books.
audiobooks, always popular, are getting even more so.
podcasting is growing the base of recording experience
(and audience) in the user-population at a _huge_ rate.

and, for those of us keeping track, there has already been
a message posted on the distributed proofing forums from
a person who reported using voice-recognition software
_within_the_current_d.p._system_.   now that's dedication.

and as the form-factors of our machines continue to shrink,
voice-recognition will become more and more important,
and more ingrained, and some people will rely on it entirely.

and speaking of libre vox, it's important to keep in mind that
a _recording_ retains value even _after_ it has been turned into
digital text.   heck, many people will prefer the .mp3 to the .txt.
there's sure a lot more player-hardware out here for the .mp3.

moreover, when a person creates a recording, that product
is _seeped_ with their contribution.   with their own _voice_,
for crying out loud.   can it get much more personal than that?
to some people, that will surely be more satisfying than the
simple credit-line at the top of a project gutenberg e-text...

and hey, it might mean a lot more to the _end-user_ as well!

i can tell you that i've looked at a lot of texts from jon ingram.
lots and lots of them.   and they almost always look very nice.
but none of them has had the impact of the bit he recorded
for libre vox, where his accent had me muttering to myself,
"hey, i forgot, that bloody bloke is from _england_, isn't he?"

there's something very endearing and personal about a voice.
even one with a heavy english accent.          ;+)

so a person who records a book is giving us _two_ products;
one is a route to obtaining digital text via voice-recognition,
and the other is a recording of that book in a human voice.
it might be that down the line, the second dwarfs the first.

it's also quite important to remind ourselves that these two
products are complementary, not competing with each other.

and it's not hard to imagine that the recording will become
_especially_ useful when it gets combined with page-scans.
a recording of each page playing when the scan is displayed
might become the most typical kind of "book" in the future!

likewise, it does _not_ have to be either/or between o.c.r. and
voice-recognition; we can instead make the two work together.

we could do o.c.r. on the scans, and then cross-check the o.c.r.
against the voice-recognition results, then concentrate on the
differences to intelligently remove errors from _both_ versions.

we would expect homonym problems in the voice-recognition,
for instance, and scannos in the o.c.r., so could control for that.

anytime you combine two different methods for the same result,
they can serve as a useful cross-check on each other.   bingo.

in case you didn't know, some of the people who are obtaining
the highest accuracy in their e-texts use text-to-speech to get it.
what i'm talking about here can be viewed as the flip-side of that.

so, in summary, if you're "rolling on the floor laughing" about
voice-recognition and the possibilities it offers to digitizers,
you show your lack of vision.   there's no other way to say it...

of course, your loss is the lurkers' gain,
because it gave me a reason to explain.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060616/23249cc6/attachment.html
From greg at durendal.org  Fri Jun 16 10:30:59 2006
From: greg at durendal.org (Greg Weeks)
Date: Fri Jun 16 11:00:11 2006
Subject: [gutvol-d] PG text in library catalog
In-Reply-To: <Pine.LNX.4.60.0606160940060.12979@pglaf.org>
References: <Pine.GSO.4.58.0606151224070.4697@vtn1.victoria.tc.ca>
	<Pine.LNX.4.63.0606151555190.9828@durendal.durendal.org>
	<Pine.LNX.4.60.0606160940060.12979@pglaf.org>
Message-ID: <Pine.LNX.4.63.0606161328420.14874@durendal.durendal.org>

On Fri, 16 Jun 2006, Michael Hart wrote:

>
> On Thu, 15 Jun 2006, Greg Weeks wrote:
>
>> On Thu, 15 Jun 2006, Andrew Sly wrote:
>> 
>>> Well! I've had my first experience of running into a
>>> Project Gutenberg citation in a major "traditional" library catalog.
>>> This was through Amicus, a collection of records from Canadian
>>> libraries. The only unfortunate thing is that it is presented
>>> via NetLibrary, which limits and controls access to its texts.
>> 
>> I've ran into a number of these citations via NetLibrary from the Carnegie 
>> library in Pittsburgh. They don't have the complete Gutenberg catalog.
>
> NetLibrary has sold perhaps millions of PG eBooks for ~100 to
> college libraries. . .libraries, I might add, who wouldn't take
> them when I offered them free of charge. . . .
>
> Including my own local Big 10 University of Illinois.

NetLibrary gives credit also, so I can't claim to be unhappy with them. If 
that's what it takes to get our books into brick and mortar libraries ok.

-- 
Greg Weeks
http://durendal.org:8080/greg/

From marcello at perathoner.de  Fri Jun 16 11:14:00 2006
From: marcello at perathoner.de (Marcello Perathoner)
Date: Fri Jun 16 11:14:03 2006
Subject: [gutvol-d] the newest d.p. iteration
In-Reply-To: <31b.502718f.31c35be7@aol.com>
References: <31b.502718f.31c35be7@aol.com>
Message-ID: <4492F4E8.5000406@perathoner.de>

Bowerbird@aol.com wrote:

> p.s.   plus, as voice recognition improves over the next 5 years or so,
> i expect that o.c.r. will take a back seat to voice-transcribed books...

It is a troot uneeferselly ecknooledged, thet a seengle-a mun in
pussesseeun ooff a guud furtoone-a, moost be-a in vunt ooff a veeffe-a.

Hooefer leettle-a knoon zee feeleengs oor feeoos ooff sooch a mun mey
be-a oon hees furst intereeng a neeeghbuoorhuud, thees troot is su vell
feexed in zee meends ooff zee soorruoondeeng femeelies, thet he-a is
cunseedered zee reeghtffool pruperty ooff sume-a oone-a oor oozeer ooff
zeeur dooghters.


-- 
Marcello Perathoner
webmaster@gutenberg.org

From dixonm at pobox.com  Fri Jun 16 12:39:07 2006
From: dixonm at pobox.com (Meredith Dixon)
Date: Fri Jun 16 12:39:04 2006
Subject: [gutvol-d] the newest d.p. iteration
In-Reply-To: <504.251e86.31c446a1@aol.com>
References: <504.251e86.31c446a1@aol.com>
Message-ID: <449308DB.7000505@pobox.com>

Bowerbird@aol.com wrote:
> the reason people will slog through the problems anyway
> with voice recognition is because it will be a lot more _fun_
> and _easy_ to just _read_ a book through rather than to sit
> inside an editing system, and that will make the difference.

Bowerbird, how often do you read books aloud?  My grandmother, who grew 
up in a time when reading to others was
an essential skill, taught me to read aloud as a child, and I spent many 
hours reading aloud to her and to my mother.
I actually enjoyed doing it, and I often wish I had more opportunities 
to do so now.  But reading a book aloud
is an extremely slow and inefficient way to get text into electronic 
form.  I could *type* a book in faster than I could
read it aloud, much less scan it.

Reading out loud is tiring, even when you're used to it.  If you have 
only read, say, picture books to your children, you
may not realize this.  You need to rest your voice after an hour or so. 
  And it takes hours and hours to read an ordinary
book out loud, never mind something like The Lord of the Rings (and, 
yes, I have read the entire The Lord of the Rings
out loud.  Twice.).

Scanning is boring, yes, but it is also fast.  And it doesn't make your 
throat hurt at the end of a session.


> 
> over the past 5 years, some 35,000 people signed up at d.p.
> 
> roughly 10% -- about 3,500 -- were around when d.p. reset
> its subscription base a while back.  those were the top 10%,
> so that wasn't a bad thing, but it does go to show that the
> o.c.r. route is just a little bit too trying for the average bear.
> even when you distribute out the work.
> 
> but hey, if that other 90% could do their part to help out
> by simply recording a book -- they _did_ once express
> enough interest in the cause to sign up, remember --
> then maybe they could have been retained as helpers...
> 
> and maybe a whole order of magnitude of more helpers
> could be _recruited_ if the means of helping were so fun.
> 
> with libre vox, people are already recording old books.
> audiobooks, always popular, are getting even more so.
> podcasting is growing the base of recording experience
> (and audience) in the user-population at a _huge_ rate.
> 
> and, for those of us keeping track, there has already been
> a message posted on the distributed proofing forums from
> a person who reported using voice-recognition software
> _within_the_current_d.p._system_.  now that's dedication.
> 
> and as the form-factors of our machines continue to shrink,
> voice-recognition will become more and more important,
> and more ingrained, and some people will rely on it entirely.
> 
> and speaking of libre vox, it's important to keep in mind that
> a _recording_ retains value even _after_ it has been turned into
> digital text.  heck, many people will prefer the .mp3 to the .txt.
> there's sure a lot more player-hardware out here for the .mp3.
> 
> moreover, when a person creates a recording, that product
> is _seeped_ with their contribution.  with their own _voice_,
> for crying out loud.  can it get much more personal than that?
> to some people, that will surely be more satisfying than the
> simple credit-line at the top of a project gutenberg e-text...
> 
> and hey, it might mean a lot more to the _end-user_ as well!
> 
> i can tell you that i've looked at a lot of texts from jon ingram.
> lots and lots of them.  and they almost always look very nice.
> but none of them has had the impact of the bit he recorded
> for libre vox, where his accent had me muttering to myself,
> "hey, i forgot, that bloody bloke is from _england_, isn't he?"
> 
> there's something very endearing and personal about a voice.
> even one with a heavy english accent.         ;+)
> 
> so a person who records a book is giving us _two_ products;
> one is a route to obtaining digital text via voice-recognition,
> and the other is a recording of that book in a human voice.
> it might be that down the line, the second dwarfs the first.
> 
> it's also quite important to remind ourselves that these two
> products are complementary, not competing with each other.
> 
> and it's not hard to imagine that the recording will become
> _especially_ useful when it gets combined with page-scans.
> a recording of each page playing when the scan is displayed
> might become the most typical kind of "book" in the future!
> 
> likewise, it does _not_ have to be either/or between o.c.r. and
> voice-recognition; we can instead make the two work together.
> 
> we could do o.c.r. on the scans, and then cross-check the o.c.r.
> against the voice-recognition results, then concentrate on the
> differences to intelligently remove errors from _both_ versions.
> 
> we would expect homonym problems in the voice-recognition,
> for instance, and scannos in the o.c.r., so could control for that.
> 
> anytime you combine two different methods for the same result,
> they can serve as a useful cross-check on each other.  bingo.
> 
> in case you didn't know, some of the people who are obtaining
> the highest accuracy in their e-texts use text-to-speech to get it.
> what i'm talking about here can be viewed as the flip-side of that.
> 
> so, in summary, if you're "rolling on the floor laughing" about
> voice-recognition and the possibilities it offers to digitizers,
> you show your lack of vision.  there's no other way to say it...
> 
> of course, your loss is the lurkers' gain,
> because it gave me a reason to explain.
> 
> -bowerbird
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d


-- 
Meredith Dixon <dixonm@pobox.com>
Check out *Raven Days* <www.ravendays.org>
For victims and survivors of bullying at school.
And for those who want to help.
From Bowerbird at aol.com  Fri Jun 16 12:55:15 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Fri Jun 16 12:55:29 2006
Subject: [gutvol-d] mark pilgrim
Message-ID: <234.bddf23a.31c466a3@aol.com>

mark pilgrim, an early open-source person,
recently switched from apple over to linux...

in a blog entry on this, he talks about archiving,
and how it gets complicated by file-formats and
_especially_ by d.r.m. (which hobbles it by design),
and remarks open source does not always equate
to open formats (using "gimp" as an example of it).

even before he mentioned project gutenberg,
i was thinking i'd share a pointer, so here it is:
>    http://diveintomark.org/archives/2006/06/16/juggling-oranges

-bowerbird

p.s.   i highly recommend -- for guys -- pilgrim's blog entry
before this, "howto make the perfect fruit salad and get laid."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060616/e705c6d9/attachment.html
From joshua at hutchinson.net  Fri Jun 16 13:22:26 2006
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Fri Jun 16 13:22:37 2006
Subject: [gutvol-d] the newest d.p. iteration
Message-ID: <20060616202226.BDCD510995B@ws6-4.us4.outblaze.com>

> If you have only 
> read, say, picture books to your children, you
> may not realize this.

That has to be one of the scariest things I've read lately... bowerbird procreating?  *shudder*

Other than that, I would add one more reason that OCR is more convenient than Voice Recognition ... I can work on a page of typed text at my computer without annoying the crap out of people around me.  Can you imagine trying to read a book while sitting at your local Starbucks?

Josh
From Bowerbird at aol.com  Fri Jun 16 13:24:53 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Fri Jun 16 13:24:58 2006
Subject: [gutvol-d] the newest d.p. iteration
Message-ID: <4f8.2aa03d.31c46d95@aol.com>

meredith said:
>    But reading a book aloud is an extremely slow 
>    and inefficient way to get text into electronic form.

that's the point, though.   people won't be doing it
"to get text into electronic form".   that will just be
a pleasant side-effect, a tangent from their real aim,
which will be "to share a book with the whole world".

they'll be doing it because it's _fun_.

sure it's work too.   but people will do a whole lot of work
if they enjoy what they're doing.   you see that all the time.


>    I could *type* a book in faster than I could read it aloud, 
>    much less scan it.

but your typed version will be no different than anyone else's.

your _recorded_ version, however, will be _uniquely_ yours,
perfectly representing the one-of-a-kind snowflake you are,
something that your grandchildren, and _their_ grandchildren,
can listen to over and over whenever they want to think of you.

don't you wish you could hear your grandmother's voice again?


>    Reading out loud is tiring, even when you're used to it.

i agree, it is.   but you also get used to it, the more you do it,
until you can do it without straining yourself in the slightest.


>    If you have only read, say, picture books to your children, you may 
>    not realize this.   You need to rest your voice after an hour or so. 

i do performance poetry, so i'm sharply cognizant of voice training.
i'm also acutely aware a large audience provides a lot of motivation.


>    And it takes hours and hours to read an ordinary book out loud, 
>    never mind something like The Lord of the Rings 

the market for audiobooks has already asserted itself, quite loudly.
i imagine that _free_ audiobooks will provide a _very_ large audience.
and thus a lot of motivation.


>    (and, yes, I have read the entire The Lord of the Rings out loud.? 
Twice.).

then i guess you must have had sufficient motivation of some kind.


>    Twice.

ya know, if you would have recorded yourself the first time you did it,
you wouldn't have had to read it out loud again the second time...        ;+)


>    Scanning is boring, yes, but it is also fast.? 
>    And it doesn't make your throat hurt at the end of a session.

warm water.   (for your throat, not for your scanner...)            ;+)

-bowerbird

p.s.   i see your signature-block promotes a book you've written.
perhaps you heard that an author who was podcasting his novel
recently got picked up by one of the major publishing houses?
so lots of aspiring authors might think of becoming podcasters.
voice training -- it's not just for performance poets any more!

>   If you ask me what I came to do in this world, 
>    I, an artist, I will answer you: "I am here to live out loud.?
>    -- Emile Zola
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060616/5f375af8/attachment.html
From dixonm at pobox.com  Fri Jun 16 16:46:42 2006
From: dixonm at pobox.com (Meredith Dixon)
Date: Fri Jun 16 16:46:43 2006
Subject: [gutvol-d] the newest d.p. iteration
In-Reply-To: <4f8.2aa03d.31c46d95@aol.com>
References: <4f8.2aa03d.31c46d95@aol.com>
Message-ID: <449342E2.7070403@pobox.com>

Bowerbird@aol.com wrote:
>  >   Reading out loud is tiring, even when you're used to it.
> 
> i agree, it is.  but you also get used to it, the more you do it,
> until you can do it without straining yourself in the slightest.

All I can say is that I never managed to get so used to it that my 
throat didn't hurt when I'd finished reading for the day,
and I read aloud almost every day for at least an hour a day for most of 
my childhood.  Certainly there's a learning curve to
learning to read aloud, but that's mostly neurological; you need to 
learn how to read ahead with your eyes, to plan
emphasis, while your mouth is reading an earlier sentence, and to jump 
back smoothly to your place in time to start your
mouth off on the next sentence.  But mastering that doesn't help any 
with tiredness, or with your throat's getting sore.

> then i guess you must have had sufficient motivation of some kind.

Well, yes, I liked the book well enough to spend time reading it to my 
mother.

> ya know, if you would have recorded yourself the first time you did it,
> you wouldn't have had to read it out loud again the second time...  

I don't think my mother would have stood for listening to a tape 
recorder instead of listening to me, and I shudder to think
how many 45-minutes-on-a-side tapes it would have filled.

> p.s.  i see your signature-block promotes a book you've written.
No, it promotes one of my websites.  No book is involved.


-- 
Meredith Dixon <dixonm@pobox.com>
Check out *Raven Days* <www.ravendays.org>
For victims and survivors of bullying at school.
And for those who want to help.
From Bowerbird at aol.com  Sat Jun 17 02:07:13 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Sat Jun 17 02:07:18 2006
Subject: [gutvol-d] "all of them?"
Message-ID: <319.50cf8c9.31c52041@aol.com>


>    http://youtube.com/watch?v=veIU0Jwu54w

no comment necessary...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060617/c18d5080/attachment.html
From nwolcott2ster at gmail.com  Sun Jun 18 09:31:41 2006
From: nwolcott2ster at gmail.com (Norm Wolcott)
Date: Sun Jun 18 09:33:45 2006
Subject: [gutvol-d] http://www.ebooksgratuits.com/
Message-ID: <000c01c692f4$daed5420$650fa8c0@gw98>

The web site http://www.ebooksgratuits.com/ which provided many pd french texts for PG and also had many other formats, has disappeared.

Has anyone archived this site?  Internet archive gets lost looking for individual books, although the home page is available until December 2005. 


nwolcott2@post.harvard.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060618/9eb747fa/attachment.html
From ajhaines at shaw.ca  Sun Jun 18 10:39:41 2006
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Sun Jun 18 10:39:45 2006
Subject: [gutvol-d] http://www.ebooksgratuits.com/
References: <000c01c692f4$daed5420$650fa8c0@gw98>
Message-ID: <001c01c692fe$2de6d990$6401a8c0@ahainesp2400>

Do a Google on "ebooksgratuits", and work through Google's "cached" links.  Maybe you can extract material that way. 

  ----- Original Message ----- 
  From: Norm Wolcott 
  To: 'Project Gutenberg Volunteer Discussion' 
  Sent: Sunday, June 18, 2006 9:31 AM
  Subject: [gutvol-d] http://www.ebooksgratuits.com/


  The web site http://www.ebooksgratuits.com/ which provided many pd french texts for PG and also had many other formats, has disappeared.

  Has anyone archived this site?  Internet archive gets lost looking for individual books, although the home page is available until December 2005. 


  nwolcott2@post.harvard.edu


------------------------------------------------------------------------------


  _______________________________________________
  gutvol-d mailing list
  gutvol-d@lists.pglaf.org
  http://lists.pglaf.org/listinfo.cgi/gutvol-d
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060618/b1616513/attachment.html
From donovan at abs.net  Sun Jun 18 11:13:10 2006
From: donovan at abs.net (D Garcia)
Date: Sun Jun 18 11:13:25 2006
Subject: [dp-pg] Re: [gutvol-d] http://www.ebooksgratuits.com/
In-Reply-To: <001c01c692fe$2de6d990$6401a8c0@ahainesp2400>
References: <000c01c692f4$daed5420$650fa8c0@gw98>
	<001c01c692fe$2de6d990$6401a8c0@ahainesp2400>
Message-ID: <200606181413.10766.donovan@abs.net>

On Sunday 18 June 2006 01:39 pm, Al Haines (shaw) wrote:
> Do a Google on "ebooksgratuits", and work through Google's "cached" links. 
> Maybe you can extract material that way.

It looks like the domain expired, it eventually resolves to a placholder page 
which has a bunch of link junk on it. Google doesn't even appear to have the 
front page cached.
From marcello at perathoner.de  Sun Jun 18 11:58:38 2006
From: marcello at perathoner.de (Marcello Perathoner)
Date: Sun Jun 18 11:58:41 2006
Subject: [dp-pg] Re: [gutvol-d] http://www.ebooksgratuits.com/
In-Reply-To: <200606181413.10766.donovan@abs.net>
References: <000c01c692f4$daed5420$650fa8c0@gw98>	<001c01c692fe$2de6d990$6401a8c0@ahainesp2400>
	<200606181413.10766.donovan@abs.net>
Message-ID: <4495A25E.20300@perathoner.de>

D Garcia wrote:

> It looks like the domain expired, it eventually resolves to a placholder page 
> which has a bunch of link junk on it.

$ whois ebooksgratuits.com reveals:

   Domain Name: EBOOKSGRATUITS.COM
      Created on: 11-Dec-03
      Expires on: 11-Dec-06
      Last Updated on: 17-May-06

so the domain has NOT expired.


-- 
Marcello Perathoner
webmaster@gutenberg.org

From fvandrog at scripps.edu  Sun Jun 18 11:58:39 2006
From: fvandrog at scripps.edu (Frank van Drogen)
Date: Sun Jun 18 11:58:43 2006
Subject: [gutvol-d] http://www.ebooksgratuits.com/
In-Reply-To: <000c01c692f4$daed5420$650fa8c0@gw98>
References: <000c01c692f4$daed5420$650fa8c0@gw98>
Message-ID: <7.0.1.0.0.20060618115736.01d21348@scripps.edu>

You might try to contact Patrick Merlo. (pmerlo at yahoo dot fr).

Frank

From blondeel at clipper.ens.fr  Sun Jun 18 12:56:50 2006
From: blondeel at clipper.ens.fr (Sebastien Blondeel)
Date: Sun Jun 18 12:56:54 2006
Subject: [gutvol-d] http://www.ebooksgratuits.com/
In-Reply-To: <000c01c692f4$daed5420$650fa8c0@gw98>
References: <000c01c692f4$daed5420$650fa8c0@gw98>
Message-ID: <20060618195650.GA26987@clipper.ens.fr>

The manager of the project tells me, in a nutshell:

. ISP (inspirenetworks.com) has disappeared since Thu at noon
. major DNS outage?
. mirror site being set up on ebooksgratuits.org, should run as of Wed/Thu
. mailing list reporting problems in real time at 
  http://fr.groups.yahoo.com/group/ebooksgratuits/
From prosfilaes at gmail.com  Sun Jun 18 22:02:40 2006
From: prosfilaes at gmail.com (David Starner)
Date: Sun Jun 18 22:02:42 2006
Subject: [gutvol-d] Deleting Clearances
Message-ID: <6d99d1fd0606182202n2f6b15bh89c3099a8617956c@mail.gmail.com>

Is there any way we could add a way to delete clearances from the
clearance page? I have six clearances on my clearance page I'd like to
kill; common reasons were I couldn't get suitable scans from my source
or I ceeded it to some other volunteer with their own copy. Besides
cluttering up my already cluttered clearance page, it makes some
projects look more live than they are.
From sly at victoria.tc.ca  Sun Jun 18 22:13:44 2006
From: sly at victoria.tc.ca (Andrew Sly)
Date: Sun Jun 18 22:13:47 2006
Subject: [gutvol-d] Deleting Clearances
In-Reply-To: <6d99d1fd0606182202n2f6b15bh89c3099a8617956c@mail.gmail.com>
References: <6d99d1fd0606182202n2f6b15bh89c3099a8617956c@mail.gmail.com>
Message-ID: <Pine.GSO.4.58.0606182212260.17733@vtn1.victoria.tc.ca>

Yes, I agree this would be nice.

In case there is confusion, what is under discussion is the
copyright clearance system as used at: http://copy.pglaf.org/

Looking through my list of items with status "Cleared", I see that I
have three clearances which were submitted manually to a white-washer;
one which was a small volume of poems which was combined with another
similar volume for posting to PG; and two which are duplicates of items
already in PG that I didn't check closely enough.

I see that there is a "Cancelled" status, which could be suitable
for some of these. However, there does not seem to be a way to
use it.

Andrew

On Mon, 19 Jun 2006, David Starner wrote:

> Is there any way we could add a way to delete clearances from the
> clearance page? I have six clearances on my clearance page I'd like to
> kill; common reasons were I couldn't get suitable scans from my source
> or I ceeded it to some other volunteer with their own copy. Besides
> cluttering up my already cluttered clearance page, it makes some
> projects look more live than they are.
> _______________________________________________
From prosfilaes at gmail.com  Sun Jun 18 22:18:34 2006
From: prosfilaes at gmail.com (David Starner)
Date: Sun Jun 18 22:18:40 2006
Subject: [gutvol-d] Deleting Clearances
In-Reply-To: <Pine.GSO.4.58.0606182212260.17733@vtn1.victoria.tc.ca>
References: <6d99d1fd0606182202n2f6b15bh89c3099a8617956c@mail.gmail.com>
	<Pine.GSO.4.58.0606182212260.17733@vtn1.victoria.tc.ca>
Message-ID: <6d99d1fd0606182218k7923e196q7242750691d3fadf@mail.gmail.com>

On 6/19/06, Andrew Sly <sly@victoria.tc.ca> wrote:
> Yes, I agree this would be nice.
>
> In case there is confusion, what is under discussion is the
> copyright clearance system as used at: http://copy.pglaf.org/
>
> Looking through my list of items with status "Cleared", I see that I
> have three clearances which were submitted manually to a white-washer;
> one which was a small volume of poems which was combined with another
> similar volume for posting to PG; and two which are duplicates of items
> already in PG that I didn't check closely enough.

I've got a few that were posted to PG--Widger particularly seems to
directly post when PPVing. Those would be better transfered to status
Submitted, I would think; they need to stick around in some form. If
they aren't getting moved to Submitted, how are they linked to the
books behind the scenes?
From traverso at dm.unipi.it  Mon Jun 19 00:30:03 2006
From: traverso at dm.unipi.it (Carlo Traverso)
Date: Mon Jun 19 00:26:39 2006
Subject: [gutvol-d] Deleting Clearances
In-Reply-To: <6d99d1fd0606182218k7923e196q7242750691d3fadf@mail.gmail.com>
	(prosfilaes@gmail.com)
References: <6d99d1fd0606182202n2f6b15bh89c3099a8617956c@mail.gmail.com>
	<Pine.GSO.4.58.0606182212260.17733@vtn1.victoria.tc.ca>
	<6d99d1fd0606182218k7923e196q7242750691d3fadf@mail.gmail.com>
Message-ID: <200606190730.k5J7U3F29072@pico.dm.unipi.it>


It would also be handy to be able to keep alive a clearance for which
a book has been posted, and more will follow. This is mainly for
multi-volume works that are submitted one at a time. 

Carlo
From fvandrog at scripps.edu  Mon Jun 19 07:20:53 2006
From: fvandrog at scripps.edu (Frank van Drogen)
Date: Mon Jun 19 07:20:59 2006
Subject: [gutvol-d] Deleting Clearances
In-Reply-To: <6d99d1fd0606182218k7923e196q7242750691d3fadf@mail.gmail.co
 m>
References: <6d99d1fd0606182202n2f6b15bh89c3099a8617956c@mail.gmail.com>
	<Pine.GSO.4.58.0606182212260.17733@vtn1.victoria.tc.ca>
	<6d99d1fd0606182218k7923e196q7242750691d3fadf@mail.gmail.com>
Message-ID: <7.0.1.0.0.20060619071954.0365ccb8@scripps.edu>


>I've got a few that were posted to PG--Widger particularly seems to
>directly post when PPVing. Those would be better transfered to status
>Submitted, I would think; they need to stick around in some form.


You can change them to the submitted state by 'previewing' any dummy 
file under the clearance.

Frank

From gbnewby at pglaf.org  Tue Jun 20 09:34:09 2006
From: gbnewby at pglaf.org (Greg Newby)
Date: Tue Jun 20 09:34:11 2006
Subject: [gutvol-d] Fwd: Abbey Library of St. Gall,
	Switzerland: Online 100 manuscripts (fwd)
Message-ID: <20060620163409.GA17431@pglaf.org>

This might have some materials suitable for harvesting.
  -- Greg

----- Forwarded Message ----
From: Christoph Fl??eler <christophe.flueler@unifr.ch>
To: christophe.flueler@unifr.ch
Sent: Tuesday, June 20, 2006 10:10:11 AM
Subject: Abbey Library of St. Gall, Switzerland: Online 100 manuscripts

                 Abbey Library of St. Gall, Switzerland online


 - free access: www.cesg.unifr.ch
 - high resolution digital images: over 40'000 facsimile pages
 - regularly updated: now 100 complete manuscripts
 - manuscript descriptions and many search options
 - accessible in German, French, English and Italian

 Please recommend it to your colleagues and put a link to CESG on your 
 homepage.

 ?? CESG - Codices Electronici Sangallenses


----- End forwarded message -----
From sly at victoria.tc.ca  Tue Jun 20 21:28:28 2006
From: sly at victoria.tc.ca (Andrew Sly)
Date: Tue Jun 20 21:28:34 2006
Subject: [gutvol-d] Fwd: Abbey Library of St. Gall, Switzerland: Online
	100 manuscripts (fwd)
In-Reply-To: <20060620163409.GA17431@pglaf.org>
References: <20060620163409.GA17431@pglaf.org>
Message-ID: <Pine.GSO.4.58.0606202126460.7991@vtn1.victoria.tc.ca>

Perhaps on the new wiki, we could try adding a page for
a list of propsed sites to harvest material from.

I know that I seem to keep find more than I could ever
deal with.

Andrew

On Tue, 20 Jun 2006, Greg Newby wrote:

> This might have some materials suitable for harvesting.
>   -- Greg
>
From hart at pglaf.org  Wed Jun 21 06:41:30 2006
From: hart at pglaf.org (Michael Hart)
Date: Wed Jun 21 06:41:34 2006
Subject: [gutvol-d] !@! Just TWO Books Needed for 20,000!!!
Message-ID: <Pine.LNX.4.60.0606210640390.3921@pglaf.org>


Anyone got anything coming in the next THREE hours???

;-)


Thanks!!!

Give the world eBooks in 2006!!!

Michael S. Hart
Founder
Project Gutenberg

Blog at http://hart.pglaf.org

From jon at noring.name  Wed Jun 21 07:34:02 2006
From: jon at noring.name (Jon Noring)
Date: Wed Jun 21 07:34:11 2006
Subject: [gutvol-d] 20000 (decimal) represented in other bases -- Impact on
	PG
In-Reply-To: <Pine.LNX.4.60.0606210640390.3921@pglaf.org>
References: <Pine.LNX.4.60.0606210640390.3921@pglaf.org>
Message-ID: <106176712.20060621083402@noring.name>

In reply to Michael's post asking for two more books to reach 20000
(yes, he must be itchy to reach another numerical milestone!), I was
curious to see what 20000 (decimal) looks like in other numerical
bases from 2-20:

 2: 100111000100000   (binary)
 3: 1000102202
 4: 10320200
 5: 1120000
 6: 232332
 7: 112211
 8: 47040   (octal)
 9: 30382
10: 20000   (decimal)
11: 14032
12: B6A8
13: 9146
14: 7408
15: 5DD5
16: 4E20   (hexadecimal)
17: 4138
18: 37D2
19: 2H7C
20: 2A00


Hmmmm, I am disappointed that 20000 in other bases is nothing special.
No cool patterns -- no "Da Vinci" code stuff -- just "ordinary"
sequences of numbers. There must be something wrong! 20000 (decimal)
must be special in some way! It has to be special!

Considering that base 10 (decimal) is also arbitrary in our modern
world (why not 9 or 11 or ?), then I guess 20000 is nothing special
either.

That is, the current number of books, 19998, is only two less than
20000. Why aren't we celebrating over 19998? why does a 0.01% change
all of a sudden start a wild party? (Don't we wish -- It's "par-tay
time!")

But I guess people like to see the odometer on the ole' car turn over
from all 9's back to 0's. It's like a rebirth of sorts. So it is human
nature, I suppose, to ascribe special meaning to certain patterns in
numbers.

Therefore, I recommend to PG that if human nature is important, and
bigger is better, then PG should report the number of books it has in
a lower base. Now, doesn't 232332 (base 6) sound much more impressive?
You can report the number of books in the collection as:

   "# of books in PG's collection: 232332 [*]"
   
And at the bottom of the page:

   "[*} Note, this is base 6."

<smile/>


Jon Noring


From marcello at perathoner.de  Wed Jun 21 11:15:09 2006
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed Jun 21 11:15:14 2006
Subject: [gutvol-d] 20000 (decimal) represented in other bases -- Impact
	on	PG
In-Reply-To: <106176712.20060621083402@noring.name>
References: <Pine.LNX.4.60.0606210640390.3921@pglaf.org>
	<106176712.20060621083402@noring.name>
Message-ID: <44998CAD.1090202@perathoner.de>

Jon Noring wrote:

> Therefore, I recommend to PG that if human nature is important, and
> bigger is better, then PG should report the number of books it has in
> a lower base.

Wasn't it Donald E. Knuth who celebrated his 1,000,000th birthday?
(base 2)


We should count our books in base t where t == 2^(12/18)

That would make all those computations about our keeping up with Moore's
Law much simpler: If we have to add a new digit each new year, we are on
schedule.


I hope the advertising industry won't wisen up to this: everything would
start to cost $10? and you'll have to read the fine print to find out
the number base.


?) base the real price


-- 
Marcello Perathoner
webmaster@gutenberg.org

From joshua at hutchinson.net  Wed Jun 21 11:46:18 2006
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Wed Jun 21 11:46:29 2006
Subject: [gutvol-d] 20000 (decimal) represented in other bases -- Impact
	on PG
Message-ID: <20060621184623.9C68F2F93E@ws6-3.us4.outblaze.com>

And below is an example of true geek humor.

Us geeks are having a good chuckle.

Everyone else is scratching their heads, saying, "What the *bleep* are they talking about!?"

Josh

PS And the Google nerds are busy searching for the meanings of the esoteric phrases... ;)

> ----- Original Message -----
> From: "Marcello Perathoner" <marcello@perathoner.de>
> To: "Jon Noring" <jon@noring.name>, "Project Gutenberg Volunteer Discussion" <gutvol-d@lists.pglaf.org>
> Subject: Re: [gutvol-d] 20000 (decimal) represented in other bases -- Impact	on	PG
> Date: Wed, 21 Jun 2006 20:15:09 +0200
> 
> 
> Jon Noring wrote:
> 
> > Therefore, I recommend to PG that if human nature is important, and
> > bigger is better, then PG should report the number of books it has in
> > a lower base.
> 
> Wasn't it Donald E. Knuth who celebrated his 1,000,000th birthday?
> (base 2)
> 
> 
> We should count our books in base t where t == 2^(12/18)
> 
> That would make all those computations about our keeping up with Moore's
> Law much simpler: If we have to add a new digit each new year, we are on
> schedule.
> 
> 
> I hope the advertising industry won't wisen up to this: everything would
> start to cost $10? and you'll have to read the fine print to find out
> the number base.
> 
> 
> ?) base the real price
> 
> 
> --
> Marcello Perathoner
> webmaster@gutenberg.org
> 
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d

>

From Bowerbird at aol.com  Wed Jun 21 13:41:51 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Jun 21 13:42:00 2006
Subject: [gutvol-d] scoo bee doo bee bee doo
Message-ID: <522.e90e5d.31cb090f@aol.com>

jon said:
>    Now, doesn't 232332 (base 6) sound much more impressive?

yes, it does.   especially if you're special enough to know that
-- in base 6 lingo -- 3 is articulated as "bee", and 2 is "doo"
except when it occurs at the start of a "word" in which case
it is pronounced "scoo", meaning this number is vocalized as
"scoo bee doo bee bee doo".

-bowerbird

p.s.   personally, i like base 7 -- 112211 -- a lot because
it's great to have m.c. palindrome on the ones and twos...

p.p.s.   there are 10 types of people in this world --
those who understand base 2 and those who don't.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060621/9eed935c/attachment.html
From Bowerbird at aol.com  Wed Jun 21 14:13:58 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Jun 21 14:14:15 2006
Subject: [gutvol-d] scraping the p.g. default .txt files
Message-ID: <4ee.125e96c.31cb1096@aol.com>

well, i have scraped the p.g. default .txt files --
http://www.gutenberg.org/files/#####/#####.txt
-- from #10000 up, and surprisingly _quickly_.
text is indeed compact.   even when not zipped.

a few notes.

circa #18644 is the most recent?   really?
i thought we were up close to #20000?
i take it .aus and .eur are in that count?

please relabel human genome files!   not really .txt!

out of each 1,000 e-texts, about 150 are a.w.o.l. --
different types (e.g., mp3) or something or other,
reducing these 8,644 down to some 7,000 or so.

plus before i process further, i will toss out
the non-english and other pesky variants...
let's get it working on the simple ones first.

which might take the 7,000 down to 6,000.

i'd thought of it initially as a mere pilot-test,
but it's looking more like split-half reliability.

(i choose 10,000+ only because filenames
were generated with a one-line template.)

anyway, i chunked those files into folders of
1,000 e-texts each, because that was the size
where my old machine starting choking, but
o.s.x. seems to handle folders just fine even
when the number of files inside is 5,000+...

so i might consolidate the folders further, but
in the meantime, the results lead to good news.

each set of 1,000 e-texts takes roughly 300 megs,
so the entire set of 20,000 would be about 6 gigs,
meaning they will fit comfortably on today's dvd.
and that is without any compression at all, baby.

if we figure in compression, and tomorrow's dvd,
we're talking an impressive library on a single disc.
and a _huge_ library in a case containing 10 discs...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060621/9b5b32ff/attachment.html
From Bowerbird at aol.com  Wed Jun 21 15:03:19 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Jun 21 15:03:29 2006
Subject: [gutvol-d] chapter-headings linked to the table-of-contents
Message-ID: <26d.b2f3709.31cb1c27@aol.com>

i see that carlo, one of the smarter p.g. people,
has started doing one of the things i suggested
some time back -- having each chapter header
link to the table of contents.   well-done, carlo...

>    http://www.gutenberg.org/files/18627/18627-h/18627-h.htm#table

chalk up another "i told you so".

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060621/3f33ddb5/attachment.html
From Bowerbird at aol.com  Wed Jun 21 23:20:38 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Jun 21 23:20:43 2006
Subject: [gutvol-d] scraping the p.g. default .txt files
Message-ID: <270.b3ef12d.31cb90b6@aol.com>

i said:
>   well, i have scraped the p.g. default .txt files --
>    http://www.gutenberg.org/files/#####/#####.txt
>    -- from #10000 up, and surprisingly _quickly_.

of course, the idea is to rework these e-texts into z.m.l. format.

although i won't be able to get started on that for a few weeks,
and it will probably take me about 6 months to finish them all,
i did have a chance to do just a wee bit of experimentation...

so, to see a review of the transformation of one such p.g. e-text:
>    http://snowy.arsc.alaska.edu/bowerbird/misc/screen1.html

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060622/59203d61/attachment.html
From nwolcott2ster at gmail.com  Thu Jun 22 09:05:58 2006
From: nwolcott2ster at gmail.com (Norm Wolcott)
Date: Thu Jun 22 09:13:48 2006
Subject: [gutvol-d] David's in progress list
Message-ID: <001001c69615$e72b6be0$650fa8c0@gw98>

I have been unable to download david's in progress list. It stops at Abbott and just sits there and never finisihes. Is it posted anywhere else?
nwolcott2@post.harvard.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060622/66638951/attachment.html
From ajhaines at shaw.ca  Thu Jun 22 10:23:58 2006
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Thu Jun 22 10:25:29 2006
Subject: [gutvol-d] David's in progress list
References: <001001c69615$e72b6be0$650fa8c0@gw98>
Message-ID: <002601c69620$a56f9130$6401a8c0@ahainesp2400>

I just tried saving it to my Windows desktop, and checking that it was complete - no problem.

Norm - if you want, and your e-mail has no problem with large zip files, I can forward it as a zip file.  About 1.25M.

Al

  ----- Original Message ----- 
  From: Norm Wolcott 
  To: 'Project Gutenberg Volunteer Discussion' 
  Cc: harvard.edu@pglaf.org ; N Wolcott 
  Sent: Thursday, June 22, 2006 9:05 AM
  Subject: [gutvol-d] David's in progress list


  I have been unable to download david's in progress list. It stops at Abbott and just sits there and never finisihes. Is it posted anywhere else?
  nwolcott2@post.harvard.edu


------------------------------------------------------------------------------


  _______________________________________________
  gutvol-d mailing list
  gutvol-d@lists.pglaf.org
  http://lists.pglaf.org/listinfo.cgi/gutvol-d
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060622/6fbb9b33/attachment.html
From greg at durendal.org  Thu Jun 22 10:14:14 2006
From: greg at durendal.org (Greg Weeks)
Date: Thu Jun 22 10:30:03 2006
Subject: [gutvol-d] David's in progress list
In-Reply-To: <001001c69615$e72b6be0$650fa8c0@gw98>
References: <001001c69615$e72b6be0$650fa8c0@gw98>
Message-ID: <Pine.LNX.4.63.0606221312320.27734@durendal.durendal.org>

On Thu, 22 Jun 2006, Norm Wolcott wrote:

> I have been unable to download david's in progress list. It stops at 
> Abbott and just sits there and never finisihes. Is it posted anywhere 
> else?

I downloaded it ok. I don't know of any backup copies. Let me know and 
I'll mail you a copy.

-- 
Greg Weeks
http://durendal.org:8080/greg/

From nwolcott2ster at gmail.com  Thu Jun 22 11:36:24 2006
From: nwolcott2ster at gmail.com (Norm Wolcott)
Date: Thu Jun 22 11:36:50 2006
Subject: [gutvol-d] David's in progress list
References: <001001c69615$e72b6be0$650fa8c0@gw98>
	<002601c69620$a56f9130$6401a8c0@ahainesp2400>
Message-ID: <002b01c6962a$cf4c1320$650fa8c0@gw98>

Thanks--I got it, it was just very slow, itis a 5 meg file I don't think the browser liked it too much. 
nwolcott2@post.harvard.edu
  ----- Original Message ----- 
  From: Al Haines (shaw) 
  To: Project Gutenberg Volunteer Discussion 
  Cc: N Wolcott ; harvard.edu@pglaf.org 
  Sent: Thursday, June 22, 2006 1:23 PM
  Subject: Re: [gutvol-d] David's in progress list


  I just tried saving it to my Windows desktop, and checking that it was complete - no problem.

  Norm - if you want, and your e-mail has no problem with large zip files, I can forward it as a zip file.  About 1.25M.

  Al

    ----- Original Message ----- 
    From: Norm Wolcott 
    To: 'Project Gutenberg Volunteer Discussion' 
    Cc: harvard.edu@pglaf.org ; N Wolcott 
    Sent: Thursday, June 22, 2006 9:05 AM
    Subject: [gutvol-d] David's in progress list


    I have been unable to download david's in progress list. It stops at Abbott and just sits there and never finisihes. Is it posted anywhere else?
    nwolcott2@post.harvard.edu


----------------------------------------------------------------------------


    _______________________________________________
    gutvol-d mailing list
    gutvol-d@lists.pglaf.org
    http://lists.pglaf.org/listinfo.cgi/gutvol-d


------------------------------------------------------------------------------


  _______________________________________________
  gutvol-d mailing list
  gutvol-d@lists.pglaf.org
  http://lists.pglaf.org/listinfo.cgi/gutvol-d
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060622/e624678a/attachment-0001.html
From sly at victoria.tc.ca  Thu Jun 22 13:27:17 2006
From: sly at victoria.tc.ca (Andrew Sly)
Date: Thu Jun 22 13:27:22 2006
Subject: [gutvol-d] David's in progress list
In-Reply-To: <001001c69615$e72b6be0$650fa8c0@gw98>
References: <001001c69615$e72b6be0$650fa8c0@gw98>
Message-ID: <Pine.GSO.4.58.0606221325390.11059@vtn1.victoria.tc.ca>


A while ago I found a page where someone had taken David's
in progress list and broken it up into bite-sized pieces.
They kept it periodically updated too. I thought I had
saved the url, but now I can't find it.

Andrew

On Thu, 22 Jun 2006, Norm Wolcott wrote:

> I have been unable to download david's in progress list. It stops at Abbott and just sits there and never finisihes. Is it posted anywhere else?

From malcolm.farmer at gmail.com  Thu Jun 22 15:44:45 2006
From: malcolm.farmer at gmail.com (Malcolm Farmer)
Date: Thu Jun 22 15:51:55 2006
Subject: [gutvol-d] David's in progress list
In-Reply-To: <Pine.GSO.4.58.0606221325390.11059@vtn1.victoria.tc.ca>
References: <001001c69615$e72b6be0$650fa8c0@gw98>
	<Pine.GSO.4.58.0606221325390.11059@vtn1.victoria.tc.ca>
Message-ID: <8baaac1d0606221544w538c851bkec2810cb9b621edf@mail.gmail.com>

On 6/22/06, Andrew Sly <sly@victoria.tc.ca> wrote:
>
>
> A while ago I found a page where someone had taken David's
> in progress list and broken it up into bite-sized pieces.
> They kept it periodically updated too. I thought I had
> saved the url, but now I can't find it.


Here's its index page, pointing to the individual lists by letter.:
http://www.zuhause.org/dp/GutIP/

Done by Bruce Albrecht, who is user bgalbrecht  at DP.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060622/f10c42be/attachment.html
From Bowerbird at aol.com  Fri Jun 23 13:06:54 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Fri Jun 23 13:07:01 2006
Subject: [gutvol-d] the end of the line
Message-ID: <360.64af89c.31cda3de@aol.com>

as i watch all these p.g. e-texts float across my screen,
i just can't help but have some thoughts recur to me...

in the old days, when -- for some very good reasons --
a p.g. e-text was considered to be an _amalgamation_ of
different versions of a book (even when it really was not,
a fiction advised by p.g. legal counsel at that early time),
that gave a good reason to remove end-line hyphenation
and reflow text (without hyphenation) to p.g. margination.
after all, hyphenation mostly causes problems in e-books.

in the current era, however, where most p.g. e-texts are
pegged to a specific version of a book (and where, for the
most part, the scans are now retained to cement this direct
correspondence), it no longer makes sense to discard the
line-breaks, or even the end-line hyphenation, to be frank.

yes, end-line hyphenation should be _marked_ in some way,
so it can be automatically eliminated, but the _default_action_
should be to retain it.   it would defeat the purpose of saving
the line-breaks if you didn't also retain end-line hyphenation,
because the goal here would be to duplicate the print version.

(don't bother arguing that there would never be such a desire;
maybe you'd never have any need for it, but _someone_ might.
i can think of half-a-dozen such reasons -- want to hear them?)

if you want to see what the future of electronic-books looks like,
see the "digital reprints" that jose menendez has been producing.
>    http://www.ibiblio.org/ebooks/Mabie/
>    http://www.ibiblio.org/ebooks/Cather/
>    http://www.ibiblio.org/ebooks/Einstein/

the deep links to the actual .pdf "digital reprints" are these:
>    http://www.ibiblio.org/ebooks/Mabie/Books_Culture.pdf
>    http://www.ibiblio.org/ebooks/Cather/Antonia/Antonia.pdf
>    http://www.ibiblio.org/ebooks/Einstein/Einstein_Relativity.pdf

aside from the unfortunate fact that jose is using the .pdf format
(a format which makes it far to difficult to repurpose the content),
these "digital reprints" carve out an awesome model for e-books.

they replicate the original paper-book to a high degree of fidelity,
and do so using a small percentage of the disk-space of the scans.
yet because it is an e-book, it gives all the benefits that they give.
(at least it _would_, if it wasn't a .pdf.   but that part can be fixed.)

and the secret of these "digital reprints" is extremely simple, folks;
all that jose has done is merely to retain the original line-breaks...

so, once again, i recommend and request that you start retaining
this valuable information, instead of intentionally tossing it away.

(it is very ironic, to me, that distributed proofreaders _retains_
the line-breaks during their proofing -- because it makes that
process so much easier -- but then they discard the line-breaks!
hey, there might be some end-users out there who need 'em too!)

honestly, folks, when i look at your p.g. e-texts, what i see is that
they're gonna be thrown on the trashpile one day -- maybe soon.

in a world that is awash in scans, and where o.c.r. is a commodity,
it'll be trivial to convert those scans to text.   so if someone needs
to have the ability to duplicate the print version -- i.e., they _need_
to have the line-break information you are routinely discarding --
they'll simply o.c.r. the scans again.   they will be required to do that,
because your e-texts simply won't do the job that they want done...

that's not to say that your p.g. e-texts will be _completely_ worthless.
as an independent digitization, they'll go a long way toward helping
to move any new o.c.r. effort up to an absurdly high level of accuracy.

but since the absurd level of accuracy can be applied to either e-text,
and since the new effort will have retained the line-break information,
that will be the one that's retained.   the p.g. e-text will be thrown away.
and it would break my heart to see all your hard work just thrown away.

on the other hand, if y'all started retaining that line-break information,
then it'd be _your_ version which would be kept (because of its primacy),
and the new o.c.r. effort would just be seen as a tool to increase accuracy.

if project gutenberg wants to remain as the premiere library in cyberspace,
you're going to have to fix this glitch, and do it quickly.   mark my 
words...

-bowerbird

p.s.   at some of you aren't   good at reading between the lines, i'll tell 
you
that i intend to mount such a massive o.c.r. effort, so the question about
which version, p.g. or not, receives the higher accuracy is a very real one.
i don't want to challenge the p.g. library, _unless_ you've made it 
deficient.
i'm trying to help you by giving you this advice before it becomes crucial...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060623/181693a8/attachment.html
From hart at pglaf.org  Fri Jun 23 14:38:48 2006
From: hart at pglaf.org (Michael Hart)
Date: Fri Jun 23 14:38:49 2006
Subject: [gutvol-d] the end of the line
In-Reply-To: <360.64af89c.31cda3de@aol.com>
References: <360.64af89c.31cda3de@aol.com>
Message-ID: <Pine.LNX.4.60.0606231428530.28332@pglaf.org>


Bowerbird's lengthy essay is just one more example of how publishers,
editors, etc., put their own needs ahead of those of the readers.

While there might be some value in keeping references to arcane modes
of pagination and margination for those who actually have those other
reasons for opening books other than to simply read their contents, a
certain respect for the reader, ostensably for whom all is being done
by the publishers and editors, should clearly indicate that no longer
is there any need for a slavish mentality to conserve the paper pages
by introducing end of line hyphenation, or to create some appearances
that there were actually the same number of characters on every line,
when it is obvious to anyone who cares to look that there are not.

And, as Mr. Bowerbird points out, end of line hyphenation can be some
serious pain in the neck, depending on what programs you use to read,
search, edit, etc.

So, while I obviously agree that there are to camps in his model of a
world of eBooks, I disagree as to which is primary.

The reader is primary.

Any effort to preserve items of interest only to publishers, editors,
etc., should be the efforts that are invisible to the naked eye, with
the option to bring them into view when desired, rather than defaults
being of the nature that it is the millions of readers who have to do
the process to eliminate them, rather than just a few who will prefer
to have them visible.


Thanks!!!

Give the world eBooks in 2006!!!

Michael S. Hart
Founder
Project Gutenberg

Blog at http://hart.pglaf.org

From Bowerbird at aol.com  Fri Jun 23 16:36:59 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Fri Jun 23 16:37:04 2006
Subject: [gutvol-d] the end of the line
Message-ID: <516.1689c6b.31cdd51b@aol.com>

i said:
>    (a format which makes it far to difficult to repurpose the content),

haha.   "far to difficult".   i made a boo-boo.


>    p.s.? at some of you aren't good at reading between the lines

"at some of you"...?

wow.   two mistakes (at least) in one post.   good thing it's friday...       
    ;+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060623/0d1bbbd9/attachment.html
From jeroen.mailinglist at bohol.ph  Fri Jun 23 16:43:12 2006
From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account))
Date: Fri Jun 23 16:40:33 2006
Subject: [gutvol-d] the end of the line
In-Reply-To: <Pine.LNX.4.60.0606231428530.28332@pglaf.org>
References: <360.64af89c.31cda3de@aol.com>
	<Pine.LNX.4.60.0606231428530.28332@pglaf.org>
Message-ID: <449C7C90.8040208@bohol.ph>


Although I agree with Michael that there is no need to preserve things 
as linebreaks
in most texts -- if you really need to go to that level of detail, there 
is always the original
or the scans to fall back upon -- I want to make a case for preserving 
page numbers, if
not at least as recognisable anchors in text, and only for those books being
referenced to regularly by other books. This excludes most fiction, but is
particularly important for scientific works, which have constructed a 
kind of
paper web with cross references mainly based on page numbers.

In long term, such references of course should give way to proper 
references to
the actual paragraph or sentence being referenced, but as a practical 
ad-interim
solution, staying with page numbers will increase the number of texts we 
can digitize
with our limited means.

This leads me to one place where further work could be done on the PG 
collection:
turning it from a collection of static texts into an enriched web of 
knowledge.
I've seen a lot of websites grabbing all of PG, and republishing it in a 
slightly modified
form. I would however, like to see the collection be incorporated in a 
kind of wiki-like
system, where people can add -- without tampering with the static source 
texts -- annotations,
add tagging and create live cross references: both for own use, smaller 
dissemination in
a group or publicy.

I've added a large number of texts related to the Philippines to PG, and 
many of these
text interact. Some critize each other, others provide opposing views, 
and so forth. It would
be great to build a system that makes that easy to follow for everybody, 
such that
people can immediately see, when reading a text, where it has been cited 
or referenced
in other works. It would be great also to provide study introductions or 
synopises, to give
users a grasp of the material, and enable them to find what they really 
need within
reasonable time. Search enginges are a great tool, but only to a certain 
extend.

Jeroen.

From sly at victoria.tc.ca  Fri Jun 23 17:16:44 2006
From: sly at victoria.tc.ca (Andrew Sly)
Date: Fri Jun 23 17:16:46 2006
Subject: [gutvol-d] the end of the line
In-Reply-To: <449C7C90.8040208@bohol.ph>
References: <360.64af89c.31cda3de@aol.com>
	<Pine.LNX.4.60.0606231428530.28332@pglaf.org>
	<449C7C90.8040208@bohol.ph>
Message-ID: <Pine.GSO.4.58.0606231715060.14164@vtn1.victoria.tc.ca>


There are places such as wikisource.org, where you could
add the texts and start providing links such as you
mention here immediately.

Andrew

On Sat, 24 Jun 2006, Jeroen Hellingman (Mailing List Account) wrote:

> This leads me to one place where further work could be done on the PG
> collection:
> turning it from a collection of static texts into an enriched web of
> knowledge.
> I've seen a lot of websites grabbing all of PG, and republishing it in a
> slightly modified
> form. I would however, like to see the collection be incorporated in a
> kind of wiki-like
> system, where people can add -- without tampering with the static source
> texts -- annotations,
> add tagging and create live cross references: both for own use, smaller
> dissemination in
> a group or publicy.
>
> I've added a large number of texts related to the Philippines to PG, and
> many of these
> text interact. Some critize each other, others provide opposing views,
> and so forth. It would
> be great to build a system that makes that easy to follow for everybody,
> such that
> people can immediately see, when reading a text, where it has been cited
> or referenced
> in other works. It would be great also to provide study introductions or
> synopises, to give
> users a grasp of the material, and enable them to find what they really
> need within
> reasonable time. Search enginges are a great tool, but only to a certain
> extend.
>
> Jeroen.
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
>
From jon at noring.name  Fri Jun 23 22:12:12 2006
From: jon at noring.name (Jon Noring)
Date: Fri Jun 23 22:12:24 2006
Subject: [gutvol-d] the end of the line
In-Reply-To: <449C7C90.8040208@bohol.ph>
References: <360.64af89c.31cda3de@aol.com>
	<Pine.LNX.4.60.0606231428530.28332@pglaf.org>
	<449C7C90.8040208@bohol.ph>
Message-ID: <228656957.20060623231212@noring.name>

[cc: Jose Menendez]


Jeroen Hellingman wrote:

> Although I agree with Michael that there is no need to preserve things
> as linebreaks in most texts -- if you really need to go to that level
> of detail, there is always the original or the scans to fall back upon
> -- I want to make a case for preserving page numbers, if not at least
> as recognisable anchors in text, and only for those books being
> referenced to regularly by other books.

First off, I agree with Bowerbird in the sense that it is a good thing
to preserve both the line breaks and page breaks in the master marked-
up texts converted from a source book. I assume with the DP work flow
that this would not be that difficult of a thing to do, so why not do
it if it could be done (mostly) automatically?

For the OpenReader Publication Format, which is in an advanced stage of
development, we're now putting together an OpenReader namespace set of
elements to do various tasks. These elements may be used for all XML
content documents which OpenReader now supports (an XHTML subset) and
plans to support in the future (such as a subset of TEI). The namespaced
elements include (attributes not described here):

<or:hlink> ... </or:hlink>   (simple hypertext linking)
<or:object/>                 (embedding images, video and audio)
<or:page/>                   (page break in a paper source)
<or:lb/>                     (line break in a paper source)
<or:marker/>                 (a generic marker)

(both or:hlink and or:object will be defined using XLink.)

With the permission of Jose Menendez, he is letting us use his copy of
"My Antonia" (which is more accurate than the one I've been working
on which hasn't yet been completely proofed), to put it into a demo of
the OpenReader format. I've "diffed" it to my version and checked all
differences found by consulting the original page scans, and it's
been restored to the original 1918 edition (including textual errors --
the errors are specially marked however, including what the text
should be based on both the Univ. of Nebraska online edition and
Jose's edition), and have added precise line breaks and page breaks.

For line breaks, I've placed the line breaks at the precise place of
hyphenation. If the broken word does not have a natural hyphen, I use
a &shy; (a soft hyphen) to indicate that -- if the broken word does
have a natural hyphen at the break, the hard hyphen character "-" is
used.

Here's an example paragraph (the 63rd paragraph in the text) which
includes a page break, soft and hard hyphens:

****************************************************************************
<p id="p0063">The little girl was pretty, but &Aacute;n-tonia
&mdash;<or:lb/> <or:page id="page026"/>they accented the name thus,
strongly, when<or:lb/> they spoke to her &mdash; was still prettier. I
re&shy;<or:lb/>membered what the conductor had said about<or:lb/> her
eyes. They were big and warm and full<or:lb/> of light, like the sun
shining on brown pools<or:lb/> in the wood. Her skin was brown, too,
and<or:lb/> in her cheeks she had a glow of rich, dark<or:lb/> color.
Her brown hair was curly and wild-<or:lb/>looking. The little sister,
whom they called<or:lb/> Yulka (Julka), was fair, and seemed mild
and<or:lb/> obedient. While I stood awkwardly confront&shy;<or:lb/>ing the
two girls, Krajiek came up from the<or:lb/> barn to see what was going
on. With him was<or:lb/> another Shimerda son. Even from a
distance<or:lb/> one could see that there was something strange<or:lb/> about
this boy. As he approached us, he began<or:lb/> to make uncouth noises, and
held up his hands<or:lb/> to show us his fingers, which were webbed
to<or:lb/> the first knuckle, like a duck&#x2019;s foot. When he<or:lb/> saw
me draw back, he began to crow delight&shy;<or:lb/>edly, &ldquo;Hoo, hoo-hoo,
hoo-hoo!&rdquo; like a rooster.<or:lb/> His mother scowled and said sternly,
&ldquo;Ma&shy;<or:lb/>rek!&rdquo; then spoke rapidly to Krajiek in
Bo&shy;<or:lb/>hemian.</p>
*****************************************************************************

If the above is rendered in plain text preserving the line breaks
(ignore the page break), we have: (since this is an ASCII text email,
I've converted the A-acute in "Antonia" to a unaccented A, em-dashes
to "--", and curly quotes/apostrophes to the straight varieties.)

*****************************************************************************
The little girl was pretty, but An-tonia --
they accented the name thus, strongly, when
they spoke to her -- was still prettier. I re-
membered what the conductor had said about
her eyes. They were big and warm and full
of light, like the sun shining on brown pools
in the wood. Her skin was brown, too, and
in her cheeks she had a glow of rich, dark
color. Her brown hair was curly and wild-
looking. The little sister, whom they called
Yulka (Julka), was fair, and seemed mild and
obedient. While I stood awkwardly confront-
ing the two girls, Krajiek came up from the
barn to see what was going on. With him was
another Shimerda son. Even from a distance
one could see that there was something strange
about this boy. As he approached us, he began
to make uncouth noises, and held up his hands
to show us his fingers, which were webbed to
the first knuckle, like a duck's foot. When he
saw me draw back, he began to crow delight-
edly, "Hoo, hoo-hoo, hoo-hoo!" like a rooster.
His mother scowled and said sternly, "Ma-
rek!" then spoke rapidly to Krajiek in Bo-
hemian.
*****************************************************************************


Of course, comments welcome on the above!

Jon Noring


From ke at gnu.franken.de  Fri Jun 23 23:39:21 2006
From: ke at gnu.franken.de (Karl Eichwalder)
Date: Fri Jun 23 23:39:35 2006
Subject: [gutvol-d] Re: the end of the line
In-Reply-To: <228656957.20060623231212@noring.name> (Jon Noring's message of
	"Fri, 23 Jun 2006 23:12:12 -0600")
References: <360.64af89c.31cda3de@aol.com>
	<Pine.LNX.4.60.0606231428530.28332@pglaf.org>
	<449C7C90.8040208@bohol.ph> <228656957.20060623231212@noring.name>
Message-ID: <sh1wtfm50m.fsf@tux.gnu.franken.de>

Jon Noring <jon@noring.name> writes:

> ****************************************************************************
> <p id="p0063">The little girl was pretty, but &Aacute;n-tonia
> &mdash;<or:lb/> <or:page id="page026"/>they accented the name thus,
> strongly, when<or:lb/> they spoke to her &mdash; was still prettier. I
> re&shy;<or:lb/>membered what the conductor had said about<or:lb/> her

No result, if you grep for "remember".  Consider to encode it as
follows:

    <reg orig="re&shy;membered">remembered<or:lb/></reg>

-- 
http://www.gnu.franken.de/ke/                           |      ,__o
                                                        |    _-\_<,
                                                        |   (*)/'(*)
Key fingerprint = F138 B28F B7ED E0AC 1AB4  AA7F C90A 35C3 E9D0 5D1C
From marcello at perathoner.de  Sat Jun 24 07:10:40 2006
From: marcello at perathoner.de (Marcello Perathoner)
Date: Sat Jun 24 07:10:53 2006
Subject: [gutvol-d] the end of the line
In-Reply-To: <228656957.20060623231212@noring.name>
References: <360.64af89c.31cda3de@aol.com>	<Pine.LNX.4.60.0606231428530.28332@pglaf.org>	<449C7C90.8040208@bohol.ph>
	<228656957.20060623231212@noring.name>
Message-ID: <449D47E0.1070102@perathoner.de>

Jon Noring grudgingly admits:


> <or:page/>                   (page break in a paper source)
> <or:lb/>                     (line break in a paper source)
> <or:marker/>                 (a generic marker)

Why not use <tei:pb> , <tei:lb> and <tei:milestone> ? Insisting on
making your own when there are perfectly good elements in TEI is just
plain ... sub-optimal.


> he began to crow delight&shy;<or:lb/>edly,

Sorry to rain on your parade but your (at best) half-baked proposal has
following shortcomings:


1. Non-standard use of &shy;

The soft-hyphen is a "non-printable" character that may be replaced with
a "printable" hyphen by processors before output.

Your use is to record the place where an existent hyphen has been stripped.

You got it backwards. You confuse the very different stages of text
feature recording and text output.


2. Throws off grep

An xml-grep could find "delight<tei:lb/>edly" if searching for
"delighted", but it surely won't find "delight&shy;<tei:lb/>edly".


3. Redundant text feature documentation

All you are doing here is repeatedly "documenting" that the character
used to hyphenate words in this text is the hyphen. You don't have to
repeat that statement through all of your text. A single statement to
that effect in the TEI header will suffice.


4. Incompatibility with LOTE

Remember that in LOTE you have to deal with cases like the German "ck"
and "fff" which got hyphenated this way:

  dachdecker
  dachdek-ker

  Schiffahrt
  Schiff-fahrt

Also remember French and Italian elisions that don't happen at line breaks.


5. Dependance on one edition

All those hard-coded &shy;'s will marry your electronic text to one
edition. You have no provision to encode different editions of the very
same text like hardcover and paperback (which may very well have
different line endings).


Conclusion

My advice is: forget entirely about line breaks. They are random
artefacts introduced by the person operating the typesetting machine and
indirectly by the person who chose paper size and font. They have no
raison d'?tre once you separate the ebook from the scans, ie. after it
left DP. (That this suggestion was by "You Know Who" should have tipped
you off immediately.)


But if you belong to that fastidious class of people who can't throw
away even the most useless random artefact, I suggest doing it this
standard way:

  <html:p>
  ...
  he began to crow de<tei:lb ed="paperback" />light<tei:lb
ed="hardcover" />edly,
  ...
  </html:p>

A standard XHTML browser (OpenReader ?) will simply throw away the
unknown tags and render the normalized text. A special processor may be
used to reconstruct the paper layout of the text.


-- 
Marcello Perathoner
webmaster@gutenberg.org

From jon at noring.name  Sat Jun 24 09:05:28 2006
From: jon at noring.name (Jon Noring)
Date: Sat Jun 24 09:05:43 2006
Subject: [gutvol-d] the end of the line
In-Reply-To: <449D47E0.1070102@perathoner.de>
References: <360.64af89c.31cda3de@aol.com>
	<Pine.LNX.4.60.0606231428530.28332@pglaf.org>
	<449C7C90.8040208@bohol.ph>
	<228656957.20060623231212@noring.name> <449D47E0.1070102@perathoner.de>
Message-ID: <1509458751.20060624100528@noring.name>

Marcello wrote:
> Jon Noring grudgingly admits:

>> <or:page/>                   (page break in a paper source)
>> <or:lb/>                     (line break in a paper source)
>> <or:marker/>                 (a generic marker)

> Why not use <tei:pb> , <tei:lb> and <tei:milestone> ? Insisting on
> making your own when there are perfectly good elements in TEI is just
> plain ... sub-optimal.

Actually, a very good idea. We've not fixed the "custom" elements yet.

I'll have to look at the TEI-defined semantics of the use of the TEI
equivalents, but *if* reasonably close to what we need, will likely
embrace them. It will add to the list of namespace declarations, but
that downside is pretty minor. Thanks.


>> he began to crow delight&shy;<or:lb/>edly,

> Sorry to rain on your parade but your (at best) half-baked proposal has
> following shortcomings:

No, I'm submitting the idea for feedback, and your feedback is
valuable.


> 1. Non-standard use of &shy;
>
> The soft-hyphen is a "non-printable" character that may be replaced with
> a "printable" hyphen by processors before output.
>
> Your use is to record the place where an existent hyphen has been stripped.

Yes.


> You got it backwards. You confuse the very different stages of text
> feature recording and text output.

Actually, I've been debating whether or not to include the &shy; as it
is used.


> 2. Throws off grep
>
> An xml-grep could find "delight<tei:lb/>edly" if searching for
> "delighted", but it surely won't find "delight&shy;<tei:lb/>edly".

Well, with existing toolbases, this might be. I believe, however, that
Unicode itself implies that text processors should ignore &shy;
(U+00AD). One reference is:

   http://www.unicode.org/unicode/reports/tr14/#SoftHyphen

In addition HTML discusses the use of the soft hyphen:

   http://www.w3.org/TR/html401/struct/text.html#hyphenation

In summary, user agents, such as doing word searching, should ignore
the soft hyphen character. That some don't is a real-world issue that
unfortunately has to be pragmatically considered.


> 3. Redundant text feature documentation
>
> All you are doing here is repeatedly "documenting" that the character
> used to hyphenate words in this text is the hyphen. You don't have to
> repeat that statement through all of your text. A single statement to
> that effect in the TEI header will suffice.

Two points (based on what I interpret you are saying):

1) We are not focusing on TEI documents, thus many XML documents will
   not have a TEI header.

2) The Unicode annex statement on the use of the soft hyphen (see
   above link) takes into account other characters used for word
   breaking purposes. It does not imply a "hard hyphen", but some
   character used for linebreaking depending upon the text's language
   and country code (required for all OpenReader Content Documents)


> 4. Incompatibility with LOTE
>
> Remember that in LOTE you have to deal with cases like the German "ck"
> and "fff" which got hyphenated this way:
>
>   dachdecker
>   dachdek-ker
>
>   Schiffahrt
>   Schiff-fahrt
>
> Also remember French and Italian elisions that don't happen at line breaks.

Good points. I'll have to check the Unicode annex document (URL above)
to see what it talks about regarding this.


> 5. Dependance on one edition
>
> All those hard-coded &shy;'s will marry your electronic text to one
> edition. You have no provision to encode different editions of the very
> same text like hardcover and paperback (which may very well have
> different line endings).

Yes, this is an issue. I do plan to allow addition of an attribute to
both the page break and line break pointing (via Binder identifier) to
the source work. So the markup may contain multiple source works.
Things get messy if in two works the same word is broken, but in
different places. But I think my system will work for this.

Example of identifier attribute (still using OR namespace):

   <or:page bid="book2" .../>

   <or:lb bid="book2"/>

In the Binder document, in the "descriptions" section (now being amended),
we might have:

   <markdesc id="book2">Second Edition Issued in 1922</markdesc>


> My advice is: forget entirely about line breaks. They are random
> artefacts introduced by the person operating the typesetting machine and
> indirectly by the person who chose paper size and font. They have no
> raison d'?tre once you separate the ebook from the scans, ie. after it
> left DP. (That this suggestion was by "You Know Who" should have tipped
> you off immediately.)

Disagreed. There may be a need, for example, to continue proofing work
in the future. Knowing where line breaks occurred makes it easier with
DP and similar processes. It also better correlates to the "bounding box
information" from OCR which is being preserved. And *someone* may want
to know this for formatting purposes. It is information about the source
which by and large is easy for user-agents to ignore.

Regarding you-know-who, I think you know that I often have profound
disagreements with him, but when I agree with him, I agree. I don't let
personal issues get in the way of acknowledging when I think he is
right. Those who believe in objectivity evaluate what a person says.


> But if you belong to that fastidious class of people who can't throw
> away even the most useless random artefact, I suggest doing it this
> standard way:
>
>   <html:p>
>   ...
>   he began to crow de<tei:lb ed="paperback" />light<tei:lb ed="hardcover" />edly,
>   ...
>   </html:p>
>
> A standard XHTML browser (OpenReader ?) will simply throw away the
> unknown tags and render the normalized text. A special processor may be
> used to reconstruct the paper layout of the text.

Well, the real issue is dealing with the "fff", etc. issue of LOTE.
I'll have to reread the Unicode annex. In OpenReader we reference that
spec, and recommend user agents follow its guidelines. But it might
not cover the particular LOTE "exceptions" you brought up.


Thanks for your frank feedback. Definitely needed.

Jon Noring


From brad at chenla.org  Sat Jun 24 19:06:35 2006
From: brad at chenla.org (Brad Collins)
Date: Sat Jun 24 19:03:57 2006
Subject: [gutvol-d] the end of the line
In-Reply-To: <449D47E0.1070102@perathoner.de> (Marcello Perathoner's message
	of "Sat, 24 Jun 2006 16:10:40 +0200")
References: <360.64af89c.31cda3de@aol.com>
	<Pine.LNX.4.60.0606231428530.28332@pglaf.org>
	<449C7C90.8040208@bohol.ph> <228656957.20060623231212@noring.name>
	<449D47E0.1070102@perathoner.de>
Message-ID: <m3u06am1jo.fsf@chenla.org>

Marcello Perathoner <marcello@perathoner.de> writes:

> My advice is: forget entirely about line breaks. They are random
> artefacts introduced by the person operating the typesetting machine and
> indirectly by the person who chose paper size and font. They have no
> raison d'?tre once you separate the ebook from the scans, ie. after it
> left DP. (That this suggestion was by "You Know Who" should have tipped
> you off immediately.)

I agree.

Before encoding a text you have to decide if you are encoding the
expression of the text or the manifestation of the text.[1]

  Marking up an expression is the structure and text of the text.
  This is what the author has created and has handed over to a
  publisher.

  Marking up a manifestation is all about layout and presentation.
  This is the realm of the publisher and this is where you get into
  fonts, line breaks etc.

You can easily mark up a text as either one or the other, but it's not
practical to try to do both in the same markup.

There are a few examples of texts and manuscripts which would be worth
having an expression level markup and a second manifestation markup,
but these will be rare.  I seriously doubt that any manifestation of
Willa Cather's work would fall into this catagory :)

Dead tree books fix a manifestation into a permanent arrangement.

Electronic manifestations, which use systems like CSS to mold the
manifestation to the moment and to the device on the fly, are liquid,
if you try to hold them in your hand it just escapes through your
fingers.

The world of print books puts the publisher, and the manifestation at
the center.  The manifestation is more important than the author who
has takes a back seat to the glorious manifestation that was made of
the expression of her work.

But when copying and distribution is for all practical purposes free
and the manifestation has been reduced to an algorithm which an
electronic reader interprets, the manifestation itself takes a back
seat to the expression.

The Age of the manifestation and the publisher is drawing to an end
and we are slowly seeing the emergence of the Age of the expression
and the author.

PG is well named.  Gutenberg's press was the first instance of fixing
a manifestation so that millions of identical copies could be made.
Before Gutenberg, each copy of a text was a different manifestation.
Being able to make error free copies was a revolution, but came at the
expense of easily being able to mold manifestations for different uses
and environments.

But you can make exact copy of an electronic text without it depending
on any one manifestation of it.  This is just as significant as
Gutenburg's press.

Is it useful to include some information from some manifestations in
an expression level markup?  Damn yes -- page breaks are the anchor
and hyperlink in the world of paper.  Countless millions of references
to page numbers have been made over the last two centuries.
Preserving page breaks is an essential part of preserving all those
references which use them.

So if you want to create a markup of a text which preserves a specific
manifestation that's fine, there are whole sections of TEI devoted to
allowing you to pick the tiniest bit of navel lint and preserving it
for eternity.  But for most purposes page scans of the original
manifestation will provide enough of this information for most
questions about a text, as well as provide the source material for the
lint pickers to encode away to their heart's content for specific
manifestations. 

But electronic books will mostly be in the business of preserving the
expression of a work which can then be converted into other markup
languages like XML or OR for dynamically generating flexible,
ephemeral manifestations on the fly.


b/


Footnotes: 
[1]  I am using work, expression and manifestation as defined in
the FRBR (Functional Requirements for Bibiographic Records).

    work           :: the concept representing an intellectual 
                      or creative creation.

    expression     :: includes the specific sequence of words, 
                      images and structure of work.
                
    manifestation  :: includes the specific layout, typography, 
                      pagination etc of a specific expression. 

-- 
Brad Collins <brad@chenla.org>, Banqwao, Thailand
From nwolcott2ster at gmail.com  Sun Jun 25 06:45:53 2006
From: nwolcott2ster at gmail.com (Norm Wolcott)
Date: Sun Jun 25 06:59:27 2006
Subject: [gutvol-d] the end of the line
References: <360.64af89c.31cda3de@aol.com><Pine.LNX.4.60.0606231428530.28332@pglaf.org><449C7C90.8040208@bohol.ph>
	<228656957.20060623231212@noring.name><449D47E0.1070102@perathoner.de>
	<m3u06am1jo.fsf@chenla.org>
Message-ID: <004f01c6985f$8c855060$640fa8c0@gw98>

One could make the argument that the paragraph and perhaps the chapter are
useful tags. Poetry and sidenotes and footnotes seem fairly established in
PG without attitional tagging. Also will we be scanning 20 editions of
Dickens, all with different line breaks and page numbers?
nwolcott2@post.harvard.edu
----- Original Message -----
From: "Brad Collins" <brad@chenla.org>
To: "Project Gutenberg Volunteer Discussion" <gutvol-d@pglaf.org>
Sent: Saturday, June 24, 2006 10:06 PM
Subject: Re: [gutvol-d] the end of the line


Marcello Perathoner <marcello@perathoner.de> writes:

> My advice is: forget entirely about line breaks. They are random
> artefacts introduced by the person operating the typesetting machine and
> indirectly by the person who chose paper size and font. They have no
> raison d'?tre once you separate the ebook from the scans, ie. after it
> left DP. (That this suggestion was by "You Know Who" should have tipped
> you off immediately.)

I agree.

Before encoding a text you have to decide if you are encoding the
expression of the text or the manifestation of the text.[1]

  Marking up an expression is the structure and text of the text.
  This is what the author has created and has handed over to a
  publisher.

  Marking up a manifestation is all about layout and presentation.
  This is the realm of the publisher and this is where you get into
  fonts, line breaks etc.

You can easily mark up a text as either one or the other, but it's not
practical to try to do both in the same markup.

There are a few examples of texts and manuscripts which would be worth
having an expression level markup and a second manifestation markup,
but these will be rare.  I seriously doubt that any manifestation of
Willa Cather's work would fall into this catagory :)

Dead tree books fix a manifestation into a permanent arrangement.

Electronic manifestations, which use systems like CSS to mold the
manifestation to the moment and to the device on the fly, are liquid,
if you try to hold them in your hand it just escapes through your
fingers.

The world of print books puts the publisher, and the manifestation at
the center.  The manifestation is more important than the author who
has takes a back seat to the glorious manifestation that was made of
the expression of her work.

But when copying and distribution is for all practical purposes free
and the manifestation has been reduced to an algorithm which an
electronic reader interprets, the manifestation itself takes a back
seat to the expression.

The Age of the manifestation and the publisher is drawing to an end
and we are slowly seeing the emergence of the Age of the expression
and the author.

PG is well named.  Gutenberg's press was the first instance of fixing
a manifestation so that millions of identical copies could be made.
Before Gutenberg, each copy of a text was a different manifestation.
Being able to make error free copies was a revolution, but came at the
expense of easily being able to mold manifestations for different uses
and environments.

But you can make exact copy of an electronic text without it depending
on any one manifestation of it.  This is just as significant as
Gutenburg's press.

Is it useful to include some information from some manifestations in
an expression level markup?  Damn yes -- page breaks are the anchor
and hyperlink in the world of paper.  Countless millions of references
to page numbers have been made over the last two centuries.
Preserving page breaks is an essential part of preserving all those
references which use them.

So if you want to create a markup of a text which preserves a specific
manifestation that's fine, there are whole sections of TEI devoted to
allowing you to pick the tiniest bit of navel lint and preserving it
for eternity.  But for most purposes page scans of the original
manifestation will provide enough of this information for most
questions about a text, as well as provide the source material for the
lint pickers to encode away to their heart's content for specific
manifestations.

But electronic books will mostly be in the business of preserving the
expression of a work which can then be converted into other markup
languages like XML or OR for dynamically generating flexible,
ephemeral manifestations on the fly.


b/


Footnotes:
[1]  I am using work, expression and manifestation as defined in
the FRBR (Functional Requirements for Bibiographic Records).

    work           :: the concept representing an intellectual
                      or creative creation.

    expression     :: includes the specific sequence of words,
                      images and structure of work.

    manifestation  :: includes the specific layout, typography,
                      pagination etc of a specific expression.

--
Brad Collins <brad@chenla.org>, Banqwao, Thailand
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/listinfo.cgi/gutvol-d

From nwolcott2ster at gmail.com  Sun Jun 25 06:58:58 2006
From: nwolcott2ster at gmail.com (Norm Wolcott)
Date: Sun Jun 25 06:59:31 2006
Subject: [gutvol-d] ebooks libre et gratuits
Message-ID: <005001c6985f$8d3df200$640fa8c0@gw98>

Ebooks libre et gratuits had an arrangement with MH apparently where their books would appear on PG eventually. Now that the ebooks web site is no more, what will happen to the ebooksgratuits which did not make it to PG? Will all of this work have to be repeated by someone else? Is there an archive anywhere of this enormous quantitiiy of work? Why did ebooksgratuits disappear? Pressure from Canadian publishers/ government? Is there an unknown story here? 

nwolcott2@post.harvard.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060625/33f9bbec/attachment.html
From jmk at his.com  Sun Jun 25 07:25:19 2006
From: jmk at his.com (Janet Kegg)
Date: Sun Jun 25 07:36:37 2006
Subject: [gutvol-d] ebooks libre et gratuits
In-Reply-To: <005001c6985f$8d3df200$640fa8c0@gw98>
References: <005001c6985f$8d3df200$640fa8c0@gw98>
Message-ID: <ev6t92dch4df6utuojtr83v2r2okjd3fs0@4ax.com>


The site is now available again: http://www.ebooksgratuits.com/

See the front page of the Web site  for what I believe (my French is
almost nonexistent) is an explanation of what happened.

On Sun, 25 Jun 2006 09:58:58 -0400, you wrote:

>Ebooks libre et gratuits had an arrangement with MH apparently where their books would appear on PG eventually. Now that the ebooks web site is no more, what will happen to the ebooksgratuits which did not make it to PG? Will all of this work have to be repeated by someone else? Is there an archive anywhere of this enormous quantitiiy of work? Why did ebooksgratuits disappear? Pressure from Canadian publishers/ government? Is there an unknown story here? 
>
>nwolcott2@post.harvard.edu
From jon at noring.name  Sun Jun 25 08:12:40 2006
From: jon at noring.name (Jon Noring)
Date: Sun Jun 25 08:12:50 2006
Subject: [gutvol-d] the end of the line
In-Reply-To: <004f01c6985f$8c855060$640fa8c0@gw98>
References: <360.64af89c.31cda3de@aol.com><Pine.LNX.4.60.0606231428530.28332@pglaf.org><449C7C90.8040208@bohol.ph>
	<228656957.20060623231212@noring.name><449D47E0.1070102@perathoner.de>
	<m3u06am1jo.fsf@chenla.org> <004f01c6985f$8c855060$640fa8c0@gw98>
Message-ID: <161854616.20060625091240@noring.name>

Norm Wolcott wrote:

> One could make the argument that the paragraph and perhaps the chapter are
> useful tags. Poetry and sidenotes and footnotes seem fairly established in
> PG without attitional tagging. Also will we be scanning 20 editions of
> Dickens, all with different line breaks and page numbers?

Well, since I sort of initiated this sub-thread, let me note that the
addition of an optional "page break" element in OpenReader is instigated
mostly by the needs of modern educational books, where there may be mixed
use with co-existing paper and ebook versions. And, yes, this feature
has been asked for by a user agent vendor working with the educational
community.

Of course, this feature may be used to preserve page breaks for other
purposes and sources, such as PG/DP. Do note that there exist lots of
scholarly references which point to particular pages in particular
paper manifestations of a work, so having page break info may
eventually prove useful to interlink all the old stuff (provided, of
course, that the focus is on preserving "manifestation" information in
the master digital documents.)

I don't see as much use for the line break empty tag, but we plan to
include it so it's there for those who wish to use it. In the demo
OpenReader Publication of "My Antonia", the line break element will be
included. I'm still going over Marcello's suggestions, plus rereading
the Unicode annex about line breaking (which *does* cover, in a
general way, the unusual ways line breaks are done in LOTE, such as
older German and Dutch.)

The other part of this sub-thread, the discussion of FRBR, is also
interesting. I discovered the FRBR a few years ago, and find it very
useful to understand how to categorize textual works. I like to refer
to the system it describes as "WEMI", which rhymes with "hemi" (for
you auto buffs out there):

Work -- Expression -- Manifestation -- Item

   http://www.ifla.org/VII/s13/frbr/frbr.pdf

(WEMI is the mnemonic I use to remember the system!)


Regarding "expression" versus "manifestation" in the digitization of
public domain materials, such as done by DP and PG, I've made my
thoughts known the last couple years, so I'll refrain from getting
into that again at this time!


Jon Noring


From Bowerbird at aol.com  Sun Jun 25 10:19:23 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Sun Jun 25 10:19:28 2006
Subject: [gutvol-d] the end of the line
Message-ID: <309.7649529.31d01f9b@aol.com>

jon said:
>    Well, since I sort of initiated this sub-thread

um, you mean since you _hijacked_ the thread...

just had to talk about your shiny markup, didn't you?

what a debilitating distraction...

the need to retain line-breaks has nothing to do with markup.
(and your example, which shows the absurd lengths to which
a markup mentality will drive a person, was very illuminating,
as is all the technoid jargon-jabbering in this "sub-thread".)

p.g. introduces its own linebreaks into its plain-ascii e-texts,
all without ever entering the markup arena.

and i put in my 
own line-breaks,
right here in these
posts to this listserve,
again without using 
any markup at all,
just the return key.

i'll bring this thread back to relevance starting tomorrow...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060625/feaeb328/attachment.html
From jon at noring.name  Sun Jun 25 12:25:44 2006
From: jon at noring.name (Jon Noring)
Date: Sun Jun 25 12:25:56 2006
Subject: [gutvol-d] the end of the line
In-Reply-To: <309.7649529.31d01f9b@aol.com>
References: <309.7649529.31d01f9b@aol.com>
Message-ID: <121351504.20060625132544@noring.name>

Bowerbird wrote:

>  p.g. introduces its own linebreaks into its plain-ascii e-texts,
>  all without ever entering the markup arena.
>  
>  and i put in my 
>  own line-breaks,
>  right here in these
>  posts to this listserve,
>  again without using 
>  any markup at all,
>  just the return key.
>  
>  i'll bring this thread back to relevance starting tomorrow...

Yes, if one doesn't care about internal word breaks in the original,
then the markup approach is *equivalent* to the plain text break
approach.

Using your example above, let's suppose we want to preserve internal
word breaks, then we might have (ignore starting spaces, simply used
to shift the left margin inward so it's easier to see in this message):

   and i put in my
   own line-breaks,
   right here in these
   posts to this list-
   serve, again without using
   any markup at all,
   just the return key.


So the question is, is the word which is broken "listserve" or is it
"list-serve"? Does it matter? Yes, for word searching purposes, and a
few other purposes.

I surmise you don't think that preserving the actual internal word
break in the original is important, just shift the break to the
nearest intra-word break.

Well, fine, but some people might want to have the information
preserved. The markup approach I presented gives the optional capability
to mark this up. (I'm evaluating Marcello's feedback, so there are
different markup approaches that may be taken.)

Btw, if we don't care about internal word breaks, and place the break
at the nearest intra-word break, your original example in markup becomes
(using the or: namespace):

   <p>and i put in my<or:lb/> own line-breaks,<or:lb/> right here in
   these<or:lb/> posts to this listserve,<or:lb/> again without
   using<or:lb/> any markup at all,<or:lb/> just the return key.</p>

If the above is rendered in a web browser, and the end-user does not
care about where the line breaks occur and takes no action, the web
browser ignores the <or:lb/> tags, and the text is displayed nicely
to fit the browser window parameters. But an ebook reading system, as
well as simple CSS, can be used to "activate" the <or:lb/> at user
demand. Or if there's a conversion script of the markup to plain text,
such as to ZML, then we know where the breaks are. (One advantage to
using <or:lb/> rather than <br/> is that browsers will ignore the
<or:lb/> tag by default -- it would take CSS or a special user agent
to activate them on demand.)

Another advantage with using <or:lb/> is that the markup document
is not restricted to exact plain text formatting. This allows a
lot of latitude for document authors to do what they want in their
text editor editing the XML document. For example, the above markup
could be expressed in the document as:

   <p>and i put in my<or:lb/> own line-breaks,<or:lb/> right here in these<or:lb/> posts to this listserve,<or:lb/> again without using<or:lb/> any markup at all,<or:lb/> just the return key.</p>

Or as:

   <p>and i put in my<or:lb/> own
   line-breaks,<or:lb/> right here
   in these<or:lb/> posts to this
   listserve,<or:lb/> again without
   using<or:lb/> any markup at
   all,<or:lb/> just the return key.</p>

Same thing... XML parsing user agents normalize all three to the same
thing.

But in plain text, if someone happens to edit your text, such as to

   and i put in my own line-breaks, right here in these
   posts to this listserve, again without using any markup
   at all, just the return key.

The line breaks are changed and the original line breaks lost forever.
What if someone takes a PG text formatted in ZML, and didn't understand
it was ZML (see note below), did some line length reformatting, and then
redistributed that -- especially if it's Bowerbird poetry?


Jon Noring


(Note: How would the user know the plain text they are working with is
ZML? And how would they know in a particular instance that text line
breaks *are* important? Is there going to be machine-readable metadata
to say that the document is ZML or that text breaks are important? I
recommended that a plain text document which conforms to ZML should
have some message or processing-instruction-like thing at the beginning
saying it is ZML, and which version, and possibly that line breaks are
important in this particular document and why.

That's the purpose for <?xml version="1.0"?> at the beginning of an
XML document. It identifies it and even assists with determination of
the text encoding. Will ZML require UTF-8 or UTF-16? Or will it stick
to ASCII? Or will it allow ISO 8859-1? Or will it allow all of them?
Will it allow any text encoding? How would a user agent know the text
encoding of the ZML document, especially without having to process the
whole thing?)

From donovan at abs.net  Sun Jun 25 13:22:35 2006
From: donovan at abs.net (D Garcia)
Date: Sun Jun 25 13:23:02 2006
Subject: [dp-pg] Re: [gutvol-d] ebooks libre et gratuits
In-Reply-To: <ev6t92dch4df6utuojtr83v2r2okjd3fs0@4ax.com>
References: <005001c6985f$8d3df200$640fa8c0@gw98>
	<ev6t92dch4df6utuojtr83v2r2okjd3fs0@4ax.com>
Message-ID: <200606251622.35322.donovan@abs.net>

On Sunday 25 June 2006 10:25 am, Janet Kegg wrote:
> The site is now available again: http://www.ebooksgratuits.com/
>
> See the front page of the Web site  for what I believe (my French is
> almost nonexistent) is an explanation of what happened.

The news item roughly translated is:

As you probably noted, the site was inaccessible for over a week; the reason 
is that our ISP shut down following "A crippling DDOS attack" which they were 
not able to successfully block. We changed ISPs, and the site is once again 
available. We will take the necessary means so that this type of thing cannot 
happen again; I will speak about it again very soon. 
From gbnewby at pglaf.org  Sun Jun 25 14:59:23 2006
From: gbnewby at pglaf.org (Greg Newby)
Date: Sun Jun 25 14:59:25 2006
Subject: [gutvol-d] Automated readability scores for PG eBooks
Message-ID: <20060625215923.GA18811@pglaf.org>

Feedback/input would be valued.  I've been corresopnding
with Simon Ronald at RocketReader.com to see about
integrating readability scores into the main PG book
catalog.

Because we don't have a lot of subject cataloging, one
value of this is that it does a good job of identifing
children's eBooks (they tend to be "easy").  

This is also usable for people seeking to develop
literacy or provide literacy instruction, by providing
a way of reading something "harder" or "easier" as desired.

Take a look at the list below (ten hardest, then ten
easiest).  The first score is overall, followed by
a set of scores that made it up.  I had provided
some earlier feedback on how "hard" books were not
necessarily prose, which is part of what Dr. Ronald is
responding to.

If you have feedback on the results, or my idea for 
adding these scores as an element of the catalog search
results, please chime in!
  -- Greg

----- Forwarded message from "Dr. Simon Ronald" -----

 Subject: Further Readability Results
 Date: Tue, 20 Jun 2006 03:34:26 +0930

Hello Greg,


Here are some further "hardest and easiest" books based on a recent run.
The run required 1 hour and 49 minutes to complete.  This run classified
15,099 books - being a full scan of the English books.


We incorporated a ordered list detection algorithm - some of the books
contained (sometimes very noisy) lists of items - we found 162 books in
total that were list based.  It should be noted that we classified the
entire book as list or "not list" based on a threshold -> if the book
was a list then each separate line was considered a sentence for the
purposes of readability.   In time we will incorporate intra book list
detection to allow the readability methodology to vary depending on the
context within the book.  It should  also be noted that some of the HTML
versions may well contain markup hints such as the use of the <ol> or
<ul> HTML tag, we could use these and other tags to improve the quality
of sentence chunking.


Each entry has a series of 12 percentiles listed after the main
readability percentile.  These percentiles correspond to the 12
readability attributes in this order.


bigword density

short word density (-)

wordsPerSentences

syllablesPerWords

profainwordsPerWords

numbersPerWords

mostCommon1000WordsPerWord (-)

commascharsPerWords

wordsPerParagraphs

letterFrequencyDistributionError

adjacentLetterPairsFrequencyDistributionError

uniqueStemmedWordsPerWord;


99.914 95 90 95 97  0 79 86 84 79 88 85 90 Note on the Resemblances and
Differences in the Structure and the Development of the Brain in Man and
Apes (etext2354)

99.907 96 93 90 98  0 71 96 94 49 80 70 96 Original Letters and
Biographic Epitomes (etext13203)

99.907 89 86 96 88 95 82 71 94 81 80 67 67 The Great Conspiracy, Volume
7 (etext7139)

99.904 85 87 93 86  0 86 97 75 78 88 78 99 A Biography of Edmund Spenser
(etext6937)

99.897 92 90 95 92 88 69 76 90 80 48 60 78 Memoirs of the Court of St.
Cloud (Being secret letters from a gentleman at Paris to a nobleman in
London) \xe2\x8 0\x94 Volume 1 (etext3892)

99.897 82 92 32 87 88 93 98 87 68 93 83 85 Graf von Loeben and the
Legend of Lorelei (etext11066)

99.894 96 93 89 98 80 91 84 97 84 88 36 27 The Modern Regime, Volume 2
(etext2582)

99.887 92 88 73 90 92 82 77 89 45 97 67 76 An Enquiry Concerning the
Principles of Taste, and of the Origin of our Ideas of Beauty, etc.
(etext13485)

99.887 91 89 88 92  0 64 75 99 96 88 67 96 Giordano Bruno (etext4228)

99.884 99 95 92 99 88 79 84 34 87 66 70 74 Monism as Connecting Religion
and Science (etext9199)

99.881 93 91 75 92 94 64 67 77 94 88 67 80 Rise of the Dutch Republic,
the \xe2\x80\x94 Volume 22: 1574-76 (etext4824)

99.874 91 92 23 93 80 80 98 95 70 80 91 74 The Principal Navigations,
Voyages, Traffiques and Discoveries of the English Nation \xe2\x80\x94
Volume 01 (etext7182)

99.868 97 95 74 99 97 81 91 77 64 27 36 78 Gilbertus Anglicus
(etext16155)

99.868 86 95 63 86  0 98 90 89 50 93 89 80 Cessions of Land by Indian
Tribes to the United States: Illustrated by Those in the State of
Indiana^M (etext17148

)

99.858 87 87 95 87 80 56 79 93 63 66 83 73 An Essay towards Fixing the
True Standards of Wit, Humour, Railery, Satire, and Ridicule (1744)
(etext16233)

99.858 91 96 71 90 95 99 99 99  3  2 85 65 Noteworthy Families (Modern
Science) (etext17128)

99.858 99 99  5 99 95 92 99 99 68  0 81 87 Roget's Thesaurus of English
Words and Phrases (etext10681)

99.854 97 92 89 98  0 83 81 79 95 97 60 57 Eighteenth Brumaire of Louis
Bonaparte (etext1346)

99.844 99 99 84 99 80 98 93 14 61 88 83 55 Venereal Diseases in New
Zealand (1922) (etext15352)

99.844 96 91 98 96 88 80 74 98 97 66 64  9 Act, Declaration, &amp;
Testimony for the Whole of our Covenanted Reformation, as Attained to,
and Established in

Britain and Ireland; Particularly Betwixt the Years 1638 and 1649,
Inclusive (etext13200)

99.844 90 89 96 90  0 79 78 69 87 66 70 97 Dr. Bullivant (etext9249)

99.831 90 88 94 89 80 71 61 96 90 66 30 72 Memoirs of the Court of St.
Cloud (Being secret letters from a gentleman at Paris to a nobleman in
London) \xe2\x8

0\x94 Volume 7 (etext3898)

99.831 99 99 87 99 99 92 88 27 54 93 94 34 Three Contributions to the
Theory of Sex (etext14969)

99.831 95 88 91 92 80 61 67 72 83 93 64 70 Superstition Unveiled
(etext15696)

99.831 95 93 45 97 80 89 97 94 43 13 72 80 Aboriginal American Authors
(etext9188)

99.824 89 92 88 88  0 90 92 56 54 80 94 84 Transactions of the American
Society of Civil Engineers, Vol. LXVIII, Sept. 1910 (etext18012)

99.824 87 99 40 90  0 99 95 94  3 88 92 97 On the Origin of Species
(etext8205)

99.798 91 89 93 90 88 69 60 89 86 48 47 74 Memoirs of the Court of St.
Cloud (Being secret letters from a gentleman at Paris to a nobleman in
London) \xe2\x8

0\x94 Volume 3 (etext3894)

99.798 90 87 85 88  0 81 64 96 84 48 81 94 The Lives of the Twelve
Caesars, Volume 11: Titus (etext6396)

99.798 90 86 97 90  0 81 76 90 94 27 76 82 The evolution of English
lexicography (etext11694)

99.798 73 81 96 78 98 56 54 95 72 66 89 96 A Modest Proposal (etext1080)

99.798 95 96 41 97  0 95 96 79 54 98 78 78 Webster's March 7th
Speech/Secession (etext1663)

99.798 91 92 47 92 92 50 87 87 87 66 64 89 Rise of the Dutch Republic,
the \xe2\x80\x94 Volume 01: Introduction I (etext4801)

99.798 88 86 77 88 95 50 73 73 97 80 72 83 Rise of the Dutch Republic,
the \xe2\x80\x94 Volume 26: 1577, part III (etext4828)

99.785 95 85 98 90 88 75 71 92 94 66 56 32 The Auchensaugh Renovation of
the National Covenant and (etext12381)

99.785 93 92 81 92 88 92 85 98 84 80 24 16 The Modern Regime, Volume 1
(etext2581)

99.785 70 78 92 75 92 81 79 94 87 48 52 82 The Mayflower and Her Log;
July 15, 1620-May 6, 1621 \xe2\x80\x94 Volume 5 (etext4105)


Easiest


4.176  2  1 11  2  0  0  1 36  9 48 78 11 The Song of the Blood-Red
Flower (etext12935)

4.176  0  0 11  0  0  0 15 48  8  2 94  6 Six Little Bunkers at Grandma
Bell's (etext14623)

4.176 14  8 27 12  0  0  9 15 10 27 41  7 Melbourne House, Volume 2
(etext12964)

4.176  7  3  6  4  0  0  5  3  9 66 86 23 The Romantic (etext13292)

4.176 17  5 14 10  0  0  1 12 48 48  6 23 The Girl from Montana
(etext15274)

4.176  7  6 13  7  0  0 11 27 22 27 56  9 Jess of the Rebel Trail
(etext15382)

4.176  5  3  8  3  0  0  0  2 45 66 60 25 Stories of American Life and
Adventure (etext15597)

4.176  3  8 13  6  0  0 12  3 72  1 78 13 Kazan (etext10084)

4.176 11  5  8  8  0  0 12  1  9 48 94 11 The Second Honeymoon
(etext17446)

4.176 13  9 10 12  0  0 13  8  3 48 36 23 The Circus Boys on the Plains
: or, the Young Advance Agents Ahead of the Show (etext2478)

4.176 19 11 11 15  0  0 10 26 42 27  9  1 The Captives (etext3601)

4.176  0  1 12  1  0  0  2  3 49 27 90 29 Old Granny Fox (etext4980)

4.176  0  0  6  0  0  0  2  0 19 27 89 54 Sleepy-Time Tales: the Tale of
Fatty Coon (etext5701)

4.176  0  0 12  1  0  0 10  6 31  2 93 40 The Adventures of Johnny Chuck
(etext5844)

4.176  2  3 11  3  0  0  7  1 12  5 85 54 Tale of Brownie Beaver
(etext6754)

4.176 12  7 19  8  0  0  9  7 55  5 52 13 The City of Fire (etext7008)

4.176 23  9 10 15  0  0 13 16 18 13  3 29 The Man with Two Left Feet^M
(etext7471)

4.176  4  5 13  5  0  0  8 15 40  5 80 21 Way of the Lawless (etext9903)

3.931 11  7 11  8  0  0 14  7 36  2 85  9 The Hunted Woman (etext11328)

3.931  4  1 14  2  0  0  1 34 44 27 64  3 Mary Marie (etext11143)

3.931 12  8 22 11  0  0 10 20 15 13 41 13 Contrary Mary (etext17938)

3.931  0  0  3  0  0  0  0  1  0 98 97 29 The New McGuffey First Reader
(etext1489)

3.931  4  7 18  6  0  0  7 33 21 13 56  9 Martin Pippin in the Apple
Orchard (etext2032)

3.931  8  9  8  8  0  0  1 12 48 13 64 23 Twenty-Two Goblins (etext2290)

3.931  9  8  9  8  0  0  7  5 43 27 72 13 God's Country\xe2\x80\x94And
the Woman (etext4585)

3.931 12 11 11 13  0  0  6 12 60 27 13 14 The Valley of Silent Men
(etext4707)

3.931  5  2 23  3  0  0  6 16 10 13 80 23 The Boy Scout Camera Club, or,
the Confession of a Photograph (etext7356)

3.931  8  9 15 10  0  0 17  4  6  5 81 21 Bob Cook and the German Spy
(etext9899)

3.676 16  6  8 10  0  0 25  3  3 13 76 13 The Three Sisters (etext11876)

3.676  7  2 10  3  0  0  3 17 43 27 78 11 His Second Wife (etext17259)

3.676  6  5 17  6  0  0 14 33 24  5 60  3 Michael O'Halloran (etext9489)

3.384 12  7  6 10  0  0 28  0 20 27 13 32 The Sheriff's Son (etext17043)

3.384  1  0 17  1  0  0  9 36 11  5 88  9 The Bobbsey Twins in the Great
West (etext5952)

3.384  6  3  7  4  0  0  3 24 20 13 52 34 Pan (etext7214)

3.384  0  0  5  0  0  0  2  1 20  5 93 59 Five Little Friends
(etext7801)

3.384  0  1  7  1  0  0 13  0 11  1 88 53 The Tale of Sandy Chipmunk
(etext9462)

3.109  0  0  6  0  0  0  0 19  0  2 93 50 Boy Blue and His Friends
(etext16046)

3.109  8  3 10  4  0  0  1 30 19 27 75  6 Wanderers (etext7762)

2.874  1  2 56  2  0  0  0 33 32  2  9  5 Twilight Land (etext1751)

2.874  0  0 13  0  0  0 11 38  9  1 91  6 The Curlytops on Star Island
(etext5989)

2.666  0  0 10  0  0  0  6 44  9 13 76  7 The Bobbsey Twins at Home
(etext18420)

2.666  0  0 34  0  0  0  0  1 43 48 64  5 The King of Ireland's Son
(etext3495)

2.460  7 11 12 11  0  0  8  3 70  5 19 16 Baree, Son of Kazan
(etext4748)

2.460 10  9 11 10  0  0  2 15  9 13 52 21 Samuel the Seeker (etext5961)

2.255  6  1  2  2  0  0 28  1  7 80 30 14 Plays (etext10623)

2.255 12 13 14 14  0  0 11  4 23  5 30 14 King of the Khyber Rifles
(etext6066)

2.255  3  3 19  3  0  0  6 11 13 13 64 23 Riders of the Silences
(etext9867)

1.917  4  2  7  3  0  0 11  4  7 48 89  6 Anne Severn and the Fieldings
(etext10817)

1.785  0  0 15  0  0  0  0  9 13  5 72 36 Fifty Famous Stories Retold
(etext18442)

1.507  8  4 17  6  0  0 21  0 24 13 19 16 The Light in the Clearing
(etext14150)

1.507  9  7 21  8  0  0  3  2 30 27 13 14 Voyages of Dr. Dolittle
(etext1154)

1.507  0  0  2  0  0  0 43 22  3 27 30  2 Six Plays (etext5618)

1.391  4  6 18  6  0  0  9  0 24  5 67  6 The Secret Garden (etext17396)

1.391 12  9  8 10  0  0 16  7 21 13  2 19 Black Jack (etext9925)

1.080 10  4 14  6  0  0  8  9 14  5 24 19 The Gay Cockade (etext16433)

0.742  4  4 14  4  0  0  5  3 55  2  6 16 Isobel : a Romance of the
Northern Trail (etext6715)

0.440  0  0  5  0  0  0  2  2  0  1 97 16 Bunny Rabbit's Diary
(etext16982)

0.281  6  3  6  4  0  0 22  2  8  5 19  6 Mary Olivier: a Life
(etext9366)


Cheers,
Dr. Simon Ronald
CEO


The Leader in High Performance Reading

Level 2, 25 Gresham Street,
Adelaide, SA, Australia, 5000
GPO Box 944, Adelaide SA 5001
Ph. +61 8 8410 2771
Fax. +61 8 8125 6679

1133 Broadway, Suite 706
New York, NY 10010
Ph: (646) 736 7673 (New York)
Ph: (415) 992 5412 (California)
Fax: (877) 731 4410 (toll free)

____________________________________


----- End forwarded message -----

----- End forwarded message -----
From scott_bulkmail at productarchitect.com  Sun Jun 25 18:11:22 2006
From: scott_bulkmail at productarchitect.com (Scott Lawton)
Date: Sun Jun 25 18:26:56 2006
Subject: [gutvol-d] Automated readability scores for PG eBooks
In-Reply-To: <20060625215923.GA18811@pglaf.org>
References: <20060625215923.GA18811@pglaf.org>
Message-ID: <p06110403c0c4d8c7b04a@[192.168.0.52]>

>If you have feedback on the results, or my idea for
>adding these scores as an element of the catalog search
>results, please chime in!

I think that a readability score on every book is a super good idea.  And, the n easiest/hardest would make good lists for the site (well, formatted as an HTML table, and perhaps including author as well as title).  And, there's probably no harm in including the sub-scores, though the overall is certainly most important for public consumption.

Cheers,

Scott S. Lawton
http://blogsearch.com/ - a starting point
http://ProductArchitect.com/ - consulting
From phil at thalasson.com  Sun Jun 25 17:35:22 2006
From: phil at thalasson.com (Philip Baker)
Date: Sun Jun 25 18:52:33 2006
Subject: [gutvol-d] ebooks libre et gratuits
In-Reply-To: <200606251622.35322.donovan@abs.net>
Message-ID: <i31QvGAKvynEFwUV@thalasson.com>

In article <200606251622.35322.donovan@abs.net>, D Garcia 
<donovan@abs.net> writes
>On Sunday 25 June 2006 10:25 am, Janet Kegg wrote:
>> The site is now available again: http://www.ebooksgratuits.com/
>>
>> See the front page of the Web site  for what I believe (my French is
>> almost nonexistent) is an explanation of what happened.
>
>The news item roughly translated is:
>
>As you probably noted, the site was inaccessible for over a week; the reason 
>is that our ISP shut down following "A crippling DDOS attack" which they were 
>not able to successfully block. We changed ISPs, and the site is once again 
>available. We will take the necessary means so that this type of thing cannot 
>happen again; I will speak about it again very soon. 


They are being rather optimistic but we will have to wait and see if 
their "mesures n?cessaires" work.
-- 
Philip Baker
From sly at victoria.tc.ca  Sun Jun 25 21:40:55 2006
From: sly at victoria.tc.ca (Andrew Sly)
Date: Sun Jun 25 21:40:58 2006
Subject: [gutvol-d] ebooks libre et gratuits
In-Reply-To: <i31QvGAKvynEFwUV@thalasson.com>
References: <i31QvGAKvynEFwUV@thalasson.com>
Message-ID: <Pine.GSO.4.58.0606252137510.10919@vtn1.victoria.tc.ca>


I think that this shows the value of the PG approach
(What MH likes to call "Unlimited distribution"), where
the whole collection is mirrored in many different
locations. So if the main server is down, there are
plenty of alternate sites availible.

Andrew

On Mon, 26 Jun 2006, Philip Baker wrote:

> In article <200606251622.35322.donovan@abs.net>, D Garcia
> <donovan@abs.net> writes
> >As you probably noted, the site was inaccessible for over a week; the reason
> >is that our ISP shut down following "A crippling DDOS attack" which they were
> >not able to successfully block. We changed ISPs, and the site is once again
> >available. We will take the necessary means so that this type of thing cannot
> >happen again; I will speak about it again very soon.
>
>
> They are being rather optimistic but we will have to wait and see if
> their "mesures n?cessaires" work.
>
From Bowerbird at aol.com  Sun Jun 25 23:49:35 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Sun Jun 25 23:49:43 2006
Subject: [gutvol-d] Automated readability scores for PG eBooks
Message-ID: <520.1afbfc5.31d0dd7f@aol.com>

greg said:
>   one value of this is that it does 
>    a good job of identifing children's eBooks 
>    (they tend to be "easy").?

checklist said:
>    bigword density
>   short word density (-)
>   wordsPerSentences
>   syllablesPerWords
>   profainwordsPerWords
>   numbersPerWords
>   mostCommon1000WordsPerWord (-)
>   commascharsPerWords
>   wordsPerParagraphs
>   letterFrequencyDistributionError
>   adjacentLetterPairsFrequencyDistributionError
>   uniqueStemmedWordsPerWord;

aren't scientists silly?            :+)

look, greg, if you want a list of children's e-books,
or a list of "easy" e-books, or any kind of list of books,
just ask the distributed proofreaders people for the list...

they'll give you a long list of books, any kind of list you want,
and you won't have to do one little bit of fancy-ass statistics...

i'm serious, they can give a list with p.g. e-text numbers and
meaningful notes, and funny little stories, and _everything_...

much more vivid than your boring-ass statistics...          :+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060626/4374cb79/attachment-0001.html
From Bowerbird at aol.com  Sun Jun 25 23:53:10 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Sun Jun 25 23:53:17 2006
Subject: [gutvol-d] ebooks libre et gratuits
Message-ID: <531.136b248.31d0de56@aol.com>


unlimited distribution rocks big time...

major fucking concept...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060626/60950a94/attachment.html
From prosfilaes at gmail.com  Mon Jun 26 00:03:05 2006
From: prosfilaes at gmail.com (David Starner)
Date: Mon Jun 26 00:32:03 2006
Subject: [gutvol-d] Automated readability scores for PG eBooks
In-Reply-To: <20060625215923.GA18811@pglaf.org>
References: <20060625215923.GA18811@pglaf.org>
Message-ID: <6d99d1fd0606260003v1da9790ep3aed09dd6fc9414@mail.gmail.com>

On 6/25/06, Greg Newby <gbnewby@pglaf.org> wrote:
> Because we don't have a lot of subject cataloging, one
> value of this is that it does a good job of identifing
> children's eBooks (they tend to be "easy").

If the problem is that we don't have a lot of subject cataloging,
provide more subject cataloging. We could copy the LoC cataloging for
most of the catalog without too much work. If we're going to a
Wiki-type thing, lists of children's books, mysterys, sci-fi, etc.
will be made, and will be superior to this.

> This is also usable for people seeking to develop
> literacy or provide literacy instruction, by providing
> a way of reading something "harder" or "easier" as desired.

If the problem is literacy instruction, then we should work on a list
of books for literacy, not rely on some tool that can't tell the
difference between a 17th century children's book and a 20th century
one, or how much dialect is used. Again, a Wiki-tool is perfect for
this.

> If you have feedback on the results, or my idea for
> adding these scores as an element of the catalog search
> results, please chime in!

I think that these are somewhat interesting, but they are far from the
most interesting factoids. I've been drooling over Amazon's
Statistically Improbable Phrases, personally. I surely wouldn't have
them as promenant as on the search page; I don't think it's the most
important thing that most people look at.

> 0.281  6  3  6  4  0  0 22  2  8  5 19  6 Mary Olivier: a Life
> (etext9366)

This is surely a mistake; the second sentence in the book is "When old
Jenny shook it the wooden rings rattled on the pole and grey men with
pointed heads and squat, bulging bodies came out of the folds on to
the flat green ground. " The numbers are too hard to decipher in this
form to really try and understand why.

I also wonder about "profainwordsPerWords"? The profanity of words has
little to do with the readability; they're just adjectives and nouns
from that perspective.
From traverso at dm.unipi.it  Mon Jun 26 00:38:48 2006
From: traverso at dm.unipi.it (Carlo Traverso)
Date: Mon Jun 26 00:34:29 2006
Subject: [gutvol-d] ebooks libre et gratuits
In-Reply-To: <Pine.GSO.4.58.0606252137510.10919@vtn1.victoria.tc.ca> (message
	from Andrew Sly on Sun, 25 Jun 2006 21:40:55 -0700 (PDT))
References: <i31QvGAKvynEFwUV@thalasson.com>
	<Pine.GSO.4.58.0606252137510.10919@vtn1.victoria.tc.ca>
Message-ID: <200606260738.k5Q7cm402654@pico.dm.unipi.it>

>>>>> "Andrew" == Andrew Sly <sly@victoria.tc.ca> writes:

    Andrew> I think that this shows the value of the PG approach (What
    Andrew> MH likes to call "Unlimited distribution"), where the
    Andrew> whole collection is mirrored in many different
    Andrew> locations. So if the main server is down, there are plenty
    Andrew> of alternate sites availible.

    Andrew> Andrew

They do allow mirroring the collection, just nobody did. I think that
they cannot afford to pay several sites (but clearly they keep
copies). They work as Life+50, so a mirroring by PG is impossible (but
might be possible by a PG+50)

Carlo

From gbnewby at pglaf.org  Mon Jun 26 01:30:40 2006
From: gbnewby at pglaf.org (Greg Newby)
Date: Mon Jun 26 01:30:42 2006
Subject: [gutvol-d] ebooks libre et gratuits
In-Reply-To: <Pine.GSO.4.58.0606252137510.10919@vtn1.victoria.tc.ca>
References: <i31QvGAKvynEFwUV@thalasson.com>
	<Pine.GSO.4.58.0606252137510.10919@vtn1.victoria.tc.ca>
Message-ID: <20060626083040.GD26556@pglaf.org>

On Sun, Jun 25, 2006 at 09:40:55PM -0700, Andrew Sly wrote:
> 
> I think that this shows the value of the PG approach
> (What MH likes to call "Unlimited distribution"), where
> the whole collection is mirrored in many different
> locations. So if the main server is down, there are
> plenty of alternate sites availible.
> 
> Andrew

I think Michael's approach to unlimited distribution is
a little different, but not that different.

What you're actually talking about is gbn's approach
to belt+suspenders when it comes to server resiliency.

Insert obligatory Linux Torvalds quote about mirroring, here.
  -- Greg


> On Mon, 26 Jun 2006, Philip Baker wrote:
> 
> > In article <200606251622.35322.donovan@abs.net>, D Garcia
> > <donovan@abs.net> writes
> > >As you probably noted, the site was inaccessible for over a week; the reason
> > >is that our ISP shut down following "A crippling DDOS attack" which they were
> > >not able to successfully block. We changed ISPs, and the site is once again
> > >available. We will take the necessary means so that this type of thing cannot
> > >happen again; I will speak about it again very soon.
> >
> >
> > They are being rather optimistic but we will have to wait and see if
> > their "mesures n?cessaires" work.
> >
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
From walter.van.holst at xs4all.nl  Mon Jun 26 02:12:04 2006
From: walter.van.holst at xs4all.nl (Walter van Holst)
Date: Mon Jun 26 02:12:09 2006
Subject: [gutvol-d] the end of the line
In-Reply-To: <309.7649529.31d01f9b@aol.com>
References: <309.7649529.31d01f9b@aol.com>
Message-ID: <449FA4E4.9010505@xs4all.nl>

Bowerbird@aol.com wrote:
>
> p.g. introduces its own linebreaks into its plain-ascii e-texts, all 
> without ever entering the markup arena.
>
> and i put in my own line-breaks, right here in these posts to this 
> listserve, again without using any markup at all, just the return key.

Line-breaks are mark-up. They don't add anything whatsoever to the text 
itself and are completely arbitrarily decided, usually based on the 
technology that is used to display the actual content. You can deny the 
difference between structure, content and presentation all you want, but 
it is perfectly possible to reformat a book using columns instead of 
lines without changing the actual content. And where will your precious 
line-breaks go in that case?

Greetings,

 Walter
From gbnewby at pglaf.org  Mon Jun 26 02:32:37 2006
From: gbnewby at pglaf.org (Greg Newby)
Date: Mon Jun 26 02:32:38 2006
Subject: [gutvol-d] New DVD ISO feedback sought
Message-ID: <20060626093237.GA27369@pglaf.org>

I've been working, slowly, on some new
CD/DVD images (ISO files) for our use.  As
many people know, we've given away many thousands
of free CDs and DVDs, and added the ISO images
(along with BitTorrent, RAR and other formats) to
the main PG collection.

You can peruse the images I've been working on
here:
	http://snowy.arsc.alaska.edu/gbn/pgimages
actual ISOs are at:
	ftp://snowy.arsc.alaska.edu/pub/gbn/isos

These are not completed....I'll be adding stuff like
GUTINDEX.ALL, donate-howto.txt, and a README.TXT

You can see the nifty tool for creating such images
here:
	http://snowy.arsc.alaska.edu/pgiso/

Here are the main two CD/DVD collections for you to consider:

1) "As many titles as possible."  In the tool, I specified
these numbers:

	1-2199,2225-3500,3525-11774,11800-20000

with "no copyrighted", "txt/zip" format, and any language.
The result is all of the zipped eBooks in plain
text format, minus our copies of the Human Genome.
(No, we don't go up to #20000 in the main PG collection,
which the tool uses...only 18683 as of right now.  I'm just using
a high enough number that I don't need to look up the
actual number.)

This should be similar to our eBook #11800, the 
PG 2003 "10k special."  For that, we tried to add
as many as possible, resulting in ~9300 titles including
.txt and .html (also Genome), all zipped.  

Surprisingly, we can fit *all* 17454 of our non-copyrighted
text/zip titles with space to spare in a DVD:  about 3.5GB.

In case you're wondering (I was!), including as many HTML
titles as possible (including their images) in html/zip,
then filling in the rest with text/zip, yields about 3.25 DVDs
(14.5GB).  


2) "Best of Redux."  Our Best Of CD image was made by
human selection (on this list!), resulting in just under
600 titles.  Many are HTML.  Since #11220, we've added
lots of great stuff.  

So, what would go on today's "Best Of"?

I went ahead and recreated the image in the new tool,
and also made one emphasizing HTML (since some titles
have been moved to HTML that were previously just text).

I've uploaded (to the /pgimages URL) the list of 
the "Best Of" eBook numbers, as well as the list of
"best of" public domain that Amazon did last year (remember
that?).


GOALS:
- confirm viability/suitability of the "allzipnohgp" 
collection (#1 above); make any suggestions.  This is
basically the densest way of getting people all of the
PG collection, fitting easily on a single DVD.

(Yes, I plan on filling it up with some of our nice
HTML & multimedia.  Your ideas are solicited.)

- consider ways forward for a new "Best of" - either
CD or DVD.  The only thing I feel strongly about is showcasing
some of our beautiful HTML titles with nice images.  Yes
to all the classics, and yes to plain text or HTML...but
consider "best of" in terms of PG's best work, not just
the classic titles.

If anyone would really like to run with these idea
and create some new images, go for it!  The snowy
tool makes it easy to share your own collections, and
we have many places to distribute ISOs you create.

I do think it's time to create some new "primary"
giveaway images, though, and appreciate any ideas you
might have.
  -- Greg

From scott_bulkmail at productarchitect.com  Mon Jun 26 04:19:07 2006
From: scott_bulkmail at productarchitect.com (Scott Lawton)
Date: Mon Jun 26 04:19:42 2006
Subject: [gutvol-d] Automated readability scores for PG eBooks
In-Reply-To: <6d99d1fd0606260003v1da9790ep3aed09dd6fc9414@mail.gmail.com>
References: <20060625215923.GA18811@pglaf.org>
	<6d99d1fd0606260003v1da9790ep3aed09dd6fc9414@mail.gmail.com>
Message-ID: <p06110409c0c5707e5796@[192.168.0.52]>

>If the problem is that we don't have a lot of subject cataloging,
>provide more subject cataloging. We could copy the LoC cataloging for
>most of the catalog without too much work.

>If the problem is literacy instruction, then we should work on a list
>of books for literacy, not rely on some tool that can't tell the
>difference between a 17th century children's book and a 20th century
>one, or how much dialect is used.

While I agree that it would not be worth adding readability score if it had much impact on these and other worthy goals, I really don't see it as either/or.  Granting of course that adding scores will take some time away from other projects (and, that it's not my personal time at stake here), I still see this as relatively high gain for relatively low investment.

There are lots and lots of cool things that could be done with the catalog.  And, any relatively "easy" (i.e. automated) method of adding readability scores will inevitably miscategorize a whole bunch of books.  But, I think the 'signal' will far outweigh the 'noise'.  Even in the context of the above, the scores would provide a great starting point for being improved with manual cataloging and literacy labeling.

Don't let the perfect stand in the way of the good.

Plus, I think the scores (and miscategorizations) are interesting in and of themselves for those of us interested in words and language.

Cheers,

Scott S. Lawton
http://blogsearch.com/ - a starting point
http://ProductArchitect.com/ - consulting
From schultzk at uni-trier.de  Mon Jun 26 03:04:18 2006
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Mon Jun 26 04:19:50 2006
Subject: [gutvol-d] the end of the line
In-Reply-To: <449FA4E4.9010505@xs4all.nl>
References: <309.7649529.31d01f9b@aol.com> <449FA4E4.9010505@xs4all.nl>
Message-ID: <2E862222-A145-4583-AA4C-261DCD0C39B6@uni-trier.de>

Hi All,

Am 26.06.2006 um 11:12 schrieb Walter van Holst:

> Bowerbird@aol.com wrote:
>>
>> p.g. introduces its own linebreaks into its plain-ascii e-texts,  
>> all without ever entering the markup arena.
>>
>> and i put in my own line-breaks, right here in these posts to this  
>> listserve, again without using any markup at all, just the return  
>> key.
>
> Line-breaks are mark-up. They don't add anything whatsoever to the  
> text itself and are completely arbitrarily decided, usually based  
> on the technology that is used to display the actual content. You  
> can deny the difference between structure, content and presentation  
> all you want, but it is perfectly possible to reformat a book using  
> columns instead of lines without changing the actual content. And  
> where will your precious line-breaks go in that case?
>

     In normal prose line breaks generally do not effect the actual  
content,
     but in poetry in may be very meaningful. Especially in works  
where the
     form of the text is important !!

	But do not take my word for it.


	Keith.

From walter.van.holst at xs4all.nl  Mon Jun 26 05:37:34 2006
From: walter.van.holst at xs4all.nl (Walter van Holst)
Date: Mon Jun 26 05:37:37 2006
Subject: [gutvol-d] the end of the line
In-Reply-To: <2E862222-A145-4583-AA4C-261DCD0C39B6@uni-trier.de>
References: <309.7649529.31d01f9b@aol.com> <449FA4E4.9010505@xs4all.nl>
	<2E862222-A145-4583-AA4C-261DCD0C39B6@uni-trier.de>
Message-ID: <449FD50E.90002@xs4all.nl>

Schultz Keith J. wrote:
>
>     In normal prose line breaks generally do not effect the actual 
> content,
>     but in poetry in may be very meaningful. Especially in works where 
> the
>     form of the text is important !!
>
>     But do not take my word for it.
>
I will take your word for it. In some poetry even the typeface is part 
of the poem. Think about Paul van Ostaijen's Boem Paukenslag!

http://users.pandora.be/gaston.d.haese/paukenslag.html

Nonetheless, I wouldn't dare to call poetry each and every e-mail I wrote.

Regards,

 Walter
From jon at noring.name  Mon Jun 26 06:15:28 2006
From: jon at noring.name (Jon Noring)
Date: Mon Jun 26 06:15:40 2006
Subject: [gutvol-d] the end of the line
In-Reply-To: <2E862222-A145-4583-AA4C-261DCD0C39B6@uni-trier.de>
References: <309.7649529.31d01f9b@aol.com> <449FA4E4.9010505@xs4all.nl>
	<2E862222-A145-4583-AA4C-261DCD0C39B6@uni-trier.de>
Message-ID: <1307373850.20060626071528@noring.name>

Walter van Holst wrote:
> Bowerbird@aol.com wrote:

>> p.g. introduces its own linebreaks into its plain-ascii e-texts,
>> all without ever entering the markup arena.
>>
>> and i put in my own line-breaks, right here in these posts to this
>> listserve, again without using any markup at all, just the return
>> key.

> Line-breaks are mark-up. They don't add anything whatsoever to the
> text itself and are completely arbitrarily decided, usually based
> on the technology that is used to display the actual content. You
> can deny the difference between structure, content and presentation
> all you want, but it is perfectly possible to reformat a book using
> columns instead of lines without changing the actual content. And
> where will your precious line-breaks go in that case?

Yes, line-breaks (CR/LF, etc.) are markup. They are text characters
used to communicate something besides the content. Paper books don't
need to include these characters (they'd be invisible anyway), thus
they are characters not part of the content, i.e. markup.

Also, using * and _ for highlighting purposes is also markup.

Of course, what Bowerbird means by markup is formalized and
comprehensive text markup systems such as TeX, SGML/XML, troff, etc.,
but then his ZML system is another markup system that has kept
markup characters to a minimum.

This brings up an interesting observation in that using line-breaks in
plain text has variable importance, from mildly important (arbitrarily
used in paragraphs simply to trim line lengths to something manageable),
to quite important (preserving poetry lines, and as Bowerbird would
attest, everything he writes even if when prose-like in meaning.)

The problem is knowing the relative importance of line-breaks in a
plain text document, especially if one does not understand the
language to ascertain context.

ZML tries to tackle this issue, and I think somewhat succeeds, albeit
at a loss of richness like the Model T Ford and black paint. And as
noted previously, *how* does one know a particular plain text is ZML
and thus falls under strict and unambiguous plain text formatting
rules?


Jon Noring

From Bowerbird at aol.com  Mon Jun 26 10:18:11 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Mon Jun 26 10:18:19 2006
Subject: [gutvol-d] the end of the line
Message-ID: <38f.548879d.31d170d3@aol.com>

Line-breaks are mark-up. They don't add anything whatsoever to the text
itself and are completely arbitrarily decided, usually based on the
technology that is used to display the actual content. You can deny the
difference between structure, content and presentation all you want, but
it is perfectly possible to reformat a book using columns instead of
lines without changing the actual content. And where will your precious
line-breaks go in that case?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060626/9cd35cfc/attachment.html
From gbnewby at pglaf.org  Mon Jun 26 11:09:27 2006
From: gbnewby at pglaf.org (Greg Newby)
Date: Mon Jun 26 11:09:29 2006
Subject: !@!Re: [gutvol-d] ebooks libre et gratuits (fwd)
In-Reply-To: <Pine.LNX.4.60.0606261038550.3664@pglaf.org>
References: <Pine.LNX.4.60.0606261038550.3664@pglaf.org>
Message-ID: <20060626180927.GB4897@pglaf.org>

We could probably run a mirror of this... is anyone in touch
with the folks (perhaps in French)?  It would take some
cooperation from their end (such as an rsync server) to run
a good mirror.
  -- Greg

> ---------- Forwarded message ----------
> Date: Sun, 25 Jun 2006 10:25:19 -0400
> From: Janet Kegg <jmk@his.com>
> To: Project Gutenberg Volunteer Discussion <gutvol-d@pglaf.org>
> Subject: Re: [gutvol-d] ebooks libre et gratuits
> 
> 
> The site is now available again: http://www.ebooksgratuits.com/
> 
> See the front page of the Web site  for what I believe (my French is
> almost nonexistent) is an explanation of what happened.
> 
> On Sun, 25 Jun 2006 09:58:58 -0400, you wrote:
> 
> >Ebooks libre et gratuits had an arrangement with MH apparently where their 
> >books would appear on PG eventually. Now that the ebooks web site is no 
> >more, what will happen to the ebooksgratuits which did not make it to PG? 
> >Will all of this work have to be repeated by someone else? Is there an 
> >archive anywhere of this enormous quantitiiy of work? Why did 
> >ebooksgratuits disappear? Pressure from Canadian publishers/ government? 
> >Is there an unknown story here?
> >
> >nwolcott2@post.harvard.edu
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
From desrod at gnu-designs.com  Mon Jun 26 11:38:20 2006
From: desrod at gnu-designs.com (David A. Desrosiers)
Date: Mon Jun 26 11:39:17 2006
Subject: !@!Re: [gutvol-d] ebooks libre et gratuits (fwd)
In-Reply-To: <20060626180927.GB4897@pglaf.org>
References: <Pine.LNX.4.60.0606261038550.3664@pglaf.org>
	<20060626180927.GB4897@pglaf.org>
Message-ID: <Pine.LNX.4.64.0606261437310.2677@aphrodite.gnu-designs.com>


> We could probably run a mirror of this... is anyone in touch with 
> the folks (perhaps in French)?  It would take some cooperation from 
> their end (such as an rsync server) to run a good mirror.

 	I'd be more than happy to mirror it, if I knew what it was I 
was supposed to fetch and mirror ;) (Yes, my last name is Desrosiers, 
a French name, but my French is so rusty, you don't want me to 
translate that for you ;)


David A. Desrosiers
desrod@gnu-designs.com
http://gnu-designs.com

From Bowerbird at aol.com  Mon Jun 26 12:07:24 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Mon Jun 26 12:07:45 2006
Subject: [gutvol-d] the end of the line
Message-ID: <37c.5655d2e.31d18a6c@aol.com>

jeroen said:
>    Although I agree with Michael that there is no need 
>    to preserve things as linebreaks in most texts -- 

ok, well you and michael agree.   that's good.          :+)

but what do you say to end-users who want that info?

somehow, "tough luck, kid, _we_ don't think it's necessary"
doesn't sound like the kind of thing _i_ want to tell people.

because that's the type of statement that makes people
go off to a different cyberlibrary.   that's my whole point.

(and to all of the other people who responded on similar
"theoretical" grounds, i'm truly sorry you missed the point.)


>    if you really need to go to that level of detail, there 
>    is always the original or the scans to fall back upon

well, neither of those gives you the flexibility of digital text.

but yes, a tight coupling of the two forms is the best method.
you will note that those "digital reprints" from jose menendez
allow a reader to summon the scan of the page with one click.

(since the page already looks like the scan anyway, there might
be little reason to do it, though, except to verify that similarity.
but this constant willingness to demonstrate the verisimilitude
will be the proof that makes people comfortable with the use of
the smaller-sized digital reprint, with its expanded functionality,
as opposed to the bigger, slower, dumber collection of scans.
anyone who has proofread a scan against reflowed text knows
the reflowing makes that task immensely more difficult though,
so you'll never attain the same confidence in the text's accuracy.)


>    I want to make a case for preserving page numbers, 
>    if not at least as recognisable anchors in text, and only for 
>    those books being referenced to regularly by other books. 

page-numbers are retained in many e-texts these days...

but i'm sure you remember we all had this same argument
about page-numbers.   i'm confident that -- down the line --
sentiment will similarly change to be in favor of line-breaks.

in general, i've just been content to wait it out until the change;
but seeing all the e-texts as they cross my screen downloading
made me realize again the sadness of the discarded line-breaks.


>    This excludes most fiction, but is particularly important for 
>    scientific works, which have constructed a kind of paper web 
>    with cross references mainly based on page numbers.

there are plenty of cross-references made to works of fiction.

and the concept of "books reading each other" would require
that _all_ of our books are brought under the same umbrella...


>    In long term, such references of course should give way to proper 
>    references to the actual paragraph or sentence being referenced

good!   you recognize the need for a finer-grained pointer than the page.

because that's the kind of thinking that leads to line-break retention.

you can narrow things down rather specifically when you point to the
range that's represented from page-19-line-7 to page-21-line-14, 
or from page-87-line-6 to page-87-line-8, can't you?   not only that,
this kind of reference also works for the person who only has the paper
copy of the book, not the e-book, if the two are duplicates of each other.

and that's precisely the type of capability i'll have in my viewer-program.

even in a traditional browser, it wouldn't be hard to implement something
roughly equivalent, though.    the user could specify some text with a link,
and after going to the precise point of the link, the browser could then
execute a "find" command for the specified text.   it wouldn't be hard at 
all,
and would seem to give a rather exact form of pointing to a specific place.

it has the benefit of being implemented entirely outside of the document,
as well, which i see as being tremendously important.   if all our links need
markup in the original document to be implemented, as is the present case,
we're _never_ going to be able to quickly get to a point of profuse 
interlinks.
we'll get thoroughly bogged down in the quicksand of heavy markup first...
(for an example of that, take a look at the markup which jon noring posted,
and then read through that particular diversion of this thread.   the 
horrors!)


>    but as a practical ad-interim solution, staying with page numbers will 
>    increase the number of texts we can digitize with our limited means.

it doesn't cost anything to retain the line-break information.


>    I would however, like to see the collection be incorporated in a kind of 

>    wiki-like system, where people can add -- without tampering with the 
static 
>    source texts -- annotations, add tagging and create live cross 
references

i've had a demo up for some time now showing "continuous proofreading".
>   http://users.aol.com/bowerbird/proof_wiki.html

i also used a similar template in these demo-books:
>   http://www.greatamericannovel.com/mabie/mabiep001.html
>    http://www.greatamericannovel.com/myant/myantc001.html
>   http://www.greatamericannovel.com/ahmmw/ahmmwc001.html
>    http://www.greatamericannovel.com/sgfhb/sgfhbc001.html

this system could easily be elaborated upon to build what you requested here.

indeed, i will be pouring all of the p.g. texts that i'll be handling -- 
perhaps
some 5000-6000, as near as i can tell -- into just this type of system, 
within
the next 6 months, and i would be open to any ideas that you might have...

heck, design a webpage to do what you want, and i will use it as the 
template.
you know me, i don't even care if it "validates", as long as it's easy and it 
works.

***

andrew said:
>   There are places such as wikisource.org, where you could add the texts 
>    and start providing links such as you mention here immediately.

i'll check out wikisource.org to see what kind of capabilities they offer.

in the past, when i've looked at existing sites, it has seemed that wikis
aren't geared to do things -- like populate pages -- on a massive scale.

even rather fundamental things like batch f.t.p. are sometimes missing.
and when you're dealing with thousands, or tens of thousands, of files,
it becomes absolutely necessary to deal with them in a template fashion.

i also think there's a good reason jeroen asked for a "wiki-like system",
and not a wiki per se, as indicated by his concern about "tampering"
with the static source texts.   the thought is that the original source
-- and indeed, the string of comments as well -- must be inviolate.

that's because the idea is to build a body of thought around a text,
of which links -- intrasystem, and outgoing and incoming -- are
a very crucial aspect.   and it's not possible to link into a wiki proper,
because what was there yesterday might well be gone today, only to
reappear in different form tomorrow.   you can't link into a pile of sand.

oh sure, you could instruct users to leave link markup untouched.
and they might even follow your instructions.   (yeah, right.)   still,
that will interfere with refactoring, and get very crufty before long.

besides, a good part of the give-and-take of this kind of conversation
involves letting all of the arguments stand, rather than editing them.
(and especially rather than "editing them by deletion".)   let the future
examine all the arguments, and see which ones stand the test of time.

so you need to have stability for the process itself, not just for the links.

-bowerbird

p.s.   jeroen, if you want to provide me a template, i could use it sooner
rather than later, the better to architect it into my overall work-flow...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060626/042334d3/attachment.html
From prosfilaes at gmail.com  Mon Jun 26 12:08:33 2006
From: prosfilaes at gmail.com (David Starner)
Date: Mon Jun 26 12:08:37 2006
Subject: [gutvol-d] Automated readability scores for PG eBooks
In-Reply-To: <p06110409c0c5707e5796@192.168.0.52>
References: <20060625215923.GA18811@pglaf.org>
	<6d99d1fd0606260003v1da9790ep3aed09dd6fc9414@mail.gmail.com>
	<p06110409c0c5707e5796@192.168.0.52>
Message-ID: <6d99d1fd0606261208ya731c40q665b5226b05359bc@mail.gmail.com>

On 6/26/06, Scott Lawton <scott_bulkmail@productarchitect.com> wrote:
> While I agree that it would not be worth adding readability score if it had much
> impact on these and other worthy goals,

But if it doesn't, then those goals aren't reasons _for_ adding it.

> There are lots and lots of cool things that could be done with the catalog.

We could start with the results of stripping the header and running wc
on it. That strikes me at least as useful as this result. Also, the
ten or twelve most common words in the book after stripping the ten or
twelve most common words in the English language.

> Even in the context of the above, the scores would provide a great starting point for
> being improved with manual cataloging and literacy labeling.

I don't think so. It's downright useless for manual cataloging, as it
only handles that one dimension. I don't think it will help literacy
labeling much, either, which is best done manually.

> Don't let the perfect stand in the way of the good.

But I don't think having these numbers anywhere prominent is good.
Right now our pages only have a few pieces of important information;
minutia like this should go to a page linked to a page linked only
from the book page, which we can fill with various stats to our hearts
content.

It also seems a little weird to have some proprietary reading level
numbers on the system, instead of the Fog index or the Flesch-Kincaid
Readability tests. It feels like an advertisement.
From tony at baechler.net  Mon Jun 26 11:47:17 2006
From: tony at baechler.net (Tony Baechler)
Date: Mon Jun 26 12:08:40 2006
Subject: [gutvol-d] ftp.archive.org
Message-ID: <7.0.1.0.2.20060626114129.032ee4e0@baechler.net>

Hi list,

I know that often ftp.archive.org is down for a few days at a time, 
but it has been down now for almost all of June.  Is this 
permanent?  Is ftp access via ftp.archive.org ended?  I prefer it 
over ftp.ibiblio.org for PG files because it is significantly 
faster.  If ftp access is no longer available, can anyone recommend a 
fast mirror that is kept frequently up to date?  I tried 
snowy.arsc.alaska.edu but it wasn't as current as I would like.  I'm 
planning to download several thousand zip files so a fast mirror is 
appreciated.  I'm sure http is faster but I would prefer ftp if 
possible.  Besides http://www.gutenberg.org/dirs/ isn't really much 
faster than metalab.unc.edu, AKA ftp.ibiblio.org.  Is there a chance 
that ftp.archive.org has moved to a different host or ip address?

I'm running ncftp for Windows so I don't think it's a caching or dns 
problem.  I think I tried under Linux as well with similar 
results.  It tries for about a minute and times out.  I tried with 
and without passive mode but it doesn't matter since I can't 
connect.  I am in California, US.


-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.1.394 / Virus Database: 268.9.4/375 - Release Date: 6/25/06


From traverso at dm.unipi.it  Mon Jun 26 13:10:34 2006
From: traverso at dm.unipi.it (Carlo Traverso)
Date: Mon Jun 26 13:06:09 2006
Subject: !@!Re: [gutvol-d] ebooks libre et gratuits (fwd)
In-Reply-To: <20060626180927.GB4897@pglaf.org> (message from Greg Newby on
	Mon, 26 Jun 2006 11:09:27 -0700)
References: <Pine.LNX.4.60.0606261038550.3664@pglaf.org>
	<20060626180927.GB4897@pglaf.org>
Message-ID: <200606262010.k5QKAYR09348@pico.dm.unipi.it>

>>>>> "Greg" == Greg Newby <gbnewby@pglaf.org> writes:

    Greg> We could probably run a mirror of this... is anyone in touch
    Greg> with the folks (perhaps in French)?  It would take some
    Greg> cooperation from their end (such as an rsync server) to run
    Greg> a good mirror.  -- Greg

    >> ---------- Forwarded message ---------- Date: Sun, 25 Jun 2006
    >> 10:25:19 -0400 From: Janet Kegg <jmk@his.com> To: Project
    >> Gutenberg Volunteer Discussion <gutvol-d@pglaf.org> Subject:
    >> Re: [gutvol-d] ebooks libre et gratuits
    >> 
    >> 
    >> The site is now available again: http://www.ebooksgratuits.com/
    >> 

I have written to coolmicro, if I don't hear back shortly I'll ask to
common friends that work with him.

Carlo

From Bowerbird at aol.com  Mon Jun 26 13:31:43 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Mon Jun 26 13:31:56 2006
Subject: [gutvol-d] Automated readability scores for PG eBooks
Message-ID: <55d.27a30.31d19e2f@aol.com>

david said:
>   We could start with the results of stripping the header

and the "footer", where most of the legalese is these days.

does anyone here know the best way to strip both of them?


>   Also, the ten or twelve most common words in the book after 
>    stripping the ten or twelve most common words in the English language.

you'd need to strip more than a dozen.   below is a list from wikipedia.
there's a strong power-law in word usage.   unless you strip 200-500
common words, it probably won't reveal anything very interesting...

-bowerbird

>    http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists

Here are the top 100 words (from Project Gutenberg texts) in alphabetical 
order:

a
about
after
all
an
and
any
are
as
at
be
been
before
but
by
can
could
did
do
down
first
for
from
good
great
had
has
have
he
her
him
his
I
if
in
into
is
it
its
know
like
little
made
man
may
me
men
more
mr
much
must
my
no
not
now
of
on
one
only
or
other
our
out
over
said
see
she
should
so
some
such
than
that
the
their
them
then
there
these
they
this
time
to
two
up
upon
us
very
was
we
were
what
when
which
who
will
with
would
you
your
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060626/f5d86011/attachment-0001.html
From scott_bulkmail at productarchitect.com  Mon Jun 26 13:36:13 2006
From: scott_bulkmail at productarchitect.com (Scott Lawton)
Date: Mon Jun 26 13:53:03 2006
Subject: [gutvol-d] Automated readability scores for PG eBooks
In-Reply-To: <6d99d1fd0606261208ya731c40q665b5226b05359bc@mail.gmail.com>
References: <20060625215923.GA18811@pglaf.org>
	<6d99d1fd0606260003v1da9790ep3aed09dd6fc9414@mail.gmail.com>
	<p06110409c0c5707e5796@192.168.0.52>
	<6d99d1fd0606261208ya731c40q665b5226b05359bc@mail.gmail.com>
Message-ID: <p0611040bc0c5f07a6c58@[192.168.0.52]>

>>Even in the context of the above, the scores would provide a great starting point for
>>being improved with manual cataloging and literacy labeling.
>
>I don't think so. It's downright useless for manual cataloging, as it
>only handles that one dimension.

Isn't "useless" a bit strong?  Sure, it's only one dimension; that's true of any single piece of information.  Right now, a manual cataloger looking for children's books would probably look for known titles and authors, search for some likely keywords ... and then what?  How will they surface children's books that they don't already know about?  A list of the "most readable" (no matter how flawed the metric) is a MUCH better starting point than the complete list of books at PG.


>I don't think it will help literacy
>labeling much, either, which is best done manually.

Actually, readability scores are widely used in education.  I'm sure they have their detractors, but that's true of almost anything.

Even with manual labelling (which hasn't been done to date and therefore I don't see how it's an argument against an automated solution), scores are also useful.


>It also seems a little weird to have some proprietary reading level
>numbers on the system, instead of the Fog index or the Flesch-Kincaid
>Readability tests. It feels like an advertisement.

I'm in favor of any and all readability scores.  If these existing scores were already in place, I probably wouldn't have bothered to comment.  Or, if the choice was Fog + F-K vs. some other score, I would choose the most common score.  But I haven't seen anyone offer to add Fog or F-K, so I welcome useful info from any source.

Just so it's clear: I have no connection with Rocket Reader.  I'm not even sure if I ever heard of them before Greg's note.  I've thought for a long time that it would be useful to include readability scores.

Scott
From JBuck814366460 at aol.com  Mon Jun 26 14:12:17 2006
From: JBuck814366460 at aol.com (Jared Buck)
Date: Mon Jun 26 14:12:25 2006
Subject: [gutvol-d] ftp.archive.org
In-Reply-To: <7.0.1.0.2.20060626114129.032ee4e0@baechler.net>
References: <7.0.1.0.2.20060626114129.032ee4e0@baechler.net>
Message-ID: <44A04DB1.6070608@aol.com>

Hi Tony, I am in California too - southern california to be exact.  I 
don't know why it's not working for you because it works fine for me. 
Maybe your FTP program is not connecting correctly.  Me, i use wget 
(avilable for windows as well as a default on Linux) for my Gutenberg 
downloading needs.

I plan to get an external hard drive (preferably an Iomega drive) later, 
probably for my birthday next Thursday, which i can then use to store 
the Gutenberg etexts and save me some disk space on my current drive.  I 
would be using rsync to do that (check the Mirroring FAQ on PG if you 
don't know what that is), apparently it's much faster than wget or even 
FTP because it doesn't check every single file for hours to find 
updates, it keeps a list of all files and only downloads the ones that 
specifically need updating, saves you a couple of hours of time.  Or at 
least that's what Aaron (Cannon) told me.

Jared

Tony Baechler wrote on 26/06/2006, 11:47 AM:

 > Hi list,
 >
 > I know that often ftp.archive.org is down for a few days at a time,
 > but it has been down now for almost all of June.  Is this
 > permanent?  Is ftp access via ftp.archive.org ended?  I prefer it
 > over ftp.ibiblio.org for PG files because it is significantly
 > faster.  If ftp access is no longer available, can anyone recommend a
 > fast mirror that is kept frequently up to date?  I tried
 > snowy.arsc.alaska.edu but it wasn't as current as I would like.  I'm
 > planning to download several thousand zip files so a fast mirror is
 > appreciated.  I'm sure http is faster but I would prefer ftp if
 > possible.  Besides http://www.gutenberg.org/dirs/ isn't really much
 > faster than metalab.unc.edu, AKA ftp.ibiblio.org.  Is there a chance
 > that ftp.archive.org has moved to a different host or ip address?
 >
 > I'm running ncftp for Windows so I don't think it's a caching or dns
 > problem.  I think I tried under Linux as well with similar
 > results.  It tries for about a minute and times out.  I tried with
 > and without passive mode but it doesn't matter since I can't
 > connect.  I am in California, US.
 >
 >
 > --
 > No virus found in this outgoing message.
 > Checked by AVG Anti-Virus.
 > Version: 7.1.394 / Virus Database: 268.9.4/375 - Release Date: 6/25/06
 >
 >
 > _______________________________________________
 > gutvol-d mailing list
 > gutvol-d@lists.pglaf.org
 > http://lists.pglaf.org/listinfo.cgi/gutvol-d
 >

-- 
            .
               .:.
              .:::.
             .:::::.
         ***.:::::::.***
    *******.:::::::::.*******                Dmitri
Yalovsky
********.:::::::::::.********
********.:::::::::::::.********             USS
Authority
*******.::::::'***`::::.*******
******.::::'*********`::.******            Asst. Chief of
Engineering
****.:::'*************`:.****
    *.::'*****************`.*
    .:'  ***************    .
   .

From traverso at dm.unipi.it  Mon Jun 26 14:16:54 2006
From: traverso at dm.unipi.it (Carlo Traverso)
Date: Mon Jun 26 14:12:29 2006
Subject: !@!Re: [gutvol-d] ebooks libre et gratuits (fwd)
In-Reply-To: <200606262010.k5QKAYR09348@pico.dm.unipi.it> (message from Carlo
	Traverso on Mon, 26 Jun 2006 22:10:34 +0200)
References: <Pine.LNX.4.60.0606261038550.3664@pglaf.org>
	<20060626180927.GB4897@pglaf.org>
	<200606262010.k5QKAYR09348@pico.dm.unipi.it>
Message-ID: <200606262116.k5QLGsq10323@pico.dm.unipi.it>

>>>>> "Carlo" == Carlo Traverso <traverso@dm.unipi.it> writes:

>>>>> "Greg" == Greg Newby <gbnewby@pglaf.org> writes:

    Greg> We could probably run a mirror of this... is anyone in touch
    Greg> with the folks (perhaps in French)?  It would take some
    Greg> cooperation from their end (such as an rsync server) to run
    Greg> a good mirror.  -- Greg

    >>> ---------- Forwarded message ---------- Date: Sun, 25 Jun 2006
    >>> 10:25:19 -0400 From: Janet Kegg <jmk@his.com> To: Project
    >>> Gutenberg Volunteer Discussion <gutvol-d@pglaf.org> Subject:
    >>> Re: [gutvol-d] ebooks libre et gratuits
    >>> 
    >>> 
    >>> The site is now available again:
    >>> http://www.ebooksgratuits.com/
    >>> 

    Carlo> I have written to coolmicro, if I don't hear back shortly
    Carlo> I'll ask to common friends that work with him.

    Carlo> Carlo

Coolmicro answered, who thanks very much for our interest, but they
have already planned a mirror, so a second one is not critical. I send
to Greg an adddress for further contacts.

Carlo


From gbnewby at pglaf.org  Mon Jun 26 15:35:14 2006
From: gbnewby at pglaf.org (Greg Newby)
Date: Mon Jun 26 15:35:15 2006
Subject: [gutvol-d] ftp.archive.org
In-Reply-To: <7.0.1.0.2.20060626114129.032ee4e0@baechler.net>
References: <7.0.1.0.2.20060626114129.032ee4e0@baechler.net>
Message-ID: <20060626223514.GA11041@pglaf.org>

On Mon, Jun 26, 2006 at 11:47:17AM -0700, Tony Baechler wrote:
> Hi list,
> 
> I know that often ftp.archive.org is down for a few days at a time, 
> but it has been down now for almost all of June.  Is this 
> permanent?  Is ftp access via ftp.archive.org ended?  I prefer it 
> over ftp.ibiblio.org for PG files because it is significantly 
> faster.  If ftp access is no longer available, can anyone recommend a 
> fast mirror that is kept frequently up to date?  I tried 
> snowy.arsc.alaska.edu but it wasn't as current as I would like.  I'm 
> planning to download several thousand zip files so a fast mirror is 
> appreciated.  I'm sure http is faster but I would prefer ftp if 
> possible.  Besides http://www.gutenberg.org/dirs/ isn't really much 
> faster than metalab.unc.edu, AKA ftp.ibiblio.org.  Is there a chance 
> that ftp.archive.org has moved to a different host or ip address?

I'm surprised you can connect to ftp.archive.org.  I can't.

We stopped pushing the collection to them several weeks ago.
They had a hardware failure, and were unresponsive.

Today, there are three master collections where new eBooks are
pushed:
	
	http://www.gutenberg.org  
on iBiblio....see this for direct access to the raw files:
	ftp://ftp.ibiblio.org/pub/docs/books/gutenberg

	http://gutenberg.readingroo.ms
same as	ftp://readingroo.ms/gutenberg

	http://snowy.arsc.alaska.edu/gutenberg
same as	ftp://snowy.arsc.alaska.edu/mirrors/gutenberg

They all get new files immediately.  The catalog
at gutenberg.org is only updated daily, and of course
mirrors have their own schedule.  You can check "gutenberg.dcs"
in the top-level mirror directory to see if they have updated
in the past week (we update gutenberg.dcs Sunday mornings EST).

I hope this helps.  My guess is the readingroo.ms server will
give you the best throughput (though it will have some
brief downtime, then possibly be heavily loaded during the
world ebook fair, http://www.worldebookfair.com).  

Are there any Debian whizzes on this list who might want to help look
after the readingroo.ms server with me?

  -- Greg


> I'm running ncftp for Windows so I don't think it's a caching or dns 
> problem.  I think I tried under Linux as well with similar 
> results.  It tries for about a minute and times out.  I tried with 
> and without passive mode but it doesn't matter since I can't 
> connect.  I am in California, US.
> 
> 
> -- 
> No virus found in this outgoing message.
> Checked by AVG Anti-Virus.
> Version: 7.1.394 / Virus Database: 268.9.4/375 - Release Date: 6/25/06
> 
> 
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d
From joey at joeysmith.com  Mon Jun 26 23:31:47 2006
From: joey at joeysmith.com (joey)
Date: Mon Jun 26 23:47:34 2006
Subject: [gutvol-d] ftp.archive.org
In-Reply-To: <20060626223514.GA11041@pglaf.org>
References: <7.0.1.0.2.20060626114129.032ee4e0@baechler.net>
	<20060626223514.GA11041@pglaf.org>
Message-ID: <20060627063147.GB2650@joeysmith.com>

On Mon, Jun 26, 2006 at 03:35:14PM -0700, Greg Newby wrote:
> 
> Are there any Debian whizzes on this list who might want to help look
> after the readingroo.ms server with me?

How can I help? I've been a Debian admin for going on 6 years now.
From pm40fr at yahoo.fr  Tue Jun 27 01:21:17 2006
From: pm40fr at yahoo.fr (pat)
Date: Tue Jun 27 01:28:00 2006
Subject: [gutvol-d] EbooksGratuits online
Message-ID: <20060627082117.59885.qmail@web26809.mail.ukl.yahoo.com>

  Hi, 
  I am Patrick from ebooksgratuits
  1/
  As Carlo told you, thank you very much for being concerned by what happened to us.
  Now, we have moved www.ebooksgratuits.com to a very robust provider, and we will have a mirror soon at www.ebooksgratuits.org , so that it does not happen again. We have now collected some funds through a paypal button and can afford to secure our website. Such a predicament is indeed quite painful.
  2/
  Moreover,as our clearance process (to use a PG word) is quite light, we are not in a position to justify the life +50 rule on some of our books (some translators especially are hard to find out), which is not something that PG would not want, to mirror help
  3/
  We have now transferred around 160 books to PG thanks to huge recent help (Chuck Greif) and the enduring patience of Tonya.
  We are now going on more slowly, as the problem is to find out the sources that can be cleared.
  4/
  Should we disappear again, be aware that you can also have access plenty of our files through P2P (edonkey/emule), search on "ebooksgratuits". 

 		
---------------------------------
 Yahoo! Mail r?invente le mail ! D?couvrez le nouveau Yahoo! Mail et son interface r?volutionnaire.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060627/e9659641/attachment.html
From schultzk at uni-trier.de  Tue Jun 27 03:42:19 2006
From: schultzk at uni-trier.de (Schultz Keith J.)
Date: Tue Jun 27 03:42:25 2006
Subject: [gutvol-d] the end of the line
In-Reply-To: <37c.5655d2e.31d18a6c@aol.com>
References: <37c.5655d2e.31d18a6c@aol.com>
Message-ID: <E6A4AA7E-3252-434F-9EE3-7F147C69AC8A@uni-trier.de>

Hi Everybody,

	I have to admit I have not followed this thread fully,
	believe I understand the arguements well enough.

	It all boils down to what you want and what is practical and
	practically possible.

	Linking and references are a problem of syncronisation.
	Using hard copies you always give the reference author,
	publisher, year, edition and page (optional line) when a reference
	is to another book, article.  All this information is absolute  
necessary.
	The layout of the publication could change or even the text itsself!!

	The other aspect is that a reference is always made to text and not
	lines or pages ( blanks, and punctuation is also text in the wider  
sense)!

	What is needed is a method to keep all this information syncronized.
	For e-text(books) you need mark-up in one form or another and a
	system that keeps track of everything. That is all changes, links,  
references,
	changes in text and its position.

	As mentioned you need an umbrella to keep everything under control.
	In other words a sub set in which everthing is syncronized.

	There is no one method that is fool proof and many systems out there.

	As sugested one could use a method in which the critical edition are  
marked up
	and the user can state what he wants to see. That makes the files  
very large and
	sometimes difficult to find what you want.

	I will go with bowerbirds umbrella and take what I can get.

	To me you can not have your cake and eat it too. That is easy mark- 
up, easy to read without
	preprocessing the text or using a viewer !

		Regards
			Keith.

	
From hart at pglaf.org  Tue Jun 27 07:20:05 2006
From: hart at pglaf.org (Michael Hart)
Date: Tue Jun 27 07:20:08 2006
Subject: [gutvol-d] ftp.archive.org
In-Reply-To: <20060627063147.GB2650@joeysmith.com>
References: <7.0.1.0.2.20060626114129.032ee4e0@baechler.net>
	<20060626223514.GA11041@pglaf.org>
	<20060627063147.GB2650@joeysmith.com>
Message-ID: <Pine.LNX.4.60.0606270719540.23893@pglaf.org>


I just wanted to add my personal thanks!


Thanks!!!

Give the world eBooks in 2006!!!

Michael S. Hart
Founder
Project Gutenberg

Blog at http://hart.pglaf.org

From Bowerbird at aol.com  Tue Jun 27 14:41:27 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Tue Jun 27 14:41:40 2006
Subject: [gutvol-d] a tool for your grandma to download p.g. e-texts en masse
Message-ID: <2c5.a8525c6.31d30007@aol.com>

what are the feelings here on releasing a tool for 
your grandma to download p.g. e-texts en masse?

although it seems in line with "unlimited distribution",
it will also mean people scraping texts indiscriminately.

would someone who has a stake in the bandwidth used
please give me a definite answer?   thanks.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060627/e51ce705/attachment.html
From brad at chenla.org  Tue Jun 27 17:34:59 2006
From: brad at chenla.org (Brad Collins)
Date: Tue Jun 27 17:32:26 2006
Subject: [gutvol-d] the end of the line
In-Reply-To: <38f.548879d.31d170d3@aol.com> (Bowerbird@aol.com's message of
	"Mon, 26 Jun 2006 13:18:11 EDT")
References: <38f.548879d.31d170d3@aol.com>
Message-ID: <m3k672w218.fsf@chenla.org>

Bowerbird@aol.com writes:

> Line-breaks are mark-up. They don't add anything whatsoever to the text
> itself and are completely arbitrarily decided, usually based on the
> technology that is used to display the actual content. You can deny the
> difference between structure, content and presentation all you want, but
> it is perfectly possible to reformat a book using columns instead of
> lines without changing the actual content. And where will your precious
> line-breaks go in that case?

Perhaps it's better to think of line-breaks as an arbitrary part of
layout, rather than as mark-up.

In a markup language you can specify if the value of an element ignores
whitespace and line breaks (like html <p>) or preserves them (like
html <pre>).

But line breaks are treated very differently by text editors.

Some text editors and email clients will auto-insert soft line breaks
at column markers.  This gives the use the illusion of having line
breaks but if they send the text to someone who doesn't have this
feature the person on the other side will just see extremely long
lines which scroll faaaaar off the screen.

Older editors like Emacs allow you to auto-insert hard ling-breaks as
you type.  And then when you edit text, or cut and paste you use a
command to "re-fill" the line or paragraph by reformatting the text to
break lines at a defined column marker.

Different programing languages treat whitespace and line breaks
completely differently.  Some languages require you to explicitly
indicate line breaks with markup like "\n" or ";".

Since everyone seems to have a different opinion on how to treat
whitespace and line breaks, it's best to specify very clearly how your
language or markup treats them.

But it would be foolish to treat line-breaks as markup for preserving
line breaks for the simple reason that a lot of software out there
will simply not respect it as such.

b/

-- 
Brad Collins <brad@chenla.org>, Banqwao, Thailand
From j.hagerson at comcast.net  Tue Jun 27 18:34:56 2006
From: j.hagerson at comcast.net (John Hagerson)
Date: Tue Jun 27 18:45:30 2006
Subject: [gutvol-d] Daily progress reports missing from [posted]?
Message-ID: <007401c69a53$1318e700$0200a8c0@sarek>

I have not received a daily progress report through [posted] since 22-JUN.
Have they been temporarily suspended or discontinued? Have they been moved
to another list?

Thank you.


From Bowerbird at aol.com  Tue Jun 27 19:13:26 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Tue Jun 27 19:13:33 2006
Subject: [gutvol-d] the end of the line
Message-ID: <250.cbe5e9b.31d33fc6@aol.com>

brad-

the post you quoted, from me, was an errant send.
it was actually a message posted by someone else.

in general, i don't think much about how existing
software will treat my files, because i consider it
my job to deliver software that does what i want.

non-programmers have to live within their apps,
but as a programmer, i create the worlds i want...

the question of how to "mark up" line-breaks is a
non-starter for me.   the plain-ascii p.g. e-texts
already have hard line-endings in them indicating
line-breaks.   those are the ones inserted by p.g.
my suggestion was that the original line-breaks
should be used instead.

i think the discussion has run its course this time.
maybe it will come up again.   or maybe it will not.

and maybe there will be demand from users for
e-texts that mimic their hard-copy counterparts,
or maybe there will not be.

maybe jeroen will work up a template i can use.
or maybe he won't.

time will tell on all these things.   or maybe it won't.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060627/d9e49c3b/attachment.html
From michael.p.may at earthlink.net  Tue Jun 27 21:07:59 2006
From: michael.p.may at earthlink.net (Michael May)
Date: Tue Jun 27 21:08:02 2006
Subject: [gutvol-d] How to digitize SRR's Five Laws?
Message-ID: <31763220.1151467680194.JavaMail.root@elwamui-royal.atl.sa.earthlink.net>

Hi all,

I am Michael May, new "Classics Editor" at dLIST, the Digital Library of Information Science and Technology: http://dlist.sir.arizona.edu/

dLIST has received written permission from the copyright owner of works by S.R. Ranganathan to post electronic copies of several of SRR's books at the dLIST site, including the original 1931 edition of The Five Laws of Library Science, the main premise of which is "Books are for use!" Despite being out of print (a reprint is planned for later this year by Ess Ess Publications of India <http://www.essessreference.com/servlet/esgetbiblio?n=000313>), Five Laws is arguably the most important work in library science to date.

We have experimented with PDF by posting the prefatory pages and Chapter 1 here:
http://genie.sir.arizona.edu/1115/

However, Five Laws is over 500 pages and includes numerous illustrations. I believe a text or html version would be much easier to access and preserve.

What advice do you have about how to proceed? I was thinking about starting by recruiting volunteers from the LIS community to transcribe the text. What should I think about or plan for before asking people to help? Does Project Gutenberg already have resources available that could help us?

I'd very much appreciate any suggestions or advice.

Thanks.

Mike
From hyphen at hyphenologist.co.uk  Tue Jun 27 23:08:58 2006
From: hyphen at hyphenologist.co.uk (Dave Fawthrop)
Date: Tue Jun 27 23:09:10 2006
Subject: [gutvol-d] a tool for your grandma to download p.g. e-texts en
	masse
In-Reply-To: <2c5.a8525c6.31d30007@aol.com>
References: <2c5.a8525c6.31d30007@aol.com>
Message-ID: <u274a21gtcvcrrauikr3rma8rdlut9n2kb@4ax.com>

On Tue, 27 Jun 2006 17:41:27 EDT,  Bowerbird@aol.com wrote:

|what are the feelings here on releasing a tool for 
|your grandma to download p.g. e-texts en masse?

What have you against Grandparents?
They/I as a grandparent use the normal tools which you do.

-- 
Dave Fawthrop <dave hyphenologist co uk> 
"Intelligent Design?" my knees say *not*. 
"Intelligent Design?" my back says *not*.
More like "Incompetent design". Sig (C) Copyright Public Domain

From sly at victoria.tc.ca  Tue Jun 27 23:39:02 2006
From: sly at victoria.tc.ca (Andrew Sly)
Date: Tue Jun 27 23:39:05 2006
Subject: [gutvol-d] How to digitize SRR's Five Laws?
In-Reply-To: <31763220.1151467680194.JavaMail.root@elwamui-royal.atl.sa.earthlink.net>
References: <31763220.1151467680194.JavaMail.root@elwamui-royal.atl.sa.earthlink.net>
Message-ID: <Pine.GSO.4.58.0606272258120.15869@vtn1.victoria.tc.ca>


Michael,

Thanks for your message.

Disclaimer: These comments are just my personal opinion,
based on what I've seen from being involved with PG for
a decent number of years.

Yes, PG volunteers have found that, for many purposes,
a text or html file can be preferable to a pdf.
To start with, you have a smaller file size, which makes
the file more accessible over slow connections.
You can also run into extra difficulties if you
want to update the file, or correct some errors that
are found in a year or two's time.

In my own experience, lots of illustrations certainly does
add to the complexity of the task.

One point to consider about looking for volunteers from the
LIS community, is that you might be getting yourself into a
big discussion of markup, encoding process, documentation,
etc. before you get going.

Have you done much work transcribing books before?
If you were a new PG volunteer, I would gently suggest
that a project of this nature is too much to tackle,
and point you towards www.pgdp.net to start with some
easy pages there.

You ask "Does Project Gutenberg already have resources available
that could help us?" Interesting question. By far the biggest
resource the PG has is many volunteers who directly (or indirectly)
contribute to it. If you have any specific requests or problems,
we could probably direct you to someone who has dealt with it
before. (With 18,000 books, we've had plenty of issues to
deal with.)
For a general overview, you could try reading:
http://www.gutenberg.org/faq/
although some of the material there is slightly outdated now.

Of course the tempting possibility I could mentioned is requesting
non-exclusive permission for PG to distribute this text, and then
we could run it through Distributed Proofreaders.

Andrew

On Tue, 27 Jun 2006, Michael May wrote:

> Hi all,
>
> I am Michael May, new "Classics Editor" at dLIST, the Digital Library of Information Science and Technology: http://dlist.sir.arizona.edu/
>
> dLIST has received written permission from the copyright owner of works by S.R. Ranganathan to post electronic copies of several of SRR's books at the dLIST site, including the original 1931 edition of The Five Laws of Library Science, the main premise of which is "Books are for use!" Despite being out of print (a reprint is planned for later this year by Ess Ess Publications of India <http://www.essessreference.com/servlet/esgetbiblio?n=000313>), Five Laws is arguably the most important work in library science to date.
>
> We have experimented with PDF by posting the prefatory pages and Chapter 1 here:
> http://genie.sir.arizona.edu/1115/
>
> However, Five Laws is over 500 pages and includes numerous illustrations. I believe a text or html version would be much easier to access and preserve.
>
> What advice do you have about how to proceed? I was thinking about starting by recruiting volunteers from the LIS community to transcribe the text. What should I think about or plan for before asking people to help? Does Project Gutenberg already have resources available that could help us?
>
> I'd very much appreciate any suggestions or advice.
>
> Thanks.
>
> Mike
From tony at baechler.net  Wed Jun 28 00:40:54 2006
From: tony at baechler.net (Tony Baechler)
Date: Wed Jun 28 00:40:52 2006
Subject: [gutvol-d] ftp.archive.org
In-Reply-To: <44A04DB1.6070608@aol.com>
References: <7.0.1.0.2.20060626114129.032ee4e0@baechler.net>
	<44A04DB1.6070608@aol.com>
Message-ID: <7.0.1.0.2.20060628003711.03fdd800@baechler.net>

Hi.  Yes, I'm vaguely familiar with rsync.  The problem is that I 
don't want each and every file posted.  I don't download html and 
8-bit files for example.  I only download the zipped plain text 
files.  Also I don't want some religious works.  Therefore rsync 
won't help me.  As far as the external drive, that's not a bad idea 
but I think I prefer DVD instead.  Finally, http://www.archive.org/ 
is fine, just ftp doesn't work.  I tried on two different computers 
so I don't think it's my settings.  I also have wget but prefer ncftp 
as it's a dedicated ftp client.  I am near San Diego, CA.

At 02:12 PM 6/26/06 -0700, you wrote:
>Hi Tony, I am in California too - southern california to be exact.  I
>don't know why it's not working for you because it works fine for me.
>Maybe your FTP program is not connecting correctly.  Me, i use wget
>(avilable for windows as well as a default on Linux) for my Gutenberg
>downloading needs.
>
>I plan to get an external hard drive (preferably an Iomega drive) later,
>probably for my birthday next Thursday, which i can then use to store
>the Gutenberg etexts and save me some disk space on my current drive.  I
>would be using rsync to do that (check the Mirroring FAQ on PG if you
>don't know what that is), apparently it's much faster than wget or even
>FTP because it doesn't check every single file for hours to find
>updates, it keeps a list of all files and only downloads the ones that
>specifically need updating, saves you a couple of hours of time.  Or at
>least that's what Aaron (Cannon) told me.


-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.1.394 / Virus Database: 268.9.5/376 - Release Date: 6/26/06


From tony at baechler.net  Wed Jun 28 00:52:27 2006
From: tony at baechler.net (Tony Baechler)
Date: Wed Jun 28 00:52:24 2006
Subject: [gutvol-d] ftp.archive.org
In-Reply-To: <20060626223514.GA11041@pglaf.org>
References: <7.0.1.0.2.20060626114129.032ee4e0@baechler.net>
	<20060626223514.GA11041@pglaf.org>
Message-ID: <7.0.1.0.2.20060628004644.03fd5a20@baechler.net>

Hi.  Thanks very much, the readingroo.ms server seems much 
faster.  When I checked last, snowy.arsc.alaska.edu seemed to be a 
few hours behind the other master sites.  I am no longer able to 
connect to ftp.archive.org, it just times out.  I am not a Debian 
expert but I do run a Debian server and know a reasonable amount 
about it.  What needs doing?  I am not really a programmer but I know 
how to install packages and set up things for the most part.  If 
there is something that needs to be done, let me know and I'll see.

At 03:35 PM 6/26/06 -0700, you wrote:

>I hope this helps.  My guess is the readingroo.ms server will
>give you the best throughput (though it will have some
>brief downtime, then possibly be heavily loaded during the
>world ebook fair, http://www.worldebookfair.com).
>
>Are there any Debian whizzes on this list who might want to help look
>after the readingroo.ms server with me?
>
>   -- Greg


-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.1.394 / Virus Database: 268.9.5/376 - Release Date: 6/26/06


From Bowerbird at aol.com  Wed Jun 28 01:43:45 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Jun 28 01:43:55 2006
Subject: [gutvol-d] a tool for your grandma to download p.g. e-texts en
	masse
Message-ID: <256.cf31f48.31d39b41@aol.com>

dave said:
>    What have you against Grandparents?

it's just an expression, dave.
it indicates "not technically inclined".

and, like most such stereotypical shorthand,
it's got a grain of truth, and not much more.

hey, two of my best online e-book buddies
-- nicholas hodson and meyer moldeven --
are technically astute (nicholas especially),
and they're both into their eighties now...

i'm old enough to be a grandparent myself.         :+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060628/dc2b03e8/attachment-0001.html
From Bowerbird at aol.com  Wed Jun 28 02:36:08 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Wed Jun 28 02:36:13 2006
Subject: [gutvol-d] How to digitize SRR's Five Laws?
Message-ID: <2fd.7a682d2.31d3a788@aol.com>

mike said:
>    Despite being out of print 
>    (a reprint is planned for later this year 
>    by Ess Ess Publications of India 
>    <http://www.essessreference.com/servlet/esgetbiblio?n=000313>), 
>    Five Laws is arguably the most important work in library science to 
date.

that's quite a sad commentary, isn't it, that
what is arguably _the_ most important work
in library science to date is _out_of_print_...

so congratulations on bringing it back to life.

the .pdf versions you've made are not as useful
as they could be, however, because you've just
wrapped the scans into a .pdf.   that means that
the text cannot be searched or copied out of it,
and those are two of the big benefits of e-books.

so yes, you are right they would be better with
digital text.   but there's no need to transcribe.
instead, o.c.r. the scans, correct the results, and
then wrap that digital text into several formats:
plain text could be one, .html could be another,
and even .pdf (except this time with text that is
searchable and could be copied out of the .pdf).
further, the scans could also be used themselves.
(but you should strive for higher-quality scans.)

here's a rough sketch of how to proceed:
1.   scan the book's pages.
2.   clean up the scans.   (straighten, crop, etc.)
3.   perform the o.c.r.
4.   clean up the text.
5.   proofread the text against the scans.
6.   auto-convert the text to .html.
7.   auto-convert the text to .pdf.

i'd be willing to help guide you on any of the steps.
(especially the last couple, which some people might
try and tell you are "impossible".   don't believe them.)

distributed proofreaders would probably also help
if you could donate the text to project gutenberg.
since there will be one copy in cyberspace anyway,
tell the publisher there might as well be lots of 'em.
(online copies don't really cannibalize print sales;
indeed, there are some indications they feed 'em.)
besides, _books_are_for_use_, are they not?          :+)

for some pretty "digital reprint" examples to look at, see:
>   http://www.ibiblio.org/ebooks/Mabie/Books_Culture.pdf
>   http://www.ibiblio.org/ebooks/Cather/Antonia/Antonia.pdf
>   http://www.ibiblio.org/ebooks/Einstein/Einstein_Relativity.pdf

for a look at a system that enables volunteers to proofread, see:
>   http://www.greatamericannovel.com/mabie/mabiep001.html
>   http://www.greatamericannovel.com/myant/myantc001.html
>   http://www.greatamericannovel.com/ahmmw/ahmmwc001.html
>    http://www.greatamericannovel.com/sgfhb/sgfhbc001.html

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060628/8a73521e/attachment.html
From desrod at gnu-designs.com  Wed Jun 28 04:10:43 2006
From: desrod at gnu-designs.com (David A. Desrosiers)
Date: Wed Jun 28 04:11:43 2006
Subject: [gutvol-d] ftp.archive.org
In-Reply-To: <7.0.1.0.2.20060628003711.03fdd800@baechler.net>
References: <7.0.1.0.2.20060626114129.032ee4e0@baechler.net>
	<44A04DB1.6070608@aol.com>
	<7.0.1.0.2.20060628003711.03fdd800@baechler.net>
Message-ID: <Pine.LNX.4.64.0606280708370.8333@aphrodite.gnu-designs.com>


> Hi.  Yes, I'm vaguely familiar with rsync.  The problem is that I 
> don't want each and every file posted.  I don't download html and 
> 8-bit files for example.  I only download the zipped plain text 
> files.  Also I don't want some religious works.  Therefore rsync 
> won't help me.

 	I'm sorry... what? You can rsync exactly what files you wish, 
recursively or not, pick and choose, with rsync... using the right 
options. I mirror Gutenberg here with rsync, skipping the DVD files, 
.mp3 files, .rar files and a few others, getting only the useful 
copies of books.

 	What part of rsync's usage is confusing you?


David A. Desrosiers
desrod@gnu-designs.com
http://gnu-designs.com
From tony at baechler.net  Wed Jun 28 08:53:34 2006
From: tony at baechler.net (Tony Baechler)
Date: Wed Jun 28 08:53:31 2006
Subject: [gutvol-d] New DVD ISO feedback sought
In-Reply-To: <20060626093237.GA27369@pglaf.org>
References: <20060626093237.GA27369@pglaf.org>
Message-ID: <7.0.1.0.2.20060628084333.0426a7b0@baechler.net>

Hi Greg,

At the risk of sounding uninformed, why not include the copyrighted 
books on the first DVD with as many titles as possible?  My 
understanding is that PG must be allowed to at least noncommercially 
distribute copyrighted works before they are added.  You wouldn't be 
selling the DVDs so I don't see a problem.  With most CC licenses, it 
allows at least free noncommercial distribution anyway.  Is it just a 
matter of not enough space after the 3.5 GB of public domain titles?

What about a DVD of only html books and no plain text?  The books 
could directly be viewed in a browser.  Maybe a "best of" collection 
but only with uncompressed html files and illustrations and on a DVD 
instead of a CD.

As far as PG's best work in terms of illustrations, I suggest 
searching through the "posted" list archives for the word 
"illustration."  I've noticed that David W and Joe sometimes comment 
on images which stand out.  This might be a good basis for the best 
of DVD described above.  Also, what about including musical scores in 
one of these sets?

I'm unfamiliar with Amazon's best of public domain list so I can't 
comment on that.  One slight concern I would have with showing off 
PG's best work is that some people might not be interested.  For 
example, David W just posted five volumes on the life of George 
Washington.  I'm sure it's interesting (I haven't looked at it yet) 
but might not interest non-US readers and might be advanced for some 
people.  It isn't exactly light reading.  I'm sure the text has few 
errors and the html looks good but maybe it isn't of interest to 
many.  This could be where the readability scores come in useful 
though.  Pick the best PG books with the nicest html and images that 
is the easiest to read.

Those are my thoughts.  Another possibility in the future would be a 
CD or DVD with Braille files.  National Braille Press in the US is 
selling such a CD but it's expensive.  It would make more sens to 
give it away.  The majority of blind people are unemployed so paying 
for such a CD set is out of the reach of most of them, at least in the US.


-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.1.394 / Virus Database: 268.9.5/376 - Release Date: 6/26/06


From nwolcott2ster at gmail.com  Tue Jun 27 06:43:03 2006
From: nwolcott2ster at gmail.com (Norm Wolcott)
Date: Wed Jun 28 09:48:31 2006
Subject: [gutvol-d] ebooks libre et gratuits
References: <i31QvGAKvynEFwUV@thalasson.com>
Message-ID: <003301c69ad2$a66849e0$640fa8c0@gw98>

They are now instituting a "quota" system, apparently to avoid  wholesale
downloads of their site. Interestingly internet archive does not get any of
their books, only the front page, and nothing since December 2005!

The limit is a daily one, and you are invited to return tomorrow, for
another quota apparently.
nwolcott2@post.harvard.edu
----- Original Message -----
From: "Philip Baker" <phil@thalasson.com>
To: <gutvol-d@lists.pglaf.org>
Sent: Sunday, June 25, 2006 8:35 PM
Subject: [gutvol-d] ebooks libre et gratuits


> In article <200606251622.35322.donovan@abs.net>, D Garcia
> <donovan@abs.net> writes
> >On Sunday 25 June 2006 10:25 am, Janet Kegg wrote:
> >> The site is now available again: http://www.ebooksgratuits.com/
> >>
> >> See the front page of the Web site  for what I believe (my French is
> >> almost nonexistent) is an explanation of what happened.
> >
> >The news item roughly translated is:
> >
> >As you probably noted, the site was inaccessible for over a week; the
reason
> >is that our ISP shut down following "A crippling DDOS attack" which they
were
> >not able to successfully block. We changed ISPs, and the site is once
again
> >available. We will take the necessary means so that this type of thing
cannot
> >happen again; I will speak about it again very soon.
>
>
> They are being rather optimistic but we will have to wait and see if
> their "mesures n?cessaires" work.
> --
> Philip Baker
> _______________________________________________
> gutvol-d mailing list
> gutvol-d@lists.pglaf.org
> http://lists.pglaf.org/listinfo.cgi/gutvol-d

From joey at joeysmith.com  Wed Jun 28 16:19:00 2006
From: joey at joeysmith.com (joey)
Date: Wed Jun 28 16:35:03 2006
Subject: [gutvol-d] ftp.archive.org
In-Reply-To: <Pine.LNX.4.64.0606280708370.8333@aphrodite.gnu-designs.com>
References: <7.0.1.0.2.20060626114129.032ee4e0@baechler.net>
	<44A04DB1.6070608@aol.com>
	<7.0.1.0.2.20060628003711.03fdd800@baechler.net>
	<Pine.LNX.4.64.0606280708370.8333@aphrodite.gnu-designs.com>
Message-ID: <20060628231900.GD2650@joeysmith.com>

On Wed, Jun 28, 2006 at 07:10:43AM -0400, David A. Desrosiers wrote:
> 
> >Hi.  Yes, I'm vaguely familiar with rsync.  The problem is that I 
> >don't want each and every file posted.  I don't download html and 
> >8-bit files for example.  I only download the zipped plain text 
> >files.  Also I don't want some religious works.  Therefore rsync 
> >won't help me.

I have to echo what David said. Rather than chaining yourself to FTP,
you should look more deeply at what rsync is capable of. If you need,
I could probably help you define an rsync line that gets what you want
and ONLY what you want (I myself already have one that pulls ONLY the 
zip files).
From hart at pglaf.org  Wed Jun 28 17:35:30 2006
From: hart at pglaf.org (Michael Hart)
Date: Wed Jun 28 17:35:31 2006
Subject: [gutvol-d] Michael Hart is on the Road
Message-ID: <Pine.LNX.4.60.0606281735150.23512@pglaf.org>


I will be rather slow with my email responses for the next month,
and I presumed, already correctly show, that some messages of the
somewhat negative kind would come at such a time and I appreciate
the way our list members have allowed me the opportunities for an
easier time with such message as I don't have to respond alone.

I really can't tell you how much I appreciate all the support for
the work I have been doing, and hope will continue, with the very
wonderful help of perhaps as many as 50,000 volunteers.


Thank you!

Thank you!

Thank you!


Give the world eBooks in 2006!!!

Michael S. Hart
Founder
Project Gutenberg

Blog at http://hart.pglaf.org

From hart at pglaf.org  Wed Jun 28 17:43:15 2006
From: hart at pglaf.org (Michael Hart)
Date: Wed Jun 28 17:43:17 2006
Subject: [gutvol-d] !@!Re: [BP] Re: EXTRA! Project Gutenberg Weekly
	Newsletter 
Message-ID: <Pine.LNX.4.60.0606281743000.23512@pglaf.org>


By the way, as we have continually offered, if anyone would like
to write up a different catalogue, counting system, or whatever,
we would be only too happy to include it the Newsletters, and in
the various archives.

I am sure that people could come up with counts both higher, and
lower, than whatever method is chosen, and we would be very glad
to repost those counts each week, each month, or even each year,
if anyone would be willing to put them together in pretty much a
free fashion, as long as there was some internal consistency.

Nothing like 100% accuracy would be required, and we should have
the capability of averaging all such counts, hopefully to see in
some manner an average count that reflected what people are used
to seeing in library catalogues.

As I will be away for a month, now would be a perfect time to do
this sort of thing, and if it catches on, perhaps I won't have a
Newsletter that I have to do so much of personally when I return
after this trip. . .perhaps I won't have to do it at all. . . .


Thanks!!!

Give the world eBooks in 2006!!!

Michael S. Hart
Founder
Project Gutenberg

Blog at http://hart.pglaf.org
From desrod at gnu-designs.com  Wed Jun 28 17:51:40 2006
From: desrod at gnu-designs.com (David A. Desrosiers)
Date: Wed Jun 28 17:52:45 2006
Subject: [gutvol-d] ftp.archive.org
In-Reply-To: <20060628231900.GD2650@joeysmith.com>
References: <7.0.1.0.2.20060626114129.032ee4e0@baechler.net>
	<44A04DB1.6070608@aol.com>
	<7.0.1.0.2.20060628003711.03fdd800@baechler.net>
	<Pine.LNX.4.64.0606280708370.8333@aphrodite.gnu-designs.com>
	<20060628231900.GD2650@joeysmith.com>
Message-ID: <Pine.LNX.4.64.0606281936430.19662@aphrodite.gnu-designs.com>


> I have to echo what David said. Rather than chaining yourself to 
> FTP, you should look more deeply at what rsync is capable of. If you 
> need, I could probably help you define an rsync line that gets what 
> you want and ONLY what you want (I myself already have one that 
> pulls ONLY the zip files).

 	Here's mine...

 	rsync -avzprlHtPS --delete --exclude=[0-9]*.txt 	\
 		--exclude=*.iso --exclude=*.rar --exclude=*.ISO	\
 		--exclude=*.mp3 --exclude=pgdvd*		\
 		ftp@ftp.ibiblio.org::gutenberg Gutenberg

 	This gives me ~34GiB of data... enough for me to use as a 
viable mirror.


David A. Desrosiers
desrod@gnu-designs.com
http://gnu-designs.com
From jeroen.mailinglist at bohol.ph  Thu Jun 29 14:06:00 2006
From: jeroen.mailinglist at bohol.ph (Jeroen Hellingman (Mailing List Account))
Date: Thu Jun 29 14:03:21 2006
Subject: [gutvol-d] a tool for your grandma to download p.g. e-texts en
	masse
In-Reply-To: <256.cf31f48.31d39b41@aol.com>
References: <256.cf31f48.31d39b41@aol.com>
Message-ID: <44A440B8.9090801@bohol.ph>

Bowerbird@aol.com wrote:
> dave said:
>   
>>    What have you against Grandparents?
>> it's just an expression, dave.
>> it indicates "not technically inclined".
>>     
My grandfather worked with the one of the first computers to be
installed here in the Netherlands in the fifties. Last year he bought a
new PC, at ninety years old, and is still using it regularly to stay in
touch with relatives who have settled down all across the globe.
Although not a nerd, he certainly knows how to use the machine...

Jeroen.

From Bowerbird at aol.com  Thu Jun 29 14:14:12 2006
From: Bowerbird at aol.com (Bowerbird@aol.com)
Date: Thu Jun 29 14:14:24 2006
Subject: [gutvol-d] a tool for your grandma to download p.g. e-texts en
	masse
Message-ID: <51e.25e11f5.31d59ca4@aol.com>

jeroen said:
>    Although not a nerd, he certainly knows how to use the machine...

so, have you got him hard at work proofing for you?           ;+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pglaf.org/private.cgi/gutvol-d/attachments/20060629/10fbcd63/attachment.html
From tony at baechler.net  Fri Jun 30 00:58:52 2006
From: tony at baechler.net (Tony Baechler)
Date: Fri Jun 30 00:58:46 2006
Subject: [gutvol-d] ftp.archive.org
In-Reply-To: <Pine.LNX.4.64.0606280708370.8333@aphrodite.gnu-designs.com
 >
References: <7.0.1.0.2.20060626114129.032ee4e0@baechler.net>
	<44A04DB1.6070608@aol.com>
	<7.0.1.0.2.20060628003711.03fdd800@baechler.net>
	<Pine.LNX.4.64.0606280708370.8333@aphrodite.gnu-designs.com>
Message-ID: <7.0.1.0.2.20060630005304.03354d60@baechler.net>

My understanding of rsync is that you had to mirror the entire PG 
archive.  That was based on the PG FAQ and my attempts to read the 
help and man page.  I couldn't figure out the command line options 
and experiments I tried gave me errors.  I think the PG FAQ gives a 
sample command line but that's for everything which isn't what I 
want.  Besides it's nice to manually look at and download each 
file.  I often like to stop every few files and look at a book of 
interest.  So, to answer your question, all of rsync confuses me 
since I never got it to work.

Also, another problem might be that I'm primarily on Windows.  I know 
rsync is common in Linux and I have it installed on the Debian server 
that I run but I'm not sure if it's available for Windows or not.  I 
have Cygwin so I might have it, but again I have no idea how to get 
it to only get the files I want.  That's nice about getting them 
manually, I can skip those I don't want as I see them in the newsletters.

At 07:10 AM 6/28/06 -0400, you wrote:
>         I'm sorry... what? You can rsync exactly what files you wish,
>recursively or not, pick and choose, with rsync... using the right
>options. I mirror Gutenberg here with rsync, skipping the DVD files,
>.mp3 files, .rar files and a few others, getting only the useful
>copies of books.
>
>         What part of rsync's usage is confusing you?


-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.1.394 / Virus Database: 268.9.7/379 - Release Date: 6/29/06


From JBuck814366460 at aol.com  Fri Jun 30 16:37:07 2006
From: JBuck814366460 at aol.com (Jared Buck)
Date: Fri Jun 30 16:37:19 2006
Subject: [gutvol-d] ftp.archive.org
In-Reply-To: <7.0.1.0.2.20060630005304.03354d60@baechler.net>
References: <7.0.1.0.2.20060626114129.032ee4e0@baechler.net>
	<44A04DB1.6070608@aol.com>
	<7.0.1.0.2.20060628003711.03fdd800@baechler.net>
	<Pine.LNX.4.64.0606280708370.8333@aphrodite.gnu-designs.com>
	<7.0.1.0.2.20060630005304.03354d60@baechler.net>
Message-ID: <44A5B5A3.1010703@aol.com>

Rsync's available for Windows as part of the cygwin package.  Just like 
FTP or wget you can tell rsync to get only the stuff you want. and 
unlike FTP or wget it will only download the files that need updating, 
without you having to wait several hours for it to skip over every file 
that hasn't changed.

I admit it can be confusing since it's a very powerful too.  I was 
talking about it with Aaron Cannon and he says it's a better way to make 
a "mirror" of PG (with or without specific files that you want.

My suggestion for people who want to use rsync?  Have someone write a 
more detailed FAQ on it, explain it in non-technical terms, and provide 
some examples (using the PG archive) of commands you can run with it, 
especially sample rsync lines like David has, explaining all the '-' 
tags and what they mean in context with the line and what they will make 
rsync do to the files you download/mirror.

Jared

Tony Baechler wrote on 30/06/2006, 12:58 AM:

 > My understanding of rsync is that you had to mirror the entire PG
 > archive.  That was based on the PG FAQ and my attempts to read the
 > help and man page.  I couldn't figure out the command line options
 > and experiments I tried gave me errors.  I think the PG FAQ gives a
 > sample command line but that's for everything which isn't what I
 > want.  Besides it's nice to manually look at and download each
 > file.  I often like to stop every few files and look at a book of
 > interest.  So, to answer your question, all of rsync confuses me
 > since I never got it to work.
 >
 > Also, another problem might be that I'm primarily on Windows.  I know
 > rsync is common in Linux and I have it installed on the Debian server
 > that I run but I'm not sure if it's available for Windows or not.  I
 > have Cygwin so I might have it, but again I have no idea how to get
 > it to only get the files I want.  That's nice about getting them
 > manually, I can skip those I don't want as I see them in the newsletters.
 >
 > At 07:10 AM 6/28/06 -0400, you wrote:
 > >         I'm sorry... what? You can rsync exactly what files you wish,
 > >recursively or not, pick and choose, with rsync... using the right
 > >options. I mirror Gutenberg here with rsync, skipping the DVD files,
 > >.mp3 files, .rar files and a few others, getting only the useful
 > >copies of books.
 > >
 > >         What part of rsync's usage is confusing you?
 >
 >
 > --
 > No virus found in this outgoing message.
 > Checked by AVG Anti-Virus.
 > Version: 7.1.394 / Virus Database: 268.9.7/379 - Release Date: 6/29/06
 >
 >
 > _______________________________________________
 > gutvol-d mailing list
 > gutvol-d@lists.pglaf.org
 > http://lists.pglaf.org/listinfo.cgi/gutvol-d
 >

-- 
            .
               .:.
              .:::.
             .:::::.
         ***.:::::::.***
    *******.:::::::::.*******                Dmitri
Yalovsky
********.:::::::::::.********
********.:::::::::::::.********             USS
Authority
*******.::::::'***`::::.*******
******.::::'*********`::.******            Asst. Chief of
Engineering
****.:::'*************`:.****
    *.::'*****************`.*
    .:'  ***************    .
   .

From desrod at gnu-designs.com  Fri Jun 30 16:56:58 2006
From: desrod at gnu-designs.com (David A. Desrosiers)
Date: Fri Jun 30 16:58:01 2006
Subject: [gutvol-d] ftp.archive.org
In-Reply-To: <44A5B5A3.1010703@aol.com>
References: <7.0.1.0.2.20060626114129.032ee4e0@baechler.net>
	<44A04DB1.6070608@aol.com>
	<7.0.1.0.2.20060628003711.03fdd800@baechler.net>
	<Pine.LNX.4.64.0606280708370.8333@aphrodite.gnu-designs.com>
	<7.0.1.0.2.20060630005304.03354d60@baechler.net>
	<44A5B5A3.1010703@aol.com>
Message-ID: <Pine.LNX.4.64.0606301956120.4089@aphrodite.gnu-designs.com>


> My suggestion for people who want to use rsync?  Have someone write 
> a more detailed FAQ on it, explain it in non-technical terms, and 
> provide some examples (using the PG archive) of commands you can run 
> with it, especially sample rsync lines like David has, explaining 
> all the '-' tags and what they mean in context with the line and 
> what they will make rsync do to the files you download/mirror.

 	How about using Unison?

 	http://www.cis.upenn.edu/~bcpierce/unison/


David A. Desrosiers
desrod@gnu-designs.com
http://gnu-designs.com