From marcello at perathoner.de  Tue Sep  1 04:44:33 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Tue, 01 Sep 2009 13:44:33 +0200
Subject: [gutvol-d] Re: Grosly Broken browser or wiki
In-Reply-To: <Pine.LNX.4.63.0908311929090.32579@durendal.durendal.org>
References: <Pine.LNX.4.63.0908311929090.32579@durendal.durendal.org>
Message-ID: <4A9D0921.7040700@perathoner.de>

Greg Weeks wrote:
> 
> Something is messed up. Can someone undo the last edit I just did for 
> the Science fiction bookshelf wiki page. It wiped it clean.
> 

ibiblio is blocking posts longer than 64K.

Even the older edits are longer than that so I can't restore them from here.

I have contacted ibiblio to remove this limitation for pg.

Meanwhile nothing is lost. We have all edits in the history.





From marcello at perathoner.de  Tue Sep  1 09:31:58 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Tue, 01 Sep 2009 18:31:58 +0200
Subject: [gutvol-d] Re: Grosly Broken browser or wiki
In-Reply-To: <4A9D0921.7040700@perathoner.de>
References: <Pine.LNX.4.63.0908311929090.32579@durendal.durendal.org>
	<4A9D0921.7040700@perathoner.de>
Message-ID: <4A9D4C7E.3030102@perathoner.de>

Marcello Perathoner wrote:

> Greg Weeks wrote:
>>
>> Something is messed up. Can someone undo the last edit I just did for 
>> the Science fiction bookshelf wiki page. It wiped it clean.
>>
> 
> ibiblio is blocking posts longer than 64K.
> 
> Even the older edits are longer than that so I can't restore them from 
> here.
> 
> I have contacted ibiblio to remove this limitation for pg.
> 
> Meanwhile nothing is lost. We have all edits in the history.

Works now.

Maybe we should split that page into smaller chunks:

   http://www.gutenberg.org/wiki/Science_Fiction_(Bookshelf)/A-L
   http://www.gutenberg.org/wiki/Science_Fiction_(Bookshelf)/M-Z

or even more chunks.

From Bowerbird at aol.com  Thu Sep  3 11:36:30 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 3 Sep 2009 14:36:30 EDT
Subject: [gutvol-d] Re: =?utf-8?q?Everyone_Wants_a_Kindle=E2=80=93For_=2450?=
Message-ID: <cb4.5e0255a1.37d166ae@aol.com>

it ends up that people would like for the kindle to cost $50.

somebody should tell david rothman about this.

>    
http://mediamemo.allthingsd.com/20090903/study-everyone-wants-a-kindle-for-50/

kindle-shmindle.   i want the apple itablet to cost $50...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090903/94ce3d05/attachment.html>

From Bowerbird at aol.com  Fri Sep  4 13:55:05 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 4 Sep 2009 16:55:05 EDT
Subject: [gutvol-d] keeping up with cory
Message-ID: <d5d.473d8042.37d2d8a9@aol.com>

if you haven't kept up with cory lately,
he does a nice little history review here:
>    
http://www.locusmag.com/Perspectives/2009/09/cory-doctorow-special-pleading.html

money quote:
>    I don't give away downloads because I'm just a swell guy --
>    I do it because I'm a self-employed entrepreneur who 
>    needs to make as much as he can to support his family.

in other words, a free online copy doesn't _cost_ him money,
it _makes_ him money.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090904/75357a77/attachment.html>

From hart at pobox.com  Sun Sep  6 07:31:49 2009
From: hart at pobox.com (Michael S. Hart)
Date: Sun, 6 Sep 2009 07:31:49 -0700 (PDT)
Subject: [gutvol-d] !@! An urgent appeal to all Canadian supporters of
 Project Gutenberg
Message-ID: <alpine.DEB.2.00.0909060729490.15115@mail.pglaf.org>



Please forward as you feel appropriate.


Michael S. Hart
Founder
Project Gutenberg


---------- Forwarded message ----------
Date: Sun, 6 Sep 2009 01:03:59 -0700 (PDT)
From: Mark Akrigg <contact.gutenberg at yahoo.ca>
To: Michael Hart <hart at pobox.com>
Cc: hart at pglaf.org
Subject: An urgent appeal to all Canadian supporters of Project Gutenberg

Dear friends:

I am the founder of Project Gutenberg Canada, and would like to make
a special appeal to Canadian supporters of Project Gutenberg.

Our government is sponsoring a Copyright Consultation on future
changes to the Copyright Act. This is an unprecedented request from
the government for the people of Canada to express their views on
copyright law.

Please consider making your personal submission. In the submission
which I made on behalf of Project Gutenberg Canada, I made the
following five recommendations:

1. A "Safe Harbour" provision for works more than 75 years old where
the life dates of the authors are not known

2. No extensions of copyright durations

3. Explicit assignment to the Public Domain of those photographs
that were in the Public Domain in 1997

4. 75 year copyright for works with more than 15 authors

5. Enhanced protection of the Public Domain

You can read the full PG Canada submission at

http://www.ic.gc.ca/eic/site/008.nsf/eng/01390.html

Your own submission should be in your own words, and can be quite
short.  We don't want to bury the government in spam, and truly
individual submissions will have the greatest effect.  There is no
need to precisely mirror the recommendations I made.

You will find the main Copyright Consultations page here:

http://copyright.econsultation.ca/

You will find information on how to email your submission here:

http://copyright.econsultation.ca/topics-sujets/show-montrer/18

You might also wish to send a copy of your submission to your Member
of Parliament:

http://www2.parl.gc.ca/Parlinfo/Compilations/HouseOfCommons/MemberByPostalCode.aspx?Menu=HOC

The main Copyright Consultation page has information on how you can
participate in the forums being conducted by the government on
copyright issues, which naturally cover many issues which do not
affect PG Canada, but do affect your life in other respects.

The main thing is to make your submission sooner rather than later:
the Copyright Consultation ends on September 13th.

It appears possible that there will be a federal election in Canada
this fall.  Don't forget to tell your candidates that they are
answerable to you when it comes to copyright law, and that you
expect any future government to protect and promote the Public
Domain.


Thank you in advance for your help.  And don't forget to make your
submission!


Dr. Mark Akrigg Founder, Project Gutenberg Canada



From richfield at telkomsa.net  Mon Sep  7 01:00:42 2009
From: richfield at telkomsa.net (Jon Richfield)
Date: Mon, 07 Sep 2009 10:00:42 +0200
Subject: [gutvol-d] Re: !@! An urgent appeal to all Canadian supporters of
 Project Gutenberg
In-Reply-To: <alpine.DEB.2.00.0909060729490.15115@mail.pglaf.org>
References: <alpine.DEB.2.00.0909060729490.15115@mail.pglaf.org>
Message-ID: <4AA4BDAA.4080306@telkomsa.net>

Michael S. Hart wrote:
> Please forward as you feel appropriate.
>
>   
OK, so I am un-Canadian. So they don't have to read it. I sent them 
this. They won't do it of course, but I think some countries should 
consider the principle (among others of course).
============================

To: Copyright Consultations

Your initiative in consulting Canadians on copyright matters does Canada 
credit, especially during a period of world-wide confusion and bad-faith 
violation and manipulation of copyrights. I am an author of largely 
semitechnical material and a heavy user of published material in general 
and I hope that you will consider some of the following points during 
the consultations. There is no question of any one correspondent 
covering the entire field of course.

At the end of this document I address your questions as they were 
presented on your web page.

Please note that I have nothing to say that specifically addresses 
anything but reading matter and illustrations, whether in electronic, 
printed or written form. Music, films and the like are outside my line 
of intimate involvement.

To begin with we should understand that the entire matter is one of 
resolution of conflicts of interest. Realising this does not make the 
question simple, but trying to resolve it without clearly understanding 
that point would be simply futile, and only the lawyers would profit. At 
any point the question should be: "Whose interests would be furthered by 
such a measure?" If the answer is; "None in particular," then the 
measure should be considered no further. With due admiration for 
Mencken's "...there is always an easy solution to every human problem - 
neat, plausible, and wrong," I insist that the fewer and simpler the 
rules and regulations, the better.

Let us consider some of the interests in possible conflict, in no 
definitive sequence.

   1. The author or authors
   2. The authors' estates, dependents, and heirs
   3. Purchasers of copyrights
   4. The publishers
   5. The Canadian public who purchase the material
   6. The Canadian public who use the material
   7. The Canadian public image domestically and internationally
   8. The International public who purchase the material
   9. The International public who use the material
  10. Posterity

You will be well aware that there are emergent complications, both in 
good faith and very often in very bad faith, but as far as practical I 
am trying to stick to simple, commonsense lines of thought.

   1. It is largely common cause that it is good that authors can
      publish and that they may exercise reasonable copyright. I do not
      consider complications such as authorship under contract or
      employment.
   2. It similarly is good that an author that serves the public's
      desires suitably should be able to do so at sufficient profit to
      make it worth his own while and for adequate benefit to his
      dependents.
   3. In spite of certain parties' idealistic objections, there is no
      practical basis apart from normal taxation, for limiting an
      author's legitimate profit from his work. If he or his publishers
      become billionaires from a book, then so be it.
   4. It is in the public interest that an author's work be made
      available and that an author's productivity be nurtured for as
      long as public interests underwrite the published works through
      purchase or sponsorship or whatever arrangement suits the relevant
      parties.
   5. There is no cogent basis for nominating any particular time limit
      to the copyright. 75 years after a work or 50 years after the
      author's death or the like are simply thumb-sucks at figures that
      suited particular parties, or were as long as they thought they
      could get away with. For most books they are too long by far, for
      a few they are probably too short.
   6. It is a matter of the mildest concern one way or the other how
      long a book stays in copyright as long as it is sufficiently
      widely available in sufficient numbers and at reasonable cost if
      there is public demand. Publishing one copy a year in central
      Greenland at a price of a million dollars each as a bad-faith
      legalistic means of preventing public access would not meet the case.
   7. Conversely, there are thousands of books that seem unlikely to get
      back into commercial print again, but are unsung classics. I could
      mention quite a few off my own shelves, such as "A Sailor's Life"
      by de Hartog, the autobiographical works of Alexander King,
      "Nature is your Guide" by Gatty, "Short History of the Art of
      Distillation" by Forbes, and a number of others that I do not wish
      to check for being in print at present. Some are textbooks of
      great value or primary documentation of events of great interest,
      but without commercial appeal. Such books often are doomed because
      no commercial publisher in his right mind would touch them, but by
      the time that they are out of copyright even the libraries and
      second-hand shops will have pulped their copies. They are of no
      benefit to any of the categories of interests that I listed above.
      Consider "Mr Belloc Objects" by Wells; it went out of print
      immediately after being published in 1926 and I am not even sure
      of its status today. However, it is one of the greatest gems of
      polemics in the history of science, and if certain specially
      interested parties had not scanned it in, it might have been lost
      by now. His "Science of Life" might well follow. At the same time,
      as long as works in that twilight zone are technically in
      copyright, projects such as Gutenberg will not touch them.
   8. Any regulation that could be dispensed with without injustice, or
      could be substituted by a simpler or more self-regulatory
      convention is an imposition on both state and public and should be
      expunged or avoided.
   9. The following scheme should accommodate or alleviate most of the
      foregoing considerations.
         1. Copyright restrictions should apply according to some such
            scheme as those currently applicable. The exact terms and
            periods are not of major concern to this discussion.
         2. As long as the product remains in print and reasonably
            available to the public through normal commercial channels
            etc and no other cogent objection can be raised, there need
            be little material change to the arrangements.
         3. However, at any time after publication, any interested party
            could apply to some central national authority for
            non-exclusive copyright. He would have to give appropriate
            reasons why this should be granted. Such reasons would be of
            two basic types, firstly negative: lack of reasonable
            objections from interested parties. Examples of reasonable
            objections might include: the author might object to his
            publication being re-issued because of regret that he ever
            had published it. That would be valid. Conversely, the
            author might have no objection, but the publisher might wish
            to quash the book for competitive or personal reasons. That
            would not be valid. I cannot give a ranked list of negative
            considerations that the authority might consider, but it
            might be such things as that the author and family were
            deceased, that the book was out of print and that the former
            publishers had expressed lack of interest in re-commencing
            publication etc. Positive reasons might the public interest.
            A niche group might think the book of crucial value, but it
            might not at the present time be available. The appellant's
            own commercial interests would obviously not figure as
            strong arguments, and the author or his assignees would have
            claim for reasonable royalties.
         4. If the arguments for allocating non-exclusive copyright were
            seen as adequate, then the original copyright holders would
            be notified if possible, and given a reasonable period to
            respond (perhaps half a year?) and if they did not respond,
            the copyright would not be ceded, but would be extended to
            the appellant, possibly with certain restrictions fitting
            the case.
         5. The copyright, if granted to an appellant, would be
            non-exclusive; anyone else could concurrently ask for
            similar or different rights on the same or different
            grounds, and they might or might not succeed.
         6. Any such copyright would remain contingent on no valid
            objection emerging subsequent to its being granted during
            the normal period of copyright. There would explicitly be no
            assurance that the appellant either could rely on no one
            else being granted a similar copyright. Also, the copyright
            might be withdrawn (without penalty, but also without
            compensation) if the original copyright holder subsequently
            produced adequate reasons for regaining exclusive copyright.

The questions presented in the invitation to respond were as follows:

1. How do Canada?s copyright laws affect you? How should existing laws 
be modernized?

As long as Canada is a signatory to international copyright conventions, 
including those that constrain the general use of material out of print, 
but still within copyright, everyone, including myself, suffers 
pointless loss of access to valuable material. (Of course, an even 
larger volume of total rubbish gets lost as well, but none of my 
suggestions aggravate its retention!)

2. Based on Canadian values and interests, how should copyright changes 
be made in order to withstand the test of time

This is a little vague on two counts. Canadian values in context might 
at a guess include dignity, practicality, and fairness to all parties. 
The foregoing proposal seems to me to cover those. I should hope that 
the values would not assume slavishly unthinking adherence to 
traditional ways of doing things or to the NIH syndrome. As for Canadian 
interests, the scheme should entail no penalty whatever on any author or 
good-faith publisher, but should enable any party in Canada to avail 
themselves of resources that currently are being wasted pointlessly. 
Test of time? That is always hard to say antecedent to the test, but any 
scheme that puts the incentive to act constructively in the hands of the 
interested party, and permits correction in the event of error or 
justified objection, should not readily attract long-term resentment or 
annulment.

3. What sorts of copyright changes do you believe would best foster 
innovation and creativity in Canada?

As detailed above. It would leave creative individuals (authors etc) 
with absolutely no reduction of their rights and incentives (they need 
never pay a lawyer to say "no" on their behalf, or (re)commence 
publication and circulation within a reasonable time, or whatever 
similar action might prove appropriate), but it also would enable users 
among the public to avail themselves of valuable works that otherwise 
would go to waste. If anything, they might profit from extra royalties.

Possibly one also might wish to give attention to questions of 
unreasonable extensions of copyright in the hands of non-creators. 
Consider the case of the alleged behaviour of the copyright holders of 
"Gone With The Wind" in the US.

4. What sorts of copyright changes do you believe would best foster 
competition and investment in Canada?

Something along the lines of the foregoing suggestion, calculated as it 
is, not only to increase access to desirable works by rescuing them from 
stagnation and unfair competition, but also increasing their returns for 
the author by increasing the scope for keeping them in print, might well 
attract foreign authors to print their works in Canada instead of in 
more hidebound countries that do less to promote publication.

5. What kinds of changes would best position Canada as a leader in the 
global, digital economy?

I assume this refers to the current context only? After all, I am no 
economist! Canada is already a major leader in such fields. It is 
important to maintain flexibility, rather than to confuse rigidity with 
high standards. The most important thing is to ensure that laws 
accommodate the need to reward good faith and punish bad faith. A 
hypothetical illustration in the current international situation might 
be that whereas there need be no ceiling to the bonus that an executive 
of an enterprise could accept in the event of his delivering as 
contracted, it would have to be balanced by an ppropriately matching 
penalty in the event of non-delivery. A company, or even a third party 
should be able to invoke something of the type.

Another principle should be very rapid turnover in court cases. There 
should be no limit to the value of damages that could be handled in what 
are currently the "small claims courts" or ombudsmen. Instead the 
assumption should be that their verdicts are correct, rapid, informal, 
and paid for by the state. Anyone who thinks that he has been hard done 
by in such a verdict should be informed of the basis of the decision 
before he appeals. If he then appeals, it gets passed on together with 
the details of the decision and objections to a supervisory and 
re-evaluatory committee that must report back within say 24 hours. If 
anyone still is dissatisfied he has recourse to the more ponderous 
mechanisms of the courts. The appellant then pays for everything, 
lawyers on both sides etc. Only if he then wins does he get his 
investment back.

In general, keep things constructive, keep them brief, and concentrate 
on visible good faith and visible good sense.

But that wasn't the sort of question I was expecting. Did I 
misunderstand the intent?

Thank you for your attention. Feel welcome to contact me if in this 
hurried and incoherent note I left anything that seemed interesting but 
obscure.

Jon Richfield

=======================

Comments, up to and including horrified shrieks or bored yawns, as 
anyone prefers.

Jon



From Bowerbird at aol.com  Tue Sep  8 02:30:36 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 8 Sep 2009 05:30:36 EDT
Subject: [gutvol-d] labor day -- working for peace
Message-ID: <d15.51be01f9.37d77e3c@aol.com>

the rev. carl kabat has spent over 10 years in jail
as "punishment" for his _symbolic_ protests against
nuclear weapons.

>    http://www.nytimes.com/2009/09/07/us/07activist.html

what is wrong with our justice system?

what is wrong with us?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090908/2b7fa08f/attachment.html>

From bzg at altern.org  Mon Sep  7 20:36:00 2009
From: bzg at altern.org (Bastien)
Date: Tue, 08 Sep 2009 11:36:00 +0800
Subject: [gutvol-d] Re: labor day -- working for peace
In-Reply-To: <d15.51be01f9.37d77e3c@aol.com> (Bowerbird@aol.com's message of
	"Tue, 8 Sep 2009 05:30:36 EDT")
References: <d15.51be01f9.37d77e3c@aol.com>
Message-ID: <873a6yds67.fsf@bzg.ath.cx>

Bowerbird at aol.com writes:

> the rev. carl kabat has spent over 10 years in jail
> as "punishment" for his _symbolic_ protests against
> nuclear weapons.
>
>>   http://www.nytimes.com/2009/09/07/us/07activist.html
>
> what is wrong with our justice system?
>
> what is wrong with us?

Using uppercase letters in not completely useless...

-- 
 Bastien

From pterandon at gmail.com  Tue Sep  8 03:43:00 2009
From: pterandon at gmail.com (Greg M. Johnson)
Date: Tue, 8 Sep 2009 06:43:00 -0400
Subject: [gutvol-d] What is the intended use of TXT format-- why line breaks?
Message-ID: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com>

Hi.
I have looked at PG's books in both HTML and TXT formats, on several
different devices, from large-screen laptops to netbooks to an Ipod Touch.

In just about every possible scenario, I had the line breaks creating an
irregular right margin down the screen that made for unpleasant reading.  I
also tried taking one of the raw TXT files to make "my own" HTML file, and
was tripped up by the line breaks.

In order to prevent me from making the suggestion of changing the whole
collection, can someone tell me why that number of characters on the screen
was chosen?




-- 
Greg M. Johnson
http://pterandon.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090908/cd0e07ec/attachment.html>

From ricardofdiogo at gmail.com  Tue Sep  8 07:29:45 2009
From: ricardofdiogo at gmail.com (Ricardo F Diogo)
Date: Tue, 8 Sep 2009 15:29:45 +0100
Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line
	breaks?
In-Reply-To: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com>
References: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com>
Message-ID: <9c6138c50909080729j1e3f90c7ha7f0cf0da9582ff8@mail.gmail.com>

2009/9/8 Greg M. Johnson <pterandon at gmail.com>
>
> Hi.

Hi Greg M.

> I have looked at PG's books in both HTML and TXT formats, on several different devices, from large-screen laptops to netbooks to an Ipod Touch.
>
> In just about every possible scenario, I had the line breaks creating an irregular right margin down the screen that made for unpleasant reading.? I also tried taking one of the raw TXT files to make "my own" HTML file, and was tripped up by the line breaks.
>
You can visit: http://www.gutenberg.org/wiki/Gutenberg:Readers%27_FAQ#R.30._When_I_print_out_the_text_file.2C_each_line_runs_over_the_edge_of_the_page_and_looks_bad.

First, all paragraphs and separate lines should be separated by two
HRs, so that you can see one blank line between them. Where they
aren't, as in the case of a table of contents or lines of verse, add
the extra HRs to make them so.
Replace All occurrences of two HRs with some nonsense character or
string that doesn't exist in the text, like ~$~.
Replace All remaining HRs with a space.
Replace your inserted string ~$~ with one HR.


> In order to prevent me from making the suggestion of changing the whole collection, can someone tell me why that number of characters on the screen was chosen?
>
>
You can visit: http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#About_the_formatting_of_a_text_file

The idea is: using a pure text format, with a number of lines per page
that can readable in most computers and preserved for the future to
come.

Ricardo F. Diogo

From ricardofdiogo at gmail.com  Tue Sep  8 07:31:14 2009
From: ricardofdiogo at gmail.com (Ricardo F Diogo)
Date: Tue, 8 Sep 2009 15:31:14 +0100
Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line
	breaks?
In-Reply-To: <9c6138c50909080729j1e3f90c7ha7f0cf0da9582ff8@mail.gmail.com>
References: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com>
	<9c6138c50909080729j1e3f90c7ha7f0cf0da9582ff8@mail.gmail.com>
Message-ID: <9c6138c50909080731l396d32b9u17a88c3e2dfed472@mail.gmail.com>

(Of course, in my last message I meant "characters per line", not
"lines per page".

Ricardo F. Diogo)

From Bowerbird at aol.com  Tue Sep  8 09:22:23 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 8 Sep 2009 12:22:23 EDT
Subject: [gutvol-d] Re: labor day -- working for peace
Message-ID: <c06.5f263ab7.37d7debf@aol.com>

bastien said:
>    Using uppercase letters in not completely useless...

what is wrong with us americans?

and what is wrong with our justice system?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090908/0b87de64/attachment.html>

From lee at novomail.net  Tue Sep  8 09:41:40 2009
From: lee at novomail.net (Lee Passey)
Date: Tue, 08 Sep 2009 10:41:40 -0600
Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line
 breaks?
In-Reply-To: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com>
References: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com>
Message-ID: <4AA68944.2040108@novomail.net>

Greg M. Johnson wrote:

> In order to prevent me from making the suggestion of changing the whole 
> collection, can someone tell me why that number of characters on the 
> screen was chosen?

In 1985 virtually all interactions with computers were performed via 
"smart terminals," predominately the DEC VT-52 and VT-100, which 
presented only text in an 80 x 25 array, that is, 25 lines each having 
at most 80 characters. The characters could be highlighted by reversing 
the electron output on the CRT (i.e. using 'on' instead of 'off', and 
vice-versa) but any other manipulation of the font, such a italic, 
bolding, or even a different font, was simply not possible. Even most 
personal computers of that day used VT-100 emulation.

At that same time, we were being taught in typing class that left 
margins should be 66 characters; the bell would be set at 60, at which 
point the typist needed to decide whether the current word would fit in 
the 66 character limit, or whether it needed to be hyphenated.

In 1985 the principals at Project Gutenberg did not want to deal with 
hyphenation, so no words were hyphenated. The current line length of 
Project Gutenberg files was designed so no word in unhyphenated form 
would ever cause a line to exceed 80 characters and wrap to a new line 
on a typical 1985-era smart terminal.

In most ways, Project Gutenberg has not progressed beyond 1985.

From hart at pobox.com  Tue Sep  8 13:55:30 2009
From: hart at pobox.com (Michael S. Hart)
Date: Tue, 8 Sep 2009 13:55:30 -0700 (PDT)
Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line
 breaks?
In-Reply-To: <4AA68944.2040108@novomail.net>
References: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com>
	<4AA68944.2040108@novomail.net>
Message-ID: <alpine.DEB.2.00.0909081353020.22979@mail.pglaf.org>


Too many search engines fail when words are hyphenated.

There are all sorts of ways to remove hard returns in one second.

It takes more time to complain than to actually find one of these
and use it. . . .

In many ways complainers have not evolved past Medieval Times.



On Tue, 8 Sep 2009, Lee Passey wrote:

> Greg M. Johnson wrote:
>
> > In order to prevent me from making the suggestion of changing the whole
> > collection, can someone tell me why that number of characters on the screen
> > was chosen?
>
> In 1985 virtually all interactions with computers were performed via "smart
> terminals," predominately the DEC VT-52 and VT-100, which presented only text
> in an 80 x 25 array, that is, 25 lines each having at most 80 characters. The
> characters could be highlighted by reversing the electron output on the CRT
> (i.e. using 'on' instead of 'off', and vice-versa) but any other manipulation
> of the font, such a italic, bolding, or even a different font, was simply not
> possible. Even most personal computers of that day used VT-100 emulation.
>
> At that same time, we were being taught in typing class that left margins
> should be 66 characters; the bell would be set at 60, at which point the
> typist needed to decide whether the current word would fit in the 66 character
> limit, or whether it needed to be hyphenated.
>
> In 1985 the principals at Project Gutenberg did not want to deal with
> hyphenation, so no words were hyphenated. The current line length of Project
> Gutenberg files was designed so no word in unhyphenated form would ever cause
> a line to exceed 80 characters and wrap to a new line on a typical 1985-era
> smart terminal.
>
> In most ways, Project Gutenberg has not progressed beyond 1985.
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

From marcello at perathoner.de  Tue Sep  8 15:07:06 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed, 09 Sep 2009 00:07:06 +0200
Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line
 breaks?
In-Reply-To: <alpine.DEB.2.00.0909081353020.22979@mail.pglaf.org>
References: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com>	<4AA68944.2040108@novomail.net>
	<alpine.DEB.2.00.0909081353020.22979@mail.pglaf.org>
Message-ID: <4AA6D58A.4070700@perathoner.de>

Michael S. Hart wrote:

> There are all sorts of ways to remove hard returns in one second.

But no way to decide which ones to drop and which one to keep.


> It takes more time to complain than to actually find one of these
> and use it. . . .

Actually nobody has yet come up with a satisfactory solution to this 
problem.


> In many ways complainers have not evolved past Medieval Times.

Here we go again. Blaming your customers is still cheaper than fixing 
your bugs.

I guess that's your only way out if you can't find a single argument to 
uphold the boneheaded plain text format that PG is still producing.

And you can't find one, because there ain't one.


-- 
Marcello Perathoner
webmaster at gutenberg.org

From i30817 at gmail.com  Tue Sep  8 15:27:49 2009
From: i30817 at gmail.com (Paulo Levi)
Date: Tue, 8 Sep 2009 23:27:49 +0100
Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line
	breaks?
In-Reply-To: <4AA6D58A.4070700@perathoner.de>
References: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com> 
	<4AA68944.2040108@novomail.net>
	<alpine.DEB.2.00.0909081353020.22979@mail.pglaf.org> 
	<4AA6D58A.4070700@perathoner.de>
Message-ID: <212322090909081527m1ea8c8ddpec17b8709f853c64@mail.gmail.com>

I actually agree with the complaints if you don't mind my input (no
flaming please). I try to do some sort of correction for my ebook
reader, but its very primitive (and breakable) if the first alphabetic
character  in the new line is uppercase, keep the line, otherwise join
them. First i tried if the last character of the previous line before
a alphanumeric is a punctuation, keep the line, otherwise join it, but
hey, more false positives. The one i uses at least corrects normal
errors (Noun names non-withstanding) while keeping things like Chapter
headings mostly intact (except lowercase off course).

They can't be both applied i think. If some has a better algorithm,
please share hey? This is one of the reasons i prefer html formats. A
space is a space is not dozens of spaces and \n is nothing at all and
<p> is king.

From azkar0 at gmail.com  Tue Sep  8 15:54:46 2009
From: azkar0 at gmail.com (Scott Olson)
Date: Tue, 8 Sep 2009 16:54:46 -0600
Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line
	breaks?
In-Reply-To: <212322090909081527m1ea8c8ddpec17b8709f853c64@mail.gmail.com>
References: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com>
	<4AA68944.2040108@novomail.net>
	<alpine.DEB.2.00.0909081353020.22979@mail.pglaf.org>
	<4AA6D58A.4070700@perathoner.de>
	<212322090909081527m1ea8c8ddpec17b8709f853c64@mail.gmail.com>
Message-ID: <2362473e0909081554i5d16a98drafdc9923d8d540b7@mail.gmail.com>

On Tue, Sep 8, 2009 at 4:27 PM, Paulo Levi <i30817 at gmail.com> wrote:

> If some has a better algorithm, please share hey?


If the line begins with whitespace, don't re-wrap it unless you have to
(usually poetry, or something else where the linebreaks matter).

If there are blank lines (two or more \n's), don't wrap the stuff together
(paragraphs, section/chapter breaks).

Otherwise \n = space.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090908/2b4e7cee/attachment-0001.html>

From desrod at gnu-designs.com  Tue Sep  8 15:58:47 2009
From: desrod at gnu-designs.com (David A. Desrosiers)
Date: Tue, 8 Sep 2009 18:58:47 -0400
Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line
	breaks?
In-Reply-To: <4AA6D58A.4070700@perathoner.de>
References: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com>
	<4AA68944.2040108@novomail.net>
	<alpine.DEB.2.00.0909081353020.22979@mail.pglaf.org>
	<4AA6D58A.4070700@perathoner.de>
Message-ID: <a82cdbb90909081558n53f04a90j724d4c27010848ba@mail.gmail.com>

On Tue, Sep 8, 2009 at 6:07 PM, Marcello
Perathoner<marcello at perathoner.de> wrote:
> But no way to decide which ones to drop and which one to keep.

Given the previous example, negative lookahead assertions seem to fit well here:

s/(?<!\n)\n(?!\n)/.../g or perhaps ^\n(?!\n) if you need to anchor it
or /\n(?!\n)/ for zero width and /\n[^\n]/ for width=1 and so on.

Plenty of ways to skin that cat in most regex-capable languages.

From gbnewby at pglaf.org  Tue Sep  8 17:34:14 2009
From: gbnewby at pglaf.org (Greg Newby)
Date: Tue, 8 Sep 2009 17:34:14 -0700
Subject: [gutvol-d] Why plain text was, is,
	and always will be a huge mistake (Re:	 Re: What is the intended
	use of TXT format-- why line	breaks?)
In-Reply-To: <4AA6D58A.4070700@perathoner.de>
References: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com>
	<4AA68944.2040108@novomail.net>
	<alpine.DEB.2.00.0909081353020.22979@mail.pglaf.org>
	<4AA6D58A.4070700@perathoner.de>
Message-ID: <20090909003414.GA1609@pglaf.org>

On Wed, Sep 09, 2009 at 12:07:06AM +0200, Marcello Perathoner wrote:
> Michael S. Hart wrote:
>> In many ways complainers have not evolved past Medieval Times.
>
> Here we go again. Blaming your customers is still cheaper than fixing  
> your bugs.

Between David Widger & Al Haines, thousands of older text only titles
were updated and HTML added.  Nearly all new titles come with text and
HTML.  Many, many bugs were fixed.  Some people might characterize the
lack of HTML as a bug.

The discussion thread had turned to different ways of identifying
paragraph breaks in plain text.  As Marcello rightly points out, it's
tough to do automatically with complete accuracy (though many methods
can produce less than complete accuracy).

Nobody has argued that text is the master format, or should be.  Or that
it is somehow richer, or preserves more of the original formatting.
There are some advantages...there are many limitations.

PG's reasons for an emphasis on including plain text, if feasible, are
well documented.  One of the first responses in this thread pointed this
out.  I've adjusted the Subject line in my response, for people who want
to talk about why text sux, among themselves.  If there's enough
interest, I can set up a separate mailing list for you.  I won't be
joining, however.

  -- Greg



From Bowerbird at aol.com  Wed Sep  9 02:46:42 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 9 Sep 2009 05:46:42 EDT
Subject: [gutvol-d] why the plain-text format is the most useful format for
	eliciting beauty (and more)
Message-ID: <d4a.56f9fa3e.37d8d382@aol.com>

read the reviews for the iphone e-book viewer-app called "eucalyptus".
you will see that it gets credit for making p.g. e-texts look beautiful...
eucalyptus uses the plain-text format; it elicits beauty from that format.

greg notes that david widger and al haines are "updating" older e-texts
with .html equivalents.   do you know how they are accomplishing that?
they run a program that converts the plain-text format to an .html file.
to put it in other words, the .html file is elicited from the plain-text 
file.

i guess david and al want to "get the credit" for creating the .html files,
which is fine.   but if they really wanted to increase overall 
productivity,
they'd turn the conversion routine loose, so any end-user could run it,
without having to wait for david or al to get around to the file they want.
moreover, with more people using the routine, chances are that it would
be improved via open-source coding contributions, which would be cool.

but remember, it's the plain-text file that puts all of this action in 
play...

***

greg said:
>   Nobody has argued that text is the master format, or should be.

that's bull-shit, pure and simple.

i have argued -- at length, with good arguments, ones that nobody
has been able to counter -- that the plain-text format is the master.

the same conversion processes that enable eucalyptus to elicit beauty
from the plain-text file and which enable whitewashers to elicit .html
can be used to create any type of file-format we might want to elicit,
from the kindle to .epub to .pdf to .rtf to .lit to the-next-big-thing...

now, that's not to say that the current form of the plain-text files is
good enough to do the job, because it's not.   but that's simply because
the "powers that be" haven't accepted the modifications i've suggested.
but that's their stupidity, it's not an inherent weakness in the format...

in summary...

if you're not smart enough to see that i have won this particular debate,
step right up into the circle and i will be happy to knock you out, again.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090909/c8ec885e/attachment.html>

From marcello at perathoner.de  Wed Sep  9 05:12:57 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed, 09 Sep 2009 14:12:57 +0200
Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line
 breaks?
In-Reply-To: <a82cdbb90909081558n53f04a90j724d4c27010848ba@mail.gmail.com>
References: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com>	<4AA68944.2040108@novomail.net>	<alpine.DEB.2.00.0909081353020.22979@mail.pglaf.org>	<4AA6D58A.4070700@perathoner.de>
	<a82cdbb90909081558n53f04a90j724d4c27010848ba@mail.gmail.com>
Message-ID: <4AA79BC9.3040003@perathoner.de>

David A. Desrosiers wrote:

> On Tue, Sep 8, 2009 at 6:07 PM, Marcello
> Perathoner<marcello at perathoner.de> wrote:
>> But no way to decide which ones to drop and which one to keep.
> 
> Given the previous example, negative lookahead assertions seem to fit well here:
> 
> s/(?<!\n)\n(?!\n)/.../g or perhaps ^\n(?!\n) if you need to anchor it
> or /\n(?!\n)/ for zero width and /\n[^\n]/ for width=1 and so on.
> 
> Plenty of ways to skin that cat in most regex-capable languages.

ROTFL! Apply that algorithm to Hamlet and see.


See if you can come up with an algorithm that doesn't make mincemeat of 
the following small excerpt. The algorithm should at least:

1. Recognizes that "HAMLET, PRINCE OF DENMARK by William Shakespeare" is 
the title statement of the work. This should be marked up like:

   <h1>Hamlet, Prince of Denmark<br/><br/>
   by William Shakespeare</h1>

and NOT:

   <h1>Hamlet, Prince of Denmark</h1>
   <h2>by William Shakespeare</h2>


2. Not wrap the list of persons proper, BUT wrap <p>Lords, Ladies, 
Officers, Soldiers, Sailors, Messengers, and other Attendants.</p>

3. Recognize that <p>SCENE. Elsinore</p> is a stage direction, not the 
start of scene 1.

4. Recognize <h2>ACT I.</h2>

5. Recognize <h3>Scene I. Elsinore. A platform before the Castle.</h3> 
(Even if it lacks spacing.)
--- start excerpt from #1524 ----


HAMLET, PRINCE OF DENMARK

by William Shakespeare




PERSONS REPRESENTED.

Claudius, King of Denmark.
Hamlet, Son to the former, and Nephew to the present King.
Polonius, Lord Chamberlain.
Horatio, Friend to Hamlet.
Laertes, Son to Polonius.
Voltimand, Courtier.
Cornelius, Courtier.
Rosencrantz, Courtier.
Guildenstern, Courtier.
Osric, Courtier.
A Gentleman, Courtier.
A Priest.
Marcellus, Officer.
Bernardo, Officer.
Francisco, a Soldier
Reynaldo, Servant to Polonius.
Players.
Two Clowns, Grave-diggers.
Fortinbras, Prince of Norway.
A Captain.
English Ambassadors.
Ghost of Hamlet's Father.

Gertrude, Queen of Denmark, and Mother of Hamlet.
Ophelia, Daughter to Polonius.

Lords, Ladies, Officers, Soldiers, Sailors, Messengers, and other
Attendants.

SCENE. Elsinore.



ACT I.

Scene I. Elsinore. A platform before the Castle.

[Francisco at his post. Enter to him Bernardo.]

Ber.
Who's there?

Fran.
Nay, answer me: stand, and unfold yourself.

Ber.
Long live the king!

Fran.
Bernardo?

Ber.
He.

Fran.
You come most carefully upon your hour.

--- end excerpt #1524 ----




-- 
Marcello Perathoner
webmaster at gutenberg.org

From desrod at gnu-designs.com  Wed Sep  9 06:03:05 2009
From: desrod at gnu-designs.com (David A. Desrosiers)
Date: Wed, 9 Sep 2009 09:03:05 -0400
Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line
	breaks?
In-Reply-To: <4AA79BC9.3040003@perathoner.de>
References: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com>
	<4AA68944.2040108@novomail.net>
	<alpine.DEB.2.00.0909081353020.22979@mail.pglaf.org>
	<4AA6D58A.4070700@perathoner.de>
	<a82cdbb90909081558n53f04a90j724d4c27010848ba@mail.gmail.com>
	<4AA79BC9.3040003@perathoner.de>
Message-ID: <a82cdbb90909090603x478d9859id777c2e4ea9b5983@mail.gmail.com>

On Wed, Sep 9, 2009 at 8:12 AM, Marcello
Perathoner<marcello at perathoner.de> wrote:
> ROTFL! Apply that algorithm to Hamlet and see.

> See if you can come up with an algorithm that doesn't make mincemeat of the
> following small excerpt. The algorithm should at least:

As you already know, parsing HTML is a much easier matter than parsing
semi-freeflow text (which was the original poster's request).

Also remember, I do this all the time for spiders we write for
Plucker. I slice, I dice, and I make beautiful, automated works of art
from the worst, most semantically-incorrect HTML out there. See some
examples here:

http://projects.plkr.org/

From hart at pobox.com  Wed Sep  9 06:31:33 2009
From: hart at pobox.com (Michael S. Hart)
Date: Wed, 9 Sep 2009 06:31:33 -0700 (PDT)
Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line
 breaks?
In-Reply-To: <a82cdbb90909090603x478d9859id777c2e4ea9b5983@mail.gmail.com>
References: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com>
	<4AA68944.2040108@novomail.net>
	<alpine.DEB.2.00.0909081353020.22979@mail.pglaf.org>
	<4AA6D58A.4070700@perathoner.de>
	<a82cdbb90909081558n53f04a90j724d4c27010848ba@mail.gmail.com>
	<4AA79BC9.3040003@perathoner.de>
	<a82cdbb90909090603x478d9859id777c2e4ea9b5983@mail.gmail.com>
Message-ID: <alpine.DEB.2.00.0909090631001.26284@mail.pglaf.org>


Want the simple way?

Try unzipping to the Apple text format. . . .


On Wed, 9 Sep 2009, David A. Desrosiers wrote:

> On Wed, Sep 9, 2009 at 8:12 AM, Marcello
> Perathoner<marcello at perathoner.de> wrote:
> > ROTFL! Apply that algorithm to Hamlet and see.
>
> > See if you can come up with an algorithm that doesn't make mincemeat of the
> > following small excerpt. The algorithm should at least:
>
> As you already know, parsing HTML is a much easier matter than parsing
> semi-freeflow text (which was the original poster's request).
>
> Also remember, I do this all the time for spiders we write for
> Plucker. I slice, I dice, and I make beautiful, automated works of art
> from the worst, most semantically-incorrect HTML out there. See some
> examples here:
>
> http://projects.plkr.org/
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

From marcello at perathoner.de  Wed Sep  9 06:45:28 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed, 09 Sep 2009 15:45:28 +0200
Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line
 breaks?
In-Reply-To: <a82cdbb90909090603x478d9859id777c2e4ea9b5983@mail.gmail.com>
References: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com>	<4AA68944.2040108@novomail.net>	<alpine.DEB.2.00.0909081353020.22979@mail.pglaf.org>	<4AA6D58A.4070700@perathoner.de>	<a82cdbb90909081558n53f04a90j724d4c27010848ba@mail.gmail.com>	<4AA79BC9.3040003@perathoner.de>
	<a82cdbb90909090603x478d9859id777c2e4ea9b5983@mail.gmail.com>
Message-ID: <4AA7B178.9010009@perathoner.de>

David A. Desrosiers wrote:

> On Wed, Sep 9, 2009 at 8:12 AM, Marcello
> Perathoner<marcello at perathoner.de> wrote:
>> ROTFL! Apply that algorithm to Hamlet and see.
> 
>> See if you can come up with an algorithm that doesn't make mincemeat of the
>> following small excerpt. The algorithm should at least:
> 
> As you already know, parsing HTML is a much easier matter than parsing
> semi-freeflow text (which was the original poster's request).


Do you read a post before replying?

That's exactly what I requested you to do: To parse a plain text version 
of Hamlet into wrapped and non-wrapped paragraphs.



-- 
Marcello Perathoner
webmaster at gutenberg.org

From desrod at gnu-designs.com  Wed Sep  9 08:20:55 2009
From: desrod at gnu-designs.com (David A. Desrosiers)
Date: Wed, 9 Sep 2009 11:20:55 -0400
Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line
	breaks?
In-Reply-To: <4AA7B178.9010009@perathoner.de>
References: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com>
	<4AA68944.2040108@novomail.net>
	<alpine.DEB.2.00.0909081353020.22979@mail.pglaf.org>
	<4AA6D58A.4070700@perathoner.de>
	<a82cdbb90909081558n53f04a90j724d4c27010848ba@mail.gmail.com>
	<4AA79BC9.3040003@perathoner.de>
	<a82cdbb90909090603x478d9859id777c2e4ea9b5983@mail.gmail.com>
	<4AA7B178.9010009@perathoner.de>
Message-ID: <a82cdbb90909090820x14c22133p55d8ccb6fba1f3d4@mail.gmail.com>

On Wed, Sep 9, 2009 at 9:45 AM, Marcello
Perathoner<marcello at perathoner.de> wrote:
> Do you read a post before replying?

Of course... do you?

> That's exactly what I requested you to do: To parse a plain text version of
> Hamlet into wrapped and non-wrapped paragraphs.

You did? The following looks pretty much like HTML to me, not plain
ASCII text that wraps at 70 columns (like the original poster who
started this thread requested).

> See if you can come up with an algorithm that doesn't make mincemeat of the
> following small excerpt. The algorithm should at least:
>
> 1. Recognizes that "HAMLET, PRINCE OF DENMARK by William Shakespeare" is the
> title statement of the work. This should be marked up like:
>
>  <h1>Hamlet, Prince of Denmark<br/><br/>
>  by William Shakespeare</h1>
>
> and NOT:
>
>  <h1>Hamlet, Prince of Denmark</h1>
>  <h2>by William Shakespeare</h2>
>
>
> 2. Not wrap the list of persons proper, BUT wrap <p>Lords, Ladies, Officers,
> Soldiers, Sailors, Messengers, and other Attendants.</p>
>
> 3. Recognize that <p>SCENE. Elsinore</p> is a stage direction, not the start
> of scene 1.
>
> 4. Recognize <h2>ACT I.</h2>
>
> 5. Recognize <h3>Scene I. Elsinore. A platform before the Castle.</h3> (Even
> if it lacks spacing.)

From desrod at gnu-designs.com  Wed Sep  9 08:24:29 2009
From: desrod at gnu-designs.com (David A. Desrosiers)
Date: Wed, 9 Sep 2009 11:24:29 -0400
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
	for eliciting beauty (and more)
In-Reply-To: <d4a.56f9fa3e.37d8d382@aol.com>
References: <d4a.56f9fa3e.37d8d382@aol.com>
Message-ID: <a82cdbb90909090824g6d85ae4cn49df24baec6cb057@mail.gmail.com>

On Wed, Sep 9, 2009 at 5:46 AM, <Bowerbird at aol.com> wrote:
> read the reviews for the iphone e-book viewer-app called "eucalyptus".
> you will see that it gets credit for making p.g. e-texts look beautiful...
> eucalyptus uses the plain-text format; it elicits beauty from that format.

And thanks to Apple, another compelling reason NOT to get an iPhone to
read etexts:

http://www.blog.montgomerie.net/whither-eucalyptus

From hart at pobox.com  Wed Sep  9 09:47:35 2009
From: hart at pobox.com (Michael S. Hart)
Date: Wed, 9 Sep 2009 09:47:35 -0700 (PDT)
Subject: [gutvol-d] Re: What is the intended use of TXT format-- why line
 breaks?
In-Reply-To: <a82cdbb90909090820x14c22133p55d8ccb6fba1f3d4@mail.gmail.com>
References: <a0bf3e960909080343rf974219o45d05ca7b1bf3577@mail.gmail.com>
	<4AA68944.2040108@novomail.net>
	<alpine.DEB.2.00.0909081353020.22979@mail.pglaf.org>
	<4AA6D58A.4070700@perathoner.de>
	<a82cdbb90909081558n53f04a90j724d4c27010848ba@mail.gmail.com>
	<4AA79BC9.3040003@perathoner.de>
	<a82cdbb90909090603x478d9859id777c2e4ea9b5983@mail.gmail.com>
	<4AA7B178.9010009@perathoner.de>
	<a82cdbb90909090820x14c22133p55d8ccb6fba1f3d4@mail.gmail.com>
Message-ID: <alpine.DEB.2.00.0909090947070.2565@mail.pglaf.org>

On Wed, 9 Sep 2009, David A. Desrosiers wrote:






...


Now THAT is the plainest text message I've ever seen!



From Bowerbird at aol.com  Wed Sep  9 12:12:14 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 9 Sep 2009 15:12:14 EDT
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
	for eliciting beauty (and more)
Message-ID: <d14.52f2e707.37d9580e@aol.com>

david said:
>   And thanks to Apple, another compelling reason

don't care to get into the ring itself, do you david?
so you point to a shiny distraction off to the side!

we're talking about the plain-text format, and how it is
the most useful format for eliciting beauty (and more)...

we're talking about how so many people -- like you --
argued with me over a period of several years about
this very point, and how i have emerged victorious...

that's what we're talking about...

-bowerbird

p.s.   and, since you mentioned it, the brouhaha with
apple gave _tons_ of additional exposure to eucalyptus,
so -- in the end -- it gave the program a tremendous lift.
i'm gonna do everything i can to make sure that my apps
are held up by apple the same way, to get public sympathy.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090909/18b28cb9/attachment.html>

From jimad at msn.com  Wed Sep  9 12:50:16 2009
From: jimad at msn.com (James Adcock)
Date: Wed, 9 Sep 2009 12:50:16 -0700
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
	for	eliciting beauty (and more)
In-Reply-To: <d4a.56f9fa3e.37d8d382@aol.com>
References: <d4a.56f9fa3e.37d8d382@aol.com>
Message-ID: <SNT120-DS100053AB3CC0340C06AEE7AEE90@phx.gbl>

I don't see how one "elicits beauty" from something that isn't there.  Plain
text doesn't have enough power to encode even simple mainstream texts, which
frequently include the use of italic, for example.  Yes, one can fake it,
but then its not plain text anymore.

 

I'd like to see a format that at least allows unambiguous encoding of
mainstream texts, capturing the author's intent.  Yes, once again one can
fake it using HTML, but HTML contains SO MANY other weaknesses in the other
direction!

 

If we had an unambiguous encoding which captures authors intent, then it
would be easy to go the other direction and "throw away" author's intent
when it doesn't fit into plain jane text mode.

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090909/e4782342/attachment.html>

From Bowerbird at aol.com  Wed Sep  9 17:15:11 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 9 Sep 2009 20:15:11 EDT
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
	for eliciting beauty (and more)
Message-ID: <cde.52ed0b6e.37d99f0f@aol.com>

jim said:
>    I don?t see how one ?elicits beauty? from something that isn?t 
there.? 
>    Plain text doesn?t have enough power to encode even 
>    simple mainstream texts, which frequently include 
>    the use of italic, for example.? 

italics are indicated by surrounding _underscores_...


>    Yes, one can fake it, but then its not plain text anymore.

you have an archaic and incorrect notion of "plain text"... ?


>    If we had an unambiguous encoding which captures authors intent, 
>    then it would be easy to go the other direction and ?throw away? 
>    author?s intent when it doesn?t fit into plain jane text mode.

why would you want to "throw away" the author's intent?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090909/5545a5fc/attachment.html>

From hart at pobox.com  Wed Sep  9 22:15:59 2009
From: hart at pobox.com (Michael S. Hart)
Date: Wed, 9 Sep 2009 22:15:59 -0700 (PDT)
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
 for eliciting beauty (and more)
In-Reply-To: <cde.52ed0b6e.37d99f0f@aol.com>
References: <cde.52ed0b6e.37d99f0f@aol.com>
Message-ID: <alpine.DEB.2.00.0909092213270.2199@mail.pglaf.org>

On Wed, 9 Sep 2009, Bowerbird at aol.com wrote:

> jim said:
> >?? I don?t see how one ?elicits beauty? from something that isn?t there.?
> >?? Plain text doesn?t have enough power to encode even
> >?? simple mainstream texts, which frequently include
> >?? the use of italic, for example.?
>
> italics are indicated by surrounding _underscores_...
>
>
> >?? Yes, one can fake it, but then its not plain text anymore.
>
> you have an archaic and incorrect notion of "plain text"... ?
>
>
> >?? If we had an unambiguous encoding which captures authors intent,
> >?? then it would be easy to go the other direction and ?throw away?
> >?? author?s intent when it doesn?t fit into plain jane text mode.
>
> why would you want to "throw away" the author's intent?
>
> -bowerbird
>

Most of the authors I have interviewed on this subject,
perhaps all, told me they never wrote in italics, bold
or underscore, that this is only a publisher artifact,
nothing to do with "author's intent."


Thanks!!!


Michael S. Hart
Founder
Project Gutenberg
Inventor of ebooks


From user5013 at aol.com  Thu Sep 10 02:45:09 2009
From: user5013 at aol.com (Christa & Jay Toser)
Date: Thu, 10 Sep 2009 04:45:09 -0500
Subject: [gutvol-d] TXT format and hard line breaks
Message-ID: <CA527096-4850-492F-B669-7D56B749DF4E@aol.com>

I do not often post to this list. But the question is relevant.   
Unfortunately, I must state my preference for raw text files.

And, to a great extent, I agree with Bowerbird.  At least, those  
parts I can read.

You need to know that I live in America. Born here. But I currently  
"surf" the "internet" with a PowerPC 6100/66.  I use Mozilla ver.  
3.0.  Macintosh operating system 7.5.1.  Dial-up is mostly 56K.

Try downloading video with that set-up.  In fact, try reading this  
user list with that set-up.  I get SOME messages when I read  
directly. I get OTHER messages when I choose to read "raw source."  
And, when I go to my workplace, and read on their PC computers, I  
read more DIFFERENT messages. [Their internet connection is lots  
better. They download about 1/2 terabyte a day.]

And yet, of the three different places that I can read this list, I  
NEVER get ALL the messages. Each and every one is different. Oh yes,  
there is overlap, but I'm not really sure if I have really gotten all  
the messages.  Honestly!

So, I think the original question was a two parter: 1) Why text only?  
2) Why the hard line breaks?

I must first apologize if I offend anyone by answering question 1).  
Text was considered universal to the English speaking world -- way  
back when Project Gutenberg started. This was at a time when Unicode  
would not exist for about two decades.

I LOVE TEXT.

As I just said above, I will not/can not/am not allowed to/ read all  
of your messages. Even if I go through three different setups, and  
two different servers, I am still not certain that I have read  
everything you have sent. I feel that I am being censored by the  
internet. It is truly my opinion that, if e-mail were just sent in  
TEXT, then I would know more of this world.

Yes, a picture is worth a thousand words. No, I would rather read a  
thousand words than see a picture.  Especially in this modern day,  
when everyone and their mother have a better way of showing data.  
Every country on the entire planet (that's what? 300+ countries?)  
they all have a new and better way to format text. Every different  
language must somehow show their data somehow JUST ABSOLUTELY  
CORRECT. Their standard is right. This standard is right. That  
standard is right. No, everything is wrong. Let's re-invent the wheel  
from scratch. No, it doesn't "look right." It has to be "correct." It  
is wrong if the text lines "break" at the "wrong" place.

Errrm, got carried away there.

In my opinion, I think the raw text of each book in Project  
Gutenberg, is the ultimate in how a book should be delivered.

Again, I apologize if I have offended anyone on this list for writing  
my obvious opinion.

2) Why the hard line breaks?  Partially, this was covered by (I  
think) Bowerbird.  There was a time when there were no fonts.  
Specifically, there were no "variable-width" fonts. Way before the  
Macintosh existed, there was only one way to read text.  And it was  
only one width per each character, and there were only 80 characters  
per line. Max. Period.

And when Project Gutenberg was started, he set the standard at  
whatever existed at the time. Break it at never more than 80  
characters -- and break it between words.  No hyphenation.

Now, this problem of hard line breaks is a legitimate problem.  Now,  
several decades after Macintosh (and later, PC's); it is my  
considered opinion that there is no need for a hard line break.  Even  
way-back-when, in the early days -- there was question of whether a  
hard line break was just a <LF> (line feed) or <CR><LF> (carriage  
return, followed by line feed). Yep, there were format problems back  
before 1980.

Now-a-days, with all the wonderful formatting which is available; in  
so many different fonts;  in so many different platforms; with so  
many different programs; that can read so many different styles; well  
then -- what do we choose is right?

**sigh**

Above, I have described my computer system. I will tell you, that my  
computer system is more advanced than perhaps 2/3rds of the world.   
Most do not have the bandwidth for a .pdf. Or actually any kind of  
formatted book.  Maybe, they have an hour per week at an internet  
cafe. At 10-12K speed.

They don't care if the line breaks are wrong.  They care if they can  
read the books. And pretty much throughout that world, they can only  
read the books, ONLY if simple 8-bit ASCII text exists. No one on  
Project Gutenberg, NO ONE, can guarantee a more universal format, nor  
a faster format to download, than text only. (Except perhaps 7-bit  
ASCII, [capital letters only]; or OCTAL; but that diverges.]


My only recommendation in this debate is this: There is no longer a  
need for a hard line break at every 80 character line.  However, I  
believe there is still a need for a hard line break between paragraphs.

I believe the text versions of the books can be scanned for single  
<CR> or <CR><LF> groups, and be removed.  Double <CR><CR> or  
<CR><LF><CR><LF> should be maintained. And yes, I feel strongly this  
can be done to the ORIGINAL .txt files. It is my opinion that the  
technology of the world has truly advanced beyond the need for a hard  
line break at the end of every line. Paragraph breaks, yes. Line  
breaks, no.

As to how this would translate into .pdf or Kindle? Gagg me with a  
spoon. I'm not there.

Hope this helps,
Jay Toser

From pterandon at gmail.com  Thu Sep 10 16:28:57 2009
From: pterandon at gmail.com (Greg M. Johnson)
Date: Thu, 10 Sep 2009 19:28:57 -0400
Subject: [gutvol-d] In search of a more-vanilla vanilla TXT
Message-ID: <a0bf3e960909101628v4a1b5d7fx6cffeec7da691de9@mail.gmail.com>

Jay Toser wrote:

> 1) ... I LOVE TEXT ...

Me too.  My problem is that the HTML n.n (n's being really small numbers)
format I think is more universally viewable TODAY than the 80-character-line
TXT on more devices (my ipod touch and 12" wide laptop screen each giving
different readability problems: wacky wraparound and too fine a print in
Notepad, respectively).


> 2) My only recommendation in this debate is this: There is no longer a
need for a
> hard line break at every 80 character line.  However, I believe there is
still a need for a
> hard line break between paragraphs.

Completely agreed.  That was the gist of my original proposal.

Another question is whether today's most primitive TXT-reading softwares now
come with wraparound-- and by this I mean "terminal editors" like emacs and
vi.  Or what is the most primitive device in use today-- is it a 1980 Win
3.1 'puter, perhaps the itouch (in some regards)?

Another idea is whether we could tolerate another format. If we've already
got half a dozen, why not have another that is "non-80 plain text" (defined
above by Jay).




-- 
Greg M. Johnson
http://pterandon.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090910/124e93a8/attachment.html>

From hart at pobox.com  Thu Sep 10 18:32:48 2009
From: hart at pobox.com (Michael S. Hart)
Date: Thu, 10 Sep 2009 18:32:48 -0700 (PDT)
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <a0bf3e960909101628v4a1b5d7fx6cffeec7da691de9@mail.gmail.com>
References: <a0bf3e960909101628v4a1b5d7fx6cffeec7da691de9@mail.gmail.com>
Message-ID: <alpine.DEB.2.00.0909101829470.16447@mail.pglaf.org>



Trouble reading on a 12" screen?

I read on my 9" screen just fine.

Perhaps a font or resolution adjustment might help?

I was on my 9" for hours today, never noticed a problem,
surfing, reading, I use Notebook there all the time with
no problem, but I probably adjusted the font/resolution,
etc., the first day I had it and never worried again.

I do use $1 reading glasses with all my computers,
I must admit. . . .


Michael



On Thu, 10 Sep 2009, Greg M. Johnson wrote:

> Jay Toser wrote:
>
> > 1) ... I LOVE TEXT ...
>
> Me too.? My problem is that the HTML n.n (n's being really small numbers)? format I think is more universally viewable TODAY than the 80-character-line
> TXT on more devices (my ipod touch and 12" wide laptop screen each giving different readability problems: wacky wraparound and too fine a print in
> Notepad, respectively).
>
>
> > 2) My only recommendation in this debate is this: There is no longer a need for a
> > hard line break at every 80 character line. ?However, I believe there is still a need for a
> > hard line break between paragraphs.
>
> Completely agreed.? That was the gist of my original proposal.
>
> Another question is whether today's most primitive TXT-reading softwares now come with wraparound-- and by this I mean "terminal editors" like emacs and
> vi.? Or what is the most primitive device in use today-- is it a 1980 Win 3.1 'puter, perhaps the itouch (in some regards)?
>
> Another idea is whether we could tolerate another format. If we've already got half a dozen, why not have another that is "non-80 plain text" (defined
> above by Jay).
>
>
>
>
> --
> Greg M. Johnson
> http://pterandon.blogspot.com
>
>

From tb at baechler.net  Fri Sep 11 00:43:51 2009
From: tb at baechler.net (Tony Baechler)
Date: Fri, 11 Sep 2009 00:43:51 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <a0bf3e960909101628v4a1b5d7fx6cffeec7da691de9@mail.gmail.com>
References: <a0bf3e960909101628v4a1b5d7fx6cffeec7da691de9@mail.gmail.com>
Message-ID: <4AA9FFB7.2040304@baechler.net>

Hi,

I know of at least two people who still use DOS regularly and at least 
one who uses a 486.  Unfortunately, DOS doesn't handle very long lines 
as I know from personal experience.  I would ask PG to please continue 
using the same text format with line breaks.  Conversion of line endings 
can be done easily when unzipping the file or with any of several 
utilities on any OS.

Before people tell me how DOS is old and no one should use it in their 
right mind, I would like to say that the people I know of simply can't 
afford anything else and in most cases lack the computer skills.  Yes, 
Linux runs on a 486 but they don't want to learn a new OS.  Also, they 
are blind.  That in itself isn't relevant but a screen reader by itself 
costs at least $795 in most cases.  Most blind people have a very small 
income and can't afford a new Windows computer.  There are some free 
screen readers but they still require XP or better.  With that said, for 
most people, long lines aren't a problem and I realize that PG can't 
please all of the people all of the time.  Those same DOS users are also 
on dial-up.  For various reasons, html viewing in DOS isn't practical.

On 9/10/2009 4:28 PM, Greg M. Johnson wrote:
> > 2) My only recommendation in this debate is this: There is no longer 
> a need for a
> > hard line break at every 80 character line.  However, I believe 
> there is still a need for a
> > hard line break between paragraphs.
>
> Completely agreed.  That was the gist of my original proposal.
>
> Another question is whether today's most primitive TXT-reading 
> softwares now come with wraparound-- and by this I mean "terminal 
> editors" like emacs and vi.  Or what is the most primitive device in 
> use today-- is it a 1980 Win 3.1 'puter, perhaps the itouch (in some 
> regards)?
>
> Another idea is whether we could tolerate another format. If we've 
> already got half a dozen, why not have another that is "non-80 plain 
> text" (defined above by Jay).


From Bowerbird at aol.com  Fri Sep 11 01:44:38 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 11 Sep 2009 04:44:38 EDT
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
Message-ID: <d4f.5c1c36fd.37db67f6@aol.com>

greg said:
>   Or what is the most primitive device in use today

the web-browser.

a web-browser won't wrap the lines on a .txt file, so 
if the hard-returns were removed from p.g. .txt files,
the lines would run off the screen of a web-browser.

try it if you don't believe me.

***

it's absolutely true that project gutenberg should have
given users a tool that would remove the hard returns,
and it should've done that years ago, but it's also true
that the .txt files _should_ have hard-returns in them.

now, i'd suggest that those hard-returns should mimic
the ones found in the print-books against which the text
was proofed, but that won't help the books already done.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090911/9c6e78eb/attachment.html>

From Bowerbird at aol.com  Fri Sep 11 02:01:25 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 11 Sep 2009 05:01:25 EDT
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
Message-ID: <c17.64a5efbd.37db6be5@aol.com>

i said:
>    a web-browser won't wrap the lines on a .txt file, so
>    if the hard-returns were removed from p.g. .txt files,
>    the lines would run off the screen of a web-browser.

i'm sorry.   i was wrong.   safari (at least) does wrap the lines.

i'm not sure where i got that idea...

at any rate, my apologies for the misinformation.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090911/cf7fb95e/attachment.html>

From traverso at posso.dm.unipi.it  Fri Sep 11 02:34:20 2009
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Fri, 11 Sep 2009 11:34:20 +0200 (CEST)
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <c17.64a5efbd.37db6be5@aol.com> (Bowerbird@aol.com)
References: <c17.64a5efbd.37db6be5@aol.com>
Message-ID: <20090911093420.1072310141@cardano.dm.unipi.it>

>>>>> "Bowerbird" == Bowerbird  <Bowerbird at aol.com> writes:


    Bowerbird> i said:
    >> a web-browser won't wrap the lines on a .txt file, so if the
    >> hard-returns were removed from p.g. .txt files, the lines would
    >> run off the screen of a web-browser.

    Bowerbird> i'm sorry.  i was wrong.  safari (at least) does wrap
    Bowerbird> the lines.

    Bowerbird> i'm not sure where i got that idea...

    Bowerbird> at any rate, my apologies for the misinformation.

Most browsers (IE, firefox, opera, konqueror) don't wrap, at least in the
default configuration. Which makes sense, since wrapping may destroy
information. 


I agree that PG should provide several custom TXT file formats. One
might convert on the fly from one format to the other. Who cares to
tune manually lines that are shorter than 55 characters? Still, this
is one of the requirements, and one that often requires some time to
achieve.

One txt file in a sufficently rich encoding to allow correct
representation is sufficient, everything else might be generated on
the fly. And the best would be the format that carries most
information: unicode, with the original line breaks as much as
possible.

Consider also that many HTML files are now provided with the original
line breaks, and having the storage TXT file with the same lines would
greatly simplify maintenance. Especially if one derives the txt file
from the HTML automatically (or even better both from a common master).

Carlo Traverso




From hart at pobox.com  Fri Sep 11 05:03:01 2009
From: hart at pobox.com (Michael S. Hart)
Date: Fri, 11 Sep 2009 05:03:01 -0700 (PDT)
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <d4f.5c1c36fd.37db67f6@aol.com>
References: <d4f.5c1c36fd.37db67f6@aol.com>
Message-ID: <alpine.DEB.2.00.0909110459590.6651@mail.pglaf.org>

On Fri, 11 Sep 2009, Bowerbird at aol.com wrote:

> greg said:
> >?? Or what is the most primitive device in use today
>
> the web-browser.
>
> a web-browser won't wrap the lines on a .txt file, so
> if the hard-returns were removed from p.g. .txt files,
> the lines would run off the screen of a web-browser.
>
> try it if you don't believe me.
>
> ***
>
> it's absolutely true that project gutenberg should have
> given users a tool that would remove the hard returns,
> and it should've done that years ago, but it's also true
> that the .txt files _should_ have hard-returns in them.
>
> now, i'd suggest that those hard-returns should mimic
> the ones found in the print-books against which the text
> was proofed, but that won't help the books already done.
>
> -bowerbird
>
>

We did do that years ago, and years before that.

We also had very similar discussions years ago.

I can't tell you how many times we posted info
about different ways to remove hard returns,
what they were, etc., etc., etc.

As along as there are people who want it all
done for them without any knowledge fo how a
computer works, this will be an issue, along
with background color, font, font size, long
or short pages or margins, refresh rates....


mh

From marcello at perathoner.de  Fri Sep 11 05:53:33 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Fri, 11 Sep 2009 14:53:33 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <alpine.DEB.2.00.0909110459590.6651@mail.pglaf.org>
References: <d4f.5c1c36fd.37db67f6@aol.com>
	<alpine.DEB.2.00.0909110459590.6651@mail.pglaf.org>
Message-ID: <4AAA484D.8020705@perathoner.de>

Michael S. Hart wrote:

> I can't tell you how many times we posted info
> about different ways to remove hard returns,
> what they were, etc., etc., etc.

Strawman.


The problem is to decide which LF to retain.

In Hamlet there are speeches that are verse and speeches that are prose. 
The LFs in verse need to be retained!

There has never been posted any info on how to achieve this.


OTOH there is much empirical evidence that the problem of restoring PG 
plain texts is intractable:

Very many people have tried to write tools that convert the plain text 
mess into something usable. GutenMark, Munseys, Manybooks etc. come to 
mind. But when you download some of their machine-made repackages of PG 
you see that they didn't get very far.

PG has a very high standard of accuracy for the words, thus an automatic 
conversion has to achieve the same high standard for the formatting.


Unless somebody can provide this tool, much information has been lost.



-- 
Marcello Perathoner
webmaster at gutenberg.org

From jimad at msn.com  Fri Sep 11 08:26:46 2009
From: jimad at msn.com (Jim Adcock)
Date: Fri, 11 Sep 2009 08:26:46 -0700
Subject: [gutvol-d] Re: why the plain-text format is the most useful
	format	for eliciting beauty (and more)
In-Reply-To: <cde.52ed0b6e.37d99f0f@aol.com>
References: <cde.52ed0b6e.37d99f0f@aol.com>
Message-ID: <SNT120-DS19BCF59165FDCBB666022CAEE70@phx.gbl>

>why would you want to "throw away" the author's intent?

I don't want to throw away author's intent.  But the reality is, in many cases DP and PG do so.

Leading and following underscores are not plain text.  It is an encoding to signal to the reader that something is missing -- namely italics.  One could have just as well -- or as badly -- used <i> and </i> as the signals to indicate to the reader that italics is missing.  I don't doubt that eventually the reader can get used to what they're missing -- but why should they have to?

If it were really that hard to much more closely follow author's intent then I could understand the trade-offs.  But with today's technology it really wouldn't be hard to do much better. And again, if you *want* plain text then it's easy enough to go backwards and throw away the italic information, etc.



From Bowerbird at aol.com  Fri Sep 11 16:00:06 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 11 Sep 2009 19:00:06 EDT
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
Message-ID: <d56.36309ef9.37dc3076@aol.com>

i said:
>    it's absolutely true that project gutenberg should have
>    given users a tool that would remove the hard returns,
>    and it should've done that years ago

michael said:
>   We did do that years ago, and years before that.

oh really?   and just exactly where is that tool?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090911/db155892/attachment.html>

From Bowerbird at aol.com  Fri Sep 11 16:16:28 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 11 Sep 2009 19:16:28 EDT
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
	for eliciting beauty (and more)
Message-ID: <be0.6442fdd2.37dc344c@aol.com>

jim said:
>   Leading and following underscores are not plain text.? 

sure they are.

indeed, the underscore even falls in the 7-bit range.

so it's as plain-text as plain-text can be, and it has 
a long and glorious tradition of indicating emphasis.


>    It is an encoding to signal to the reader that 
>    something is missing -- namely italics.? 

actually, i think of it as an indicator to the "rendering agent"
-- a.k.a. the viewer-program -- that the surrounded text is
to be displayed with emphasis.   (which generally means italics.)


>    One could have just as well -- or as badly -- used
>    [i]and[/i] as the signals to indicate to the reader
>    that italics is missing.? 

i used square-brackets rather than angle-brackets in the quote,
but i could have used angle-brackets just like you did, jim...

and yes, sir, any of those will work.   indeed, .html uses the
angle-brackets, and many bulletin-board systems use the
square-brackets.   and this is fine, because they use those
brackets as _markup_, with no intention that the brackets
will actually be _seen_ by any human beings.   and likewise,
i don't intend my underscores to be seen by human beings.
just like .html, or forum markup, i expect that a viewer-app
will intercede and display the emphasis just as i had intended.

however -- and this is a very big _however_ -- in the case that
those underscores _are_ being seen by actual human beings,
it's not really all that much of a problem, because underscores
are relatively non-intrusive, and they seem to provide emphasis,
which is why they developed -- spontaneously -- for that purpose.

the brackets, on the other hand, are terribly intrusive, and only
obliterate the text to be emphasized, rather than emphasize it.

likewise, the other bracket commands all serve as _obstacles_
to a human being who happens to be reading the text, and even
to those human beings who have to work with the text in other
capacities, such as editing it.   z.m.l., on the other hand, is zen.
that's why light-markup systems are taking over the world now.


>    I don't doubt that eventually the reader can get used to 
>    what they're missing -- but why should they have to?

they shouldn't.   that's why i have programmed the viewer-apps
that ensure that people don't have to read z.m.l. in its raw form.


>    If it were really that hard to much more closely follow 
>    author's intent then I could understand the trade-offs.? 
>    But with today's technology it really wouldn't be hard 
>    to do much better.

i agree.   we can do much better than what we have been handed.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090911/2ee1e5d9/attachment.html>

From jimad at msn.com  Fri Sep 11 20:55:31 2009
From: jimad at msn.com (James Adcock)
Date: Fri, 11 Sep 2009 20:55:31 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <4AAA484D.8020705@perathoner.de>
References: <d4f.5c1c36fd.37db67f6@aol.com>	<alpine.DEB.2.00.0909110459590.6651@mail.pglaf.org>
	<4AAA484D.8020705@perathoner.de>
Message-ID: <SNT120-DS200FF29A9AF4E055FE90BCAEE60@phx.gbl>

>PG has a very high standard of accuracy for the words, thus an automatic 
conversion has to achieve the same high standard for the formatting.

I would be happy to start with if the same standard for the accuracy of
punctuation 
was held as for the high standards expected of the words.  Of course for
poetry puncs
and LF are basically the same issue.



From jimad at msn.com  Fri Sep 11 21:26:23 2009
From: jimad at msn.com (James Adcock)
Date: Fri, 11 Sep 2009 21:26:23 -0700
Subject: [gutvol-d] Re: why the plain-text format is the most useful
	format	for eliciting beauty (and more)
In-Reply-To: <be0.6442fdd2.37dc344c@aol.com>
References: <be0.6442fdd2.37dc344c@aol.com>
Message-ID: <SNT120-DS116B72E3B00E699B5AA56EAEE60@phx.gbl>

jim said:
>   Leading and following underscores are not plain text.  

sure they are.

indeed, the underscore even falls in the 7-bit range.

so it's as plain-text as plain-text can be, and it has 
a long and glorious tradition of indicating emphasis.



There are many long and inglorious methods of indicating emphasis in
plain-text including *asterix* and SHOUTING and _underscore_ and <i></i> and
[i][/i] and they all suffer from the same problem: They are all not what the
author wrote, at least not as implemented by the typically concurrently
existing publisher.  Now say 100 years later PG says ignore those previous
efforts we as the publisher of this day knows better than the original
intent so we will substitute something else for what was actually printed.
Now if someone really only has a 7-bit teletype to print their PG on, then I
can understand this.  I can also understand PG's desire to continue to
support such teletypists [[I tried using one when I was in college which
tells you how old I am but it kept overheating and burning out based on my
demands]]

 

What I don't understand is why PG continues to be wedded to plain-text as an
*input* encoding format demanded of people submitting texts to PG.
Plain-text is too constrained to do the job well.  HTML is too ambiguous,
and too ill-matched to books to do well.  We need something else, something
that CAN be correctly and automagically converted "correctly" to one or
another formats including plain-text, and Unicode, and HTML, and  mobi, etc.
And something that allows the simple every day tasks of the encoder,
including italics and m-dash and poetry, titles and chapters and
subchapters, publisher info, dates, etc to be handled correctly and easily.

 

PS: Bit curious which blind reader handles _the underscore "convention"_
correctly - I've not seen _that_ one!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090911/8429a997/attachment.html>

From Bowerbird at aol.com  Fri Sep 11 23:10:27 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 12 Sep 2009 02:10:27 EDT
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
	for eliciting beauty (and more)
Message-ID: <c68.55f8b1ec.37dc9553@aol.com>

jim said:
>    What I don?t understand is why PG continues to be 
>    wedded to plain-text as an *input* encoding format 
>    demanded of people submitting texts to PG.? 

well, if you _honestly_ "don't understand" the reason, jim,
then i must say that you certainly aren't trying very hard...

the plain-text format is the most valuable to people because
it is the most pliable when it comes to reworking the content.


>    Plain-text is too constrained to do the job well.? 

first you want to constrain the format to an archaic definition...
then you want to complain about it because it's too constrained.
that's disingenuous.


>    HTML is too ambiguous, and too ill-matched to books to do well.

no, that's not the problem -- .html can do a fine job on books,
for the most part, but the problem is that's it's a pain to create.


>    We need something else, something that CAN be correctly 
>    and automagically converted ?correctly? to one or another formats 
>    including plain-text, and Unicode, and HTML, and ?mobi, etc.? 

that "something else" is z.m.l.


>    And something that allows the simple every day tasks 
>    of the encoder, including italics and m-dash and poetry, 
>    titles and chapters and subchapters, publisher info, dates, etc 
>    to be handled correctly and easily.

again, you're talking about z.m.l.

but, you know, you can invent your own equivalent, if you like...


>    PS: Bit curious which blind reader handles 
>    _the underscore ?convention?_ correctly ? 
>    I?ve not seen _that_ one!

i'll let tony answer that question.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090912/174290f8/attachment-0001.html>

From marcello at perathoner.de  Sat Sep 12 02:47:55 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Sat, 12 Sep 2009 11:47:55 +0200
Subject: [gutvol-d] Re: why the plain-text format is the most useful	format
 for eliciting beauty (and more)
In-Reply-To: <SNT120-DS116B72E3B00E699B5AA56EAEE60@phx.gbl>
References: <be0.6442fdd2.37dc344c@aol.com>
	<SNT120-DS116B72E3B00E699B5AA56EAEE60@phx.gbl>
Message-ID: <4AAB6E4B.4000004@perathoner.de>

James Adcock wrote:

> There are many long and inglorious methods of indicating emphasis in plain-text 
> including **asterix** and SHOUTING and _/underscore/_ and <i></i> and [i][/i] 
> and they all suffer from the same problem: They are all not what the author 
> wrote, at least not as implemented by the typically concurrently existing 
> publisher.

No author wrote italics before word processors became available to the 
end user. They _underlined_ the passages that they wanted the publisher 
to highlight. The publisher then choose an appropriate way of 
highlighting: /italics/ or  s p a c e o u t  or SMALLCAPS.

Mediaeval copysts usually rubricated passages they wished to highlight.


> Now say 100 years later PG says ignore those previous efforts we as 
> the publisher of this day knows better than the original intent so we will 
> substitute something else for what was actually printed.

So what? The brick-and-mortar publishers of yore ignored the previous 
efforts of the monastic scribes because it was too expensive to print 
twice with different inks.

They also ignored the underlining of the author and substituted an 
artifact of their choosing. Also that artifact was largely a function of 
the cultural environment: italics or spaceout.


> What I don?t understand is why PG continues to be wedded to plain-text as an 
> **input** encoding format demanded of people submitting texts to PG.

Nobody understands that. It is a waste of resources pure and simple.

Consider that:


* The bottleneck at DP is the post-processing stage.

* The post-processor is burdened with the creation of one surplus txt file.

* The whitewasher is burdened with one or more surplus txt files.

* Every error needs to be fixed in more than one place (in html and up 
to three txt files, plus as many zips)

* We could easily produce a (good enough) txt version from html on the 
fly with lynx in any encoding the user may want.







-- 
Marcello Perathoner
webmaster at gutenberg.org

From sankarrukku at gmail.com  Sat Sep 12 03:42:29 2009
From: sankarrukku at gmail.com (Sankar Viswanathan)
Date: Sat, 12 Sep 2009 16:12:29 +0530
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
	for eliciting beauty (and more)
In-Reply-To: <4AAB6E4B.4000004@perathoner.de>
References: <be0.6442fdd2.37dc344c@aol.com>
	<SNT120-DS116B72E3B00E699B5AA56EAEE60@phx.gbl>
	<4AAB6E4B.4000004@perathoner.de>
Message-ID: <e45c9fe70909120342h6bbf67e9ya42251d7b5b61d54@mail.gmail.com>

The final output from DP is a text. This is processed through Guiguts. Most
of the Post Processors in DP use Guiguts for post processing.   The html is
generated from this text file.

So no additional work is involved in producing a text file.

Again there is no additional work in White Washing because of the text file.

On Sat, Sep 12, 2009 at 3:17 PM, Marcello Perathoner <marcello at perathoner.de
> wrote:

> James Adcock wrote:
>
>  There are many long and inglorious methods of indicating emphasis in
>> plain-text including **asterix** and SHOUTING and _/underscore/_ and <i></i>
>> and [i][/i] and they all suffer from the same problem: They are all not what
>> the author wrote, at least not as implemented by the typically concurrently
>> existing publisher.
>>
>
> No author wrote italics before word processors became available to the end
> user. They _underlined_ the passages that they wanted the publisher to
> highlight. The publisher then choose an appropriate way of highlighting:
> /italics/ or  s p a c e o u t  or SMALLCAPS.
>
> Mediaeval copysts usually rubricated passages they wished to highlight.
>
>
>  Now say 100 years later PG says ignore those previous efforts we as the
>> publisher of this day knows better than the original intent so we will
>> substitute something else for what was actually printed.
>>
>
> So what? The brick-and-mortar publishers of yore ignored the previous
> efforts of the monastic scribes because it was too expensive to print twice
> with different inks.
>
> They also ignored the underlining of the author and substituted an artifact
> of their choosing. Also that artifact was largely a function of the cultural
> environment: italics or spaceout.
>
>
>  What I don?t understand is why PG continues to be wedded to plain-text as
>> an **input** encoding format demanded of people submitting texts to PG.
>>
>
> Nobody understands that. It is a waste of resources pure and simple.
>
> Consider that:
>
>
> * The bottleneck at DP is the post-processing stage.
>
> * The post-processor is burdened with the creation of one surplus txt file.
>
> * The whitewasher is burdened with one or more surplus txt files.
>
> * Every error needs to be fixed in more than one place (in html and up to
> three txt files, plus as many zips)
>
> * We could easily produce a (good enough) txt version from html on the fly
> with lynx in any encoding the user may want.
>
>
>
>
>
>
>
> --
> Marcello Perathoner
> webmaster at gutenberg.org
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>



-- 
Sankar

Service to Humanity is Service to God
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090912/afd7a474/attachment.html>

From marcello at perathoner.de  Sat Sep 12 05:04:22 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Sat, 12 Sep 2009 14:04:22 +0200
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
 for eliciting beauty (and more)
In-Reply-To: <e45c9fe70909120342h6bbf67e9ya42251d7b5b61d54@mail.gmail.com>
References: <be0.6442fdd2.37dc344c@aol.com>	<SNT120-DS116B72E3B00E699B5AA56EAEE60@phx.gbl>	<4AAB6E4B.4000004@perathoner.de>
	<e45c9fe70909120342h6bbf67e9ya42251d7b5b61d54@mail.gmail.com>
Message-ID: <4AAB8E46.1080305@perathoner.de>

Sankar Viswanathan wrote:

> The final output from DP is a text. This is processed through Guiguts. Most of 
> the Post Processors in DP use Guiguts for post processing.   The html is 
> generated from this text file.

If this is true its all the more waste.

If you output a text file from the OCR and later use a human to 
re-create HTML this is more work than letting the OCR output the HTML 
directly.

And all this crooked workflow is needed because PG requires a txt file 
for hysterical reasons.

No wonder Google is eating our lunch ... they know how to put software 
to work instead of people.


> So no additional work is involved in producing a text file.

Nice sophism. Additional work is required to produce the HTML file. So what?


> Again there is no additional work in White Washing because of the text file.

I don't believe you.

Working 2 files (3, maybe 4) IS more work than working one file. Even if 
you just open the file to see if it is the right one, its work.



-- 
Marcello Perathoner
webmaster at gutenberg.org

From pterandon at gmail.com  Sat Sep 12 05:13:21 2009
From: pterandon at gmail.com (Greg M. Johnson)
Date: Sat, 12 Sep 2009 08:13:21 -0400
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
Message-ID: <a0bf3e960909120513n486a6216o7897c31dad7db9cd@mail.gmail.com>

Marcello wrote:

>The problem is to decide which LF to retain.

So you have to go to a program like Word, turn all instances of "^p^p" to
"QQQQ", then delete "^p", then turn "QQQQ" back to "^p^p".  This happened to
work for me fairly well just now with "Ten Nights in a Bar Room". The novel
itself looked okay but  the license at the end had some spacing
irregularites to it. However:

i)   There will be cases where folks are software limited. I cannot see
anyone being able to do this on the Ipod Touch. I've tried to look at 80-TXT
files  a couple times from an Apple store in cases where there was no HTML
version. This may be a silly example, but I think it's about making an
impression on such cursory  visitors to PG.
ii)  There will be cases where folks are skills limited. Would the
stereotypical impoverished child in Honduras be able to do that?
iii) What about Shakespeare?



Michael wrote:

> Trouble reading on a 12" screen?

Yes.  Why anyone ever came up with a 7.5 x 12.5 inch screen is beyond me,
but you sort of have to choose a small pixel size to get some things in your
workflow vertically all on the same page.  And there are font sizes that are
fine for reading things you never really need to **read**, like "File Edit
View," and then there are font sizes which you'd want if you're forcing your
eyes to actually read a whole book.  Notepad might be fine for a short
shopping list or work to-do list but not for an entire novel in monospace.
Hence also wanting to redirect the viewing experience into an HTML browser
with Ctrl + font-size-changing capability.  (Edit: I just learned just now
that Notepad has an option for changing font face!)




Okay, I'll stop stirring the pot (if yall'd prefer I do), but here are two
last ideas on this topic:

i) If you are producing a book, *please* consider making an HTML version to
be as important as the 80-TXT one, certainly more important than PDF, PUB,
and MOBI.  In my mind, the ones without HTML (and put the entire legalese at
the front of the doc) are in some sense "lost to history" because they
aren't nearly as readable.

ii) Rather than curse the darkness, someone should light a candle. My
response to my allegation of 80-TXT readability was to compile a DVD of 3850
books-- hopefully more books than any reasonable person would ever want to
read in a lifetime-- all in ****unzipped**** HTML format-- structured with
HTML which operates as I imagine the ideal book reading hand-held device
ought (if I were ever to see one in operation).  I've sent a workable draft
to Michael; I'm now looking at squeezing in a mite more books and maybe
setting up editor's picks.



-- 
Greg M. Johnson
http://pterandon.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090912/5f34df94/attachment.html>

From marcello at perathoner.de  Sat Sep 12 06:17:37 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Sat, 12 Sep 2009 15:17:37 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <a0bf3e960909120513n486a6216o7897c31dad7db9cd@mail.gmail.com>
References: <a0bf3e960909120513n486a6216o7897c31dad7db9cd@mail.gmail.com>
Message-ID: <4AAB9F71.2030701@perathoner.de>

Greg M. Johnson wrote:

> i) If you are producing a book, *please* consider making an HTML version to be 
> as important as the 80-TXT one, certainly more important than PDF, PUB, and 
> MOBI.  In my mind, the ones without HTML (and put the entire legalese at the 
> front of the doc) are in some sense "lost to history" because they aren't nearly 
> as readable.

That could easily be done. We have to make HTML on the way to producing 
EPUB. So technically we just could spew out the HTML before packaging 
the EPUB.

But I don't know if it *should* be done ...

The problem is: Nobody has ever been able to generate even barely 
palatable HTML from PG TXT.

For EPUB we can justify the ugly conversion because on most ebook 
readers and small screens ill-formatted EPUB is still better than TXT.

But HTML is supposed to be viewed on browsers and big screens, so 
ill-formatted HTML will be worse than TXT.



-- 
Marcello Perathoner
webmaster at gutenberg.org

From sankarrukku at gmail.com  Sat Sep 12 08:31:04 2009
From: sankarrukku at gmail.com (Sankar Viswanathan)
Date: Sat, 12 Sep 2009 21:01:04 +0530
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
	for eliciting beauty (and more)
In-Reply-To: <4AAB8E46.1080305@perathoner.de>
References: <be0.6442fdd2.37dc344c@aol.com>
	<SNT120-DS116B72E3B00E699B5AA56EAEE60@phx.gbl>
	<4AAB6E4B.4000004@perathoner.de>
	<e45c9fe70909120342h6bbf67e9ya42251d7b5b61d54@mail.gmail.com>
	<4AAB8E46.1080305@perathoner.de>
Message-ID: <e45c9fe70909120831n656c8eeck2da20a6680c3d276@mail.gmail.com>

Most of the post processors in D.P depend on Guiguts for post processing.
More than 80% of the texts have been produced by using Guiguts. But for the
availability of the Guiguts program many of the post processors would have
never ventured to post process.

The Guiguts program has been written for the specific purpose of post
processing of DP books. It is well supported with additional programs like
Gutcheck and Jeebies.

Guiguts generates the html from the text automatically. Guiguts has been
written taking into account the DP process.

Most post processors in DP are not technical people.

Again the question is what do the users want? I am talking about people who
download books from PG and not producers of other formats. Most of the users
download text files. Just to quote an example the text only format of Alice
in Wonderland is downloaded more often than the illustrated html version.

The text version is the LCM.

Do we have statistics about downloading of html and text versions? I am sure
most users download the text version.

So even if we have put in additional effort to produce a text version it is
justified.

Do we have any feedback from the actual users? Letters from users who submit
detailed Errata shows that the text files are being used for teaching school
children in the remote areas of U.S. These are the people who make the
effort worthwhile. May be it also benefits people who are still on Dial Up.

Plain text can be read in any computer. HTML? With all the quirks of IE6 and
other browsers it is not easy to produce html which will render perfectly in
all the browsers.

The earlier discussion was about whether a ASCII text is necessary?

DP does produce TEI text. But there are very few post processors who can do
TEI format. The main reason is the absence of a software like Guiguts.

On Sat, Sep 12, 2009 at 5:34 PM, Marcello Perathoner <marcello at perathoner.de
> wrote:

> Sankar Viswanathan wrote:
>
>  The final output from DP is a text. This is processed through Guiguts.
>> Most of the Post Processors in DP use Guiguts for post processing.   The
>> html is generated from this text file.
>>
>
> If this is true its all the more waste.
>
> If you output a text file from the OCR and later use a human to re-create
> HTML this is more work than letting the OCR output the HTML directly.
>
> And all this crooked workflow is needed because PG requires a txt file for
> hysterical reasons.
>
> No wonder Google is eating our lunch ... they know how to put software to
> work instead of people.
>
>
>  So no additional work is involved in producing a text file.
>>
>
> Nice sophism. Additional work is required to produce the HTML file. So
> what?
>
>
>  Again there is no additional work in White Washing because of the text
>> file.
>>
>
> I don't believe you.
>
> Working 2 files (3, maybe 4) IS more work than working one file. Even if
> you just open the file to see if it is the right one, its work.
>
>
>
>
> --
> Marcello Perathoner
> webmaster at gutenberg.org
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>



-- 
Sankar

Service to Humanity is Service to God
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090912/b7816d2f/attachment-0001.html>

From marcello at perathoner.de  Sat Sep 12 10:30:47 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Sat, 12 Sep 2009 19:30:47 +0200
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
 for eliciting beauty (and more)
In-Reply-To: <e45c9fe70909120831n656c8eeck2da20a6680c3d276@mail.gmail.com>
References: <be0.6442fdd2.37dc344c@aol.com>	<SNT120-DS116B72E3B00E699B5AA56EAEE60@phx.gbl>	<4AAB6E4B.4000004@perathoner.de>	<e45c9fe70909120342h6bbf67e9ya42251d7b5b61d54@mail.gmail.com>	<4AAB8E46.1080305@perathoner.de>
	<e45c9fe70909120831n656c8eeck2da20a6680c3d276@mail.gmail.com>
Message-ID: <4AABDAC7.7080609@perathoner.de>

Sankar Viswanathan wrote:

> Most of the post processors in D.P depend on Guiguts for post processing.  More 
> than 80% of the texts have been produced by using Guiguts. But for the 
> availability of the Guiguts program many of the post processors would have never 
> ventured to post process.

That's the more water to my mill. You need a custom program to proof the 
txt file while any old editor can proof html.


> The Guiguts program has been written for the specific purpose of post processing 
> of DP books. It is well supported with additional programs like Gutcheck and 
> Jeebies.

Bad enough that a special program had to be written while many free 
editors excel at doing html.


> Guiguts generates the html from the text automatically. Guiguts has been written 
> taking into account the DP process.

Yeah, for suitably small values of `HTML?.


I installed guiguts and downloaded Hamlet #1524. Then I pushed the 
'Autogenerate HTML' button in guiguts. This is part of what I got:

<p>Ham.
To be, or not to be,&mdash;that is the question:&mdash;
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune
Or to take arms against a sea of troubles,
And by opposing end them?&mdash;To die,&mdash;to sleep,&mdash;
No more; and by a sleep to say we end
The heartache, and the thousand natural shocks
That flesh is heir to,&mdash;'tis a consummation
Devoutly to be wish'd. To die,&mdash;to sleep;&mdash;
To sleep! perchance to dream:&mdash;ay, there's the rub;
For in that sleep of death what dreams may come,
When we have shuffled off this mortal coil,
Must give us pause: there's the respect
That makes calamity of so long life;
For who would bear the whips and scorns of time,
The oppressor's wrong, the proud man's contumely,
The pangs of despis'd love, the law's delay,
The insolence of office, and the spurns
That patient merit of the unworthy takes,
When he himself might his quietus make
With a bare bodkin? who would these fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,&mdash;
The undiscover'd country, from whose bourn
No traveller returns,&mdash;puzzles the will,
And makes us rather bear those ills we have
Than fly to others that we know not of?
Thus conscience does make cowards of us all;
And thus the native hue of resolution
Is sicklied o'er with the pale cast of thought;
And enterprises of great pith and moment,
With this regard, their currents turn awry,
And lose the name of action.&mdash;Soft you now!
The fair Ophelia!&mdash;Nymph, in thy orisons
Be all my sins remember'd.</p>


guiguts takes its place in the long file of products and services who 
tried to make something of PG plain text and failed.

Mind you, I'm not saying that guiguts is a bad program, I'm saying that 
it is impossible to recover the formatting once a text has been dumbed 
down to PG plain text.


> Again the question is what do the users want? 

Users want as many formats as possible to choose from.


> So even if we have put in additional effort to produce a text version it is 
> justified.

Not so. We can do that automatically with lynx --dump. lynx is free, so 
anybody can do that.

If you produce a `smart? version you can dumb it down with software. If 
you produce a `dumb? version, it is impossible to smart it up again with 
software.



> Do we have any feedback from the actual users? Letters from users who submit 
> detailed Errata shows that the text files are being used for teaching school 
> children in the remote areas of U.S. These are the people who make the effort 
> worthwhile. May be it also benefits people who are still on Dial Up.

Why do *those* people make the effort worthwile? Are you a bit 
prejudiced against better-off people?

"War and Peace" is 1.18M in HTML and 1.16M in TXT. How can that benefit 
people on dial-up?


> Plain text can be read in any computer. HTML? With all the quirks of IE6 and 
> other browsers it is not easy to produce html which will render perfectly in all 
> the browsers.

It is very easy indeed. Stick to the basic tags and even plucker on a 
cell phone will render perfectly.



-- 
Marcello Perathoner
webmaster at gutenberg.org

From azkar0 at gmail.com  Sat Sep 12 10:56:28 2009
From: azkar0 at gmail.com (Scott Olson)
Date: Sat, 12 Sep 2009 11:56:28 -0600
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
	for eliciting beauty (and more)
In-Reply-To: <4AABDAC7.7080609@perathoner.de>
References: <be0.6442fdd2.37dc344c@aol.com>
	<SNT120-DS116B72E3B00E699B5AA56EAEE60@phx.gbl>
	<4AAB6E4B.4000004@perathoner.de>
	<e45c9fe70909120342h6bbf67e9ya42251d7b5b61d54@mail.gmail.com>
	<4AAB8E46.1080305@perathoner.de>
	<e45c9fe70909120831n656c8eeck2da20a6680c3d276@mail.gmail.com>
	<4AABDAC7.7080609@perathoner.de>
Message-ID: <2362473e0909121056va506760t46236c031dea0a43@mail.gmail.com>

On Sat, Sep 12, 2009 at 11:30 AM, Marcello Perathoner <
marcello at perathoner.de> wrote:

> Sankar Viswanathan wrote:
>
>  Most of the post processors in D.P depend on Guiguts for post processing.
>>  More than 80% of the texts have been produced by using Guiguts. But for the
>> availability of the Guiguts program many of the post processors would have
>> never ventured to post process.
>>
>
> That's the more water to my mill. You need a custom program to proof the
> txt file while any old editor can proof html.
>

Guiguts processes the output of the DP proofing process. That output is
neither raw text, nor raw HTML. It's a mix of different markups that
struggles to find a balance between unambiguous output, and ease of the
actual proofing process. The format is one that's relatively easy to
pick-up, as unobtrusive as possible to the proofing process, and one that
can be fairly automatically converted to both text and html by the tools
that have been designed.



> I installed guiguts and downloaded Hamlet #1524. Then I pushed the
> 'Autogenerate HTML' button in guiguts. This is part of what I got:
>
> <p>Ham.
> To be, or not to be,&mdash;that is the question:&mdash;
> Whether 'tis nobler in the mind to suffer
> The slings and arrows of outrageous fortune
> Or to take arms against a sea of troubles,
> And by opposing end them?&mdash;To die,&mdash;to sleep,&mdash;
> No more; and by a sleep to say we end
> The heartache, and the thousand natural shocks
> That flesh is heir to,&mdash;'tis a consummation
> Devoutly to be wish'd. To die,&mdash;to sleep;&mdash;
> To sleep! perchance to dream:&mdash;ay, there's the rub;
> For in that sleep of death what dreams may come,
> When we have shuffled off this mortal coil,
> Must give us pause: there's the respect
> That makes calamity of so long life;
> For who would bear the whips and scorns of time,
> The oppressor's wrong, the proud man's contumely,
> The pangs of despis'd love, the law's delay,
> The insolence of office, and the spurns
> That patient merit of the unworthy takes,
> When he himself might his quietus make
> With a bare bodkin? who would these fardels bear,
> To grunt and sweat under a weary life,
> But that the dread of something after death,&mdash;
> The undiscover'd country, from whose bourn
> No traveller returns,&mdash;puzzles the will,
> And makes us rather bear those ills we have
> Than fly to others that we know not of?
> Thus conscience does make cowards of us all;
> And thus the native hue of resolution
> Is sicklied o'er with the pale cast of thought;
> And enterprises of great pith and moment,
> With this regard, their currents turn awry,
> And lose the name of action.&mdash;Soft you now!
> The fair Ophelia!&mdash;Nymph, in thy orisons
> Be all my sins remember'd.</p>
>

Guiguts wasn't designed to convert existing texts. It's purpose is to help a
DP PPer turn the output of the DP rounds into the final product seen on PG.
In this case, the DP text for a piece of poetry would have had the poetry
wrapped in poetry markers, signifying to Guiguts that it had to treat the
block of text as non-wrappable poetry, and not just a straight paragraph of
prose.



>
> --
> Marcello Perathoner
> webmaster at gutenberg.org
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090912/8af9341d/attachment.html>

From richfield at telkomsa.net  Sat Sep 12 11:04:11 2009
From: richfield at telkomsa.net (Jon Richfield)
Date: Sat, 12 Sep 2009 20:04:11 +0200
Subject: [gutvol-d] TXT and all that.
Message-ID: <4AABE29B.3030701@telkomsa.net>

Folks, I actually am up to the back teeth with the about it and 
abouting. A lot of intelligent people (sincerely meant, no sarcasm) 
saying mainly intelligent things about material problems and objectives 
and yet...
SO I was going to shut up.  ME!!! 
I'll try to make up for it by being short and to beside the point.
As a long-time hands-on, low-level, DP support and development man, I 
understand the importance of TXT files.  Really I do.
Anyway, as a reader with a PC, I can represent the TXT files fairly 
comfortably, even for reading big books, where Dick and Jane meet in 
Whore and Piece.
As a user of PCs that are mere years old, I understand the importance of 
more convenient formats. Trust me, I do.
As either I shall not argue the point. If you seek the reason why, 
circumspice!
Now, I am no major contributor to PG, but I have contributed some 
quick-and-dirty digitisations, using modest facilities. I use M$oft 
under protest, because nowadays as a user (read: bottom-feeder!) I don't 
have the time to learn decent stuff and instead I put up with all the 
triviality and inelegance.

Upshot:
* Scan the book (Actually, nowadays my scanner lies idle. Mainly I use 
my digital camera: faster, often better, far more flexible, more 
portable; (I can use it in libraries etc) and less harmful to the books 
too. That is nice!)
* After having for some years used the perfectly useful crippleware 
version that came in the cereal box  with my scanner I bought a decent 
omniscan on a $100 on-line special. That too is nice.
*Feed the output into Word.  (I actually have certain reservations about 
this, but note that Word has certain useful aspects: It deals fairly 
nicely with TXT AND with HTM. And it is programmable. I don't have to 
work with Courier or FTM Arial all the time.Useful font, but neither 
restful, nor comfortable for speed reading.) In fact, though I have not 
yet done anything with it, I am fairly sure that I have enough 
facilities at hand to produce PDF as well if I choose. But here the same 
prejudice that makes me appreciate TXT kicks in: If I lose my software 
or get stuck with moderately damaged files, I can easily edit HTM to 
make it readable, but I'll be blowed if I bother to learn more opaque 
formats. To be sure, the PG TXT format is, whatever its merits, not 
nice, but if that is what they want...
*In Word, format the whole caboodle fairly nicely, illustrations and 
all. Use all the nice features, including programmability,  for editing 
etc.  BB doesn't like the result of course, and i can see why, but I 
have only one life, and not much more of that, so...Nice chapter 
headings, page numbering etc. Also tables of contents, whatever is free 
or nearly. It helps with the editing anyway, so why not?
*What? PG doesn't like DOC?  Tsk Tsk! So I do a conversion to "filtered" 
HTML. Hard work that.  Takes dozens of keystrokes. Well, more than a 
dozen anyway.
*Ahaaah! Gotcha! PG doesn't like Word HTM either! How do you like that 
my buck???  Big deal! Someone steered me in the direction of HTMLKit! 
Now THERE is a useful product. May the commercial users make the company 
stinking rich! They deserve it big time. HTMLKit convert a Word file so 
easily to clean HTML that I don't bother to keep backups except of the 
DOC. There I have a working HTML with figures, tables, the full catastrophe.
*But what about TXT? Here is where word comes in useful. again. I 
steadfastly resist any hyphenation except where words are too long for 
the line, or where a word is hyphenated. This, apart from other 
virtues,  make it trivial to break lines with the help of a macro or 
two. (Actually, I have used other convenient TXT processors to break 
lines, but that is a matter of ad hoc convenience).
*Then there are the GUTCHKs and so on, bless their writers' hearts. 
Actually I use them before the HTML production to trap errors that slip 
through other processors. By the time I have done with the HTML, I 
usually have finished with the TXT as well. Pictures? No problem.  If 
anyone wants them they can get them from the HTM version.
*Seems a long way round? Maybe. But I already had most of the tools for 
other purposes, and knew how to use them. Apart from the necessary basic 
effort all it takes is largely automated document production and 
conversion. Two steps for two files when you come down to it.
*I realise that BB and a few others have better ways of doing it, but 
every time I get tempted to sniff down those alleys, I go and lie down 
for six seconds till the feeling passes. I'll re-evaluate new options as 
soon as everyone else agrees on all of them.  By that time books and 
computers both will be passe.

Cheers,

Jon
 

From marcello at perathoner.de  Sat Sep 12 11:13:24 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Sat, 12 Sep 2009 20:13:24 +0200
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
 for eliciting beauty (and more)
In-Reply-To: <2362473e0909121056va506760t46236c031dea0a43@mail.gmail.com>
References: <be0.6442fdd2.37dc344c@aol.com>	<SNT120-DS116B72E3B00E699B5AA56EAEE60@phx.gbl>	<4AAB6E4B.4000004@perathoner.de>	<e45c9fe70909120342h6bbf67e9ya42251d7b5b61d54@mail.gmail.com>	<4AAB8E46.1080305@perathoner.de>	<e45c9fe70909120831n656c8eeck2da20a6680c3d276@mail.gmail.com>	<4AABDAC7.7080609@perathoner.de>
	<2362473e0909121056va506760t46236c031dea0a43@mail.gmail.com>
Message-ID: <4AABE4C4.4050704@perathoner.de>

Scott Olson wrote:

> Guiguts wasn't designed to convert existing texts. It's purpose is to help a DP 
> PPer turn the output of the DP rounds into the final product seen on PG. In this 
> case, the DP text for a piece of poetry would have had the poetry wrapped in 
> poetry markers, signifying to Guiguts that it had to treat the block of text as 
> non-wrappable poetry, and not just a straight paragraph of prose.

I see. I was told the output of DP was text and the html generated from it.

Now I gather DP uses some sort of proprietary internal markup and can 
produce HTML without having to produce TXT? Am I right?



-- 
Marcello Perathoner
webmaster at gutenberg.org

From Bowerbird at aol.com  Sat Sep 12 12:28:04 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 12 Sep 2009 15:28:04 EDT
Subject: [gutvol-d] Re: TXT and all that.
Message-ID: <bf6.3cb0d8fd.37dd5044@aol.com>

oh geez, how could marcello make it any more clear that
he doesn't know jack shit about distributed proofreaders?

***

jon said:
>    In Word, format the whole caboodle fairly nicely, illustrations 
>    and all. Use all the nice features, including programmability,?for
>    editing etc.? BB doesn't like the result of course, and i can see why

hey, wait, don't put words into my mouth, jon.
lots of times i would like a .doc file very much.
it'd be much more useful than an .html file or
a butchered .txt version, that's for sure...


>   Nice chapter headings, page numbering etc. Also tables of contents, 
>    whatever is free or nearly. It helps with the editing anyway, so why 
not?

indeed.

but then, of course, if you used one of my editing tools instead,
you'd find that you get all of those things "free" with it, as well...


>    So I do a conversion to "filtered" HTML. Hard work that.? 
>    Takes dozens of keystrokes. Well, more than a dozen anyway.

the conversion from .zml to .html takes just one button-click.


>    Ahaaah! Gotcha! PG doesn't like Word HTM either!

the .html from the .zml conversion is just fine, according to p.g.


>    HTMLKit convert a Word file so easily to clean HTML that 
>    I don't bother to keep backups except of the DOC. 

sounds good.

except, of course, that your .html files differ in their makeup
from other post-processor's .html files, so -- down the line --
it will be absolutely impossible for someone to understand the
various ripples underlying all of your different .html variants,
which means they won't be able to _maintain_ those .html files.

so instead, the person(s) maintaining the library will turn to the
.txt versions, and do the little bit of work necessary so that those
.txt versions can serve as the master from which .html is created.

which is what you should have done in the first place...

this all means you've all basically created something temporary,
instead of something that is able to be maintained a long time...


>    I realise that BB and a few others have better ways of doing it, 
>    but every time I get tempted to sniff down those alleys, I go 
>    and lie down for six seconds till the feeling passes. 

i do have a better way.   but if you want to keep doing it in your
clumsy way, it's ok.   that's the way most people do most things,
and the world didn't fall apart because of it.   (the world fell apart
because of greedy bankers.)   just so long as you realize that your
work on the .html files will be thrown out down the line, and you
are ok with that, everything's fine.   thanks for your generosity in
volunteering your time and energy to the cause of digitizing books.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090912/b215c7c1/attachment-0001.html>

From Bowerbird at aol.com  Sat Sep 12 12:34:33 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 12 Sep 2009 15:34:33 EDT
Subject: [gutvol-d] one more reminder about those .txt files
Message-ID: <cad.56f5cd71.37dd51c9@aol.com>

one more reminder that the iphone app "eucalyptus"
creates beautiful books by using the .txt files from p.g.

all you yahoos who continue to say that it can't be done
have been proven wrong by a person who went and did it.

-bowerbird

p.s.   the programmer has even done a pretty good job of
detecting the places where lines should not be rewrapped,
like in tables, address blocks, signature blocks, and so on.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090912/bbb25522/attachment.html>

From cloos at jhcloos.com  Sun Sep 13 08:49:38 2009
From: cloos at jhcloos.com (James Cloos)
Date: Sun, 13 Sep 2009 11:49:38 -0400
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <4AAB9F71.2030701@perathoner.de> (Marcello Perathoner's message
	of "Sat, 12 Sep 2009 15:17:37 +0200")
References: <a0bf3e960909120513n486a6216o7897c31dad7db9cd@mail.gmail.com>
	<4AAB9F71.2030701@perathoner.de>
Message-ID: <m3d45u3kut.fsf@lugabout.jhcloos.org>

In case anyone really wants to do it right, what PG needs is to have
each book (and other documents) marked up semanticly.

Of all of the exsting SGML/XML applications, TEI seems best for what
PG is doing.  Combined with SVG and X3D for graphics, xcite for any
citations, etc.

The best way to mark up existing PG texts may be to put the docuemnts
in a wiki alongside scans and encourage the public to add the markup.
Wiki-style markup seems to be easier to comprehend for most of the
public.  (And with reason.)

In this model, incidently, each work could be served as a single file,
complete with images and the like included inline.  And the plain text
version can be readily extracted using a stylesheet.

TEI is at:

http://www.tei-c.org/

-JimC
-- 
James Cloos <cloos at jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

From cloos at jhcloos.com  Sun Sep 13 08:55:51 2009
From: cloos at jhcloos.com (James Cloos)
Date: Sun, 13 Sep 2009 11:55:51 -0400
Subject: [gutvol-d] Line Art
Message-ID: <m37hw23kkg.fsf@lugabout.jhcloos.org>

I see that the HTML versions of a number of PG works include images
of art from the original books.

Are higher-resolution scans of those images available anywhere?

I'd like to experiment with automated vectorization of scans of line
art, and providing SVG format vector files of the line art back to PG
would make the effort doubly useful.  (Encapsulated PS and PDF files
also could be made available, if desired.)

-JimC
-- 
James Cloos <cloos at jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

From sly at victoria.tc.ca  Sun Sep 13 11:22:25 2009
From: sly at victoria.tc.ca (Andrew Sly)
Date: Sun, 13 Sep 2009 11:22:25 -0700 (PDT)
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <m3d45u3kut.fsf@lugabout.jhcloos.org>
References: <a0bf3e960909120513n486a6216o7897c31dad7db9cd@mail.gmail.com>
	<4AAB9F71.2030701@perathoner.de> <m3d45u3kut.fsf@lugabout.jhcloos.org>
Message-ID: <Pine.GSO.4.58.0909131108320.24111@vtn1.victoria.tc.ca>



Yes, TEI has been discussed in this group a number of times before.
And there are some contributors using it.

When I go to gutenberg.org and do an advanced search, looking for
TEI as filetype, I find 210 results.

One volunteer's guideline for using TEI can be found at:
http://pgtei.pglaf.org/marcello/0.4/doc/20000-h.html

In short, it is there, and is being used, but not by many
people.

Would you like to help contribute more TEI texts to
the project?

Thanks,
Andrew

On Sun, 13 Sep 2009, James Cloos wrote:

> In case anyone really wants to do it right, what PG needs is to have
> each book (and other documents) marked up semanticly.
>
> Of all of the exsting SGML/XML applications, TEI seems best for what
> PG is doing.  Combined with SVG and X3D for graphics, xcite for any
> citations, etc.
>

From traverso at posso.dm.unipi.it  Sun Sep 13 11:49:18 2009
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Sun, 13 Sep 2009 20:49:18 +0200 (CEST)
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <Pine.GSO.4.58.0909131108320.24111@vtn1.victoria.tc.ca> (message
	from Andrew Sly on Sun, 13 Sep 2009 11:22:25 -0700 (PDT))
References: <a0bf3e960909120513n486a6216o7897c31dad7db9cd@mail.gmail.com>
	<4AAB9F71.2030701@perathoner.de> <m3d45u3kut.fsf@lugabout.jhcloos.org>
	<Pine.GSO.4.58.0909131108320.24111@vtn1.victoria.tc.ca>
Message-ID: <20090913184918.1EC95100F8@cardano.dm.unipi.it>


The problem with PGTEI (the PG dialect of TEI for which PG has
automatic conversion to several end-user formats) is that the final
output is considered ugly by many contributors. 

A second problem is that there is no automatic conversion tool to get
(almost) working PGTEI from DP internal markup. 

I believe that both problems could be solved with little effort. 

Carlo Traverso

From Bowerbird at aol.com  Sun Sep 13 12:32:57 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 13 Sep 2009 15:32:57 EDT
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
Message-ID: <d4e.5ad2093d.37dea2e9@aol.com>

jim said:
>    In case anyone really wants to do it right, 
>    what PG needs is to have each book 
>    (and other documents) marked up semanticly.
>
>    Of all of the exsting SGML/XML applications, 
>    TEI seems best for what PG is doing.?

jim, first of all, you're wrong.

and second of all, you're about 5-8 years late
for this conversation.

i think the archives are still available, and will
give you a good idea of this long-raging debate.

but thanks for giving us all a blast from the past.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090913/0b364606/attachment.html>

From no.la at web.de  Sun Sep 13 13:36:19 2009
From: no.la at web.de (Norbert Langkau)
Date: Sun, 13 Sep 2009 22:36:19 +0200
Subject: [gutvol-d] Re: Line Art
Message-ID: <1258367676@web.de>

Hi James, 
please have a look at this book:
http://www.gutenberg.org/etext/23787
The "base drectory" directs you to 
http://www.gutenberg.org/files/23787/
where you find 
http://www.gutenberg.org/files/23787/23787-page-images/
These are 300 dpi, as far as I recall, but I could provide some 600 dpi's if need be.
I'm very curious on how the outcome will look like.
Best regards - Norbert

> -----Urspr?ngliche Nachricht-----
> Von: "James Cloos" <cloos at jhcloos.com>
> Gesendet: 13.09.09 17:56:42
> An: gutvol-d at lists.pglaf.org
> Betreff: [gutvol-d] Line Art


> I see that the HTML versions of a number of PG works include images
> of art from the original books.
> 
> Are higher-resolution scans of those images available anywhere?
> 
> I'd like to experiment with automated vectorization of scans of line
> art, and providing SVG format vector files of the line art back to PG
> would make the effort doubly useful.  (Encapsulated PS and PDF files
> also could be made available, if desired.)
> 
> -JimC
> -- 
> James Cloos <cloos at jhcloos.com>         OpenPGP: 1024D/ED7DAEA6
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
> 


______________________________________________________
GRATIS f?r alle WEB.DE-Nutzer: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://movieflat.web.de


From jimad at msn.com  Sun Sep 13 15:47:32 2009
From: jimad at msn.com (Jim Adcock)
Date: Sun, 13 Sep 2009 15:47:32 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <d4e.5ad2093d.37dea2e9@aol.com>
References: <d4e.5ad2093d.37dea2e9@aol.com>
Message-ID: <SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>

Sigh. I don't know what the solution is, but for me as a content-provider it
is heart-breaking to do my best to try to "do the job right" and then see
the hard-won knowledge and effort I have put into "doing it right" thrown
away BOTH by the txt and the html as implemented by PG.  I'd love to see an
input format that preserves the hard-won effort I put into content creation,
AND which is NOT a "write once" format, such that future content producers
can easily build on the efforts I have already put into creating a correct
content creation, and NOT have to redo the work I have already done because
BOTH txt and html as implemented by PG throw away work effort I have already
done.  Yes, it is possible for future content producers to go over the text
front to back another three or four passes after I have done so already in
order to try to "catch" again the errors that txt and html have
re-introduced -- but why would anyone want that they should have to do so?

What I would like to see as an input-submission format is something that:

1) Preserves the hard-won effort I have already put into content creation,
such that a future volunteer can build on my work without having to "reverse
engineer" those gratuitous errors currently being introduced by the current
PG use of txt and html.

2) Works well-enough even with commonly available "bottom feeder" tools.
[[Personally I get tired of claims of "magic bullet" tools and then I spend
a day trying to get them to work on my computer and they don't even install
and run correctly.]]

3) Does simple common tasks in a simple transparent way.

4) Isn't ugly or ungainly for simple common everyday tasks.

5) Can be -- and is in practice -- transformed from input format to a
variety of end reader formats in an attractive manner which does not contain
common uglinesses for common book situations.

 




From marcello at perathoner.de  Sun Sep 13 16:11:00 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Mon, 14 Sep 2009 01:11:00 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>
Message-ID: <4AAD7C04.5030806@perathoner.de>

Jim Adcock wrote:

> 1) Preserves the hard-won effort I have already put into content creation,
> such that a future volunteer can build on my work without having to "reverse
> engineer" those gratuitous errors currently being introduced by the current
> PG use of txt and html.

Please give some real-world examples.


-- 
Marcello Perathoner
webmaster at gutenberg.org

From tb at baechler.net  Sun Sep 13 19:56:27 2009
From: tb at baechler.net (Tony Baechler)
Date: Sun, 13 Sep 2009 19:56:27 -0700
Subject: [gutvol-d] Re: why the plain-text format is the most useful	format
 for eliciting beauty (and more)
In-Reply-To: <SNT120-DS116B72E3B00E699B5AA56EAEE60@phx.gbl>
References: <be0.6442fdd2.37dc344c@aol.com>
	<SNT120-DS116B72E3B00E699B5AA56EAEE60@phx.gbl>
Message-ID: <4AADB0DB.5000209@baechler.net>

I'm not sure what you mean by this. Most screen readers will read 
underlines or periods as underlines or periods, so there is no emphasis 
on bold or italics. If you mean something else, please accept my 
apologies and elaborate further. I personally turn off all punctuation 
when reading because I don't want to hear the periods and such. In 
Windows and MS Word, it will tell you if something is formatted 
differently. In English Braille which is also 7-bit, there is usually an 
accent mark (the equivalent to "`" to the sighted) to show any accented 
letter and a similar underline (or underscore if you prefer) convention 
for other emphasis.

On 9/11/2009 9:26 PM, James Adcock wrote:
>
> PS: Bit curious which blind reader handles _/the underscore 
> ?convention?/_ correctly ? I?ve not seen _/that/_ one!
>
>


From richfield at telkomsa.net  Mon Sep 14 02:17:17 2009
From: richfield at telkomsa.net (Jon Richfield)
Date: Mon, 14 Sep 2009 11:17:17 +0200
Subject: [gutvol-d] Re: Line Art
In-Reply-To: <m37hw23kkg.fsf@lugabout.jhcloos.org>
References: <m37hw23kkg.fsf@lugabout.jhcloos.org>
Message-ID: <4AAE0A1D.9080306@telkomsa.net>

Dunno really. When I scan a book I photograph the pages, usually at 
fairly low resolution for OCR. Any line art (or other)  pictures I may 
photograph over again with more care and higher resolution.  However, I 
don't know how relevant this is to your question.  My books are 
generally either technical or of historical interest and I have not yet 
had occasion to prepare one on art as such. Accordingly I am more 
interested in producing a functional representation than an artistically 
adequate one.
Cheers
Jon
> I see that the HTML versions of a number of PG works include images
> of art from the original books.
>
> Are higher-resolution scans of those images available anywhere?
>
> I'd like to experiment with automated vectorization of scans of line
> art, and providing SVG format vector files of the line art back to PG
> would make the effort doubly useful.  (Encapsulated PS and PDF files
> also could be made available, if desired.)
>
> -JimC
>   


From jimad at msn.com  Mon Sep 14 08:59:33 2009
From: jimad at msn.com (Jim Adcock)
Date: Mon, 14 Sep 2009 08:59:33 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <4AAD7C04.5030806@perathoner.de>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>
	<4AAD7C04.5030806@perathoner.de>
Message-ID: <SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>

>> 1) Preserves the hard-won effort I have already put into content
creation,
>> such that a future volunteer can build on my work without having to
"reverse
>> engineer" those gratuitous errors currently being introduced by the
current
>> PG use of txt and html.
>
>Please give some real-world examples.

OK.  My point being that IF PG were to accept a "proper" book INPUT encoding
format that preserves the hard-won knowledge of the original encoding
volunteer, then there would be no need for a future volunteer to have to
completely scan that encoding against the original book scans in order to
make another pass looking for errors, etc.  So what all is "wrong" with TXT
and HTML in this regards as stored in the PG databases?

Both formats throw away the original volunteers' knowledge about the common
parts of books: TOC, author info, pub info, copyright pages, index,
chapters, etc.  Yes one can code this information in HTML but there is no
unambiguous way to do so which means that PG HTML encodings all take
different paths, as one rapidly discovers if one tries to automagically
convert PG HTML into other reflow file formats. You could follow common h1,
h2, h3 settings by convention -- if PG were to establish and require such --
but then you end up with really ugly rendered HTML on common displays.  You
can overcome this with style sheets -- but then you are defeating many tools
which automagically convert HTML into a variety of other reflow file formats
for the various e-readers.

Both formats as stored by PG gratuitous throw away hard-won line-by-line
alignments between scan text and hand-scanno corrected text.  These
alignments are needed if a future volunteer wanted to make another pass at
"fixing" errors in the text, for example by running through DP again, or
running it against a future automagic tool comparing a new scan to the PG
text. I submit my HTML to PG WITH the original line-by-line alignments --
because it doesn't in any way hurt the HTML and allows a future volunteer to
make another pass on my work -- but then PG insists on throwing this
information away anyway before posting their HTML files.

Both formats throw away page numbers and page breaks, which again are
necessary to make another volunteer pass against the original scans, and
also to make future passes against broken link info, etc.  Also would be
useful for some college courses, where you need page number refs, even if
reading on a reflow reader device.  I'm NOT suggesting that page numbers
should be typically be displayed in an OUTPUT reflow file format rendering,
rather that this represents hard-won information that ought to be retained
in a well-designed INPUT file format encoding.

TXT files seem to me to almost always have some glyphs outside of the 8-bit
char set.  Unicode text files would at least overcome this limitation.  HTML
in theory doesn't have this limitation, but in practice I find in submitting
"acceptable" HTML to PG running it through their battery of acceptance tools
I find some glyphs I can't get through so I end up punting and throwing away
"correct" glyph information dumbing down the representation of some glyphs.

PG and DP *in practice* have a dumbed-down concept of punctuation, such that
it's impossible to maintain and retain "original authors intent" as
expressed in the printed work.  For example, M-Dash is commonly found in
three contexts: lead-in, lead-out, and connecting, similar to how ellipses
are used at least in three different ways: ...lead in, lead out, and ...
connecting.  But in practice all one can get through PG and DP is connecting
M-dash.  Also consider all the [correct] variety of Unicode quotation marks
which needlessly get reduced in PG and DP to only U+0022 OR U+0027. In
general PG has a dumbed-down concept of punctuation, that near is near
enough, and is actively hostile to accurately encoding the punctuation as
rendered in the original print document.  Again, it is EASY to dumb down an
INPUT file format, for example if you need to output to a 7-bit or even a
5-bit teletypewriter, if that is what you want. So why insist that the input
file encoder get it wrong in the first place?  It is easy to throw away
information when going from an INPUT file encoding to an OUTPUT file
rendering.  It is VERY DIFFICULT to correctly fix introduced errors when
going back from a reduced OUTPUT file rendering to a correctly encoded input
file encoding.

What I am imagining is some simple-to-use file encoding format where a
volunteer can correctly and unambiguously code the common things and
conventions one commonly finds in every day books, such that another
volunteer can pick up and make another pass on the book some years hence --
without having to reinvent nor rediscover work that the previous volunteer
has already put into understanding and coding the book.  Such an INPUT file
encoding having little or nothing to do with how the output will be
displayed in an eventual OUTPUT file rendering.  DP already has much of this
distinction in their work flow.  Unfortunately, their page-by-page
conventions and simplifications "dumbing down" for the sake of the multiple
levels of volunteers guarantees loss of information.  Not to mention that
they also throw away the correctly encoded INPUT file hard-won knowledge for
more ambiguous OUTPUT file renderings in HTML and TXT.

The end result is that both PG and DP end up be "write once" efforts that
are hostile to future improvements by future volunteers -- instead of
encouraging on-going efforts to improve what we got.  Which is also
indicative of a general culture of quantity not quality.

PG pretends that part of why we do what we do is to protect and preserve
books in perpetuity.  This implies in exchange that information that is
gratuitously thrown away during input file encoding [directly in an output
file rendering] is potentially lost for eternity.  Why insist via policy
that volunteer input file encoders must throw away this information?




From jimad at msn.com  Mon Sep 14 09:31:47 2009
From: jimad at msn.com (Jim Adcock)
Date: Mon, 14 Sep 2009 09:31:47 -0700
Subject: [gutvol-d] Re: why the plain-text format is the most useful	format
	for eliciting beauty (and more)
In-Reply-To: <4AADB0DB.5000209@baechler.net>
References: <be0.6442fdd2.37dc344c@aol.com>	<SNT120-DS116B72E3B00E699B5AA56EAEE60@phx.gbl>
	<4AADB0DB.5000209@baechler.net>
Message-ID: <SNT120-DS12943C6BC655A386052818AEE40@phx.gbl>

Yes, you understand me the way I understand the blind readers I know of,
namely that they will read _an ironic reference_ as "underscore an ironic
reference underscore" not read with prosodic emphasis "an ironic reference".
Thus when you turn off punctuation you also lose any representation of
prosodic emphasis that the author originally encoding in the original
printed text.  This is not a small deal, IMHO.  There are some books such as
"The Dove" by Henry James which are virtually impossible to even scan
without maintaining the author's original proper representation of prosodic
emphasis.

>I'm not sure what you mean by this. Most screen readers will read 
underlines or periods as underlines or periods, 



From marcello at perathoner.de  Mon Sep 14 10:47:08 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Mon, 14 Sep 2009 19:47:08 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
Message-ID: <4AAE819C.1040108@perathoner.de>

Jim Adcock wrote:

> OK.  My point being that IF PG were to accept a "proper" book INPUT encoding
> format that preserves the hard-won knowledge of the original encoding
> volunteer, then there would be no need for a future volunteer to have to
> completely scan that encoding against the original book scans in order to
> make another pass looking for errors, etc.  

There's a misconception here.

PG *does* allow you to post additional file formats *along* with TXT and 
HTML. TEI comes to mind as format perfectly suitable to preserve a lot 
that HTML cannot.

The reason that there isn't a TEI file posted along with *every* ebook 
is that most PPers at DP don't care to produce one.


> Both formats throw away the original volunteers' knowledge about the common
> parts of books: TOC, author info, pub info, copyright pages, index,
> chapters, etc.

TEI has elements for all these cases.


> TXT files seem to me to almost always have some glyphs outside of the 8-bit
> char set.  Unicode text files would at least overcome this limitation.

I don't see any problem here: Produce utf-8 files.

The whitewashers will create some work for themselves by converting the 
utf-8 to all sorts of embarrassing encodings and then waste more time at 
the helpdesk to explain to incredulous users what `encodings? are, but 
that need not be your problem.



-- 
Marcello Perathoner
webmaster at gutenberg.org

From prosfilaes at gmail.com  Mon Sep 14 11:40:18 2009
From: prosfilaes at gmail.com (David Starner)
Date: Mon, 14 Sep 2009 14:40:18 -0400
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <4AAE819C.1040108@perathoner.de>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>
	<4AAD7C04.5030806@perathoner.de>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
	<4AAE819C.1040108@perathoner.de>
Message-ID: <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com>

On Mon, Sep 14, 2009 at 1:47 PM, Marcello Perathoner
<marcello at perathoner.de> wrote:
> PG *does* allow you to post additional file formats *along* with TXT and
> HTML. TEI comes to mind as format perfectly suitable to preserve a lot that
> HTML cannot.
>
> The reason that there isn't a TEI file posted along with *every* ebook is
> that most PPers at DP don't care to produce one.

And the reason for that is not only is it a lot more work than an HTML
edition, unsupported by any sort of tools, it's worthless to the end
user, as apparently no one at PG can get decent output from it.

-- 
Kie ekzistas vivo, ekzistas espero.

From hart at pobox.com  Mon Sep 14 14:22:02 2009
From: hart at pobox.com (Michael S. Hart)
Date: Mon, 14 Sep 2009 14:22:02 -0700 (PDT)
Subject: [gutvol-d] PG French eBook #1500
Message-ID: <alpine.DEB.2.00.0909141419510.4940@mail.pglaf.org>


Right now we are looking at Voltaire, de Toqueville's "Democracy,"
and a few others.

20 more and we are at 1500.

Please take a look for various copies of "Democracy" and anything
else you think we might be able to use, and let me know.


Thanks!!!



Michael S. Hart
Founder
Project Gutenberg

From jimad at msn.com  Mon Sep 14 14:34:00 2009
From: jimad at msn.com (Jim Adcock)
Date: Mon, 14 Sep 2009 14:34:00 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <4AAE819C.1040108@perathoner.de>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
	<4AAE819C.1040108@perathoner.de>
Message-ID: <SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>

>TEI comes to mind as format perfectly suitable to preserve a lot 
that HTML cannot.

Um, this standard is 1350 pages long.  Tell me again why I should be reading
it?  I want to code books -- not the Sistine Chapel.

>I don't see any problem here: Produce utf-8 files.

But that would still leave all the other problems with txt files.  And the
reason we are required to produce txt is to support those with
teletypewriters.  Rhetorically speaking why not just produce as bad txt
files as one can and still get away with it and hope that someday soon both
Gut readers and Gut content produces will see the light and give txt up as
long gone dead?


From prosfilaes at gmail.com  Mon Sep 14 15:01:05 2009
From: prosfilaes at gmail.com (David Starner)
Date: Mon, 14 Sep 2009 18:01:05 -0400
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>
	<4AAD7C04.5030806@perathoner.de>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
	<4AAE819C.1040108@perathoner.de>
	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>
Message-ID: <6d99d1fd0909141501p2eb26b30wc73591fcf5998f6b@mail.gmail.com>

On Mon, Sep 14, 2009 at 5:34 PM, Jim Adcock <jimad at msn.com> wrote:
>>I don't see any problem here: Produce utf-8 files.
>
> But that would still leave all the other problems with txt files. ?And the
> reason we are required to produce txt is to support those with
> teletypewriters.

So? UTF-8 works just fine when viewed in a UTF-8 xterm, and can be
translated on the fly by many programs.

-- 
Kie ekzistas vivo, ekzistas espero.

From ajhaines at shaw.ca  Mon Sep 14 15:49:42 2009
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Mon, 14 Sep 2009 15:49:42 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>
	<4AAD7C04.5030806@perathoner.de>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
	<4AAE819C.1040108@perathoner.de>
	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>
Message-ID: <44414895897048CF95F99FE12CC881FD@alp2400>


----- Original Message ----- 
From: "Jim Adcock" <jimad at msn.com>
To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d at lists.pglaf.org>
Sent: Monday, September 14, 2009 2:34 PM
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT


> >TEI comes to mind as format perfectly suitable to preserve a lot
> that HTML cannot.
>
> Um, this standard is 1350 pages long.  Tell me again why I should be 
> reading
> it?  I want to code books -- not the Sistine Chapel.
>

Check out http://pgtei.pglaf.org/.  Marcello's PG-TEI manual is <200 pages. 
There's also TEI-Lite at 
http://www.tei-c.org/Guidelines/Customization/Lite/.


>>I don't see any problem here: Produce utf-8 files.
>
> But that would still leave all the other problems with txt files.  And the
> reason we are required to produce txt is to support those with
> teletypewriters.  Rhetorically speaking why not just produce as bad txt
> files as one can and still get away with it and hope that someday soon 
> both
> Gut readers and Gut content produces will see the light and give txt up as
> long gone dead?
>

Text will never be dead.  It's portable to all platforms, doesn't need a 
browser or a PDF-like reader, only the most basic editor.  In modern terms, 
it's the stem cell of ebook files--all else can be generated from it.  Maybe 
with greater or lesser prettiness, but as long as you get the words, who 
cares what the quote marks look like?



> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
> 



From jimad at msn.com  Mon Sep 14 20:31:08 2009
From: jimad at msn.com (Jim Adcock)
Date: Mon, 14 Sep 2009 20:31:08 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <44414895897048CF95F99FE12CC881FD@alp2400>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>
	<44414895897048CF95F99FE12CC881FD@alp2400>
Message-ID: <SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>

>....but as long as you get the words, who cares what the quote marks look
like?

There are a lot of texts where you cannot "get" the words from just the
words.  There are also texts with quotes within quotes, where if you don't
care what the quote marks look like _you cannot read it!_

Certainly a text like Tristram Shandy demonstrates there are books which are
NOT just about the words -- where rather, the artistry of representing word
on paper -- including careful choice of fonts, puncs, etc. is a central part
of the artistry -- as one can easily see by comparing a bad publication of
this work to a good one!  The good publications represent the work of the
artist, the bad one's clearly do not.  And a txt representation would be
just so many chicken scratchings in the mud.

I'm sure there are many here who would say "but I don't like Tristram
Shandy" -- and that would be my point.  By bringing a prejudice to the table
that only texts worth representing in txt are worth representing, you
prejudice what books PG is allowed to preserve, and you censor the choice of
artists that others are permitted to preserve.  You represent some artists,
and consign the others to oblivion.



From prosfilaes at gmail.com  Mon Sep 14 20:47:15 2009
From: prosfilaes at gmail.com (David Starner)
Date: Mon, 14 Sep 2009 23:47:15 -0400
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>
	<4AAD7C04.5030806@perathoner.de>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
	<4AAE819C.1040108@perathoner.de>
	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>
	<44414895897048CF95F99FE12CC881FD@alp2400>
	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>
Message-ID: <6d99d1fd0909142047u5a192b4cv1220494663969e8a@mail.gmail.com>

On Mon, Sep 14, 2009 at 11:31 PM, Jim Adcock <jimad at msn.com> wrote:
> Certainly a text like Tristram Shandy demonstrates there are books which are
> NOT just about the words -- where rather, the artistry of representing word
> on paper -- including careful choice of fonts, puncs, etc. is a central part
> of the artistry -- as one can easily see by comparing a bad publication of
> this work to a good one!

Those sculptors who choose to work in ice are rarely remembered well
by later ages. Sculptors who work in iron and bronze can easily be
remembered for several millennia. The choice is the artist's.

-- 
Kie ekzistas vivo, ekzistas espero.

From ajhaines at shaw.ca  Mon Sep 14 22:20:44 2009
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Mon, 14 Sep 2009 22:20:44 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>
	<4AAD7C04.5030806@perathoner.de>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
	<4AAE819C.1040108@perathoner.de>
	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>
	<44414895897048CF95F99FE12CC881FD@alp2400>
	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>
Message-ID: <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>


----- Original Message ----- 
From: "Jim Adcock" <jimad at msn.com>
To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d at lists.pglaf.org>
Sent: Monday, September 14, 2009 8:31 PM
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT


> >....but as long as you get the words, who cares what the quote marks look
> like?
>
> There are a lot of texts where you cannot "get" the words from just the
> words.  There are also texts with quotes within quotes, where if you don't
> care what the quote marks look like _you cannot read it!_
>

I think I, and any other followers of this thread, will need an example of 
"not getting the words from the words".

I've seen any number of instances of nested quotes, mostly nested 
doublequotes, lots of triple-nested double-single-double quotes, and some 
triple-nested single-double-single quotes (mostly in British-published 
books) and I have yet to encounter any that I couldn't read, either in the 
original source or when they've been etexted.



> Certainly a text like Tristram Shandy demonstrates there are books which 
> are
> NOT just about the words -- where rather, the artistry of representing 
> word
> on paper -- including careful choice of fonts, puncs, etc. is a central 
> part
> of the artistry -- as one can easily see by comparing a bad publication of
> this work to a good one!  The good publications represent the work of the
> artist, the bad one's clearly do not.  And a txt representation would be
> just so many chicken scratchings in the mud.
>

I've looked at PG's text and HTML version of Shandy, and several PDFed 
scansets in Internet Archive.  Unless I'm missing something, they all look 
like standard prose to me.

If you've got an edition as difficult to transcribe as you seem to indicate, 
and it's not in Internet Archive, you should scan it, and if you have no 
interest in producing it yourself, upload the zipped scanset via FTP to PG 
(I can give exact instructions to you privately).  As long as it's 
clearable, it may be possible to arrange for it to go into PG's Preprints 
page where it'll be available as a project for someone.


> I'm sure there are many here who would say "but I don't like Tristram
> Shandy" -- and that would be my point.  By bringing a prejudice to the 
> table
> that only texts worth representing in txt are worth representing, you
> prejudice what books PG is allowed to preserve, and you censor the choice 
> of
> artists that others are permitted to preserve.  You represent some 
> artists,
> and consign the others to oblivion.
>

Personally, I'm book-agnostic--as long as it's in English, a book is a book 
is a book.  I'm would assume that those who produce books for PG in other 
languages feel the same way about books in those languages.  Distributed 
Proofreaders, at least once, has produced a book in a language none of its 
proofers understood (#27120).


>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
> 



From marcello at perathoner.de  Tue Sep 15 01:01:57 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Tue, 15 Sep 2009 10:01:57 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <44414895897048CF95F99FE12CC881FD@alp2400>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>
	<44414895897048CF95F99FE12CC881FD@alp2400>
Message-ID: <4AAF49F5.4090206@perathoner.de>

Al Haines (shaw) wrote:

> Text will never be dead.  It's portable to all platforms, doesn't need a 
> browser or a PDF-like reader, only the most basic editor.


Its not portable to cellphones.

While every modern cellphone comes with a browser I have never seen one 
with an editor.




-- 
Marcello Perathoner
webmaster at gutenberg.org

From schultzk at uni-trier.de  Tue Sep 15 01:58:07 2009
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Tue, 15 Sep 2009 10:58:07 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>
	<4AAD7C04.5030806@perathoner.de>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
	<4AAE819C.1040108@perathoner.de>
	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>
	<44414895897048CF95F99FE12CC881FD@alp2400>
	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>
	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>
Message-ID: <A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>

Hi Everybody,

	I will step in here for a moment.

	As Bowerbird has mentioned this discussion is as old as PG itself.

	The problems are:
		1) Plain Vanilla Texts can not reproduce books (It is not meant, too)
		2) PG does NOT have a comprehensive format for reproducing books.
		3) PG has not evolved with mopdern computer technology.
		4) Ecerybody wants thier pet formats for reading.
		5) PG does not have a consolidated following willing to build the  
resources needed
                     to solve the above.

	There are many various reason for the above problems. Yes, there ARE  
and have been efforts
	to solve the above. Yet, none of these have fruited much or have been  
able to satisfy needs of
	all its contributors or users.

	So what is needed:
		1) A single modular and extensible format for encoding the books
			a) the structures in the book (text) need to be represented
			b) it does not presume a particular output format
			c) does not care about the size of files
			d) does not need to be very readable easily

		2) a parser for creating output formats
			a ) use all information to create the best possible output for a  
particular format

		3) an editor
			a) display the book
			b) allow for changes in the representation of the book
			c) must be modular and extensible

		4) a parser for creating the representation of the book in the  
format from scans
			a) must be modular and extensible
			b) must be multi-pass
			c) flags possible conflicts with the format
			d) intelligent to do most markup by itself
			e) intelligent to correct common errors by itself

		5) parsers for converting older formats
			a) all of 4)
			b) does not expect particular information
			c) allows for presets injorder to same time and desirable  
representation.

		6) a proofing workflow


	So what do we have. We need a a format that is not based on an  
existing format, is modualr and extensible.
	Either we start from scratch or use a generic format. SGML or XML  
come to mind. We can then put in waht we
	want and need, have a well structured format, can extend it easily  
and it is modular. Plus, XML can handle all kind
	of information an data.

	Yes, we have to reinvent the wheel for markup, but we want a  
representation that contains as much information
	as possible. The question would be how much is needed. At least the  
markup will be a layout format.

	It should only take about a month to create such a format. The other  
parts will take a little longer. The important thing
	is everything has to be centered around the representation format and  
not the output. The output is handle
	by parsers. Where a particular output format can handle or represent  
a particular feature can be a concern of the
	PG internal representation. The developers of the output format can  
converted it to what ever the seem fittest.


	regards
		Keith. 

From schultzk at uni-trier.de  Tue Sep 15 02:09:05 2009
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Tue, 15 Sep 2009 11:09:05 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <4AAF49F5.4090206@perathoner.de>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>
	<44414895897048CF95F99FE12CC881FD@alp2400>
	<4AAF49F5.4090206@perathoner.de>
Message-ID: <2BC18EEF-94F6-43C9-BB98-579E73E5298B@uni-trier.de>

Hi,


Am 15.09.2009 um 10:01 schrieb Marcello Perathoner:

> Al Haines (shaw) wrote:
>
>> Text will never be dead.  It's portable to all platforms, doesn't  
>> need a browser or a PDF-like reader, only the most basic editor.
	True text will never go away! Yet, the way it is represented will  
change.
	It also, depends on how you define TEXT.

>
>
> Its not portable to cellphones.
	Strange? I get text messages all the time ;-))

>
> While every modern cellphone comes with a browser I have never seen  
> one with an editor.
	I have a editor on mine. iPhones and Blackberries have them. Of  
course the are not modern. ;-))

	regards
		Keith.



From marcello at perathoner.de  Tue Sep 15 02:10:42 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Tue, 15 Sep 2009 11:10:42 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>
	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com>
Message-ID: <4AAF5A12.8040505@perathoner.de>

David Starner wrote:

>> The reason that there isn't a TEI file posted along with *every* ebook is
>> that most PPers at DP don't care to produce one.
> 
> And the reason for that is not only is it a lot more work than an HTML
> edition, unsupported by any sort of tools, it's worthless to the end
> user, as apparently no one at PG can get decent output from it.


Thats a lot of misinformation in such a short paragraph.


1. More work ...

Of course you have to learn TEI, as you had to learn HTML. No difference 
there.

Once you have mastered it, it is actually a lot less work, because TEI 
was designed for text preservation, while HTML was designed to bring 
scientific papers online.

It is also less work because from one master you get the HTML, the TXT 
and the PDF.

It is also less work fixing errata, because you fix the master instead 
of having to fix 2 or 3 different files.


2. Unsupported by tools ...

PG has an implementation of TEI. I know you don't like it because you 
haven't figured out how to produce pretty title pages. But you don't 
have to use that one, there are plenty other.

TEI is being used by many projects:

   http://www.tei-c.org/Activities/Projects/

and has a full suite of tools:

   http://wiki.tei-c.org/index.php/Category:Tools


3. Worthless to the end user ...

TEI is a master format. Its use is in producing formats suitable for 
end-user consumption.

Anf if we don't equate end user == reader but try: end user == librarian 
or end user == lunguistic researcher we find that TEI is many times as 
useful as HTML.


4. No decent output ...

`Decentness? is a matter of debate.

At DP some PPers think it is essential to use every CSS feature at least 
once in every text, having pictures float right and left and text 
flowing around them and having illuminated dropcaps and printers 
ornaments and page numbers all over the place. PGTEI cannot (yet) do that.

I very much prefer a simple layout, with only essential pictures smack 
in the middle of the text flow at the point they logically belong. A 
formatting that is easily ported to all existing devices. PGTEI excels 
at this.


Ironically `decent? DP output is already falling to pieces on ePub 
devices (not even to mention Mobipocket) because ePub does not support 
CSS position: absolute.


-- 
Marcello Perathoner
webmaster at gutenberg.org

From marcello at perathoner.de  Tue Sep 15 02:20:08 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Tue, 15 Sep 2009 11:20:08 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>	<44414895897048CF95F99FE12CC881FD@alp2400>	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>
	<A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>
Message-ID: <4AAF5C48.1040201@perathoner.de>

Keith J. Schultz wrote:

> We need a a format that is not based on an existing format, ...

Why not?


> ... but we want a representation that contains as much information as
> possible.
> It should only take about a month to create such a format.

ROTFL




-- 
Marcello Perathoner
webmaster at gutenberg.org

From pterandon at gmail.com  Tue Sep 15 04:02:16 2009
From: pterandon at gmail.com (Greg M. Johnson)
Date: Tue, 15 Sep 2009 07:02:16 -0400
Subject: [gutvol-d] World's most heavily pirated books
Message-ID: <a0bf3e960909150402u76992831k6791a07fd4c06320@mail.gmail.com>

I was listening to some podcast (I have since forgotten exactly which one,
but it was about the celebration of sci-fi culture) where they talked about
"the world's most pirated books."   At first one of the hosts went into a
little tirade that *book  *piracy was the purest form of evil-- worse than
any other, I guess because the book publishers weren't as evil or something.

The list did include a recent book about Photoshop, which is unfortunate.
But the list also included *The Kama Sutra*, and a few other really old
classics.  Eventually it dawned on me, if not entirely to both hosts as
well, that they were conflating "piracy" with "downloading by bit torrent."

< s i g h >.



-- 
Greg M. Johnson
http://pterandon.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090915/aacbcd2d/attachment.html>

From Bowerbird at aol.com  Tue Sep 15 14:34:17 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 15 Sep 2009 17:34:17 EDT
Subject: [gutvol-d] z.m.l. can do what you want
Message-ID: <cd9.5ae92b3a.37e16259@aol.com>

z.m.l. can do what you want.

as soon as you're ready to act, and not just yak.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090915/14f8821e/attachment.html>

From jimad at msn.com  Tue Sep 15 16:18:57 2009
From: jimad at msn.com (James Adcock)
Date: Tue, 15 Sep 2009 16:18:57 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>	<44414895897048CF95F99FE12CC881FD@alp2400>	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>
	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>
Message-ID: <SNT120-DS16F874FCB85B82D2E9B03EAEE30@phx.gbl>

>I think I, and any other followers of this thread, will need an example of 
"not getting the words from the words".

Okay, let's go over a number of simple examples:

Consider Michael's thesis of the "goodness" of viewing PG texts on
cellphones.  Which is a "good" submission format for submitting a
transcription of Shakespeare to be read on a cellphone, PG txt format, or
HTML?

Answer: Neither file format works worth a dang for specifying Shakespeare to
be read on cellphones. Yet both file formats contain the lists of the words.

-- Even 400 years ago authors understood the importance of formatting and
printing decisions to represent the meanings of words -- artistic writings
ARE NOT just lists of words -- even when those words are clearly intended to
be spoken out loud.

Here's a brief excerpt from Dove:

"'Go?'" he wondered. "Go when, go where?"

And another one:

She particularly likes you.

Yes, you can read these words and you will assign meanings to these words
but you will not get the author's intent because the author understood that
he needed to put additional information in the printing so that you can
understand his intent.  This is particularly important in the Henry James
because what he is writing is deliberately ambiguous and confusing in the
first place, so much so that he has to disambiguate in order to reduce the
degree of ambiguity in what he is writing -- while still deliberately
leaving the reader dazed and confused -- but not so confused as to think
(incorrectly) that they understand what is going on.

I guess I can put some txt representation of Tristram Shandy here, but what
would be the point? 

He's gone! said my uncle Toby Where? Who? cried my father My nephew, said my
uncle Toby What, without leave, without money, without governor? cried my
father in amazement No he is dead, my dear brother, quoth my uncle Toby
Without being ill? cried my father again I dare say not, said my uncle Toby,
in a low voice, and fetching a deep sigh from the bottom of his heart; he
has been ill enough, poor lad! I'll answer for him, for he is dead.

Yes, once again, you can read the words and you will assign meaning to them
-- but not the meaning intended by the author, because the txt is missing
information that the author found important to include so that you can
understand his meaning -- to the extent that he wanted you to understand his
meaning which again was partial in the first place.

I'm not saying that there is no place in the world for txt -- as archy
demonstrated clearly back in 1916:

expression is the need of my soul

And you can read this entire email and still come back and complain that you
don't understand what I am talking about and in making this complaint you
once again make my point for me:  The reason that you don't understand what
I am talking about is that I am writing this email using txt and the authors
given as examples above were writing in a style requiring representation
richer than mere PG txt. Go and find the author's original representations
and read them there because PG txt simply doesn't cut it to represent their
work. Read what they wrote and ask yourself what it takes to actually
implement the author's intent, either automagically, or even
semiautomagically, on a variety of differing reader devices -- including,
but not limited to -- teletypes and their software emulators [which is
essentially what txt devices are, including this email system and notepad,
etc

PS: If you can read this email at all please note that it's because I
*didn't* write it following PG txt conventions.



From jimad at msn.com  Tue Sep 15 16:38:54 2009
From: jimad at msn.com (James Adcock)
Date: Tue, 15 Sep 2009 16:38:54 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>	<44414895897048CF95F99FE12CC881FD@alp2400>	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>
	<A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>
Message-ID: <SNT120-DS262D1CD6B1D1C64E0A329AEE30@phx.gbl>

...So what is needed...

Yes, except I don't think it's as bad as you make it out to be. TEI and/or
PG-TEI could be a good intermediate formal file format.  DP markup [and
conventions] could be a good preliminary editing markup format. Editing
doesn't necessarily need to be WYSIWYG.  Input formatted files don't have to
be perfect since they are living documents, as opposed to current "write
once" output formatted files.  Conversion from an input format file to
output rendering formats such as txt or html or the various other reflow
formats doesn't have to be perfect -- as long as the input format to output
format rendering software does more work than the current tools for the job
-- which basically is none. You probably have to store CSS or other style
choices representation to help reconstruct how the original volunteers chose
to render the input file format to the output rendering file format. [Where
I am assuming here that html is simply being used as an output rendering
file format, so that we don't have to argue anymore about the "correct"
semantic use of html -- we would say that the semantics are being
represented in the input file format, not in the html]

Again, this is all trying to address at least three problems:

1) How do you represent the author's intention without deliberately throwing
away information?

2) How do you make the files submitted by volunteers be "living documents"
rather than "write once" documents -- which other volunteers can pick up and
improve on in the future without having to go back to original scans and
rework the work "from scratch" ?

3) How do you support as best as possible various output rendering file
formats most appropriate for various reader devices? -- of which PG
*already* "officially" recognizes literally about 80 different output file
formats of differing complexities!



From Bowerbird at aol.com  Tue Sep 15 19:05:25 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Tue, 15 Sep 2009 22:05:25 EDT
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
Message-ID: <d0b.570bdcd9.37e1a1e5@aol.com>

jim, i don't think you're saying anything
that hasn't already been said here before.

many times.   over the course of years.

i just think you're saying it less clearly...

but hey, if anyone thinks jim _is_ saying
something that can use some attention,
please do tell us just exactly what it is...

thanks.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090915/8aac1c2f/attachment.html>

From gbuchana at teksavvy.com  Tue Sep 15 20:14:42 2009
From: gbuchana at teksavvy.com (Gardner Buchanan)
Date: Tue, 15 Sep 2009 23:14:42 -0400
Subject: [gutvol-d] Re: z.m.l. can do what you want
In-Reply-To: <cd9.5ae92b3a.37e16259@aol.com>
References: <cd9.5ae92b3a.37e16259@aol.com>
Message-ID: <4AB05822.2020906@teksavvy.com>

Bowerbird at aol.com wrote:
> z.m.l. can do what you want.
> 

I know BB talks a good deal about ZML and it sounds
pretty cool, but I've googled till my fingers bleed and
I cannot tell for sure what exact ZML he has in mind.

This ZML? http://rx4rdf.liminalzone.org/ZMLMarkupRules
Or this one? http://sourceforge.net/projects/zeitung-ml/
Or this? http://www.seas.gwu.edu/~bell/publications/zml-report.pdf
Or this? http://nt-appn.comp.nus.edu.sg/fm/zml/

There's a lot to choose from.  The first one includes
something resembling a specification and a Sourceforge
project, so I bet that's it.  Right?

============================================================
Gardner Buchanan                     <gbuchana at teksavvy.com>
Ottawa, ON             FreeBSD: Where you want to go. Today.

From jimad at msn.com  Tue Sep 15 19:49:13 2009
From: jimad at msn.com (Jim Adcock)
Date: Tue, 15 Sep 2009 19:49:13 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <4AAF5A12.8040505@perathoner.de>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com>
	<4AAF5A12.8040505@perathoner.de>
Message-ID: <SNT120-DS20267EC119DCA4447F6EF1AEE20@phx.gbl>

> PG has an implementation of TEI. 

How does one learn more and/or access the "PG implementation of TEI."  I
have seen PG TEI which looks to me to add some tags to the base TEI ?

Again, TEI P5 is 1350 pages, which is a lot more inaccessible to volunteers
than anything describing HTML tags that I have seen!  I'd say the DP tagging
documentation is already painful enough for most of us.  I am about 100
pages into the TEI documentation, so in maybe two weeks I can tell you more
about what I think about it....




From jimad at msn.com  Tue Sep 15 19:56:07 2009
From: jimad at msn.com (Jim Adcock)
Date: Tue, 15 Sep 2009 19:56:07 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <4AAF5C48.1040201@perathoner.de>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>	<44414895897048CF95F99FE12CC881FD@alp2400>	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>	<A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>
	<4AAF5C48.1040201@perathoner.de>
Message-ID: <SNT120-DS20DA612BC506D0414E14B5AEE20@phx.gbl>

As an example, I just tried auto-magically unwrapping some PG txt because I
don't like the char count per line choices forced by PG and the assumed size
of the txt display that PG assumes -- which is NOT the size of MY txt
display.  This is what then ended up being displayed on MY choice of txt
display, once I applied the txt unwrapping algorithm:

Ham. To be, or not to be,--that is the question:-- Whether 'tis
nobler in the mind to suffer The slings and arrows of outrageous
fortune Or to take arms against a sea of troubles, And by 
opposing end them? --To die,--to sleep,--No more; and by a sleep
to say we end The heartache, and the thousand natural shocks That
flesh is heir to,--'tis a consummation Devoutly to be wish'd. To
die,--to sleep;-- To sleep! perchance to dream:--ay, there's the 
rub; For in that sleep of death what dreams may come, When we have
shuffled off this mortal coil, Must give us pause: there's the
respect That makes calamity of so long life; For who would bear
the whips and scorns of time, The oppressor's wrong, the proud
man's contumely, The pangs of despis'd love, the law's delay, The
insolence of office, and the spurns That patient merit of the 
unworthy takes, When he himself might his quietus make With a bare
bodkin? who would these fardels bear, To grunt and sweat under
a weary life, But that the dread of something after death,--
The undiscover'd country, from whose bourn No traveller returns,
--puzzles the will, And makes us rather bear those ills we have
Than fly to others that we know not of? Thus conscience does make
cowards of us all; And thus the native hue of resolution Is
sicklied o'er with the pale cast of thought; And enterprises of
great pith and moment, With this regard, their currents turn awry,
And lose the name of action.--Soft you now! The fair Ophelia!
--Nymph, in thy orisons Be all my sins remember'd.

Now maybe to some of you -- you consider this result to be a good thing, an
acceptable thing, a thing that well-represents the considerable efforts of
the PG volunteers.

But personally, I do not think so.



From jimad at msn.com  Tue Sep 15 15:05:52 2009
From: jimad at msn.com (Jim Adcock)
Date: Tue, 15 Sep 2009 15:05:52 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <6d99d1fd0909142047u5a192b4cv1220494663969e8a@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>	<44414895897048CF95F99FE12CC881FD@alp2400>	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>
	<6d99d1fd0909142047u5a192b4cv1220494663969e8a@mail.gmail.com>
Message-ID: <SNT120-DS2029830449090BDA7E449DAEE20@phx.gbl>

>Those sculptors who choose to work in ice are rarely remembered well
by later ages. Sculptors who work in iron and bronze can easily be
remembered for several millennia. The choice is the artist's.

Except when what we are talking about is transcribers scratching other
artist's works into mud tablets with (at best) a pointy stick.



From ajhaines at shaw.ca  Tue Sep 15 22:02:10 2009
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Tue, 15 Sep 2009 22:02:10 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>
	<4AAD7C04.5030806@perathoner.de>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
	<4AAE819C.1040108@perathoner.de>
	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com>
	<4AAF5A12.8040505@perathoner.de>
	<SNT120-DS20267EC119DCA4447F6EF1AEE20@phx.gbl>
Message-ID: <6BB9B881D0244685AE0AF5E602017D03@alp2400>

http://pgtei.pglaf.org/



----- Original Message ----- 
From: "Jim Adcock" <jimad at msn.com>
To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d at lists.pglaf.org>
Sent: Tuesday, September 15, 2009 7:49 PM
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT


>> PG has an implementation of TEI.
>
> How does one learn more and/or access the "PG implementation of TEI."  I
> have seen PG TEI which looks to me to add some tags to the base TEI ?
>
> Again, TEI P5 is 1350 pages, which is a lot more inaccessible to 
> volunteers
> than anything describing HTML tags that I have seen!  I'd say the DP 
> tagging
> documentation is already painful enough for most of us.  I am about 100
> pages into the TEI documentation, so in maybe two weeks I can tell you 
> more
> about what I think about it....
>
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
> 



From ajhaines at shaw.ca  Tue Sep 15 22:13:09 2009
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Tue, 15 Sep 2009 22:13:09 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>
	<4AAD7C04.5030806@perathoner.de>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
	<4AAE819C.1040108@perathoner.de>
	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>
	<44414895897048CF95F99FE12CC881FD@alp2400>
	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>
	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>
	<A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>
	<4AAF5C48.1040201@perathoner.de>
	<SNT120-DS20DA612BC506D0414E14B5AEE20@phx.gbl>
Message-ID: <DAD3C651E3644ED4B559863197320FD9@alp2400>

It's clearly stated in PG Volunteers' FAQ V.89 
(http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.89._Are_there_any_places_where_I_should_indent_text.3F)
that if you want to prevent unwanted wrapping, lines that should not be 
wrapped should be indented a space or two.

In PG's older etexts, that predated this standard, the technique was used 
only sporadically.  However, whenever an older text is cleaned up and 
reposted, it *is* applied where necessary.



----- Original Message ----- 
From: "Jim Adcock" <jimad at msn.com>
To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d at lists.pglaf.org>
Sent: Tuesday, September 15, 2009 7:56 PM
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT


> As an example, I just tried auto-magically unwrapping some PG txt because 
> I
> don't like the char count per line choices forced by PG and the assumed 
> size
> of the txt display that PG assumes -- which is NOT the size of MY txt
> display.  This is what then ended up being displayed on MY choice of txt
> display, once I applied the txt unwrapping algorithm:
>
> Ham. To be, or not to be,--that is the question:-- Whether 'tis
> nobler in the mind to suffer The slings and arrows of outrageous
> fortune Or to take arms against a sea of troubles, And by
> opposing end them? --To die,--to sleep,--No more; and by a sleep
> to say we end The heartache, and the thousand natural shocks That
> flesh is heir to,--'tis a consummation Devoutly to be wish'd. To
> die,--to sleep;-- To sleep! perchance to dream:--ay, there's the
> rub; For in that sleep of death what dreams may come, When we have
> shuffled off this mortal coil, Must give us pause: there's the
> respect That makes calamity of so long life; For who would bear
> the whips and scorns of time, The oppressor's wrong, the proud
> man's contumely, The pangs of despis'd love, the law's delay, The
> insolence of office, and the spurns That patient merit of the
> unworthy takes, When he himself might his quietus make With a bare
> bodkin? who would these fardels bear, To grunt and sweat under
> a weary life, But that the dread of something after death,--
> The undiscover'd country, from whose bourn No traveller returns,
> --puzzles the will, And makes us rather bear those ills we have
> Than fly to others that we know not of? Thus conscience does make
> cowards of us all; And thus the native hue of resolution Is
> sicklied o'er with the pale cast of thought; And enterprises of
> great pith and moment, With this regard, their currents turn awry,
> And lose the name of action.--Soft you now! The fair Ophelia!
> --Nymph, in thy orisons Be all my sins remember'd.
>
> Now maybe to some of you -- you consider this result to be a good thing, 
> an
> acceptable thing, a thing that well-represents the considerable efforts of
> the PG volunteers.
>
> But personally, I do not think so.
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
> 



From jimad at msn.com  Tue Sep 15 23:12:23 2009
From: jimad at msn.com (James Adcock)
Date: Tue, 15 Sep 2009 23:12:23 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <SNT120-DS262D1CD6B1D1C64E0A329AEE30@phx.gbl>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>	<44414895897048CF95F99FE12CC881FD@alp2400>	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>	<A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>
	<SNT120-DS262D1CD6B1D1C64E0A329AEE30@phx.gbl>
Message-ID: <SNT120-DS14E4D1F62296090279D926AEE20@phx.gbl>

As an example of how much author semantic information is lost going from an
author's writing to PG txt format, I went and compared differences between a
recent HTML and PG TXT I did -- where after doing the TXT encoding I went
back and did three more passes over the images to add back in semantic
differences to the HTML that the PG TXT didn't represent.

Now the reality would be that it would take say TEI not HTML to represent
all of the author's intent.  But measuring the loss going from HTML back to
TXT gives an order of magnitude estimate of how much author information we
are throwing away by representing a work in PG TXT.  In the case of this
book, the answer was more than 1000 "losses" -- or an average of about 3
losses per page.  And this is NOT counting about an addition 1000 losses in
representation of emphasis.

Now, let's say we have a PG TXT and some volunteer in the future wants to go
back from that txt and say as correctly as possible represent that text
using PDF.  How many "errors" does that volunteer need to correctly find
where the TXT file loses author's semantic information by carefully
comparing the page images to the PG TXT file, reintroducing information
known to the original volunteer transcribers, but discarded as not being
representable in PG TXT?  The answer is that this volunteer has to find and
fix the txt in literally about 2000 places.  Want to place a bet on how many
of those 2000 places the volunteer trying to create an accurate PDF file is
actually going to "catch" ???  I can tell you in my efforts going from PG
TXT to HTML in the first place it's a good part of a week's work -- not to
imply *I* caught them all either!






From sankarrukku at gmail.com  Wed Sep 16 00:41:24 2009
From: sankarrukku at gmail.com (Sankar Viswanathan)
Date: Wed, 16 Sep 2009 13:11:24 +0530
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <SNT120-DS14E4D1F62296090279D926AEE20@phx.gbl>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
	<4AAE819C.1040108@perathoner.de>
	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>
	<44414895897048CF95F99FE12CC881FD@alp2400>
	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>
	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>
	<A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>
	<SNT120-DS262D1CD6B1D1C64E0A329AEE30@phx.gbl>
	<SNT120-DS14E4D1F62296090279D926AEE20@phx.gbl>
Message-ID: <e45c9fe70909160041r76a2eb8axc48005834c90534@mail.gmail.com>

The PG texts are produced by Volunteers. Individual producers and the post
processors of D.P. Text file is only the minimum requirement stipulated by
PG. It is upto the independent producers and the post-processors of DP to
decide in what formats the book should be submitted.

PG has no control over the format submitted. The White Washers check the
files and post them.

TEI is not popular either with the independent producers or the post
processors.

We could discuss this till the cows come home. But the solution is in the
hands of the independent producers and post processors of DP.

On Wed, Sep 16, 2009 at 11:42 AM, James Adcock <jimad at msn.com> wrote:

> As an example of how much author semantic information is lost going from an
> author's writing to PG txt format, I went and compared differences between
> a
> recent HTML and PG TXT I did -- where after doing the TXT encoding I went
> back and did three more passes over the images to add back in semantic
> differences to the HTML that the PG TXT didn't represent.
>
> Now the reality would be that it would take say TEI not HTML to represent
> all of the author's intent.  But measuring the loss going from HTML back to
> TXT gives an order of magnitude estimate of how much author information we
> are throwing away by representing a work in PG TXT.  In the case of this
> book, the answer was more than 1000 "losses" -- or an average of about 3
> losses per page.  And this is NOT counting about an addition 1000 losses in
> representation of emphasis.
>
> Now, let's say we have a PG TXT and some volunteer in the future wants to
> go
> back from that txt and say as correctly as possible represent that text
> using PDF.  How many "errors" does that volunteer need to correctly find
> where the TXT file loses author's semantic information by carefully
> comparing the page images to the PG TXT file, reintroducing information
> known to the original volunteer transcribers, but discarded as not being
> representable in PG TXT?  The answer is that this volunteer has to find and
> fix the txt in literally about 2000 places.  Want to place a bet on how
> many
> of those 2000 places the volunteer trying to create an accurate PDF file is
> actually going to "catch" ???  I can tell you in my efforts going from PG
> TXT to HTML in the first place it's a good part of a week's work -- not to
> imply *I* caught them all either!
>
>
>
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>



-- 
Sankar

Service to Humanity is Service to God
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/c15381cb/attachment.html>

From Bowerbird at aol.com  Wed Sep 16 01:09:11 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 16 Sep 2009 04:09:11 EDT
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
Message-ID: <c98.56a8f41e.37e1f727@aol.com>

jim said:
>    Again, TEI P5 is 1350 pages, which is a lot more inaccessible 
>    to volunteers than anything describing HTML tags that I have seen!? 
>    I'd say the DP tagging documentation is already painful enough 
>    for most of us.? I am about 100 pages into the TEI documentation, so 
>    in maybe two weeks I can tell you more about what I think about it...

wow, jim looks to be a bit masochistic.   could be a perfect candidate.    
:+)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/d03dc5bc/attachment.html>

From Bowerbird at aol.com  Wed Sep 16 01:22:44 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 16 Sep 2009 04:22:44 EDT
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
Message-ID: <c0b.651a9757.37e1fa54@aol.com>

jim said:
>    Now maybe to some of you -- you consider this result 
>    to be a good thing, an acceptable thing, a thing that 
>    well-represents the considerable efforts of the PG volunteers.

i doubt there is anyone who thinks that.

what you ended up with is pure shit.

that's because you did it wrong.

you rewrapped lines that weren't supposed to be rewrapped.

if you would have done it correctly, it would've come out right.

but you did it wrong.

now, your point is probably that we should make it easier
for our users to do it correctly.   and nobody will disagree...

we _should_ make it easier for our users to do it correctly.

and there's an (awfully) easy way to make it easier, which is to
mark all lines that should not be rewrapped with leading spaces.

but the whitewashers won't do it.

why won't the whitewashers do it?

i dunno.   you'll have to ask them.   i've certainly asked them.
i've asked them to do it, pretty please.   i've asked them again
to do it, pretty please.   i've asked 'em why they haven't done it.
i've said, repeatedly, that i think it's stupid they haven't done it.

and they still don't do it.   not all the time, anyway.

they do it some of the time.   i consider that a very slight victory.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/eaaed885/attachment-0001.html>

From Bowerbird at aol.com  Wed Sep 16 01:32:55 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 16 Sep 2009 04:32:55 EDT
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
Message-ID: <c5f.5d59406e.37e1fcb7@aol.com>

jim said:
>    In the case of this book, the answer was more than 1000 "losses" -- 
>    or an average of about 3 losses per page.? And this is NOT counting 
>    about an addition 1000 losses in representation of emphasis.

what was the book?   i'd like to compare the versions myself.

the question is, "why are you having _any_ losses in the .txt files?"

the answer, i am sure, will once again be, "you're doing it wrong".
it sucks, yeah, but it will be important to fix your broken workflow.

seriously, if you are stripping out emphasis, you're making a mistake.
(unless the "emphasis" had no meaning, and was simply ornate decor.)

in the meantime, if you do things intentionally that harm the .txt file,
then yes, the .txt file is going to seem awfully incapable to you, so it's
not a big surprise that you keep wanting to insist that such is the case.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/bcb79a46/attachment.html>

From marcello at perathoner.de  Wed Sep 16 02:03:08 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed, 16 Sep 2009 11:03:08 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <SNT120-DS20267EC119DCA4447F6EF1AEE20@phx.gbl>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com>	<4AAF5A12.8040505@perathoner.de>
	<SNT120-DS20267EC119DCA4447F6EF1AEE20@phx.gbl>
Message-ID: <4AB0A9CC.4020608@perathoner.de>

Jim Adcock wrote:

> How does one learn more and/or access the "PG implementation of TEI."  I
> have seen PG TEI which looks to me to add some tags to the base TEI ?

Adds some very few, restricts some others, specifies the usage of the 
rend attribute.


> Again, TEI P5 is 1350 pages, which is a lot more inaccessible to volunteers
> than anything describing HTML tags that I have seen!  I'd say the DP tagging
> documentation is already painful enough for most of us.  I am about 100
> pages into the TEI documentation, so in maybe two weeks I can tell you more
> about what I think about it....

Don't read the full TEI Guidelines. Read about TEI-Lite:

   http://www.tei-c.org/release/doc/tei-p5-exemplars/html/teilite.doc.html

which is a lot shorter than the HTML4 specs. You don't even have to use 
all of TEI Lite.



-- 
Marcello Perathoner
webmaster at gutenberg.org

From Bowerbird at aol.com  Wed Sep 16 02:14:24 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 16 Sep 2009 05:14:24 EDT
Subject: [gutvol-d] Re: z.m.l. can do what you want
Message-ID: <be9.5c4a6668.37e20670@aol.com>

gardner said:
>   I know BB talks a good deal about ZML 
>    and it sounds pretty cool

wow, that almost sounds like it could be a compliment.


>    but I've googled till my fingers bleed and
>    I cannot tell for sure what exact ZML he has in mind.

oh, you just should have asked me, gardner.

google can't find everything, especially if something
ain't all that concerned with being found by google...

on the other hand, it's pretty easy to remember it:
>    http://z-m-l.com

and oh yeah, that stands for "zen markup language".

***

at the site, click "see some examples." which takes you to:
>    http://www.z-m-l.com/go/vl3.pl

you've come to a page that has filename links on the left,
and then a respective button for each one of those files...

let's say you click the link for the top file -- test-suite.zml:
>    http://www.z-m-l.com/go/test-suite.zml

make sure to click the text link -- don't click the button yet.

the file you're viewing -- the test-suite for project gutenberg
-- has been "marked up" with z.m.l. -- zen markup language.

(z.m.l. is "zen", so you won't actually "see" any markup,
at least not any anglebrackets with tags inside of them.
but you'll see that it's formatted regularly; that's z.m.l.)

i've appended the topmost lines from that file -- 
the cover page and some lines from the contents 
--   to this post.   once again, this is the .zml file...

***


now click the "back" button to go back to this page:
>    http://www.z-m-l.com/go/vl3.pl

this time, click on the "test-suite" button on the right side.

this will perform the conversion of the .zml file into .html, 
and take you to the resultant .html file right on the web...
text colors are used to signify different structural elements.

you can save this .html to your own machine to examine it.

or put it side-by-side with the .zml file to see the conversion.

this .html, with its c.s.s., certainly won't be your cup-of-tea,
but changing the c.s.s. is a mere matter of template editing.
it validates as is, to .html 4.01, which might work for kindle,
or might not, i haven't done any checking on that specifically,
but again, we modify the conversion by editing the template.

i pointed you to the test-suite first, because reading it will
give you an introduction to the overall philosophy of it all...

also, the second file is "the 11 rules of z.m.l.", which might
also give you a good orientation, even if the file is quite old.
(but as i'm reluctant to tie anything down, it's not outdated.)

i can also turn out a mean .pdf, with that conversion routine,
and since you can size the .pdf page (and all other variables,
like font, fontsize, leading, margins, etc.) however you want,
that ends up being quite usable on a fixed-size e-ink screen.

so we have .html (important in mounting a web version), and
.pdf (for those situations where it's the user-chosen solution),
but the best part of all is that zml-viewers are easy to program.

i've coded them in basic and perl, and python should be simple.
it's very fundamental coding, so it'll be portable to any language.
it also facilitates open-source efforts, because it's easy to hack...

zml-viewer-programs turn the .zml file into a beautiful e-book,
customized to the user's preference, and offering a wide variety
of high-functionality capabilities that make a .zml file powerful.

this combination of beauty and power make z.m.l. hard to beat.

-bowerbird

p.s.   the top of the test-suite file goes like this:


 the test-suite for
 project gutenberg


 a document containing the
 full range of features found
 in project gutenberg e-texts


 by bowerbird intelligentleman


 greetings, earthling...

 this is an e-text brought to you by project gutenberg,
 a 35-year-old volunteer effort to put literature online.

 please see the web-site for news and information on
 usage conditions for e-texts, volunteering, and more...

 http://www.gutenberg.org






 table of contents


 the test-suite for project gutenberg
 table of contents
 dedication
 chapter 1 -- welcome aboard
 chapter 2 -- the sections of the book
 chapter 3 -- text "styling"


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/5b79b0e5/attachment.html>

From marcello at perathoner.de  Wed Sep 16 02:19:03 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed, 16 Sep 2009 11:19:03 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <DAD3C651E3644ED4B559863197320FD9@alp2400>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>	<44414895897048CF95F99FE12CC881FD@alp2400>	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>	<A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>	<4AAF5C48.1040201@perathoner.de>	<SNT120-DS20DA612BC506D0414E14B5AEE20@phx.gbl>
	<DAD3C651E3644ED4B559863197320FD9@alp2400>
Message-ID: <4AB0AD87.4050704@perathoner.de>

Al Haines (shaw) wrote:

> It's clearly stated in PG Volunteers' FAQ V.89 
> (http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.89._Are_there_any_places_where_I_should_indent_text.3F) 
> 
> that if you want to prevent unwanted wrapping, lines that should not be 
> wrapped should be indented a space or two.

This `markup? does not distinguish between poetry and a block quote. A 
block quote should be indented *and* rewrapped.


And the Rewrap Blues is only part of the problem ...

Another formidable challenge is to recover the chapter headings and 
other headings to make them stand out and to build a TOC.



-- 
Marcello Perathoner
webmaster at gutenberg.org

From sankarrukku at gmail.com  Wed Sep 16 03:56:08 2009
From: sankarrukku at gmail.com (Sankar Viswanathan)
Date: Wed, 16 Sep 2009 16:26:08 +0530
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <4AB0AD87.4050704@perathoner.de>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>
	<44414895897048CF95F99FE12CC881FD@alp2400>
	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>
	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>
	<A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>
	<4AAF5C48.1040201@perathoner.de>
	<SNT120-DS20DA612BC506D0414E14B5AEE20@phx.gbl>
	<DAD3C651E3644ED4B559863197320FD9@alp2400>
	<4AB0AD87.4050704@perathoner.de>
Message-ID: <e45c9fe70909160356m71af0793h2ac03dc4c02416a1@mail.gmail.com>

DP produces TEI text. But very few post processors take to the TEI route.

Why?

The software Guiguts automatically converts the formatted text to html. You
need not know much about HTML. The html output only needs to be tweaked at
times. Even that is not necessary in all cases.

Even with this scenario there has been a reluctance on the part of many post
processors to do a html version. DP does not insist on a html version. But
most of the Project Managers do insist on a html version. Even then there
are DP projects which are posted only in the text format.

For TEI to become popular we need a software which would automatically
convert the TEI text to a final TEI version. Is it possible?

I saw a software here. How good is it?

http://www.tei-c.org/Talks/Forli/2006/conversion.xml

On Wed, Sep 16, 2009 at 2:49 PM, Marcello Perathoner <marcello at perathoner.de
> wrote:

> Al Haines (shaw) wrote:
>
>  It's clearly stated in PG Volunteers' FAQ V.89 (
>> http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.89._Are_there_any_places_where_I_should_indent_text.3F)
>>
>> that if you want to prevent unwanted wrapping, lines that should not be
>> wrapped should be indented a space or two.
>>
>
> This `markup? does not distinguish between poetry and a block quote. A
> block quote should be indented *and* rewrapped.
>
>
> And the Rewrap Blues is only part of the problem ...
>
> Another formidable challenge is to recover the chapter headings and other
> headings to make them stand out and to build a TOC.
>
>
>
> --
> Marcello Perathoner
> webmaster at gutenberg.org
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>



-- 
Sankar

Service to Humanity is Service to God
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/0adfe57d/attachment-0001.html>

From traverso at posso.dm.unipi.it  Wed Sep 16 04:45:00 2009
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Wed, 16 Sep 2009 13:45:00 +0200 (CEST)
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <e45c9fe70909160356m71af0793h2ac03dc4c02416a1@mail.gmail.com>
	(message from Sankar Viswanathan on Wed, 16 Sep 2009 16:26:08 +0530)
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>
	<44414895897048CF95F99FE12CC881FD@alp2400>
	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>
	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>
	<A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>
	<4AAF5C48.1040201@perathoner.de>
	<SNT120-DS20DA612BC506D0414E14B5AEE20@phx.gbl>
	<DAD3C651E3644ED4B559863197320FD9@alp2400>
	<4AB0AD87.4050704@perathoner.de>
	<e45c9fe70909160356m71af0793h2ac03dc4c02416a1@mail.gmail.com>
Message-ID: <20090916114500.232CB10074@cardano.dm.unipi.it>


People use guiguts because it produces HTML, but mainly because it
includes gutcheck, aspell, wordcount routines, and integrates display
of the text and of the image corresponding to the text cursor
position. The only route to have more TEI submissions is to have a
version of guiguts producing TEI instead of HTML.

And of course improve the automatic conversion from TEI to HTML

Carlo

From marcello at perathoner.de  Wed Sep 16 05:40:47 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed, 16 Sep 2009 14:40:47 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <20090916114500.232CB10074@cardano.dm.unipi.it>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>	<44414895897048CF95F99FE12CC881FD@alp2400>	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>	<A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>	<4AAF5C48.1040201@perathoner.de>	<SNT120-DS20DA612BC506D0414E14B5AEE20@phx.gbl>	<DAD3C651E3644ED4B559863197320FD9@alp2400>	<4AB0AD87.4050704@perathoner.de>	<e45c9fe70909160356m71af0793h2ac03dc4c02416a1@mail.gmail.com>
	<20090916114500.232CB10074@cardano.dm.unipi.it>
Message-ID: <4AB0DCCF.2000205@perathoner.de>

Carlo Traverso wrote:

> People use guiguts because it produces HTML, but mainly because it
> includes gutcheck, aspell, wordcount routines, and integrates display
> of the text and of the image corresponding to the text cursor
> position. The only route to have more TEI submissions is to have a
> version of guiguts producing TEI instead of HTML.

That should be trivial.


> And of course improve the automatic conversion from TEI to HTML

In my copius free time ...


-- 
Marcello Perathoner
webmaster at gutenberg.org

From Bowerbird at aol.com  Wed Sep 16 08:27:45 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 16 Sep 2009 11:27:45 EDT
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
Message-ID: <c26.65b50338.37e25df1@aol.com>

carlo said:
>   The only route to have more TEI submissions is 
>    to have a version of guiguts producing TEI instead of HTML.

yeah, right.   extend the rickety workflow right out the window.
that's the ticket.   that will make everything all better, for sure...

and thus again we learn to appreciate the open-source approach.


>    And of course improve the automatic conversion from TEI to HTML

and the automatic .pdf conversion.   and all the other conversions.

or, you know, just use all the "standard" routines already out there,
in this thoroughly-explored and well-documented standards arena.

or hey, maybe once we get an e-book into .tei form, we can just
stand back and admire it, and pat our backs on our achievement.

who needs to convert to .html, when we can just bask in our glory?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/65a98fc6/attachment.html>

From Bowerbird at aol.com  Wed Sep 16 08:55:17 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 16 Sep 2009 11:55:17 EDT
Subject: [gutvol-d]  re: In search of a more-vanilla vanilla TXT
Message-ID: <c02.6608ed15.37e26465@aol.com>

jim said:
>   Which is a "good" submission format for submitting a
>    transcription of Shakespeare to be read on a cellphone, 
>    PG txt format, or HTML?
>    Answer: Neither file format works worth a dang for 
>    specifying Shakespeare to be read on cellphones. 
>    Yet both file formats contain the lists of the words.

again, jim, you've come up with the wrong answer...

as i've said, repeatedly, the iphone app "eucalyptus"
does a great job of rendering the p.g. e-texts in a way
that makes them quite beautiful, according to reviewers,
and i agree, for the most part.   (it ends up eucalyptus is
kind of flawed as an e-book program when it comes to
some capabilities that i consider vital, such as _search_.
but in terms of rendering the pages, it does that nicely.)

format wonks think the format needs to describe beauty...
it's far better, however, for a viewer-application to elicit it.
because in the end, everything really depends on the viewer.


>    Here's a brief excerpt from Dove:
>    "'Go?'" he wondered. "Go when, go where?"

well, that's on page 227 of this version in google.

>    "And proceed to my business under your eyes?"
>    "Oh dear no -- we shall go."
>    "'Go?'" he wondered. "Go when, go where?"
>    " In a day or two -- straight home. Aunt Maud wishes it now."

http://books.google.com/books?id=B9AOAAAAIAAJ&client=safari&pg=PA227&
ci=89%2C537%2C776%2C177&source=bookclip"

http://books.google.com/books?id=B9AOAAAAIAAJ&pg=PA227&img=1&zoom=3&hl=en&
sig=ACfU3U0L1tXMRlU83MbayfHNmmTl27CrXw&ci=89%2C537%2C776%2C177&edge=0"/

but i don't see anything special about that, anything
that would be missed or lost in the plain-text version.


>    And another one:
>    She particularly likes you.

ok, that's on page 18.   the "you" is italicized, so you should have
put underscores around it, so the viewer knows it's emphasized.

it's also embedded in dialog, so the way you have pulled it out
-- as if it was a paragraph by itself -- is rather misleading here.

again, if you purposely disfigure the plain-text version, then
yes, it will be inferior.   your solution is to stop disfiguring it...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/ce28dfb3/attachment.html>

From prosfilaes at gmail.com  Wed Sep 16 09:21:52 2009
From: prosfilaes at gmail.com (David Starner)
Date: Wed, 16 Sep 2009 12:21:52 -0400
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <4AAF5A12.8040505@perathoner.de>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>
	<4AAD7C04.5030806@perathoner.de>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
	<4AAE819C.1040108@perathoner.de>
	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com>
	<4AAF5A12.8040505@perathoner.de>
Message-ID: <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com>

On Tue, Sep 15, 2009 at 5:10 AM, Marcello Perathoner
<marcello at perathoner.de> wrote:
> Of course you have to learn TEI, as you had to learn HTML. No difference
> there.

But we know HTML, for one. We also have tools that help us with HTML,
for two. For three and the strike-out, I have a host of tools that
will help me edit, verify and view HTML, but there is no Debian
packages for PGTEI. Yes, yes, if I want to spend my hours mucking
around with stuff, I can in theory get it all installed.

> PG has an implementation of TEI. I know you don't like it because you
> haven't figured out how to produce pretty title pages.

Note: by "pretty title pages" Marcello means a title page that looks
like any title page in an actual book. Once again, I grabbed the
nearest books; I have ten books, by ten different publishers,
including two in Esperanto and one in a mixture of Esperanto and
Chinese, and with the exception of one of the English books which
right-justifies its title page, they all follow the basic format of
centered pages, title (new line) author (bottom of page) publisher.
None of them look a darn thing like the title pages PGTEI prints out.

> and has a full suite of tools:
>
> ?http://wiki.tei-c.org/index.php/Category:Tools

I see "To install the filter(s), start Open Office and and follow the
Tools / XML Filter Settings menu. Choose Open Package and locate the
.jar file(s)." Again, no difference at all from stuff that comes
preinstalled.

> 3. Worthless to the end user ...
>
> TEI is a master format. Its use is in producing formats suitable for
> end-user consumption.

Then prove it. If I saw a single document produced from PGTEI that was
suitable for end-user consumption, I might support it.

Look damnit, I was a fan of TEI until I realized that the people who
were going to bring it to PG didn't give a damn about making the
output something we wanted people to see.

> Anf if we don't equate end user == reader but try: end user == librarian or
> end user == lunguistic researcher we find that TEI is many times as useful
> as HTML.

The librarian is never the end-user. The librarian is the person who
makes it available to the end-user. Nobody around here cares about the
linguistic researcher as the end user, and we will never produce files
that are marked up with the type of information--like distinguishing
sentence ending punctuation from the same punctuation used other
ways--that they need. The end user we're targeting is the reader.

> 4. No decent output ...
>
> `Decentness? is a matter of debate.

Which is why you blow at selling this. Until you accept that PGTEI
needs to produce output that meets the standards of the people you're
trying to sell it to, nobody cares.

> At DP some PPers think it is essential to use every CSS feature at least
> once in every text, having pictures float right and left and text flowing
> around them and having illuminated dropcaps and printers ornaments and page
> numbers all over the place. PGTEI cannot (yet) do that.
>
> I very much prefer a simple layout, with only essential pictures smack in
> the middle of the text flow at the point they logically belong. A formatting
> that is easily ported to all existing devices. PGTEI excels at this.

Yes, in fact, some PPers do want to produce an etext that replicates
the original, includes the important illustrated dropcaps (that are
frequently as much a part of the illustration of the book as any other
illustration) and page numbers (that are crucial for much of the
non-fiction that we reproduce, especially if you want to follow the
web of references from one PG era book to another.)

> Ironically `decent? DP output is already falling to pieces on ePub devices
> (not even to mention Mobipocket) because ePub does not support CSS position:
> absolute.

And if you had produced TEI output that could do what people wanted to
do, it's possible that we would have better output on the ePub
devices. Right now, I would be surprised to find that PGTEI can output
at all to ePub, and I wouldn't be surprised if the people who produced
the DP output were happier with the results of their HTML translated
to ePub than your HTML translated to ePub.

-- 
Kie ekzistas vivo, ekzistas espero.

From prosfilaes at gmail.com  Wed Sep 16 09:25:04 2009
From: prosfilaes at gmail.com (David Starner)
Date: Wed, 16 Sep 2009 12:25:04 -0400
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <SNT120-DS2029830449090BDA7E449DAEE20@phx.gbl>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>
	<4AAD7C04.5030806@perathoner.de>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
	<4AAE819C.1040108@perathoner.de>
	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>
	<44414895897048CF95F99FE12CC881FD@alp2400>
	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>
	<6d99d1fd0909142047u5a192b4cv1220494663969e8a@mail.gmail.com>
	<SNT120-DS2029830449090BDA7E449DAEE20@phx.gbl>
Message-ID: <6d99d1fd0909160925l40a2da13t3d5f0d4ffea3227@mail.gmail.com>

On Tue, Sep 15, 2009 at 6:05 PM, Jim Adcock <jimad at msn.com> wrote:
>>Those sculptors who choose to work in ice are rarely remembered well
> by later ages. Sculptors who work in iron and bronze can easily be
> remembered for several millennia. The choice is the artist's.
>
> Except when what we are talking about is transcribers scratching other
> artist's works into mud tablets with (at best) a pointy stick.

Which is exactly what happened to Gilgamesh. I suppose the author
should have thrown a temper tantrum and demanded it be written only on
the finest silk, in which case we wouldn't have a copy.

-- 
Kie ekzistas vivo, ekzistas espero.

From prosfilaes at gmail.com  Wed Sep 16 09:26:27 2009
From: prosfilaes at gmail.com (David Starner)
Date: Wed, 16 Sep 2009 12:26:27 -0400
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <4AB0DCCF.2000205@perathoner.de>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>
	<A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>
	<4AAF5C48.1040201@perathoner.de>
	<SNT120-DS20DA612BC506D0414E14B5AEE20@phx.gbl>
	<DAD3C651E3644ED4B559863197320FD9@alp2400>
	<4AB0AD87.4050704@perathoner.de>
	<e45c9fe70909160356m71af0793h2ac03dc4c02416a1@mail.gmail.com>
	<20090916114500.232CB10074@cardano.dm.unipi.it>
	<4AB0DCCF.2000205@perathoner.de>
Message-ID: <6d99d1fd0909160926u639042d3mb908f5e44dff1453@mail.gmail.com>

On Wed, Sep 16, 2009 at 8:40 AM, Marcello Perathoner
<marcello at perathoner.de> wrote:
> Carlo Traverso wrote:
>> And of course improve the automatic conversion from TEI to HTML
>
> In my copius free time ...

Stop ranting about what others are doing in their free time, then.

-- 
Kie ekzistas vivo, ekzistas espero.

From jimad at msn.com  Wed Sep 16 09:31:07 2009
From: jimad at msn.com (James Adcock)
Date: Wed, 16 Sep 2009 09:31:07 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <e45c9fe70909160041r76a2eb8axc48005834c90534@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>	<44414895897048CF95F99FE12CC881FD@alp2400>	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>	<A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>	<SNT120-DS262D1CD6B1D1C64E0A329AEE30@phx.gbl>	<SNT120-DS14E4D1F62296090279D926AEE20@phx.gbl>
	<e45c9fe70909160041r76a2eb8axc48005834c90534@mail.gmail.com>
Message-ID: <SNT120-DS12D1D9FF50506C4C855F07AEE20@phx.gbl>

>PG has no control over the format submitted. 

Nonsense.  I have tried submitting ?other things? and have been told repeatedly that the ?minimum requirements in practice of PG? is that a TXT and an HTML file be submitted, and that these two files pass through a large number of fitness tests required by PG, which in practice includes restrictions on the choice of char sets used in the internal rep of the TXT and of the HTML file.  So, in fact PG DOES have control over the format submitted, and the way PG asserts that control is by refusing to accept submission of formats and details of those formats that they choose not to support.

As a simple counter-example of the above claim ?PG has no control over the format submitted? note that personally I would much rather be submitting TXT files which do not correspond to the PG requirements of including a gratuitous line wrap every 72 chars.  Or if I am required to submit TXT files with line wraps I would much prefer to retain the line wraps of the original text, because it is a royal pain for some future volunteer to have to ?fix? the position of line wraps back to the original text in order to do additional processing of the text file in the future, for example because they want to find and include additional semantic information that can be found in the original page scans, but not in the TXT.  And in practice it is impossible to do this visual analysis unless one matches line breaks to the original page scans ? as DP well knows.

 

Another example from a couple years ago is I asked PG how I could submit MOBI formatted texts of books they already had in other formats.  I was told that I was not allowed to do so.  So I set up an independent website to distribute PG books in MOBI format to my friends in the MOBI community -- retaining the PG licenses and legalese conditions.  Now, as hoped for, some years later PG has decided to support MOBI after all ? at least to some extent. But: what a pain!

 

Why is this important to me?  Well, I happen to like classes of reader machines that the internal mechanizers of PG do not like.  PG likes big teletype like display machines, capable of displaying more than 72 chars per line.  [Your standard PC or Mac still remains fundamentally a teletype emulator] And PG likes tiny machines with extremely limited displays, also known as cell phones.  I personally do not like either of those classes of machines, but rather machines that are middle sized ? small enough that I can pick them up and easily read them while lying in bed late at night for example.  But large enough that I can understand in context the ebb-and-flow of what the author wrote in some surrounding context. With these middle-sized machines issues of text reflow become a central issue in the pleasure (or lack thereof) of being able to use the machine.  And yes there are quite a number of tools one can use to help ?fix? at least partially ?broken? texts re these machines, including Calibre and say Mobipocket Creator.  But I?d rather not have to ?fix? a text each time before I can read it.  And I?d rather it not be  ?broken? in the first place.

 

So, in summary, as a ?volunteer? am I free to do what I want?  Yes, certainly ? but not if I want any of my efforts to ever show up on any PG website! As Bowerbird is only too happy to point out: ?Please feel free to go somewhere else!?

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/5fd33be4/attachment.html>

From jimad at msn.com  Wed Sep 16 09:34:32 2009
From: jimad at msn.com (James Adcock)
Date: Wed, 16 Sep 2009 09:34:32 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <c98.56a8f41e.37e1f727@aol.com>
References: <c98.56a8f41e.37e1f727@aol.com>
Message-ID: <SNT120-DS1583B45AC09BE87687A62BAEE20@phx.gbl>

Well, therein lies the problem.  *I* [or rather I mean to say  _I_] am
masochistic enough to take on TEI, but none of the other volunteers are
willing to join in with me.


>wow, jim looks to be a bit masochistic.  could be a perfect candidate.
:+)



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/0fc36b9f/attachment.html>

From jimad at msn.com  Wed Sep 16 09:56:15 2009
From: jimad at msn.com (James Adcock)
Date: Wed, 16 Sep 2009 09:56:15 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <c0b.651a9757.37e1fa54@aol.com>
References: <c0b.651a9757.37e1fa54@aol.com>
Message-ID: <SNT120-DS15C7B9CCB4E55190B77940AEE20@phx.gbl>

>you rewrapped lines that weren't supposed to be rewrapped.



You make my point for me.  When one relies on automagical tools to try to
recreate semantic information discarded by the PG TXT representation, more
or less often one ends up with something that looks like sh*t - your word
not mine.

 

When the results break one is told "Oh you did it wrong, you should have
done something else instead."

 

Yes of course, but once one relies on human intervention to "fix" the
problem when a particular algorithm breaks, then one does not have an
automatic algorithm.  Ultimately what one should do if one wants to "get it
right" is to abandon attempts at automagical tools which work sometimes and
end up looking like sh*t other times and instead take the PG TXT file, take
the original page scans, look at the page scans to figure out where the PG
TXT files gratuitously entered line breaks where the author didn't intend
line breaks, and take them back out.  After the gratuitous page breaks are
taken back out (the work of a few days - trust me on this!) then one can
either, if one has a machine, such as a teletype, incapable of reflow, run
the now gratuitous-line-break free TXT back through a simple unambiguous
algorithm to insert a line break at the appropriate point for your machine -
at a whitespace prior to char72 if you own a teletypewriter, at a whitespace
prior to char20 perhaps if you own a cellphone.  Or better, if you have a
more modern machine, which really, I think most of us DO have, a machine
capable of calculating reflow itself aka "word wrap" then you just feed the
machine the TXT that doesn't have the gratuitous line breaks and everything
works automagically.  Assuming one is willing to live with ragged right.  Or
tolerate slightly ugly word spacing on machines that force right justify
(sigh.)  Better yet, we should ask our technologist friends to include not
only reflow but also automatic hyphenation routines in our machines.

 

Is it too much, for example, to ask PG to provide the option to the rare
user who actually WANTS line breaks at char 72, or for that matter actually
wants line breaks at char 20, is it too much to ask PG to provide a filter
to insert such "gratuitous" line breaks?  Consider: PG *already* provides
literally 40 different such filter programs to help people with various
strange obscure legacy machines. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/eac3b5f9/attachment.html>

From jimad at msn.com  Wed Sep 16 10:13:40 2009
From: jimad at msn.com (James Adcock)
Date: Wed, 16 Sep 2009 10:13:40 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <c5f.5d59406e.37e1fcb7@aol.com>
References: <c5f.5d59406e.37e1fcb7@aol.com>
Message-ID: <SNT120-DS2486F7D2C42C27D917D970AEE20@phx.gbl>

>what was the book?  i'd like to compare the versions myself.



The book is E-text #29452

 

And contrary to your complaints I didn't "strip out anything."  A more
accurate statement of your complaint is that I didn't waste a whole lot of
my time and energy inventing and manually inserting semantic markings in a
legacy file format that is a hopelessly broken representation of this book
in the first place. If you want to hack up the TXT file some way to make you
more happy feel free to do so - you're a "volunteer" too - I'm certainly not
going to be reading the TXT file, so personally I don't care what you do!

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/87f70d7b/attachment-0001.html>

From richfield at telkomsa.net  Wed Sep 16 09:26:09 2009
From: richfield at telkomsa.net (Jon Richfield)
Date: Wed, 16 Sep 2009 18:26:09 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <SNT120-DS16F874FCB85B82D2E9B03EAEE30@phx.gbl>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>	<44414895897048CF95F99FE12CC881FD@alp2400>	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>
	<SNT120-DS16F874FCB85B82D2E9B03EAEE30@phx.gbl>
Message-ID: <4AB111A1.2050703@telkomsa.net>

I have to agree with large parts of what James Adcock says.
A lot of it depends on the medium (media in fact), the message, and so 
on. When I write (without any interest in whether I should be writing or 
not, or whether anyone cares) there is considerable rumination, not to 
mention bellyaching, about punctuation, font, typeface, formatting and 
so on, in fact practically anything that could be done in more than one 
way. The fundamental thing is information. Alternative ways of 
representing the information require information to convey them, and 
offer opportunities for conveying the information. Well-conveyed 
information is in that respect at least, beautiful.  The reason that 
most of my presentations are fairly spare is that most of what I have to 
say is fairly directly factual. Conspicuous headings, distinct tables of 
contents, and clear meanings are usually enough for my purposes because 
I am no artist. The reason I struggle with punctuation is that I have my 
own rules, and bugger the grammarians.
My rules are:
If the punctuation doesn't matter, leave it out.
If  it changes the meaning, it does matter. Put it in.
If it does not really change the meaning, but the reader needs to read 
something twice to make sense of it, adapt the punctuation, the sentence 
structure, or even the wording, to provide unconscious, one-pass parsing.
If omitting (or inserting) logically unnecessary punctuation is likely 
to distract or confuse the reader, then don't or do, as the case might be.
Know something about common conventions and their significances so that 
you have some idea of what to flaunt and what to flout.
That about somes it up, sum of it anyway. If that is how simple it is, 
why is it so complicated?
Because I am lousy at noticing when I violate those rules. Many people 
are not that that is more of an excuse than an explanation.
Recently I helped I think a friend with a book that he had written in 
German and translated into English. The book was a straightforward work 
of philosophy, so it should have been easy. Unfortunately, though he is 
literate and intelligent, he had absent-mindedly retained a lot of the 
German commas. It rendered reading of the book such hard work that I 
could not read it in bulk. I was doing double-takes every few 
sentences,  which was more than was needed to ruin my concentration and 
wreck my attempts to remain coherently aware of the thread of 
significance. A sign of mine being a lesser intellect according to 
Whitehead? Definitely, but remember not only that the average intellect 
is less than lesser, but what is worse, it is less lesser than half the 
population. The lesser is who you are writing for more or lesser always. 
Consider "It is a long tail, certainly,' said Alice, looking down with 
wonder at the Mouse's tail; 'but why do you call it sad?' And she kept 
on puzzling about it while the Mouse was speaking, so that her idea of 
the tale was something like this: 'Fury said to a mouse, That he met in 
the house, "Let us both go to law: I will prosecute you. --Come, I'll 
take no denial; We must have a trial: For really this morning I've 
nothing to do." Said the mouse to the cur, "Such a trial, dear Sir, With 
no jury or judge, would be wasting our breath." "I'll be judge, I'll be 
jury," Said cunning old Fury: "I'll try the whole cause, and condemn you 
to death."' 'You are not attending!' said the Mouse to Alice severely. 
'What are you thinking of?' 'I beg your pardon,' said Alice very humbly: 
'you had got to the fifth bend, I think?' 'I had NOT!' cried the Mouse, 
sharply and very angrily." Then again, pace archy, how about something 
like: "Wenn hinter fliegen fliegen fliegen fliegen fliegen fliegen 
hinternach" or "smith who when jones had had had had had had had had had 
had had the judgement of the examiners in his favour" Or which would fit 
the writer's intention better: "You would be the lad for that." or "You 
would be the lad for that." or "You would be the lad for that." How many 
ways with more or less distinct meanings could one place the emphasis in 
"Two twenty-buck tickets for her show I should buy"? If anyone does not 
believe that punctuation matters, try reading "Eats shoots and leaves" 
by Lynne Truss. (If you haven't read it anyway, do yourself a favour and 
read it anyway.)

Now all that is great fun, compared to waiting for Godot with a hangover 
in a hot public lavatory at the terminus of a diesel trucking company in 
Houston, but if you actually wish to write (or convey someone else's 
writing) with efficiency and with respect for the information, the 
author, and the reader, then you will use all the channels of 
information that the medium (media, funiculi, funicula ) that assist 
without increasing the noise to signal ratio. The fact that some authors 
don't need or want it is irrelevant. The right amount is what works 
best, and if he wants nothing, that is the right amount.  It does not 
follow that it is the right amount elsewhere.  The fact that your reader 
can get no end of fun out of Joyce without punctuation, does not mean 
that the same must apply for figurate verse or calligraphic works.

The medium is rarely the message unless the message is about the medium, 
unless you are in one of the bottom-feeding niches, or a great artist, 
but to gird at more powerful notations because less can be made to do, 
for some people, mostly, with some exertion, is poorly persuasive, let 
alone cogent. I never did like Gertrude Stein.

I don't know when where or how anyone will come up with the generally 
universally and perfect notation. (I know when I think they will, but 
that is another story.) All I ask is that they please make it something 
that can be read with a vanilla text reader, no instruction manual, and 
some patience, even if the proper markup interpreter on a great 
audiovisual system or tiny cellphone can give a mind-blasting 
performance. Personally I would like it to start with the vanilla text 
and punctuation and have the markups follow as an appendix, to be 
ignored when unwanted or not understood.

Patience upon a rock

Smiling at grief

Because she is wearing ear plugs. (Of course?)

In  my case I am privileged because I am not dependent on pure txt. If 
necessary I can convert PDF, though I have never learned its internal 
format.

Cheers,

Jon







From jimad at msn.com  Wed Sep 16 10:27:48 2009
From: jimad at msn.com (James Adcock)
Date: Wed, 16 Sep 2009 10:27:48 -0700
Subject: [gutvol-d] Re: z.m.l. can do what you want
In-Reply-To: <be9.5c4a6668.37e20670@aol.com>
References: <be9.5c4a6668.37e20670@aol.com>
Message-ID: <SNT120-DS18E10E4CAAF69E197C659AEE20@phx.gbl>

>   http://z-m-l.com

Well, if "the proof is in the pudding" then I invite the other readers to
compare the z.m.l results for "Scrooge" at:

http://www.z-m-l.com/go/vlconvert.html

to the PG TEI results for "My Antonia" at:

http://www.gutenberg.org/files/19810/19810-pdf.pdf

However, I will point out that neither markup has the proper support
necessary to render correct EPUB nor MOBI file formats.

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/e3701208/attachment.html>

From jimad at msn.com  Wed Sep 16 10:37:40 2009
From: jimad at msn.com (James Adcock)
Date: Wed, 16 Sep 2009 10:37:40 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <c02.6608ed15.37e26465@aol.com>
References: <c02.6608ed15.37e26465@aol.com>
Message-ID: <SNT120-DS2034A76F766E349C91E654AEE20@phx.gbl>

So the "solution" is that all readers should buy an iphone and run
"eucalyptus"

 

I can't disagree with that - it IS relatively trivial to render attractive
text if one knows one is rendering to only one particular machine.

 

Help me out, now tell us readers WHICH cell phone company we ought to switch
to?

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/f99bc088/attachment.html>

From marcello at perathoner.de  Wed Sep 16 10:56:34 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Wed, 16 Sep 2009 19:56:34 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com>	<4AAF5A12.8040505@perathoner.de>
	<6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com>
Message-ID: <4AB126D2.2050706@perathoner.de>

David Starner wrote:


> But we know HTML, for one. We also have tools that help us with HTML,
> for two. For three and the strike-out, I have a host of tools that
> will help me edit, verify and view HTML, but there is no Debian
> packages for PGTEI.

Where's the debian package for guiguts? I had to actually edit the code 
to make it run on my debian/unstable.

nxml-mode in emacs is all you'll ever need to edit and validate xml.

Or use the TEI stylesheets in OpenOffice, if you must needs have 
WYSIWYG. Sheesh!


>> PG has an implementation of TEI. I know you don't like it because you
>> haven't figured out how to produce pretty title pages.
> 
> Note: by "pretty title pages" Marcello means a title page that looks
> like any title page in an actual book. Once again, I grabbed the
> nearest books; I have ten books, by ten different publishers,
> including two in Esperanto and one in a mixture of Esperanto and
> Chinese, and with the exception of one of the English books which
> right-justifies its title page, they all follow the basic format of
> centered pages, title (new line) author (bottom of page) publisher.
> None of them look a darn thing like the title pages PGTEI prints out.

Ohh. Pleeeeease!

Go here:

   http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-pdf.pdf

and tell me what you don't like about the title page.

And then go here:

   http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-h.html

to verify that it looks the same in HTML.

And then go here:

   http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-0.txt

to see how it looks in TXT.


All from ONE and the same TEI master.


> If I saw a single document produced from PGTEI that was
> suitable for end-user consumption, I might support it.

   http://www.gnutenberg.de/pgtei/0.5/examples/


> The librarian is never the end-user. The librarian is the person who
> makes it available to the end-user. Nobody around here cares about the
> linguistic researcher as the end user, and we will never produce files
> that are marked up with the type of information--like distinguishing
> sentence ending punctuation from the same punctuation used other
> ways--that they need. The end user we're targeting is the reader.

(Distinguishing punctuation is very important for typesetters.)

YOU are targeting the reader that reads on a desktop browser.

I am targeting everybody on every platform of every size and every 
software that might want to use or convert our books in any way 
imaginable or not yet imaginable.


> Yes, in fact, some PPers do want to produce an etext that replicates
> the original, includes the important illustrated dropcaps (that are
> frequently as much a part of the illustration of the book as any other
> illustration) and page numbers (that are crucial for much of the
> non-fiction that we reproduce, especially if you want to follow the
> web of references from one PG era book to another.)

And while they are busy `replicating the original? they miss all 
opportunities of electronic text.


Eg. the index entries are still linked to the *page* they reference, 
while it was technically possible for decades now to go directly to the 
word. So if the reader clicks on an indexed term, she must read all the 
page until she finds the reference instead of going directly to the 
reference (and maybe have it highlighted like on Wikipedia).

This opportunity of making the books more accessible has been missed 
because DP is still producing electronic facsimiles instead of 
electronic books.


Eg. speaker tagging. In a few years when everybody will have speech 
syntesis on their cell phones ebook readers people may want to listen to 
their books while driving. If you have quotes marked up you can assign 
different voices to different speakers.


Eg. geografic tagging. While visiting someplace you may want to find all 
book references that refer to the place you are in.


DP misses out again and again.


But they make pretty facsimiles ...


> And if you had produced TEI output that could do what people wanted to
> do, it's possible that we would have better output on the ePub
> devices. 

If people had started using TEI instead of griping endlessly about minor 
shortcomings, we might have now a complete TEI workflow in place.


> Right now, I would be surprised to find that PGTEI can output
> at all to ePub, and I wouldn't be surprised if the people who produced
> the DP output were happier with the results of their HTML translated
> to ePub than your HTML translated to ePub.

PGTEI outputs just fine to ePub. Just take the HTML output and convert 
it in Calibre or whatever you are using.

Look here (this is PDF, not ePub):

   http://www.gnutenberg.de/pgtei/0.5/examples/pgtei-pdf-sony-reader.jpg


-- 
Marcello Perathoner
webmaster at gutenberg.org

From Bowerbird at aol.com  Wed Sep 16 11:31:13 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 16 Sep 2009 14:31:13 EDT
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
Message-ID: <c91.48d074f8.37e288f1@aol.com>

jim said:
>   As Bowerbird is only too happy to point out: 
>    ?Please feel free to go somewhere else!?
 ?
jim, jim, jim.   i'm just about the only person here
who has any sympathy with what you are saying,
and who is interacting with you in good faith, so
why you be stabbin' me in da back like that?

don't put words in my mouth.
i never said anything like that.

indeed, i have often advocated that p.g. e-texts
should retain the linebreaks from the original,
just like you've advocated, for the same reason.

likewise, long before anyone else cared about it,
i argued that lines which should not be rewrapped
should be prefaced with one or more spaces, so
they could easily be treated correctly during rewrap.

i've also asked, repeatedly, for the plain-text files
to include the names of the graphics, so that my
viewer-apps could display them at the right place,
which i assume is a practice you would also support.

moreover, i am the only person here who has set up
a website that people can use to unwrap the e-texts:
>   http://z-m-l.com/unwrap.pl

so don't be raggin' on me, ok?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/6ae2fc2d/attachment-0001.html>

From Bowerbird at aol.com  Wed Sep 16 11:57:29 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 16 Sep 2009 14:57:29 EDT
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
Message-ID: <c5e.613267c2.37e28f19@aol.com>

jim said:
>    You make my point for me.

yes, i do, jim.   and i make your point
_much_better_ than you do, because
i don't do the sabotage thing to it first.


>    When one relies on automagical tools to try to 
>    recreate semantic information discarded by 
>    the PG TXT representation, more or less often 
>    one ends up with something that looks like sh*t ? 
>    your word not mine.

actually, i got the word from the dictionary,
so you should feel free to use it without me.

your tools are neither "auto" nor "magical" enough.

it is only when you've improved them to the point
that they cannot be improved any more that you
earn the right to bitch about what others are doing.


>    When the results break one is told 
>    ?Oh you did it wrong, you should 
>    have done something else instead.?

i didn't say "you did it wrong" as some kind of
mystical power intended to make you go away.

i said that you did it wrong because you did it wrong.

it ends up that, with the right tool, and if you do it right,
the p.g. e-text format works perfectly well, or at least it
_can_, if the whitewashers really did what they say they do.

and sometimes they do.

for instance, if you would have taken your "hamlet" text
from the newest version in the library, and put it into my
unwrap site listed above, you'll see that it works just fine.

so there is no _shortcoming_ of the p.g. plain-text format
that needs to be "overcome".   there are only some _flaws_
-- a portion of which seem to be intentionally inflicted --
which need to be corrected, so that the format can shine...


>    once one relies on human intervention to ?fix? the problem 
>    when a particular algorithm breaks, then one does not have 
>    an automatic algorithm.   

i agree.   but that's not what is at issue here.


>    Ultimately what one should do if one wants to ?get it right? 
>    is to abandon attempts at automagical tools which work 
>    sometimes and end up looking like sh*t other times 
>    and instead take the PG TXT file, take the original page scans, 
>    look at the page scans to figure out where the PG TXT files 
>    gratuitously entered line breaks where the author didn?t 
>    intend line breaks, and take them back out.

see, jim, here's where you get things half-right-but-kinda-wrong.

you just haven't thought through these things well enough so that
you can explain them clearly, so it comes out in this mumbo-jumbo.


>    After the gratuitous page breaks are taken back out 
>    (the work of a few days ? trust me on this!) 

again, you're severely unclear here.

(and please, please, please, if anyone thinks that jim _is_ being
"clear", do jump in and say so and help provide an explanation.)


>    then one can either, if one has a machine, such as a teletype, 
>    incapable of reflow, run the now gratuitous-line-break free 
>    TXT back through a simple unambiguous algorithm to insert 
>    a line break at the appropriate point for your machine

ok, here's a relatively straightforward description of the process.

but, really, jim, there's no need for it.   we programmers _know_
how to do this.   it's not difficult.   the guy who coded "eucalyptus"
did a fine job on doing this, and he is using the p.g. text-files, so
they don't really present the insurmountable problem you think...


>    Or tolerate slightly ugly word spacing on machines that 
>    force right justify (sigh.)   Better yet, we should ask our 
>    technologist friends to include not only reflow but also 
>    automatic hyphenation routines in our machines.

again, not to beat a dead horse, but eucalyptus does ragged-right
or justification, whichever the user prefers, and hyphenation too,
so everything you're asking for has already been done at least once.

rather than harping about the format -- which does just peachy,
thank you very much -- you need to complain about the coders
who are not giving you the type of tools you would like to have...


 

>    Is it too much, for example, to ask PG to provide the option 
>    to the rare user who actually WANTS line breaks at char 72, 
>    or for that matter actually wants line breaks at char 20, 
>    is it too much to ask PG to provide a filter to insert such 
>    ?gratuitous? line breaks?   Consider: PG *already* provides 
>    literally 40 different such filter programs to help people 
>    with various strange obscure legacy machines.

more sloppy thinking, jim.   what do you really _mean_
when you say "provide the option" or "provide a filter"?

i take it to mean "give your users a tool that does that".

and if you started saying it that way, you'd come to realize that
your beef is not with the p.g. text-file format at all, but rather
the fact that p.g. isn't supplying users with the tools we need...

-bowerbird

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/ab3e6b88/attachment.html>

From Bowerbird at aol.com  Wed Sep 16 12:18:21 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 16 Sep 2009 15:18:21 EDT
Subject: [gutvol-d] Re: z.m.l. can do what you want
Message-ID: <cdb.5a1f9861.37e293fd@aol.com>

jim, i'm losing patience with you, quickly, i warn you.

***

jim said:
>    I invite the other readers to compare 
>    the z.m.l. results for ?Scrooge? at:
>    http://www.z-m-l.com/go/vlconvert.html

first, the file at that address is volatile,
depending on which .html conversion
was last made via on button-click on
the page which is located over at:
>    http://www.z-m-l.com/go/vl3.pl

so, to ensure you've got "scrooge" at the
address listed above, click on that button.
(that'll be the "a christmas carol" button.)


>    I invite the other readers to compare 
>    the z.m.l results for ?Scrooge? at:
>    http://www.z-m-l.com/go/vlconvert.html
>    to the PG TEI results for ?My Antonia? at:
>    http://www.gutenberg.org/files/19810/19810-pdf.pdf

ok, do you understand that is comparing
a converted .html file to a converted .pdf?

if you want to compare, compare .html to .html,
and .pdf to .pdf, because that makes some sense.

oh, and by the way, picking "my antonia" was
not a wise move, because i've done extensive
research work on that particular digitization.

so if you want me to onslaught it at you, let me know.


>    However, I will point out that neither markup has 
>    the proper support necessary to render 
>    correct EPUB nor MOBI file formats.

show me the "proper" markup to accomplish the mobi,
and i will edit the template so you can get that markup.

then have the .tei guys do what they would need to do.
because i'd love to see the mobi they get from their .tei.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/ff5e45f0/attachment.html>

From gbnewby at pglaf.org  Wed Sep 16 12:20:43 2009
From: gbnewby at pglaf.org (Greg Newby)
Date: Wed, 16 Sep 2009 12:20:43 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <4AB126D2.2050706@perathoner.de>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>
	<4AAD7C04.5030806@perathoner.de>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
	<4AAE819C.1040108@perathoner.de>
	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com>
	<4AAF5A12.8040505@perathoner.de>
	<6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com>
	<4AB126D2.2050706@perathoner.de>
Message-ID: <20090916192042.GA9297@pglaf.org>

> Go here:
>
>   http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-pdf.pdf

I'd like to support what Marcello wrote, below.  I've long believed that
having TEI be the native output from Distributed Proofreaders is
desirable.  My understanding is they just don't have the available
person-power to implement this.

When a TEI eBook is submitted to the whitewashers, we have a very nice
processing stream (with the pieces mentioned below) to very easily
produce .txt, .htm and anything else we might want.  If we had enough of
eBooks with TEI as the native format, we could add transformation
options to www.gutenberg.org's catalog pages, to truly provide "your
book, your way."

There's no lack of ability to produce, transform or otherwise work with
TEI files.  As someone pointed out, the DP proofreading is essentially
agnostic about the back-end encoding format.  The postprocessors might
see some variation in the workflow, but would not necessarily need to
work directly with TEI markup.

I think the existing software and examples are compelling.  If there was
an easier way of getting TEI embedded into the DP workflow, it would
have happened by now.

  -- Greg

On Wed, Sep 16, 2009 at 07:56:34PM +0200, Marcello Perathoner wrote:
> David Starner wrote:
>
>
>> But we know HTML, for one. We also have tools that help us with HTML,
>> for two. For three and the strike-out, I have a host of tools that
>> will help me edit, verify and view HTML, but there is no Debian
>> packages for PGTEI.
>
> Where's the debian package for guiguts? I had to actually edit the code  
> to make it run on my debian/unstable.
>
> nxml-mode in emacs is all you'll ever need to edit and validate xml.
>
> Or use the TEI stylesheets in OpenOffice, if you must needs have  
> WYSIWYG. Sheesh!
>
>
>>> PG has an implementation of TEI. I know you don't like it because you
>>> haven't figured out how to produce pretty title pages.
>>
>> Note: by "pretty title pages" Marcello means a title page that looks
>> like any title page in an actual book. Once again, I grabbed the
>> nearest books; I have ten books, by ten different publishers,
>> including two in Esperanto and one in a mixture of Esperanto and
>> Chinese, and with the exception of one of the English books which
>> right-justifies its title page, they all follow the basic format of
>> centered pages, title (new line) author (bottom of page) publisher.
>> None of them look a darn thing like the title pages PGTEI prints out.
>
> Ohh. Pleeeeease!
>
> Go here:
>
>   http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-pdf.pdf
>
> and tell me what you don't like about the title page.
>
> And then go here:
>
>   http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-h.html
>
> to verify that it looks the same in HTML.
>
> And then go here:
>
>   http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-0.txt
>
> to see how it looks in TXT.
>
>
> All from ONE and the same TEI master.
>
>
>> If I saw a single document produced from PGTEI that was
>> suitable for end-user consumption, I might support it.
>
>   http://www.gnutenberg.de/pgtei/0.5/examples/
>
>
>> The librarian is never the end-user. The librarian is the person who
>> makes it available to the end-user. Nobody around here cares about the
>> linguistic researcher as the end user, and we will never produce files
>> that are marked up with the type of information--like distinguishing
>> sentence ending punctuation from the same punctuation used other
>> ways--that they need. The end user we're targeting is the reader.
>
> (Distinguishing punctuation is very important for typesetters.)
>
> YOU are targeting the reader that reads on a desktop browser.
>
> I am targeting everybody on every platform of every size and every  
> software that might want to use or convert our books in any way  
> imaginable or not yet imaginable.
>
>
>> Yes, in fact, some PPers do want to produce an etext that replicates
>> the original, includes the important illustrated dropcaps (that are
>> frequently as much a part of the illustration of the book as any other
>> illustration) and page numbers (that are crucial for much of the
>> non-fiction that we reproduce, especially if you want to follow the
>> web of references from one PG era book to another.)
>
> And while they are busy `replicating the original? they miss all  
> opportunities of electronic text.
>
>
> Eg. the index entries are still linked to the *page* they reference,  
> while it was technically possible for decades now to go directly to the  
> word. So if the reader clicks on an indexed term, she must read all the  
> page until she finds the reference instead of going directly to the  
> reference (and maybe have it highlighted like on Wikipedia).
>
> This opportunity of making the books more accessible has been missed  
> because DP is still producing electronic facsimiles instead of  
> electronic books.
>
>
> Eg. speaker tagging. In a few years when everybody will have speech  
> syntesis on their cell phones ebook readers people may want to listen to  
> their books while driving. If you have quotes marked up you can assign  
> different voices to different speakers.
>
>
> Eg. geografic tagging. While visiting someplace you may want to find all  
> book references that refer to the place you are in.
>
>
> DP misses out again and again.
>
>
> But they make pretty facsimiles ...
>
>
>> And if you had produced TEI output that could do what people wanted to
>> do, it's possible that we would have better output on the ePub
>> devices. 
>
> If people had started using TEI instead of griping endlessly about minor  
> shortcomings, we might have now a complete TEI workflow in place.
>
>
>> Right now, I would be surprised to find that PGTEI can output
>> at all to ePub, and I wouldn't be surprised if the people who produced
>> the DP output were happier with the results of their HTML translated
>> to ePub than your HTML translated to ePub.
>
> PGTEI outputs just fine to ePub. Just take the HTML output and convert  
> it in Calibre or whatever you are using.
>
> Look here (this is PDF, not ePub):
>
>   http://www.gnutenberg.de/pgtei/0.5/examples/pgtei-pdf-sony-reader.jpg
>
>
> -- 
> Marcello Perathoner
> webmaster at gutenberg.org
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d

From Bowerbird at aol.com  Wed Sep 16 12:36:41 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 16 Sep 2009 15:36:41 EDT
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
Message-ID: <ce3.57246ebf.37e29849@aol.com>

jim said:
>    So the ?solution? is that all readers should 
>    buy an iphone and run ?eucalyptus?

no.

the "solution" is to understand that a viewer-program
like eucalyptus can be programmed on _any_ platform,
designed to take the project gutenberg plain-text files
and display them in a beautiful (and powerful) manner.

the "solution" is not to debunk that plain-text format
-- which is what you seem to be wanting to do here --
and _certainly_ not invent a new cockamamie format,
but rather to patch the small inconsistency problems
that haunt the library so that the plain-text files are
dependable and reliable in terms of delivering beauty.


>    I can?t disagree with that ? it IS relatively trivial 
>    to render attractive text if one knows one is 
>    rendering to only one particular machine.

have you ever done any programming, jim?   and,
in particular, have you ever done e-book coding?

even though eucalyptus is "rendering to only one
particular machine", it allows the end-user to pick
the font-size, which requires rewrapping the text.

in addition, many apps let you switch to landscape,
which means you must code for two screen-sizes...

it's not as easy as you make it sound to make an app,
even if it's just "for one particular machine".   however,
once you've made such an app, it's not that difficult to
port it to another machine, or to another language, or
to hack it for some specific purpose you only need today,

or to modify it to fit your own personal preferences, or...

but the important thing to remember as far as this thread
is that the "vanilla" .txt format used by project gutenberg
is extremely close to being totally sufficient as a file-format.
it just needs to have a few ambiguous situations cleaned up,
and then the whole library needs to undergo quality control
because there are some rather glaring inconsistencies there.

but we don't need a new format...   never have...   never will...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/f25276b4/attachment-0001.html>

From Bowerbird at aol.com  Wed Sep 16 12:41:27 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 16 Sep 2009 15:41:27 EDT
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
Message-ID: <c2a.4cc7854b.37e29967@aol.com>

greg wants the distributed proofreaders to jump through
all kinds of _unnecessarily_ difficult hoops, just so that
project gutenberg can _supposedly_ get the benefits that
i've already _proven_ can be obtained from the .txt format.

greg went to library school.   he's supposed to be smart.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/0905ba1b/attachment.html>

From Bowerbird at aol.com  Wed Sep 16 12:54:32 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 16 Sep 2009 15:54:32 EDT
Subject: [gutvol-d] david and marcello
Message-ID: <cba.3f847803.37e29c78@aol.com>

both david starner and marcello
are relegated to my spam folder,
but it's nice to see them fighting.

marcello has convinced himself
that pgtei is the next big thing,
and has been ever since 2001...

unfortunately for mr. marcello,
he hasn't persuaded d.p. people.
they tried .tei, and ran screaming
from the scene due to the stench.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090916/1c118166/attachment.html>

From joyce.b.wilson at sbcglobal.net  Wed Sep 16 13:30:52 2009
From: joyce.b.wilson at sbcglobal.net (Joyce Wilson)
Date: Wed, 16 Sep 2009 15:30:52 -0500
Subject: [gutvol-d] Ebook 30000!  And some sort of catalog milestone
Message-ID: <4AB14AFC.9040603@sbcglobal.net>

I saw today that ebook 30000 has been posted!  Congratulations!

And in other (more self-serving) good news, I see we have 14271 books 
with no subjects in their bib records and 12062 with no LoCC in their 
bib records, so I guess it's no longer the case that "more than half of 
our books don't have subject information added" and "more than one half 
of our books don't have LoC info" as the "Help on Bibliographic Record 
Page" says.

: -)
Joyce Wilson
PG Cataloging Team

From prosfilaes at gmail.com  Wed Sep 16 22:47:03 2009
From: prosfilaes at gmail.com (David Starner)
Date: Thu, 17 Sep 2009 01:47:03 -0400
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <4AB126D2.2050706@perathoner.de>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>
	<4AAD7C04.5030806@perathoner.de>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
	<4AAE819C.1040108@perathoner.de>
	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com>
	<4AAF5A12.8040505@perathoner.de>
	<6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com>
	<4AB126D2.2050706@perathoner.de>
Message-ID: <6d99d1fd0909162247o504cc5bcy86ac80593ec2d809@mail.gmail.com>

On Wed, Sep 16, 2009 at 1:56 PM, Marcello Perathoner
<marcello at perathoner.de> wrote:
> Where's the debian package for guiguts?

There's not one, but that's why it's called in-house code and we can
talk to the programers if we need help. But there are several Debian
packages for programs that can check and display HTML.

>>> PG has an implementation of TEI. I know you don't like it because you
>>> haven't figured out how to produce pretty title pages.
>
> Ohh. Pleeeeease!

So you attack people for having made complaints that were perfectly
valid when they were made? Unless you're going for the martyr award, I
hardly see how that's productive.

> Eg. the index entries are still linked to the *page* they reference, while
> it was technically possible for decades now to go directly to the word.

If they are still linked to the page instead of the word, it's because
the PPer looked at a 50 page index and decided that there was no way
they were going to wade through there and try and find where on the
page the link was intended to go to for 20,000 references. HTML and
TEI are no different here.

> Eg. geografic tagging. While visiting someplace you may want to find all
> book references that refer to the place you are in.

Maybe. There's a very real question whether it's worth the man-power
to mark this up, and it's really a bit of a gratuitous feature.

> If people had started using TEI instead of griping endlessly about minor
> shortcomings, we might have now a complete TEI workflow in place.

If you had listened to the needs of the people who you wanted to start
using TEI instead of bitching about them and their requirements, maybe
they would have started using TEI.

-- 
Kie ekzistas vivo, ekzistas espero.

From schultzk at uni-trier.de  Thu Sep 17 00:05:17 2009
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Thu, 17 Sep 2009 09:05:17 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <4AAF5C48.1040201@perathoner.de>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>	<44414895897048CF95F99FE12CC881FD@alp2400>	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>
	<A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>
	<4AAF5C48.1040201@perathoner.de>
Message-ID: <65729AE2-DCF6-4844-BF5D-49A8809242D9@uni-trier.de>

Hi There,

Am 15.09.2009 um 11:20 schrieb Marcello Perathoner:

> Keith J. Schultz wrote:
>
>> We need a a format that is not based on an existing format, ...
>
> Why not?
	Very simply. Basically, most formats have a particular output
	in mind! Furthermore they are far too complex. The idea is to
	markup the book text in a way that we can extract its structure
	and features. Then depending on the the output format
	is created.


>
>
>> ... but we want a representation that contains as much information as
>> possible.
>> It should only take about a month to create such a format.
>
> ROTFL
	I said to create such a format. I did not say create the tools for  
creating
	output formats. Which is the actual crux if you have been trying to  
follow
	this thread. Also, you need tools for getting the scan into this  
format from scans
	which should be done mostly by a computer inorder to save time.

	regards
		Keith.

From schultzk at uni-trier.de  Thu Sep 17 00:55:48 2009
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Thu, 17 Sep 2009 09:55:48 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <6d99d1fd0909162247o504cc5bcy86ac80593ec2d809@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>
	<4AAD7C04.5030806@perathoner.de>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
	<4AAE819C.1040108@perathoner.de>
	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com>
	<4AAF5A12.8040505@perathoner.de>
	<6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com>
	<4AB126D2.2050706@perathoner.de>
	<6d99d1fd0909162247o504cc5bcy86ac80593ec2d809@mail.gmail.com>
Message-ID: <FF85895B-5A65-4923-BF6B-F72C5A23B663@uni-trier.de>

Hi Their,

	I have look at TEI, also the way things SHOULD be
	encoded and said NO WAY!!

	Fasr to complicated. As I have mention here time and time
	again a output format should not be presupossed.

	The layout of a page is not that hard to markup.

	regards
		Keith.


Am 17.09.2009 um 07:47 schrieb David Starner:

> On Wed, Sep 16, 2009 at 1:56 PM, Marcello Perathoner
> <marcello at perathoner.de> wrote:
>
> Maybe. There's a very real question whether it's worth the man-power
> to mark this up, and it's really a bit of a gratuitous feature.
>
>> If people had started using TEI instead of griping endlessly about  
>> minor
>> shortcomings, we might have now a complete TEI workflow in place.
>
> If you had listened to the needs of the people who you wanted to start
> using TEI instead of bitching about them and their requirements, maybe
> they would have started using TEI.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090917/d43ed3d5/attachment.html>

From schultzk at uni-trier.de  Thu Sep 17 00:55:52 2009
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Thu, 17 Sep 2009 09:55:52 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <4AB0AD87.4050704@perathoner.de>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>	<44414895897048CF95F99FE12CC881FD@alp2400>	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>	<A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>	<4AAF5C48.1040201@perathoner.de>	<SNT120-DS20DA612BC506D0414E14B5AEE20@phx.gbl>
	<DAD3C651E3644ED4B559863197320FD9@alp2400>
	<4AB0AD87.4050704@perathoner.de>
Message-ID: <11EF7709-07DD-4278-9394-CDCCAD76D11D@uni-trier.de>


Am 16.09.2009 um 11:19 schrieb Marcello Perathoner:

> Al Haines (shaw) wrote:
>
>> It's clearly stated in PG Volunteers' FAQ V.89 (http://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V.89._Are_there_any_places_where_I_should_indent_text.3F 
>> ) that if you want to prevent unwanted wrapping, lines that should  
>> not be wrapped should be indented a space or two.
>
> This `markup? does not distinguish between poetry and a block quote.  
> A block quote should be indented *and* rewrapped.
	It depends on what is considered desirable
>
>
> And the Rewrap Blues is only part of the problem ...
	

>
> Another formidable challenge is to recover the chapter headings and  
> other headings to make them stand out and to build a TOC.
	I have to disagree here. Any fourth grader can do it. There are  
certain rules which one can follow.
	It will not handle all possible cases, yet most. But, then again that  
what proofers can handle easily.

regards
	Keith.

From schultzk at uni-trier.de  Thu Sep 17 00:55:55 2009
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Thu, 17 Sep 2009 09:55:55 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <SNT120-DS14E4D1F62296090279D926AEE20@phx.gbl>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>	<44414895897048CF95F99FE12CC881FD@alp2400>	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>	<A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>
	<SNT120-DS262D1CD6B1D1C64E0A329AEE30@phx.gbl>
	<SNT120-DS14E4D1F62296090279D926AEE20@phx.gbl>
Message-ID: <E9B368EB-8053-4EAE-9E4A-E2D8A94CF249@uni-trier.de>

Hi There,

Am 16.09.2009 um 08:12 schrieb James Adcock:

> As an example of how much author semantic information is lost going  
> from an
> author's writing to PG txt format, I went and compared differences  
> between a
> recent HTML and PG TXT I did -- where after doing the TXT encoding I  
> went
> back and did three more passes over the images to add back in semantic
> differences to the HTML that the PG TXT didn't represent.
	The problem is that there are very few systems that truely represent
	semantic content. Inorder to truely represent such information you
	have to know about it. This requires one to have aditional information
	which is know as "world knowledge". This information is provided  by  
the
	reader of books.
>
> Now the reality would be that it would take say TEI not HTML to  
> represent
> all of the author's intent.  But measuring the loss going from HTML  
> back to
> TXT gives an order of magnitude estimate of how much author  
> information we
> are throwing away by representing a work in PG TXT.  In the case of  
> this
> book, the answer was more than 1000 "losses" -- or an average of  
> about 3
> losses per page.  And this is NOT counting about an addition 1000  
> losses in
> representation of emphasis.
	This problem is a matter of complexity. That is even in pure Vanilla  
Text
	one can reprensent these intentions, but one loses readablity.  
Furthermore
	one has to make assumptions of the true intent of the author!!

regards
	Keith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090917/4b38471a/attachment-0001.html>

From marcello at perathoner.de  Thu Sep 17 04:25:12 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Thu, 17 Sep 2009 13:25:12 +0200
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <6d99d1fd0909162247o504cc5bcy86ac80593ec2d809@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com>	<4AAF5A12.8040505@perathoner.de>	<6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com>	<4AB126D2.2050706@perathoner.de>
	<6d99d1fd0909162247o504cc5bcy86ac80593ec2d809@mail.gmail.com>
Message-ID: <4AB21C98.4020901@perathoner.de>

David Starner wrote:

>> Eg. geografic tagging. While visiting someplace you may want to find all
>> book references that refer to the place you are in.
> 
> Maybe. There's a very real question whether it's worth the man-power
> to mark this up, and it's really a bit of a gratuitous feature.

Gratuitous to people who have no vision. People who think they are 
`preserving?, while they are only consigning to rot on a different medium.

You take a book from a dusty bookshelf, digitize it, and put it on a 
file server. You have taken content expressed in technology of 500 years 
ago and `updated? it to technology of 20 years ago.

Today its all mobile devices. Ebooks have to come along in your shirt 
pocket or die.


Wikipedia is doing it:

   http://en.wikipedia.org/wiki/File:Wikitude3.jpg

There are many travel books in PG that could be marked up like that.



-- 
Marcello Perathoner
webmaster at gutenberg.org

From joyce.b.wilson at sbcglobal.net  Thu Sep 17 04:50:36 2009
From: joyce.b.wilson at sbcglobal.net (Joyce Wilson)
Date: Thu, 17 Sep 2009 06:50:36 -0500
Subject: [gutvol-d] "PG volunteer lounge" list at Yahoo Groups
Message-ID: <4AB2228C.4010304@sbcglobal.net>

In the spirit of the world-famous DP spa, there is now a "Project 
Gutenberg volunteer lounge" list at Yahoo Groups:

http://groups.yahoo.com/group/PG_vol_lounge/

Description: A friendly, supportive, and civil forum for Project 
Gutenberg volunteers. Will be moderated as needed to keep it that way.

Group Email Addresses:
Post message: PG_vol_lounge at yahoogroups.com
Subscribe: PG_vol_lounge-subscribe at yahoogroups.com
Unsubscribe: PG_vol_lounge-unsubscribe at yahoogroups.com
List owner: PG_vol_lounge-owner at yahoogroups.com

From marcello at perathoner.de  Thu Sep 17 05:30:25 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Thu, 17 Sep 2009 14:30:25 +0200
Subject: [gutvol-d] Re: "PG volunteer lounge" list at Yahoo Groups
In-Reply-To: <4AB2228C.4010304@sbcglobal.net>
References: <4AB2228C.4010304@sbcglobal.net>
Message-ID: <4AB22BE1.9060600@perathoner.de>

Joyce Wilson wrote:

> In the spirit of the world-famous DP spa, there is now a "Project 
> Gutenberg volunteer lounge" list at Yahoo Groups:
> 
> http://groups.yahoo.com/group/PG_vol_lounge/
> 
> Description: A friendly, supportive, and civil forum for Project 
> Gutenberg volunteers. Will be moderated as needed to keep it that way.

... and you can bring your baby too:

   http://groups.yahoo.com/group/dpmoms/


-- 
Marcello Perathoner
webmaster at gutenberg.org

From jimad at msn.com  Thu Sep 17 10:25:20 2009
From: jimad at msn.com (James Adcock)
Date: Thu, 17 Sep 2009 10:25:20 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <c91.48d074f8.37e288f1@aol.com>
References: <c91.48d074f8.37e288f1@aol.com>
Message-ID: <SNT120-DS165A43E83C65DE9247D12FAEE10@phx.gbl>

I retract the words I put into Bowerbirds mouth ? I must have misinterpreted his intent in some of the things he emailed ? welcome to email.

 

How about as a practical suggestion to at least make *some* progress forward out of this morass, how about if we agree that HTML be *one* of the sufficient file formats by itself for inclusion into PG, without the need to also submit PG TXT format?  Since Bowerbird claims it is easy to go from PG TXT to HTML, then certainly it is as easy to go from HTML to PG TXT ? and Bowerbird, or someone -- can provide a ?Make PG Happy? automagical tool that will ?properly? encode PG TXT indentation for verse in order to encode ?please don?t wrap me? and can make PG-Happy decisions about how exactly to wrap at ?72 chars? without making the underlying text too ugly (not a trivial issue in my experience) and can decide what ligatures in what embedded languages should be broken into two chars, and what underlying PG TXT char encoding should be used to make the best tradeoffs between maintaining the glyphs the original author used vs. ?how low can you go? backwards compatibility with the various teletype emulator programs in use worldwide. Etc.


Because, frankly, as a volunteer, these issues nauseate me.  It is not my cup of tea.  I would much rather put my time and effort into trying to do ONE reasonably good encoding of a real-world honest to god book published by some real-world publisher preferably during the lifetime of the author so that hopefully Michael will not continuously make the argument that publishers never respect the intent of the author anyway. [In my experience the first and second editions publishers of John Muir ?First Summer? DID do to a very good job of representing in printed form the style and flavor of the hand-written camp notebooks Muir made during that summer and on the contrary it is the mechanizations of DP in trying to follow the coding conventions of PG that discards this intent? so I believe it is Michael who is making excuses for *PG* being the publisher who doesn?t respect the original intent of the author by requiring PG coding conventions be respected uber alle!]

 

And I do not pretend to be ?perfect? in my choices of encoding these books ? which is *precisely* why I would like to have an acceptable input acceptance format that is not ?write once? but could be picked up and improved by another volunteer in the future ? perhaps one who say has a photocopy of John Muir?s handwritten camp notes at hand, and can perform a Ph.D. thesis-level encoding of what Muir ?really meant to say? perhaps using the full power of say TEI.

 

>jim said:
>>   As Bowerbird is only too happy to point out: 
>>   ?Please feel free to go somewhere else!?
 
>jim, jim, jim.  i'm just about the only person here
>who has any sympathy with what you are saying,



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090917/6f7d6544/attachment.html>

From jimad at msn.com  Thu Sep 17 11:09:26 2009
From: jimad at msn.com (James Adcock)
Date: Thu, 17 Sep 2009 11:09:26 -0700
Subject: [gutvol-d] Re: z.m.l. can do what you want
In-Reply-To: <cdb.5a1f9861.37e293fd@aol.com>
References: <cdb.5a1f9861.37e293fd@aol.com>
Message-ID: <SNT120-DS12B2EA673DDC673FE1EDFEAEE10@phx.gbl>

>ok, do you understand that is comparing
>a converted .html file to a converted .pdf?



Yes, because in either case we are examining the possible *output* rendering file format *rendered results* currently accomplished by an advocate of a particular *input* encoding file format.  Assuming that a particular output formatting rendering software or hardware is available on a particular hardware machine.  My currently favorite hardware machine has built-in very good support for PDF, weak support for PG TXT, and little or no support for HTML (unless I read that HTML ?on line? using the machine?s weak web browser)  So in practice, for a given hardware machine that I choose to use then the choice of output file rendering format becomes a non-issue ? as long as the hardware machine supports it.  If the hardware machine doesn?t support it, then I have to find software that renders one output rendering file format to a different rendering file format [running that cross-rendering software on a difference machine which does support the cross-rendering software] ? which ALMOST ALWAYS in practice causes considerable semantic loss of author?s original intent, plus excessive ugliness.  Which is why we would like a strong input encoding file format, one which is NOT overly concerned with how the ink get rendered on the display, so that we can avoid the problem of having to cross-render output rendering file formats.  As a reader, I no more care if HTML or PDF is the output rendering file format than in the choice of rendering graphics language the computer display card or embedded graphics chip eventually runs.  As a reader I just care how readable vs. how ugly the resulting ink on the display ends up.

 

>show me the "proper" markup to accomplish the mobi,
>and i will edit the template so you can get that markup.



I think what the ?proper? markup is, is what we are trying to discuss.  MOBI, and EPUB, have a concept of a Spine, which requires information not typically included in current markups, such as proper identification of author first name, last name.  One possible markup from OPF showing one (pretty good imho) way this can be done is:

 

<dc:creator opf:file-as="King, Martin Luther Jr." opf:role="aut">

        Rev. Dr. Martin Luther King Jr.

</dc:creator>

 

This spine information is used, for example, to provide the reader the option of listing his/her books alphabetically by author last name.

 

And I am sure that someone is now sure to claim that automagical tools can be created to correctly extract this information, but I would hope that these famous author name examples would be enough to persuade you otherwise:

 

Sun Tzu

Miguel de Cervantes

Marquis de Sade

 

And I am sure someone else will claim that this information can automagically be provided by the PG database itself, but again perusal of how author names are currently being encoded in the PG database *ought* to be enough to dissuade people that spine information can be correctly provided automagically from that location.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090917/0d080f47/attachment-0001.html>

From jimad at msn.com  Thu Sep 17 11:54:00 2009
From: jimad at msn.com (Jim Adcock)
Date: Thu, 17 Sep 2009 11:54:00 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <ce3.57246ebf.37e29849@aol.com>
References: <ce3.57246ebf.37e29849@aol.com>
Message-ID: <SNT120-DS1177D04752E43611935A96AEE10@phx.gbl>

>the "solution" is to understand that a viewer-program like eucalyptus can be programmed on _any_ platform, designed to take the project gutenberg plain-text files and display them in a beautiful (and powerful) manner.

I challenge you to write such a program to run on _my_ choice of platform: Kindle DX




From jimad at msn.com  Thu Sep 17 12:53:34 2009
From: jimad at msn.com (Jim Adcock)
Date: Thu, 17 Sep 2009 12:53:34 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <4AB126D2.2050706@perathoner.de>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com>	<4AAF5A12.8040505@perathoner.de>	<6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com>
	<4AB126D2.2050706@perathoner.de>
Message-ID: <SNT120-DS248803CAD2B284A6BD3321AEE10@phx.gbl>

I personally like Marcello's efforts pretty well, but let me accept his challenge and use his examples as examples of the problems that I *personally* find as a reader of PG texts -- that I *in reality* find with PG's current efforts -- as well as examples of the need for better input markup languages than we currently are using:

> Go here:

>   http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-pdf.pdf

>and tell me what you don't like about the title page.

What I don't like about the title page is that it doesn't show up correctly on my choice of machine, because my choice of machine assumes the existence of spine information.  Thus the "Title" shows up on my machine as "4650-pdf" and "Author" shows up as "4650-pdf"  So when I come back to my machine two weeks from now and search for this book by title, I cannot find it.  And when I search for it by author, I still cannot find it.

Other than that, this PDF text, to my surprise, shows up beautifully on my machine.  I would, in practice, be willing to read this text.  The choice of sans-serif font looks weird, and I would like to be able to change this choice of font, but of course I can't because this is PDF.  Other than that, I would be happy to read this as a book representing a good effort from PG.

Further, I would be able to download this file via the airwaves while waiting stuck at an airport, for example, and read this book there.

In my opinion these results well-represent PG as an electronic publishing house.

>And then go here:

>   http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-h.html

>to verify that it looks the same in HTML.

I can verify that it neither looks the same nor even shows up on my choice of machine at all, because my machine doesn't support HTML as a native file format.  I can, if I am lucky, access this file via the airwaves using the machine's built-in web browser while waiting stuck at the airport, but I cannot store the results as a file, because my machine doesn't support HTML as a built-in file type. So I can read it on the ground, but I probably won't be able to read it in the air, and if I use my browser to access some other web site then I will probably lose this book. [Well, I take that back -- when I actually TRY to read this file via the airwaves as described above, it crashes my machine, requiring a hard reboot]

Assuming I am not at an airport, but rather at home with my desktop computers, I can spend about 5 minutes of my time running an output-file-format to output-file-format cross-rendering software to change this HTML to MOBI format, which IS a native file format of my reader machine.

The results then show up on my machine pretty beautifully.  Except since HTML lacks spine information the Title now shows up as "4650-h" and the Author now shows up as "4650-h"  Which means again, if I come back to my machine in two weeks, I will not be able to find this book.

However, other than that, I like these results -- now that I have cross-rendered HTML to MOBI.  The results are attractive, I CAN change font size.  The font displayed is an attractive and appropriate sarif font.  The pages reflow correctly.  The links work for navigation.  I can switch the machine to landscape mode and everything reflows correctly, supporting the capabilities of my machine.  This file format would in practice be my favorite choice of file formats for my machine -- even though I can only access it initially from my house via a desktop machine and I have to waste five minutes of my time translating output file formats. In my opinion these results well represent PG as an electronic publishing house.

>And then go here:

>   http://www.gnutenberg.de/pgtei/0.5/examples/candide/4650-0.txt

>to see how it looks in TXT.

To my surprise, I CAN take this UTF-8 TXT formatted file, transfer it to my favorite machine, and it DOES open up correctly interpreting the UTF-8 encoding [You learn something new every day!]  This file also lacks spine information, so now Author information shows up as "4650-0" and Title shows up as "4650-0" which means once again, if I come back to this machine in two weeks, I will not be able to find this book. Since this file was rendered char72 under the assumption of a fixed pitch font, and since my machine doesn't use fixed pitch fonts, the end result looks silly and amateurish. The "Printers Ornament" renders as laughable junk.  The fixed char72 line breaks make the text in practice unreadable unless I choose an impossibly tiny font -- which then still makes the text in practice unreadable.  Gratuitous underscores are sprinkled liberally "everywhere" in the text making the text an unreadable hash.  I would not read this text if paid $100 to do so. If I paid good money for this text I would ask for double-my-money back. This is my least favorite file format.  Further, it also lacks spine information, meaning that again the Author now displays as "4650" and the Title displays as "4650" which means, again, that if I came back to this machine again in two weeks I will not be able to find this book -- which in this case would be a *blessing* !  In my opinion, if I were a first-time "customer" of PG who makes the mistake of choosing this file format to download to read on my brand of machine, I would conclude that PG consists of a bunch of clueless clowns and I would never return to the PG site again.

My Opinions Only -- but I would hope this illustrates how IN PRACTICE a real-world customer's opinion of PG will be filtered through the perception of their choice of reading machine -- and in turn how well WHICH choice of PG file formats they happen to choose to download matches the capabilities of their machine. And without the spine information, none of this really works well with my machine.



From i30817 at gmail.com  Thu Sep 17 13:13:24 2009
From: i30817 at gmail.com (Paulo Levi)
Date: Thu, 17 Sep 2009 21:13:24 +0100
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <SNT120-DS248803CAD2B284A6BD3321AEE10@phx.gbl>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl> 
	<4AAD7C04.5030806@perathoner.de>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl> 
	<4AAE819C.1040108@perathoner.de>
	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> 
	<4AAF5A12.8040505@perathoner.de>
	<6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> 
	<4AB126D2.2050706@perathoner.de>
	<SNT120-DS248803CAD2B284A6BD3321AEE10@phx.gbl>
Message-ID: <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com>

Just a little input about text files and charsets (encodings), since i
had to use it for my program. Most browsers and applications open
these files in the correctly simply because someone (mostly mozilla)
did the hard work of making a fast guessing engine. I wouldn't be
amazed if it failed in some books.

From i30817 at gmail.com  Thu Sep 17 13:15:13 2009
From: i30817 at gmail.com (Paulo Levi)
Date: Thu, 17 Sep 2009 21:15:13 +0100
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com> <4AAD7C04.5030806@perathoner.de> 
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
	<4AAE819C.1040108@perathoner.de> 
	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> 
	<4AAF5A12.8040505@perathoner.de>
	<6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> 
	<4AB126D2.2050706@perathoner.de>
	<SNT120-DS248803CAD2B284A6BD3321AEE10@phx.gbl> 
	<212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com>
Message-ID: <212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com>

Also, i used the catalog information to get the title. Metadata in the
file name only is not a good way to encode this information, and
metadata inside the file, would require a special parser, everywhere.

From jimad at msn.com  Thu Sep 17 13:29:18 2009
From: jimad at msn.com (Jim Adcock)
Date: Thu, 17 Sep 2009 13:29:18 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <E9B368EB-8053-4EAE-9E4A-E2D8A94CF249@uni-trier.de>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>	<4AAD7C04.5030806@perathoner.de>	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>	<SNT120-DS25386B5D548426894BC3B4AEE40@phx.gbl>	<44414895897048CF95F99FE12CC881FD@alp2400>	<SNT120-DS140F3BC19C2962D04E64A7AEE30@phx.gbl>	<19FC003DA4EC4D75BE06767ACDE2ED17@alp2400>	<A25C3EF6-0ED9-4F11-A44A-580D94C896BB@uni-trier.de>	<SNT120-DS262D1CD6B1D1C64E0A329AEE30@phx.gbl>	<SNT120-DS14E4D1F62296090279D926AEE20@phx.gbl>
	<E9B368EB-8053-4EAE-9E4A-E2D8A94CF249@uni-trier.de>
Message-ID: <SNT120-DS16F180556C62EAF0D9DA00AEE10@phx.gbl>

>....Furthermore one has to make assumptions of the true intent of the
author!!

I'm not sure what the problem is if one has an <i>tag to indicate the
author's intent was rendered in the original book in italic</i> and a
<sc>tag to indicate the author's intent in the original book was rendered in
small-caps</sc> etc.?

On the contrary, the assumptions have to be made when the input markup
language and the output rendering file formats are required to be
one-and-the-same AND the rendering file format's power is less than that
used by real-world printers already 400 years ago.  Then the markup
transcriber is forced to interpret authors intent and how to compromise that
intent in order to make it fit within the constraints of the rendering
language -- which is being artificially constrained to be identical to the
input markup language.

If one had a input markup language that closely follows author's intent as
rendered by the original printer then the problem becomes how do you reduce
the strength of this markup to match the weaknesses of the output rendering
file format, and that in general is an issue of style that can be
represented in CSS for example.  Or hacked up by hand if and when absolutely
necessary.  But it still means that the previous round of volunteers efforts
are correctly and completely maintained in the input markup language text so
that the next round of volunteers can take another shot at the text some
time in the future.


From jimad at msn.com  Thu Sep 17 14:02:47 2009
From: jimad at msn.com (Jim Adcock)
Date: Thu, 17 Sep 2009 14:02:47 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com> <4AAD7C04.5030806@perathoner.de>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>	<4AAE819C.1040108@perathoner.de>
	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com>
	<4AAF5A12.8040505@perathoner.de>	<6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com>
	<4AB126D2.2050706@perathoner.de>	<SNT120-DS248803CAD2B284A6BD3321AEE10@phx.gbl>
	<212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com>
	<212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com>
Message-ID: <SNT120-DS241B9B4366871FDF0BEFCBAEE10@phx.gbl>

PG catalog may be a reasonable way to get Title information.  As presently
implemented the PG catalog is not a reasonable source of Author Firstname,
Lastname information -- for multiple reasons!



From i30817 at gmail.com  Thu Sep 17 16:10:43 2009
From: i30817 at gmail.com (Paulo Levi)
Date: Fri, 18 Sep 2009 00:10:43 +0100
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <SNT120-DS241B9B4366871FDF0BEFCBAEE10@phx.gbl>
References: <d4e.5ad2093d.37dea2e9@aol.com> <4AAE819C.1040108@perathoner.de> 
	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> 
	<4AAF5A12.8040505@perathoner.de>
	<6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> 
	<4AB126D2.2050706@perathoner.de>
	<SNT120-DS248803CAD2B284A6BD3321AEE10@phx.gbl> 
	<212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> 
	<212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> 
	<SNT120-DS241B9B4366871FDF0BEFCBAEE10@phx.gbl>
Message-ID: <212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com>

:) I had to make this loop for it (but then again i am indexing it,
and not seperating the names like you are, that takes a little bit
more work.)
The code tries to find a date in the last part of the string, and then
reorder all last name, first name multiple authors (i wanted normal
order)

Note the "possibles", and "hopefully" there, but i think it works for
all books i encountered. In fact the names i reported a while ago as
defective were found in errors in this method. It isn't that there is
no method, it's just extremely ... non-normalized.

    private final StringBuilder normalizeString = new StringBuilder();

    protected String normalizeName(String authorString) {
        int separator = authorString.lastIndexOf(',');
        //normal date seperator
        if (separator != -1) {
            String possibleDate = authorString.substring(separator + 1);
            for (int i = 0; i < possibleDate.length(); i++) {
                if (Character.isDigit(possibleDate.charAt(i))) {
                    //a date, hopefully...
                    return exchangeNames(authorString.substring(0, separator));
                }
            }
            //no date, but change the name anyway.
            return exchangeNames(authorString);
        }
        return authorString;
    }

    protected String exchangeNames(String authorString) {
        normalizeString.setLength(0);
        exchangeNamesAux(authorString);
        return normalizeString.toString();
    }

    private void exchangeNamesAux(String authorString) {
        int seperator = authorString.indexOf(',');
        if (seperator == -1) {
            normalizeString.append(authorString);
            return;
        }
        exchangeNamesAux(authorString.substring(seperator + 2));
        normalizeString.append(' ').append(authorString.substring(0,
seperator));
    }




On Thu, Sep 17, 2009 at 10:02 PM, Jim Adcock <jimad at msn.com> wrote:
> PG catalog may be a reasonable way to get Title information. ?As presently
> implemented the PG catalog is not a reasonable source of Author Firstname,
> Lastname information -- for multiple reasons!
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

From i30817 at gmail.com  Thu Sep 17 16:23:02 2009
From: i30817 at gmail.com (Paulo Levi)
Date: Fri, 18 Sep 2009 00:23:02 +0100
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com> 
	<4AAF5A12.8040505@perathoner.de>
	<6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> 
	<4AB126D2.2050706@perathoner.de>
	<SNT120-DS248803CAD2B284A6BD3321AEE10@phx.gbl> 
	<212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> 
	<212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> 
	<SNT120-DS241B9B4366871FDF0BEFCBAEE10@phx.gbl>
	<212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com>
Message-ID: <212322090909171623y359a48b3xf5cdbdc59cc230f1@mail.gmail.com>

Correction, that is not multiple authors, but one per string + date.

From i30817 at gmail.com  Thu Sep 17 16:25:00 2009
From: i30817 at gmail.com (Paulo Levi)
Date: Fri, 18 Sep 2009 00:25:00 +0100
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <212322090909171623y359a48b3xf5cdbdc59cc230f1@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com> <4AAF5A12.8040505@perathoner.de> 
	<6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> 
	<4AB126D2.2050706@perathoner.de>
	<SNT120-DS248803CAD2B284A6BD3321AEE10@phx.gbl> 
	<212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> 
	<212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> 
	<SNT120-DS241B9B4366871FDF0BEFCBAEE10@phx.gbl>
	<212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> 
	<212322090909171623y359a48b3xf5cdbdc59cc230f1@mail.gmail.com>
Message-ID: <212322090909171625v2e96f457n3efb52126244e6ff@mail.gmail.com>

BTW i found this in my catalog post processor / indexer.

                //can have \n stupidly...
                String titleString = title.stringValue().replaceAll("\n", " ");

:)

From jimad at msn.com  Thu Sep 17 16:30:02 2009
From: jimad at msn.com (James Adcock)
Date: Thu, 17 Sep 2009 16:30:02 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com> <4AAE819C.1040108@perathoner.de>
	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com>
	<4AAF5A12.8040505@perathoner.de>	<6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com>
	<4AB126D2.2050706@perathoner.de>	<SNT120-DS248803CAD2B284A6BD3321AEE10@phx.gbl>
	<212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com>
	<212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com>
	<SNT120-DS241B9B4366871FDF0BEFCBAEE10@phx.gbl>
	<212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com>
Message-ID: <SNT120-DS149857B4784F3980DEF807AEE10@phx.gbl>

LOL -- and what pray tell do you get for the Author Lastnames in the
examples I gave using your algorithm?


>private final StringBuilder normalizeString = new StringBuilder();



From i30817 at gmail.com  Thu Sep 17 17:29:02 2009
From: i30817 at gmail.com (Paulo Levi)
Date: Fri, 18 Sep 2009 01:29:02 +0100
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <SNT120-DS149857B4784F3980DEF807AEE10@phx.gbl>
References: <d4e.5ad2093d.37dea2e9@aol.com> <4AAF5A12.8040505@perathoner.de> 
	<6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> 
	<4AB126D2.2050706@perathoner.de>
	<SNT120-DS248803CAD2B284A6BD3321AEE10@phx.gbl> 
	<212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> 
	<212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> 
	<SNT120-DS241B9B4366871FDF0BEFCBAEE10@phx.gbl>
	<212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> 
	<SNT120-DS149857B4784F3980DEF807AEE10@phx.gbl>
Message-ID: <212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com>

Normaly a name tuple is like this :
Last names, first name , Date
It can also be like this
Last name, first name
or like this
Name
(for plato etc)

I strip out the optional date (I should change the deciding algorithm
to 2 non consecutive digits probably. Basically dates always seem to
have a digit there)
then exchange the first and last names if needed and join them again.

If you want to keep them separate you can make a domain object or a
list for that. BTW i just realized i don't need the recursion for
nothing. I might change it. It takes a good 1.5 m to index the
Gutenberg index even with a lot of hacks.

From sly at victoria.tc.ca  Thu Sep 17 17:33:13 2009
From: sly at victoria.tc.ca (Andrew Sly)
Date: Thu, 17 Sep 2009 17:33:13 -0700 (PDT)
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS29C0FF764970413729C54AEE50@phx.gbl>
	<4AAD7C04.5030806@perathoner.de>
	<SNT120-DS19C2782A45FA93A003023EAEE40@phx.gbl>
	<4AAE819C.1040108@perathoner.de>
	<6d99d1fd0909141140m5cd64603t84a94d397699a813@mail.gmail.com>
	<4AAF5A12.8040505@perathoner.de>
	<6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com>
	<4AB126D2.2050706@perathoner.de>
	<SNT120-DS248803CAD2B284A6BD3321AEE10@phx.gbl>
	<212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com>
Message-ID: <Pine.GSO.4.58.0909171720580.25446@vtn1.victoria.tc.ca>


Are you talking about reading the files directly from gutenberg.org?
Files are served up with the encoding specified in the http header.
I don't know the technical details--Marcello set it all up.

--Andrew

On Thu, 17 Sep 2009, Paulo Levi wrote:

> Just a little input about text files and charsets (encodings), since i
> had to use it for my program. Most browsers and applications open
> these files in the correctly simply because someone (mostly mozilla)
> did the hard work of making a fast guessing engine. I wouldn't be
> amazed if it failed in some books.

From i30817 at gmail.com  Thu Sep 17 17:54:40 2009
From: i30817 at gmail.com (Paulo Levi)
Date: Fri, 18 Sep 2009 01:54:40 +0100
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<6d99d1fd0909160921m19175fa6td7b1f612c45360b2@mail.gmail.com> 
	<4AB126D2.2050706@perathoner.de>
	<SNT120-DS248803CAD2B284A6BD3321AEE10@phx.gbl> 
	<212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> 
	<212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> 
	<SNT120-DS241B9B4366871FDF0BEFCBAEE10@phx.gbl>
	<212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> 
	<SNT120-DS149857B4784F3980DEF807AEE10@phx.gbl>
	<212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com>
Message-ID: <212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com>

Actually a little bit of disinformation there: I do need the
recursion. Titles (for instance) have an additional ",". Forgot.

On Fri, Sep 18, 2009 at 1:29 AM, Paulo Levi <i30817 at gmail.com> wrote:
> Normaly a name tuple is like this :
> Last names, first name , Date
> It can also be like this
> Last name, first name
> or like this
> Name
> (for plato etc)
>
> I strip out the optional date (I should change the deciding algorithm
> to 2 non consecutive digits probably. Basically dates always seem to
> have a digit there)
> then exchange the first and last names if needed and join them again.
>
> If you want to keep them separate you can make a domain object or a
> list for that. BTW i just realized i don't need the recursion for
> nothing. I might change it. It takes a good 1.5 m to index the
> Gutenberg index even with a lot of hacks.
>

From i30817 at gmail.com  Thu Sep 17 18:18:44 2009
From: i30817 at gmail.com (Paulo Levi)
Date: Fri, 18 Sep 2009 02:18:44 +0100
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com> <4AB126D2.2050706@perathoner.de> 
	<SNT120-DS248803CAD2B284A6BD3321AEE10@phx.gbl>
	<212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> 
	<212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> 
	<SNT120-DS241B9B4366871FDF0BEFCBAEE10@phx.gbl>
	<212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> 
	<SNT120-DS149857B4784F3980DEF807AEE10@phx.gbl>
	<212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> 
	<212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com>
Message-ID: <212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com>

Oh i see the obvious error now (finally).

How about a little different algorithm:
strip out the date, then take the
, suffix, prefix, sufix prefix until empty.

From i30817 at gmail.com  Thu Sep 17 18:33:09 2009
From: i30817 at gmail.com (Paulo Levi)
Date: Fri, 18 Sep 2009 02:33:09 +0100
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS248803CAD2B284A6BD3321AEE10@phx.gbl> 
	<212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com> 
	<212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> 
	<SNT120-DS241B9B4366871FDF0BEFCBAEE10@phx.gbl>
	<212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> 
	<SNT120-DS149857B4784F3980DEF807AEE10@phx.gbl>
	<212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> 
	<212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> 
	<212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com>
Message-ID: <212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com>

Possibly this? BTW thanks for spotting that.

    private String normalizeName(String authorString) {
        int separator = authorString.lastIndexOf(',');
        //normal date seperator
        if (separator != -1) {
            String possibleDate = authorString.substring(separator + 1);
            for (int i = 0; i < possibleDate.length(); i++) {
                if (Character.isDigit(possibleDate.charAt(i))) {
                    //a date, hopefully...
                    return exchangeNames(authorString.substring(0, separator));
                }
            }
            //no date, but change the name anyway.
            return exchangeNames(authorString);
        }
        return authorString;
    }

    private String exchangeNames(String authorString) {
        normalizeString.setLength(0);
        exchangeNamesAuxSuffix(authorString);
        return normalizeString.toString();
    }

    private void exchangeNamesAuxSuffix(String authorString) {
        int seperator = authorString.lastIndexOf(',');
        if (seperator == -1) {
            normalizeString.append(authorString);
            return;
        }
        normalizeString.append(authorString.substring(seperator +
2)).append(' ');
        exchangeNamesAuxPrefix(authorString.substring(0, seperator));
    }

    private void exchangeNamesAuxPrefix(String authorString) {
        int seperator = authorString.indexOf(',');
        if (seperator == -1) {
            normalizeString.append(authorString);
            return;
        }
        normalizeString.append(authorString.substring(0,
seperator)).append(' ');
        exchangeNamesAuxSuffix(authorString.substring(seperator + 2));
    }

From jimad at msn.com  Thu Sep 17 19:15:41 2009
From: jimad at msn.com (James Adcock)
Date: Thu, 17 Sep 2009 19:15:41 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS248803CAD2B284A6BD3321AEE10@phx.gbl>
	<212322090909171313v4d6feeb4g500f77efb320f4da@mail.gmail.com>
	<212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com>
	<SNT120-DS241B9B4366871FDF0BEFCBAEE10@phx.gbl>	<212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com>
	<SNT120-DS149857B4784F3980DEF807AEE10@phx.gbl>	<212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com>
	<212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com>
	<212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com>
	<212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com>
Message-ID: <SNT120-DS23FDDCCA24831FB6E4116FAEE00@phx.gbl>

Sorry Paulo, I'm not sure what you are up to, but again, what do your
algorithms actually find when applied to the author name examples I
presented earlier?

Sun Tzu
Miguel de Cervantes
Marquis de Sade



From i30817 at gmail.com  Thu Sep 17 19:24:45 2009
From: i30817 at gmail.com (Paulo Levi)
Date: Fri, 18 Sep 2009 03:24:45 +0100
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <SNT120-DS23FDDCCA24831FB6E4116FAEE00@phx.gbl>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<212322090909171315j1d206e46h63d6182900d2fcf9@mail.gmail.com> 
	<SNT120-DS241B9B4366871FDF0BEFCBAEE10@phx.gbl>
	<212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> 
	<SNT120-DS149857B4784F3980DEF807AEE10@phx.gbl>
	<212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> 
	<212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> 
	<212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com> 
	<212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com> 
	<SNT120-DS23FDDCCA24831FB6E4116FAEE00@phx.gbl>
Message-ID: <212322090909171924u821f756qc64ef823c224663a@mail.gmail.com>

Sun Tzu apparently doesn't exist (it's probably as the original name.
Searching for Art of War gives Sunzi as one of the names)
Miguel de Cervantes - > Miguel de Cervantes Saavedra
Marquis de Sade -> marquis de Sade (marquis is lowercase for some
reason on the index).

From i30817 at gmail.com  Thu Sep 17 19:33:23 2009
From: i30817 at gmail.com (Paulo Levi)
Date: Fri, 18 Sep 2009 03:33:23 +0100
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <212322090909171924u821f756qc64ef823c224663a@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS241B9B4366871FDF0BEFCBAEE10@phx.gbl> 
	<212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> 
	<SNT120-DS149857B4784F3980DEF807AEE10@phx.gbl>
	<212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> 
	<212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> 
	<212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com> 
	<212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com> 
	<SNT120-DS23FDDCCA24831FB6E4116FAEE00@phx.gbl>
	<212322090909171924u821f756qc64ef823c224663a@mail.gmail.com>
Message-ID: <212322090909171933u660219f2u57df90004223717e@mail.gmail.com>

This is not applied to the names you gave themselves, but as they
appear on the index.

Marquis de Sade for instance appears on the index as : "Sade, marquis
de, 1740-1814".

From i30817 at gmail.com  Thu Sep 17 20:47:13 2009
From: i30817 at gmail.com (Paulo Levi)
Date: Fri, 18 Sep 2009 04:47:13 +0100
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <212322090909171933u660219f2u57df90004223717e@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<212322090909171610r23f48cdfo212cf715a7c11123@mail.gmail.com> 
	<SNT120-DS149857B4784F3980DEF807AEE10@phx.gbl>
	<212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> 
	<212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> 
	<212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com> 
	<212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com> 
	<SNT120-DS23FDDCCA24831FB6E4116FAEE00@phx.gbl>
	<212322090909171924u821f756qc64ef823c224663a@mail.gmail.com> 
	<212322090909171933u660219f2u57df90004223717e@mail.gmail.com>
Message-ID: <212322090909172047m65a9edaatdf683dc7884d45c1@mail.gmail.com>

Duh, still wrong. Wait a second, i will sort it out.

From i30817 at gmail.com  Thu Sep 17 20:57:47 2009
From: i30817 at gmail.com (Paulo Levi)
Date: Fri, 18 Sep 2009 04:57:47 +0100
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <212322090909172047m65a9edaatdf683dc7884d45c1@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS149857B4784F3980DEF807AEE10@phx.gbl> 
	<212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> 
	<212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> 
	<212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com> 
	<212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com> 
	<SNT120-DS23FDDCCA24831FB6E4116FAEE00@phx.gbl>
	<212322090909171924u821f756qc64ef823c224663a@mail.gmail.com> 
	<212322090909171933u660219f2u57df90004223717e@mail.gmail.com> 
	<212322090909172047m65a9edaatdf683dc7884d45c1@mail.gmail.com>
Message-ID: <212322090909172057q5a2dc32cu76d49efd24680a16@mail.gmail.com>

You're right, it is inconsistent in some (apparently only titled) authors.
For example:
1 name, title broken up.
La Rochejaquelein, Marie-Louise-Victoire, marquise de, 1772-1857

versus:
2 names title intact (correct apparently since it is consistent with
most of the rest of the names).
Disraeli, Benjamin, Earl of Beaconsfield, 1804-1881

No way to recognize if it should be plain LIFO order or something else.

From sly at victoria.tc.ca  Thu Sep 17 21:54:08 2009
From: sly at victoria.tc.ca (Andrew Sly)
Date: Thu, 17 Sep 2009 21:54:08 -0700 (PDT)
Subject: [gutvol-d] Author names in catalog
In-Reply-To: <212322090909172057q5a2dc32cu76d49efd24680a16@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS149857B4784F3980DEF807AEE10@phx.gbl>
	<212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com> 
	<212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com> 
	<212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com> 
	<212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com> 
	<SNT120-DS23FDDCCA24831FB6E4116FAEE00@phx.gbl>
	<212322090909171924u821f756qc64ef823c224663a@mail.gmail.com> 
	<212322090909171933u660219f2u57df90004223717e@mail.gmail.com> 
	<212322090909172047m65a9edaatdf683dc7884d45c1@mail.gmail.com>
	<212322090909172057q5a2dc32cu76d49efd24680a16@mail.gmail.com>
Message-ID: <Pine.GSO.4.58.0909172146470.18198@vtn1.victoria.tc.ca>


There is a separate cataloger's mailing list if you are interested
in further discussion with the people who are editing the catalog.

However, it might help if I tell you that most of the
author headings follow the form used at the Library of Congress.
And _they_ follow rules and vaguries that have built up over
many decades.

I can tell you without uncertainty that you will not be able
to prepare a process which will give you 100% good results.

--Andrew

On Fri, 18 Sep 2009, Paulo Levi wrote:

> You're right, it is inconsistent in some (apparently only titled) authors.
> For example:
> 1 name, title broken up.
> La Rochejaquelein, Marie-Louise-Victoire, marquise de, 1772-1857
>

From jimad at msn.com  Thu Sep 17 22:11:43 2009
From: jimad at msn.com (James Adcock)
Date: Thu, 17 Sep 2009 22:11:43 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <212322090909172057q5a2dc32cu76d49efd24680a16@mail.gmail.com>
References: <d4e.5ad2093d.37dea2e9@aol.com>	<SNT120-DS149857B4784F3980DEF807AEE10@phx.gbl>
	<212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com>
	<212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com>
	<212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com>
	<212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com>
	<SNT120-DS23FDDCCA24831FB6E4116FAEE00@phx.gbl>	<212322090909171924u821f756qc64ef823c224663a@mail.gmail.com>
	<212322090909171933u660219f2u57df90004223717e@mail.gmail.com>
	<212322090909172047m65a9edaatdf683dc7884d45c1@mail.gmail.com>
	<212322090909172057q5a2dc32cu76d49efd24680a16@mail.gmail.com>
Message-ID: <SNT120-DS1846AE7DF0A3362BA69D77AEE00@phx.gbl>

I hope you have figured out my point by now:

Namely, IF one wants to make "correct" e-book files in a number of formats, including EPUB and MOBI, it is not possible algorithmically to determine the "correct" encoding of Author Lastname, Firstname from data currently found in either the PG HTML encodings nor the PG TXT encodings.

It is also not possible to make "correct" encodings of Author Lastname, Firstname from the information currently recorded in the PG catalog.

One would like to have "correct" encodings of Author Lastname, Firstname so that if a customer adds a PG text in say EPUB or MOBI to their existing collection of e-book titles in their e-book library, it would be nice if the Author Lastname, Firstname sorts and displays correctly next to any other e-books they might already possess from other sources.

Sun Tzu

Sun is the author's family name, or what is represented as an authors "Lastname" in western cultures.  Tzu is a romanization of an honorarium such as "Sir" or "Mr"  

Sun Tzu ??; S?n Z?;

Which is listed in a westernized corrupted form in the PG catalog as "Sunzi" which shows lack of cultural respect -- combining the family name with the honorarium in a way to artificially form an apparent feminine.  However, I believe the transcriber needs to transcribe the book as written, including the spelling or representation of the author name found there, which means that the book transcription in HTML or PG TXT cannot be used as a reliable source of author name -- nor should the spelling given in transcription necessarily be how the author is listed in the PG catalog.  Nor can it algorithmically be thus possible to figure out what part therein is the "last name [family name]" So therefore in addition to the coding in the HTML or the PG TXT there also needs to be a "spine" representation that gives a correct canonical identification of author "Lastname: Sun Firstname: Tzu" where again Tzu isn't really the first name, but by traditional this slot gets used for that part of the canonical author name representation which isn't the lastname. "Art of War" also being known simply as "The Sun Tzu."

Miguel de Cervantes

Last name of author is actually most often canonically represented as "Cervantes Saavedra", with the "firstname" part typically represented as "Miguel de". Saavedra being mother's last name in a culture where children bear their mother's name but when the book is sold in other cultures that are uncomfortable with this convention then the Saavedra tends to get dropped -- but shouldn't be because it IS the author's last name.

Marquis de Sade

Last name of author = Sade.  First name part is "Donatien Alphonse Franc?ois".  But by tradition customers are probably expecting the firstname part to be represented as "Marquis de" -- they almost certainly will not recognize "Donatien Alphonse Franc?ois". So its not real clear how the firstname part ought be coded, but if the lastname part is coded as Sade then at least the book will show up about the right place in the possessor's library listing.

Again, the point being neither the PG catalog nor the literal transcription can be used as a reliable source of the author lastname, firstname information -- which DOES need to be reliably included in the e-book file so that the e-book will show up at correct location in the customer's e-book library sort.



From Bowerbird at aol.com  Fri Sep 18 01:49:57 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 18 Sep 2009 04:49:57 EDT
Subject: [gutvol-d] good news on the doorstep
Message-ID: <c0a.637c5c13.37e4a3b5@aol.com>

good news on the doorstep.

there's now a new listserve for "civil" p.g. volunteers.

that's right, no need to put up with the rudeness of this list.

>    http://groups.yahoo.com/group/PG_vol_lounge/message/1

here's the introduction, from founder joyce wilson:

>    Welcome to the Project Gutenberg volunteer lounge! 
>    This list is intended to provide a friendly, supportive, 
>    and civil forum for PG volunteers. I hope it will provide 
>    a sense of community connection for those of us 
>    whose PG volunteer jobs can seem kind of isolated. 
>    Problem-solve, discuss, ask questions, share good news, 
>    brag on yourself and others, liberally apply congratulations 
>    and back-pats, chit-chat about stuff. But don't be a jerk, 
>    or you'll be removed from the list.

so, now if you're feeling a need for some liberally-applied
congratulations and back-pats, or some shared good news,
without jerks pestering you, you'll know exactly where to go.

thanks for provided this much-needed service, joyce!

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090918/4e9b4e07/attachment.html>

From richfield at telkomsa.net  Fri Sep 18 04:51:05 2009
From: richfield at telkomsa.net (Jon Richfield)
Date: Fri, 18 Sep 2009 13:51:05 +0200
Subject: [gutvol-d] Name lists and Big-endianism
Message-ID: <4AB37429.5090403@telkomsa.net>

Without due respect for the dead hand of history, or the dead heads of 
aesthetes trying to impose attractive schemes devoid of logic or 
practicality, it would be nice if we could agree on some scheme to 
sequence our author indexes. It won't happen of course, and I am not 
silly enough to think that this brief note contains anything conclusive, 
but give it a thunk, anyone interested.
Anyone uninterested is sternly forbidden to consider the matter or read 
this remark (it hardly hopes to attain the dignity of a suggestion.)

Let us assume that we have authors such as the famous 

Johanna Kakebeenwania van der Merwe                     O'Brien,
Jolien Gertina van der Poel                                           
O'Mally,
Paulette Marmorella Bridhedia                                     Paul-Ewen
Truupsvor Theooseov                                                  
Swizarminife
Neville McSnurtle Quentin                                           Urtel
Xavier Ypres Zulrich                                                    
?rtur
Aspoestertjie Sinnerella Katrina                                   van 
Aswagen
Gehardus Johannes Katwimpers Janse van Vuuren       van den Heever
Johannes Gehardus du Toit                                          van 
der Vyfer
Jakobus Johannes Joumoerus                                      Vandaaigoed
Lelie Belladonna Nerina                                               
Vanderker
Otto 
Werther                                                              von 
und zu Bismarkharing

The problem is notionally to sequence them  according to a 
comprehensible and totally unambiguous scheme, with the least 
sensitivity to uncertain spellling and concentrations of initial letters 
etc.
The best approach is to write each name, as much as desired in normal 
internal sequence as above, then split each name immediately after the 
last non-alphabetic character (including spaces). The bit at the end is 
what you sequence by, NOT the full name, NOT necessarily the full 
surname, and without consideration of case or diacritical signs.

In our by no means random, but hardly unrealistic example,several 
questions arise, including the role of various non-alphabetic 
characters, and the artificial concentration of surnames under the 
initial letters of prefixes such as de, der, du, van, van der, von den, 
and no end of etcs. By sorting by the terminal alphabetic string, we 
remove ambiguity and even out the spread of names through the alphabet. 
In simple information theory this optimises search time and sort 
efficiency. The above example becomes:

Aswagen Aspoestertjie Sinnerella Katrina van

Bismarkharing Otto Werther von und zu

Brien Johanna Kakebeenwania van der Merwe O'

Ewen Paulette Marmorella Bridhedia Paul-

Heever Gehardus Johannes Katwimpers Janse van Vuuren van den

Mally Jolien Gertina van der Poel O'

Swizarminife Truupsvor Theooseov

Urtel Neville McSnurtle Quentin

?rtur Xavier Ypres Zulrich

Vandaaigoed Jakobus Johannes Joumoerus

Vanderker Lelie Belladonna Nerina

Vyfer Johannes Gehardus du Toit van der


The head benefit is in the de tailing.

Not that anyone asked.

Cheers,

Jon




From walter.van.holst at xs4all.nl  Fri Sep 18 07:51:12 2009
From: walter.van.holst at xs4all.nl (Walter van Holst)
Date: Fri, 18 Sep 2009 16:51:12 +0200
Subject: [gutvol-d] Re: Name lists and Big-endianism
In-Reply-To: <4AB37429.5090403@telkomsa.net>
References: <4AB37429.5090403@telkomsa.net>
Message-ID: <4AB39E60.6030108@xs4all.nl>

Jon Richfield schreef:
> In our by no means random, but hardly unrealistic example,several 
> questions arise, including the role of various non-alphabetic 
> characters, and the artificial concentration of surnames under the 
> initial letters of prefixes such as de, der, du, van, van der, von den, 
> and no end of etcs. By sorting by the terminal alphabetic string, we 
> remove ambiguity and even out the spread of names through the alphabet. 
> In simple information theory this optimises search time and sort 
> efficiency. The above example becomes:
> 
> Aswagen Aspoestertjie Sinnerella Katrina van
> 
> Bismarkharing Otto Werther von und zu
> 
> Brien Johanna Kakebeenwania van der Merwe O'
> 
> Ewen Paulette Marmorella Bridhedia Paul-
> 
> Heever Gehardus Johannes Katwimpers Janse van Vuuren van den
> 
> Mally Jolien Gertina van der Poel O'
> 
> Swizarminife Truupsvor Theooseov
> 
> Urtel Neville McSnurtle Quentin
> 
> ?rtur Xavier Ypres Zulrich
> 
> Vandaaigoed Jakobus Johannes Joumoerus
> 
> Vanderker Lelie Belladonna Nerina
> 
> Vyfer Johannes Gehardus du Toit van der
> 
> 
> The head benefit is in the de tailing.
> 
> Not that anyone asked.

Since you've picked a bunch of mostly Dutch and German authors or at 
least authors whose ancestors happened to be Dutch or German, I'd like 
to point out that a rather common way in Dutch databases is to do it 
slightly different:

Sinerella Katrina van Aswagen Aspostertjie would become:

Aswagen Aspostertjie, van, Sinerella Katrina

This prevents alphabetically sorting all surnames from becoming a 
massive series of entries starting with a 'V'.

I'm rather sure Marcello can provide the answer on whether our Eastern 
brethren do it the same.

Regards,

  Walter van Holst


From richfield at telkomsa.net  Fri Sep 18 08:29:14 2009
From: richfield at telkomsa.net (Jon Richfield)
Date: Fri, 18 Sep 2009 17:29:14 +0200
Subject: [gutvol-d] Re: Name lists and Big-endianism
In-Reply-To: <4AB39E60.6030108@xs4all.nl>
References: <4AB37429.5090403@telkomsa.net> <4AB39E60.6030108@xs4all.nl>
Message-ID: <4AB3A74A.6080103@telkomsa.net>

Dag Walter, bly te kenne!
In South Africa there are indeed strong Dutch as well as other Germanic 
influences, and nowhere more so than in our surnames (especially 
Afrikaans surnames of course).  Van, van der, von, van den, ter, ten, 
etc.  Van and van der are easily the leaders though. We do however have 
a strong Huguenot influences (de, du, even a few le etc) and don't 
forget the Irish O', though they are not as prominent as in say, the US. 
Also, for similar reasons some black names begin with U, N, or M. We 
also have Portuguese names (Del...)

And yes, the reason you mention is exactly the one I had in mind. 
Especially in certain districts where certain families settled and 
established a patronymic dominance that became a local source of 
pervasive inconvenience and perverse pride. (There sometimes are 
problems with the family forenames as well; schools and universities 
have been driven to distinguish between particular students by date of 
birth!)
And thereby hang various tales, variously amusing...

I am not quite certain of the DB convention you mention though. Are you 
sure that you didn't have some finger trouble?
"Aspoestertjie Sinnerella Katrina van Aswagen" becomes "Aswagen 
Aspostertjie, van, Sinerella Katrina"???  Isn't that a bit pointlessly 
arbitrary, devious, even obscure? If it is indeed the convention, then 
so be it, but I would think that the rotation scheme I proposed has 
major advantages. For one thing it puts the Driscols Benny O' in their 
places, along with the Drifters Benny Smith- and the Diemans Benny van.

Mooi bly!

Jon
> Jon Richfield schreef:
>> In our by no means random, but hardly unrealistic example,several 
>> questions arise, including the role of various non-alphabetic 
>> characters, and the artificial concentration of surnames under the 
>> initial letters of prefixes such as de, der, du, van, van der, von 
>> den, and no end of etcs. By sorting by the terminal alphabetic 
>> string, we remove ambiguity and even out the spread of names through 
>> the alphabet. In simple information theory this optimises search time 
>> and sort efficiency. The above example becomes:
>>
>> Aswagen Aspoestertjie Sinnerella Katrina van
>>
>> Bismarkharing Otto Werther von und zu
>>
>> Brien Johanna Kakebeenwania van der Merwe O'
>>
>> Ewen Paulette Marmorella Bridhedia Paul-
>>
>> Heever Gehardus Johannes Katwimpers Janse van Vuuren van den
>>
>> Mally Jolien Gertina van der Poel O'
>>
>> Swizarminife Truupsvor Theooseov
>>
>> Urtel Neville McSnurtle Quentin
>>
>> ?rtur Xavier Ypres Zulrich
>>
>> Vandaaigoed Jakobus Johannes Joumoerus
>>
>> Vanderker Lelie Belladonna Nerina
>>
>> Vyfer Johannes Gehardus du Toit van der
>>
>>
>> The head benefit is in the de tailing.
>>
>> Not that anyone asked.
>
> Since you've picked a bunch of mostly Dutch and German authors or at 
> least authors whose ancestors happened to be Dutch or German, I'd like 
> to point out that a rather common way in Dutch databases is to do it 
> slightly different:
>
> Sinerella Katrina van Aswagen Aspostertjie would become:
>
> Aswagen Aspostertjie, van, Sinerella Katrina
>
> This prevents alphabetically sorting all surnames from becoming a 
> massive series of entries starting with a 'V'.
>
> I'm rather sure Marcello can provide the answer on whether our Eastern 
> brethren do it the same.
>
> Regards,
>
>  Walter van Holst
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
>


From prosfilaes at gmail.com  Fri Sep 18 10:53:11 2009
From: prosfilaes at gmail.com (David Starner)
Date: Fri, 18 Sep 2009 13:53:11 -0400
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <SNT120-DS165A43E83C65DE9247D12FAEE10@phx.gbl>
References: <c91.48d074f8.37e288f1@aol.com>
	<SNT120-DS165A43E83C65DE9247D12FAEE10@phx.gbl>
Message-ID: <6d99d1fd0909181053s5ca3ae67sa21e4923e6ed4f21@mail.gmail.com>

On Thu, Sep 17, 2009 at 1:25 PM, James Adcock <jimad at msn.com> wrote:
> Since Bowerbird claims it is easy to go from PG
> TXT to HTML, then certainly it is as easy to go from HTML to PG TXT

That does not follow; it is easy to go from PG TXT to paper, but not
so easy to go the other way.

-- 
Kie ekzistas vivo, ekzistas espero.

From lee at novomail.net  Fri Sep 18 16:30:48 2009
From: lee at novomail.net (Lee Passey)
Date: Fri, 18 Sep 2009 17:30:48 -0600
Subject: [gutvol-d] Re: PG French text file #1500
In-Reply-To: <alpine.DEB.2.00.0909141419510.4940@mail.pglaf.org>
References: <alpine.DEB.2.00.0909141419510.4940@mail.pglaf.org>
Message-ID: <4AB41828.2000908@novomail.net>

Michael S. Hart wrote:

> Right now we are looking at Voltaire, de Toqueville's "Democracy,"
> and a few others.
>
> 20 more and we are at 1500.
>
> Please take a look for various copies of "Democracy" and anything
> else you think we might be able to use, and let me know.
>   

http://fr.wikisource.org/wiki/De_la_d%C3%A9mocratie_en_Am%C3%A9rique

Wikisource claims to have over 50,000 French works, although I notice a 
fair number of them are works translated from other languges (e.g., H.G. 
Wells' classic _La Guerre des Mondes_).

Happy Harvesting!





From sly at victoria.tc.ca  Fri Sep 18 23:28:42 2009
From: sly at victoria.tc.ca (Andrew Sly)
Date: Fri, 18 Sep 2009 23:28:42 -0700 (PDT)
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <SNT120-DS1846AE7DF0A3362BA69D77AEE00@phx.gbl>
References: <d4e.5ad2093d.37dea2e9@aol.com>
	<SNT120-DS149857B4784F3980DEF807AEE10@phx.gbl>
	<212322090909171729l613a79a3l300826cc8eb52f81@mail.gmail.com>
	<212322090909171754q2ca5d156m339ccc6023bf7a79@mail.gmail.com>
	<212322090909171818q1859acf9kb646d5bc72fe7687@mail.gmail.com>
	<212322090909171833m71b042f9j865f12135ffc93a9@mail.gmail.com>
	<SNT120-DS23FDDCCA24831FB6E4116FAEE00@phx.gbl>
	<212322090909171924u821f756qc64ef823c224663a@mail.gmail.com>
	<212322090909171933u660219f2u57df90004223717e@mail.gmail.com>
	<212322090909172047m65a9edaatdf683dc7884d45c1@mail.gmail.com>
	<212322090909172057q5a2dc32cu76d49efd24680a16@mail.gmail.com>
	<SNT120-DS1846AE7DF0A3362BA69D77AEE00@phx.gbl>
Message-ID: <Pine.GSO.4.58.0909182326370.8567@vtn1.victoria.tc.ca>


I think that with the few examples you have given, you have shown that
it is not possible to do so with _any_ library catalog, because the
usage of names has so many variations and exceptions.

--Andrew

On Thu, 17 Sep 2009, James Adcock wrote:

> I hope you have figured out my point by now:

Namely, IF one wants to make "correct" e-book files in a number of formats, including EPUB and MOBI, it is not possible algorithmically to determine the "correct" encoding of Author Lastname, Firstname from data currently found in either the PG HTML encodings nor the PG TXT encodings.


From sly at victoria.tc.ca  Fri Sep 18 23:39:37 2009
From: sly at victoria.tc.ca (Andrew Sly)
Date: Fri, 18 Sep 2009 23:39:37 -0700 (PDT)
Subject: [gutvol-d] Re: Name lists and Big-endianism
In-Reply-To: <4AB37429.5090403@telkomsa.net>
References: <4AB37429.5090403@telkomsa.net>
Message-ID: <Pine.GSO.4.58.0909182329550.8567@vtn1.victoria.tc.ca>

And don't forget that other national traditions you can have
more confusion.

For example:

For Hungarian names, the preferred order is [Family name] [Given name]
So that in the main text of PG#19433 the author's name is given as:
Balazs Bela, with the understanding that the first name that appears
is the one we alphbetize by.

And in Icelandic names, what looks to us as a "last name" is
not actually a family name, but a patrynomic. It is incorrect
to alphabetize by that, so the given name is used instead.

--Andrew

On Fri, 18 Sep 2009, Jon Richfield wrote:

> Without due respect for the dead hand of history, or the dead heads of
> aesthetes trying to impose attractive schemes devoid of logic or
> practicality, it would be nice if we could agree on some scheme to
> sequence our author indexes. It won't happen of course, and I am not
> silly enough to think that this brief note contains anything conclusive,
> but give it a thunk, anyone interested.
> Anyone uninterested is sternly forbidden to consider the matter or read
> this remark (it hardly hopes to attain the dignity of a suggestion.)
>
> Let us assume that we have authors such as the famous
>
> Johanna Kakebeenwania van der Merwe                     O'Brien,
> Jolien Gertina van der Poel
> O'Mally,
> Paulette Marmorella Bridhedia                                     Paul-Ewen
> Truupsvor Theooseov
> Swizarminife
> Neville McSnurtle Quentin                                           Urtel
> Xavier Ypres Zulrich
> ?rtur
> Aspoestertjie Sinnerella Katrina                                   van
> Aswagen
> Gehardus Johannes Katwimpers Janse van Vuuren       van den Heever
> Johannes Gehardus du Toit                                          van
> der Vyfer
> Jakobus Johannes Joumoerus                                      Vandaaigoed
> Lelie Belladonna Nerina
> Vanderker
> Otto
> Werther                                                              von
> und zu Bismarkharing
>
> The problem is notionally to sequence them  according to a
> comprehensible and totally unambiguous scheme, with the least
> sensitivity to uncertain spellling and concentrations of initial letters
> etc.
> The best approach is to write each name, as much as desired in normal
> internal sequence as above, then split each name immediately after the
> last non-alphabetic character (including spaces). The bit at the end is
> what you sequence by, NOT the full name, NOT necessarily the full
> surname, and without consideration of case or diacritical signs.
>
> In our by no means random, but hardly unrealistic example,several
> questions arise, including the role of various non-alphabetic
> characters, and the artificial concentration of surnames under the
> initial letters of prefixes such as de, der, du, van, van der, von den,
> and no end of etcs. By sorting by the terminal alphabetic string, we
> remove ambiguity and even out the spread of names through the alphabet.
> In simple information theory this optimises search time and sort
> efficiency. The above example becomes:
>
> Aswagen Aspoestertjie Sinnerella Katrina van
>
> Bismarkharing Otto Werther von und zu
>
> Brien Johanna Kakebeenwania van der Merwe O'
>
> Ewen Paulette Marmorella Bridhedia Paul-
>
> Heever Gehardus Johannes Katwimpers Janse van Vuuren van den
>
> Mally Jolien Gertina van der Poel O'
>
> Swizarminife Truupsvor Theooseov
>
> Urtel Neville McSnurtle Quentin
>
> ?rtur Xavier Ypres Zulrich
>
> Vandaaigoed Jakobus Johannes Joumoerus
>
> Vanderker Lelie Belladonna Nerina
>
> Vyfer Johannes Gehardus du Toit van der
>
>
> The head benefit is in the de tailing.
>
> Not that anyone asked.
>
> Cheers,
>
> Jon
>
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

From Bowerbird at aol.com  Sat Sep 19 00:47:44 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 19 Sep 2009 03:47:44 EDT
Subject: [gutvol-d]  re: In search of a more-vanilla vanilla TXT
Message-ID: <cb8.5613a3d2.37e5e6a0@aol.com>

jim said:
>   I challenge you to write such a program 
>    to run on _my_ choice of platform: Kindle DX

i wonder, jim, if you are being disingenuous on purpose?
or are you just incapable of having an honest discussion?

because i'm sure you know you "challenged" me to do
something that cannot be done.   and as much as you
might like to say "that's the point", it's really _not_...

because nobody else can meet that challenge either.

therefore, the obvious response to this "challenge" is
to do what one _can_ do, which is to custom-tailor
a beautiful e-book which _can_ be read on the dx...

and since the dx reads .pdf files, that's very simple!

so your big "challenge" is whisked away immediately,
in yet another testament to the power of plain-text...

z.m.l. can create very beautiful (and powerful) .pdf,
fully customized to your own personal preferences,
all with just the click of a button.   from a z.m.l. file.
(or a p.g. plain-text file you've modified into z.m.l.)
by using a program on your own personal computer.
you can choose any font you like from your computer.
and the leading you want.   and the margins you want.
and any font-size.   and any font-color.   and so on...
it's the ultimate in personal control.   people like that.

and if you prefer native kindle format instead of .pdf?

just do the zml-to-html conversion instead, where the
.html is generated to your own personal preferences,
and then convert that .html file to the kindle format...

you don't have to "put up with" the settings from some
online website, and hope that they'll be "good enough",
since you probably cannot change them if they aren't...

you get it the way you want, using your own machine,
with software that you know will keep working on it...

what's not to like?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090919/ac21ae0b/attachment.html>

From Bowerbird at aol.com  Sat Sep 19 02:08:55 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 19 Sep 2009 05:08:55 EDT
Subject: [gutvol-d] a case of deliberate sabotage by a p.g. volunteer
Message-ID: <d45.52e15a49.37e5f9a7@aol.com>

jim, i just took a look at "wings of the dove"
-- pg#29452, which you post-processed --
and i'm troubled by what i discovered there.

ok, maybe "troubled" is a bit melodramatic,
much like the subject-line on this post, but
i don't really think it's all _that_ farfetched...

what i found is that, in the .html version of
the book, you showed the italics properly...

good job.

in the .txt version, however, you deliberately
deleted the italics markers which formatters
had invested considerable work in inserting...

bad job!

the text version -- this was the 8-bit file --
was also missing a handful of diacritics in it:

>    Seen at a foreign table d'h?te, he suggested
>    Br?nig (several cases of this one)
>    word--? bient?t!--across
>   You're blas?, but you're not enlightened.
>    wasn't it, ? peu pr?s, what all
>    Matcham were inesp?r?es, were pure manna

in some books, those missing diacritics would
be a big issue.   here, they're fairly uncommon,
and thus do not really constitute a very big deal.

but the missing italics?   they are a major problem.

and it's a problem that _you_ introduced yourself.

you're supposed to use underscores for the italics.
(go ahead, read the instructions, it says it clearly.)

you're definitely _not_ supposed to remove them!

and i must say, it takes a lot of gall for you to do
this deliberate sabotage of the plain-text file and
_then_ come here to complain because that file is
inferior...   of course it's inferior!   you made it so!

you went out of your way to make it substandard.
if i would've done that, i'd be ashamed of myself.

i'm also disturbed that the whitewashers allowed
this intentionally-disfigured file into the library...

but that's another matter, a fight for another day.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090919/9c44f800/attachment-0001.html>

From Bowerbird at aol.com  Sat Sep 19 02:34:59 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 19 Sep 2009 05:34:59 EDT
Subject: [gutvol-d] Re: a case of deliberate sabotage by a p.g. volunteer
Message-ID: <cc7.5e30d415.37e5ffc3@aol.com>

ok, it looks like the diacritic problem
was an encoding glitch on my end so 
i apologize for that, and take it back.

but the missing italics still loom large.

i might also throw in a few other notes.

first off, there's no table of contents in
these files, either the .html or the .txt...

i consider that to be plain unacceptable.
it's easy enough to make, and it's useful.
besides, it orients the reader to the book.

i even like backlinks from chapter heads to 
the table of contents, for quick navigation,
and previous/next chapter links are nice too.

and, since there has been some discussion
about title-pages, i think the title-page in
the .html version is done poorly, because
it's too widely spaced, which means that
it needs about two screens on my monitor.
(and i have a 23-inch cinema-screen here.)
the whole thing needs much tighter leading.
it certainly doesn't feel like a "real" title-page.

oh yeah, and the .epub version of the book?
no side margins at all...   it looks freakish...
which means that if you don't like that look,
you're gonna need a reader-program which
"allows" you to adjust the margins yourself.
just so you know...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090919/33c87703/attachment.html>

From Bowerbird at aol.com  Sat Sep 19 13:36:57 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sat, 19 Sep 2009 16:36:57 EDT
Subject: [gutvol-d] Re: a case of deliberate sabotage by a p.g. volunteer
Message-ID: <cdf.58020bbe.37e69ae9@aol.com>

so, jim, i'm gonna give "wings of the dove"
the whole z.m.l. treatment, so you can see it.

i'm assuming that you used the copy from
the university of california, at archive.org?
>    http://ia310110.us.archive.org/1/items/wingsofthedove01jamerich/

the other copy, from university of toronto,
appears to have a 1909 publication date...

by the way, do you have a copy of your file
_before_ you rewrapped it, with the original
linebreaks? because that would help me lots.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090919/1fb2d705/attachment.html>

From richfield at telkomsa.net  Sun Sep 20 04:19:44 2009
From: richfield at telkomsa.net (Jon Richfield)
Date: Sun, 20 Sep 2009 13:19:44 +0200
Subject: [gutvol-d] Re: Name lists and Big-endianism
In-Reply-To: <Pine.GSO.4.58.0909182329550.8567@vtn1.victoria.tc.ca>
References: <4AB37429.5090403@telkomsa.net>
	<Pine.GSO.4.58.0909182329550.8567@vtn1.victoria.tc.ca>
Message-ID: <4AB60FD0.3060009@telkomsa.net>

Yes, I have a book by one Peter Rosza.  It took me some time to realise 
that Peter was in fact a woman, and a well-known mathematician at that, 
whom we might have called Rose Peter.
Tsk!  These Magyars...! You'd think they would have come to us for advice.

As for the Icelandic convention, I knew that there was something funny 
about all their terminal "-sons" and "-dotters" (sp?) but don't they 
have any family name at all?
Some of the Slavic names might be troublesome too, because they vary the 
suffix of what I take to be the family name, according to gender: -ski 
vs -ska and so on. But maybe I have that mixed up as in the Icelandic 
names.
Could it be that the Icelandic convention derives from the fact that 
they are dealing with a smallish population?
Anyway, It seems to me that the indexing convention I proposed would 
still be easy to apply by anyone that understands the naming convention 
of the language and the population in question. Simply write the 
complete name (or whatever part suits the DB in question)  in the 
lexically normal way according to the favoured convention, then rotate 
it till the first letter after the last non-alphabetic character is 
first in the string, and voila!

Go well,

Jon

Andrew Sly wrote:
> And don't forget that other national traditions you can have
> more confusion.
>
> For example:
>
> For Hungarian names, the preferred order is [Family name] [Given name]
> So that in the main text of PG#19433 the author's name is given as:
> Balazs Bela, with the understanding that the first name that appears
> is the one we alphbetize by.
>
> And in Icelandic names, what looks to us as a "last name" is
> not actually a family name, but a patrynomic. It is incorrect
> to alphabetize by that, so the given name is used instead.
>
> --Andrew
>
>
>   



From publiek.devos at skynet.be  Sun Sep 20 11:35:40 2009
From: publiek.devos at skynet.be (Frits Devos)
Date: Sun, 20 Sep 2009 20:35:40 +0200
Subject: [gutvol-d] Re: Name lists and Big-endianism
In-Reply-To: <4AB60FD0.3060009@telkomsa.net>
References: <4AB37429.5090403@telkomsa.net>
	<Pine.GSO.4.58.0909182329550.8567@vtn1.victoria.tc.ca>
	<4AB60FD0.3060009@telkomsa.net>
Message-ID: <5C1815AD-1652-47D3-9C63-BA5F4F03B646@skynet.be>

Whole can of worms.  In the Netherlands Walter van Holst (If you  
allow me to use your name, Walter) would be sorted by "Holst" . In  
Belgium, using the same language, it would be sorted by "van  
Holst" (and the "van" would be capitalised).

Frits

Op 20-sep-09, om 13:19 heeft Jon Richfield het volgende geschreven:

> Yes, I have a book by one Peter Rosza.  It took me some time to  
> realise that Peter was in fact a woman, and a well-known  
> mathematician at that, whom we might have called Rose Peter.
> Tsk!  These Magyars...! You'd think they would have come to us for  
> advice.
>
> As for the Icelandic convention, I knew that there was something  
> funny about all their terminal "-sons" and "-dotters" (sp?) but  
> don't they have any family name at all?
> Some of the Slavic names might be troublesome too, because they  
> vary the suffix of what I take to be the family name, according to  
> gender: -ski vs -ska and so on. But maybe I have that mixed up as  
> in the Icelandic names.
> Could it be that the Icelandic convention derives from the fact  
> that they are dealing with a smallish population?
> Anyway, It seems to me that the indexing convention I proposed  
> would still be easy to apply by anyone that understands the naming  
> convention of the language and the population in question. Simply  
> write the complete name (or whatever part suits the DB in  
> question)  in the lexically normal way according to the favoured  
> convention, then rotate it till the first letter after the last non- 
> alphabetic character is first in the string, and voila!
>
> Go well,
>
> Jonm
>
> Andrew Sly wrote:
>> And don't forget that other national traditions you can have
>> more confusion.
>>
>> For example:
>>
>> For Hungarian names, the preferred order is [Family name] [Given  
>> name]
>> So that in the main text of PG#19433 the author's name is given as:
>> Balazs Bela, with the understanding that the first name that appears
>> is the one we alphbetize by.
>>
>> And in Icelandic names, what looks to us as a "last name" is
>> not actually a family name, but a patrynomic. It is incorrect
>> to alphabetize by that, so the given name is used instead.
>>
>> --Andrew
>>
>>
>>
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d


From Bowerbird at aol.com  Sun Sep 20 14:52:24 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 20 Sep 2009 17:52:24 EDT
Subject: [gutvol-d] Re: Name lists and Big-endianism
Message-ID: <c8d.50ac845f.37e7fe18@aol.com>

and we see yet another excellent example of how
the "metadata" b.s. is such an unproductive path.

the o.c.d. people love to focus on these minute
details, which make very little difference at all
-- who cares how "van holst" is sorted?, or if the
"van" is capitalized or not?, or indeed whether
it is "capitalised" or not?, because a search for
"holst" is gonna find it no matter what you do --
and, as if this insignificance wasn't bad enough,
such compulsiveness usually causes full paralysis.

you can tie yourself up worrying about that crap...
or you can cut the gordian knot and be productive.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090920/d8e0299a/attachment.html>

From Bowerbird at aol.com  Sun Sep 20 15:50:19 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Sun, 20 Sep 2009 18:50:19 EDT
Subject: [gutvol-d] the wings of the dove -- 001
Message-ID: <d3b.4da6eacd.37e80bab@aol.com>

hey jim, if you're still around (and you should be),
then you'll want to know i've started on that book,
"the wings of the dove", by henry james.

i've mounted it on my website; you can find it here:
>    http://z-m-l.com/go/wotdj/wotdjp123.html

that's a page-by-page rendering, with a form at
the bottom of each page for _reporting_errors_...

you will see that the text for each page is given,
as well as the scan for that page, for verification.

this is in keeping with your excellent suggestion
that p.g. should mount a book in such a way that
other people can come along later to improve it.

both the text and the scans are from archive.org.
(i used the scan from the university of california.)

as many people here undoubtedly know, the o.c.r.
done by the o.c.a. is dreadful.   most people likely
think it's dreadful in the way that much o.c.r. is,
namely, that it's filled with misrecognition errors.

but the o.c.r. from the o.c.a. is worse.   much worse.

that's because their tech people there mishandle it.

specifically, they _lose_ the em-dashes in the text!

oh, the o.c.r. recognizes the em-dashes, but then
-- somewhere in their file-handling workflow --
the o.c.a. "tech people" there lose the em-dashes!

for example, look at page 9 from the book:
>    http://z-m-l.com/go/wotdj/wotdjp009.html

you'll see that the em-dashes in the last paragraph
have been dropped from the text.   it's unbelievable!

this problem _alone_ is enough to make the o.c.r. 
totally unworkable.   but this isn't the only problem.
(i tried to restore the em-dashes programmatically,
by coding a tool, but it's less work to redo the o.c.r.)

that's not all.   there are more problems...

if you look at the text more closely, you will see that
the techies also lost the apostrophes in contractions!

i did a few global changes to restore _some_ of 'em,
like in the contraction "i'm", but i didn't fix them all...

this is not a problem that is _common_ to o.c.a. books,
but it's not a _rare_ occurrence either.   stunning idiocy.

and further, the hyphens on end-line hyphenates are
missing as well!   this sometimes happens in the o.c.r.,
so i'm not sure if that's what happened with this book
or if end-line hyphens were lost in the o.c.a. workflow,
but whatever it was, damage to the text is considerable.

and like i said, these problems are rather pervasive...

it's really ridiculous -- and quite sad -- that the people
in charge of the technology over at the o.c.r. are idiots
who have built a workflow that actually damages text...

what's even worse is that -- when i've brought this to
their attention -- they've responded with ad hominem
attacks on _me_, as if _i_ were the guilty perpetrator...

eventually, when i persisted, they finally consented to
solve the worst of the problems -- the em-dashes --
but i don't know if they ever did solve the problem...

meanwhile, they've banned me from their listserves,
so they wouldn't have to listen to my persistent posts.

talk about killing the messenger!   it's appalling...

the main reason this is so troubling is that the o.c.a.
are supposed to be "the good guys", who are the only
competitor to google.   and they are badly incompetent.

plus they have thin skins to boot, and they would rather
_silence_ the people who point out their problems than
do the work that will solve their self-induced problems.
this does not bode well for our future.   not well at all...

anyway...

the main benefit of the o.c.r. from the o.c.a. is that
it has retained the structure from the original book.

that means we can use a clever mash-up tool (requiring
lots of elbow-grease) to use the cleaned-up text from jim
and hang it on the structural scaffolding of the o.c.a. text.

first, let's look at the o.c.a. text:
>    http://z-m-l.com/go/wotdj/wotdj.zml

this is the single-file version of the .zml text for this book,
the file that was used to generate the page-by-page view...

now, in a separate window side-by-side with the above,
let's load in (a slightly reworked version of) jim's text:
>    http://z-m-l.com/go/wotdj/wotdj.txt

you'll see that you're able to match of the paragraphs...

for instance, do a search for "she looked about her and"
to find that paragraph in both windows, to sync them...

so essentially, what we want to do is take the linebreaks
and pagebreaks from the o.c.a. file and inject them into
the (clean) e-text.   we want to reintroduce the structure.
(which p.g. should have never stripped in the first place.)

or, to look at it in the other direction, we want to replace
all of the incorrect lines of text in the o.c.a. version with 
the good, cleaned equivalent text from the p.g. version...

once we do that, we'll have a good clean structured e-text.

we'll get to that this week.   gotta get some vitamin d now...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090920/185ff299/attachment-0001.html>

From kkloos at dodo.com.au  Sun Sep 20 16:07:26 2009
From: kkloos at dodo.com.au (Keith Kloosterman)
Date: Mon, 21 Sep 2009 09:07:26 +1000
Subject: [gutvol-d] Mailing list
Message-ID: <4AB6B5AE.5080006@dodo.com.au>

Hi,

Please remove me from this mailing list.

Thank you.

Keith Kloosterman


-- 
I am using the free version of SPAMfighter.
We are a community of 6 million users 
fighting spam.
SPAMfighter has removed 10 of my spam emails 
to date.
Get the free SPAMfighter here: 
http://www.spamfighter.com/len

The Professional version does not have this 
message


---
avast! Antivirus: Outbound message clean.
Virus Database (VPS): 090920-0, 20/09/2009
Tested on: 21/09/2009 9:07:27 AM
avast! - copyright (c) 1988-2009 ALWIL Software.
http://www.avast.com




From ajhaines at shaw.ca  Sun Sep 20 16:19:09 2009
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Sun, 20 Sep 2009 16:19:09 -0700
Subject: [gutvol-d] Re: Mailing list
References: <4AB6B5AE.5080006@dodo.com.au>
Message-ID: <55EE0A3DC0444E169B1B2F96DA899AB9@alp2400>

Keith, you can remove yourself from this, or any other, PG list:

- go to http://lists.pglaf.org/mailman/listinfo
- click on the link to the list you want to remove yourself from, gutvol-d, 
in this case
- at the bottom of the resulting page, you'll see an "Unsubscribe or edit 
options" button
- enter your email address at the prompt to its left, and click the button.

Al


----- Original Message ----- 
From: "Keith Kloosterman" <kkloos at dodo.com.au>
To: <gutvol-d at lists.pglaf.org>
Sent: Sunday, September 20, 2009 4:07 PM
Subject: [gutvol-d] Mailing list


> Hi,
>
> Please remove me from this mailing list.
>
> Thank you.
>
> Keith Kloosterman
>
>
> -- 
> I am using the free version of SPAMfighter.
> We are a community of 6 million users fighting spam.
> SPAMfighter has removed 10 of my spam emails to date.
> Get the free SPAMfighter here: http://www.spamfighter.com/len
>
> The Professional version does not have this message
>
>
> ---
> avast! Antivirus: Outbound message clean.
> Virus Database (VPS): 090920-0, 20/09/2009
> Tested on: 21/09/2009 9:07:27 AM
> avast! - copyright (c) 1988-2009 ALWIL Software.
> http://www.avast.com
>
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
> 



From joyce.b.wilson at sbcglobal.net  Sun Sep 20 20:39:35 2009
From: joyce.b.wilson at sbcglobal.net (Joyce Wilson)
Date: Sun, 20 Sep 2009 22:39:35 -0500
Subject: [gutvol-d]  Re: Name lists and Big-endianism
Message-ID: <4AB6F577.2030105@sbcglobal.net>

The change I would like is to have spaces taken into account in the name 
sort.

So we would have something like this:

Green, Alice
Green, Robert
Greenacre, Janet
Greenjeans, Mr.

instead of like this:

Greenacre, Janet
Green, Alice
Greenjeans, Mr.
Green, Robert

--Joyce

From schultzk at uni-trier.de  Mon Sep 21 00:14:20 2009
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Mon, 21 Sep 2009 09:14:20 +0200
Subject: [gutvol-d] Re: Name lists and Big-endianism
In-Reply-To: <c8d.50ac845f.37e7fe18@aol.com>
References: <c8d.50ac845f.37e7fe18@aol.com>
Message-ID: <30C62690-C9E8-4642-B41E-FE8474DEB5FB@uni-trier.de>

Hi There,

Am 20.09.2009 um 23:52 schrieb Bowerbird at aol.com:

> and we see yet another excellent example of how
> the "metadata" b.s. is such an unproductive path.
	Not true. It is how the metedata is use or structured.
	See Below.

>
> the o.c.d. people love to focus on these minute
> details, which make very little difference at all
> -- who cares how "van holst" is sorted?, or if the
> "van" is capitalized or not?, or indeed whether
> it is "capitalised" or not?, because a search for
> "holst" is gonna find it no matter what you do --
> and, as if this insignificance wasn't bad enough,
> such compulsiveness usually causes full paralysis.
	Here BB is right on the point.

	Basically, the metadata is a dataabase. so we have the field
	for the name and then one or several fields of indexing
	that field. Furthermore in a typical library cataloge you wil
	find "Walter van Holst" under "Walter van Holst", "van Hols, Walter"
	and "Holst, van, Walter". So where doe sit leave us?

	With the development of a structured databese. Which means
	that we will have to comprise, that is cover the basic cases and
	in certain cases hand edit the fields involved. These special cases
	will be harder to find, but there will be a set of rules which will
	help us look for them. To make things easier we could use cross-
	references as in library catalogues.

	There is no magic bullet. As aexample take look at iTunes.
	It has field for sorting Artist. they use a db and for my own
	CDs the information is gotten from a diferent DB. I have my own
	notion how things should be sorted. So I edit the "sort for Artist"  
field.
	The only problem here is that for classical music sorting/ indexing by
	Artist is not viable. I prefer to use the Komposer field. So I have to
	use a different index.

	So what should be done is say our index follow these rules for names.
	If you cannot find a name where you expect it to be search do a full  
text search
	of the field X and you should find what you are looking for if not  
use the full
	name field !!!


	regards
		Keith.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090921/01bfb397/attachment.html>

From walter.van.holst at xs4all.nl  Mon Sep 21 00:30:33 2009
From: walter.van.holst at xs4all.nl (Walter van Holst)
Date: Mon, 21 Sep 2009 09:30:33 +0200
Subject: [gutvol-d] Re: Name lists and Big-endianism
In-Reply-To: <30C62690-C9E8-4642-B41E-FE8474DEB5FB@uni-trier.de>
References: <c8d.50ac845f.37e7fe18@aol.com>
	<30C62690-C9E8-4642-B41E-FE8474DEB5FB@uni-trier.de>
Message-ID: <2aef272c5d38121baf7069b06bc76554@xs4all.nl>

On Mon, 21 Sep 2009 09:14:20 +0200, "Keith J. Schultz"
<schultzk at uni-trier.de> wrote:

>> the o.c.d. people love to focus on these minute
>> details, which make very little difference at all
>> -- who cares how "van holst" is sorted?, or if the
>> "van" is capitalized or not?, or indeed whether
>> it is "capitalised" or not?, because a search for
>> "holst" is gonna find it no matter what you do --
>> and, as if this insignificance wasn't bad enough,
>> such compulsiveness usually causes full paralysis.
> 	Here BB is right on the point.

Not quite. If I am looking for a book written by a particular author, I
want to be able to search for his or her name and not for all books about
that particular author. Therefore metadata has a, albeit in this era of
sophisticated search algorithms, somewhat reduced, purpose.

And to that particular bird that is usually relegated to my spambox: I
really do care whether the 'van' part in my family names is capitalised or
not. I'm rather proud of it and do not need beastly pseudonyms to cower
behind.

Regards,

 Walter

From schultzk at uni-trier.de  Mon Sep 21 00:51:00 2009
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Mon, 21 Sep 2009 09:51:00 +0200
Subject: [gutvol-d] Re: Name lists and Big-endianism
In-Reply-To: <4AB6F577.2030105@sbcglobal.net>
References: <4AB6F577.2030105@sbcglobal.net>
Message-ID: <EE5C6FBB-3CB2-4486-90A1-8F16F61898A9@uni-trier.de>

Hi There,

Am 21.09.2009 um 05:39 schrieb Joyce Wilson:

> The change I would like is to have spaces taken into account in the  
> name sort.
>
> So we would have something like this:
>
> Green, Alice
> Green, Robert
> Greenacre, Janet
> Greenjeans, Mr.
>
> instead of like this:
>
> Greenacre, Janet
> Green, Alice
> Greenjeans, Mr.
> Green, Robert
	Duhhhh !! If this is true there are some people
	that ougth to take a course in 101 programming or db
	design. It takes about 5 minutes to write the code.

	IsEntrySmallertThan(X, Y) :-
		Pos := 0;
		If (Length(X) < Length (Y))
			then MaxPos = Length(X) -1;

			else MaxPos = Length(Y) - 1;
		end if

		While  ((IsSmaller := CharSmaller(X[Pos], Y[Pos]) == 0) and Pos !=  
MaxPos )
			Pos := Pos +1;
		end While

		return IsSmaller;
	 end EntrySmallerThan

	CharAtSmaller(X, Y) :-
		If (Cardinal(X) <  Cardinal(Y) )
			return 1
			else
				If Cardinal(X) > Cardinal(Y)
					then return -1;
					else return 0;
				end if
		end if
	end CahrAtSmaller

	Cardinal(X) :-
		If (X in set of standard Chars)
			then return X
			else return -1;
		end if
	end Cardinal

	Put this ipseudo code what language you want and voila. Cardinal can  
be made as
	complex as you want if you needed finer distinctions.

	regards
		Keith.


From marcello at perathoner.de  Mon Sep 21 09:06:27 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Mon, 21 Sep 2009 18:06:27 +0200
Subject: [gutvol-d] Re: Name lists and Big-endianism
In-Reply-To: <4AB6F577.2030105@sbcglobal.net>
References: <4AB6F577.2030105@sbcglobal.net>
Message-ID: <4AB7A483.5000606@perathoner.de>

Joyce Wilson wrote:

> The change I would like is to have spaces taken into account in the name 
> sort.
> 
> So we would have something like this:
> 
> Green, Alice
> Green, Robert
> Greenacre, Janet
> Greenjeans, Mr.
> 
> instead of like this:
> 
> Greenacre, Janet
> Green, Alice
> Greenjeans, Mr.
> Green, Robert


We can't do that because our database server at ibiblio uses POSIX 
collation. We cannot ask ibiblio to change that because the server is 
shared between multiple sites hosted at ibiblio and POSIX is the most 
general collation. Maybe the next software upgrade will allow us to set 
collation per database.

Re-sorting database output on the web server is impracticable because it 
would add considerable overhead to the database and web server load.

But the most important argument against changing anything is that we 
dont want to impose the preference of any one user over the rest of the 
world. There are just too many collation strategies:

Classic Spanish treasts 'ch' and 'll' as single letters.

Norwegian sorts 'aa' to the top or bottom according to pronunciation.

German phonebooks sort '?' as 'oe', but Austrian phonebooks sort '?' 
after 'o'.

Dutch phonebooks sort 'ij' as 'y', but Belgian phonebooks do not.

Now, which one should we prefer?




-- 
Marcello Perathoner
webmaster at gutenberg.org

From Bowerbird at aol.com  Mon Sep 21 09:58:36 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 21 Sep 2009 12:58:36 EDT
Subject: [gutvol-d] Re: Name lists and Big-endianism
Message-ID: <d5f.57c9a74c.37e90abc@aol.com>

walter said:
>    If I am looking for a book written by 
>    a particular author, I want to be able to 
>    search for his or her name and not for 
>    all books about that particular author.

i agree.   but that's not the point at issue.
the point here is that an unhealthy focus
on metadata usually makes one catatonic.


>    I really do care whether the 'van' part 
>    in my family names is capitalised or not

i'm sure you do, walter, and bully for you,
which is why it would be a shame if some
o.c.d. cataloger made it uppercase for the
simple purpose of fitting their sort method.

sorting was very important in the old days
where we had a _physical_ card-catalog,
but it's silly to get bogged down in it today.
the o.c.d. people, however, just love to get
bogged down with any subject that they can.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090921/a0af955f/attachment-0001.html>

From Bowerbird at aol.com  Mon Sep 21 10:08:43 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 21 Sep 2009 13:08:43 EDT
Subject: [gutvol-d] Re: Name lists and Big-endianism
Message-ID: <c30.62ceea34.37e90d1b@aol.com>

walter said:
>    And to that particular bird that is usually relegated to my spambox

oh, and by the way, there's an important object lesson here...

it's fine to put someone in your spam folder -- i do it myself --
but you should then remember that you are _not_ hearing the
whole conversation, and therefore probably should refrain from
commenting on any bits and pieces of text from the person who
you are ignoring, because you're likely to miss something vital,
and end up making yourself look silly.

and what makes this doubly ironic is that -- when that other
person who you are ignoring corrects you -- you won't hear
that correction.   but everybody else will, so you won't know
that the other person made you look silly, but everyone else
will know, and then -- down the line -- when everyone else
knows that the person had made you look silly, you will make
yourself look even sillier when you say you are ignoring them.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090921/3c4d78b4/attachment.html>

From jimad at msn.com  Mon Sep 21 10:13:03 2009
From: jimad at msn.com (James Adcock)
Date: Mon, 21 Sep 2009 10:13:03 -0700
Subject: [gutvol-d] Re: Name lists and Big-endianism
In-Reply-To: <c8d.50ac845f.37e7fe18@aol.com>
References: <c8d.50ac845f.37e7fe18@aol.com>
Message-ID: <SNT120-DS162637A5C5691B09E89569AEDD0@phx.gbl>

>and we see yet another excellent example of how
>the "metadata" b.s. is such an unproductive path.

>the o.c.d. people love to focus on these minute
>details, which make very little difference at all
>-- who cares how "van holst" is sorted?, .

 

You make great big assumptions about the nature of the machines that people
are reading on, and then make incorrect conclusions based on those
assumptions.  Yes, if all readers are reading on desktop computers running
some flavor of *nix then your conclusions may be correct.

 

But, not all readers of PG books are running *nix, or even desktops.  Many
of these machines have a very different notion of "sorting" than you have in
mind.

 

Which is why we just had this conversation a couple days ago, but, I guess
many people didn't get it.

 

On my favorite class of machine, which something like a million+ other
readers are reading on, and more every day, "sorts" are typically done on
authorlastname, where authorlastname is something provided within the book
file.  That part which does not correspond to authorlastname is stored by
convention in authorfirstname.  This sort information is displayed to the
reader in one of two ways, both of which ought to appear sensible:

 

Authorlastname, authorfirstname

 

And

 

Authorfirstname authorlastname

 

In either case the actual sort should be on authorlastname

 

This class of machine has no notion of the idea that you can type in part of
an authors' name and search on that. Rather all the books on the machine are
sorted and displayed in order by authorlastname, and you find a book by
scrolling for the authorlastname in sort order within that list.

 

Why does this matter?  Consider the famous author name Sun Tzu

 

What is the last name?  Sun

 

What is the first name?  Well, no one actually knows, but historically "Tzu"
which is actually an honorarium is stuck in the authorfirstname slot.  But
now look what happens:

 

In the authorlastname, authorfirstname case you get:

 

Sun, Tzu

 

Which is not a bad result

 

In the Authorfirstname authorlastname case you get:

 

Tzu Sun

 

Which is an error.  Thus, perhaps, one concludes with names where family
name needs to display first the encoding has to be:

 

Authorlastname: Sun Tzu

Authorfirstname: null

 

In which case both displays work out right.

 

How does one write an automatic algorithm to figure these things out from an
existing gut authorlist?

 

Answer, again, is that one can not write an automatic algorithm to figure
these things out because currently there isn't enough information stored
about author names, and further, how author names are sorted and displayed
are based in part on library tradition, perhaps best found by researching
Library of Congress for a particular author.

 

Another way of saying this is, let's say you make the mistake of wandering
into a Barnes and Noble when you were actually trying to enter the Starbucks
next door.  But while in there you decide to look at the fiction stacks just
for fun to see if they have your favorite author.  Where in the stacks do
you look? Well, that depends on how B&N sorts on your favorite author, which
in turn is based on library tradition for that particular author.

 

Yes you can try to write an algorithm to do this but then you will find that
surprisingly often it breaks, because it seems that having an unusual family
name is a prereq for writing a book. You can then say "oh well this is PG we
really don't care why be o.c.d.?"  But then you are producing books that
work inferior, in practice, for customers, on customer's machines, compared
to the other publishing houses, making PG look like amateur hour.  You might
say "well then they shouldn't have bought that machine rather they should
buy my favorite choice of machine."  But customers tend to consider that
attitude towards their choice of machine a sign of hostility towards the
customer by PG - which I guess is why PG already provides literally about 80
different file formats for customers.  I believe PG needs to remain agnostic
towards the customers' choice of machine if PG wants to retain the customer,
which means that PG needs to understand how the differing classes of
machines actually work, and what their constraints are.

 

Getting authors, titles, and sort orders "correct" IS pretty basic.  Not
easy, but basic.



 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090921/f83725b8/attachment.html>

From Bowerbird at aol.com  Mon Sep 21 10:26:14 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 21 Sep 2009 13:26:14 EDT
Subject: [gutvol-d] the wings of the dove -- 002
Message-ID: <d3a.4f385cf4.37e91136@aol.com>

as i said, one of the first steps in doing our
little mash-up is to sync up the paragraphs.

in doing so, i found a half-dozen mistakes
that jim had made in the paragraphing...

for others who want to verify these errors,
i suggest you look directly at jim's #29452:
>    http://www.gutenberg.org/files/29452/29452-8.txt

in addition, i'll give you the u.r.l. to see the
actual scans for each page up on my site...

***

in 3 cases, jim missed a paragraph break:

>    Her response, when it came, was cold but
>    http://z-m-l.com/go/wotdjp026.html 

>    She put it as to his caring to know
>    http://z-m-l.com/go/wotdjp263.html 

>    There was a finer
>    http://z-m-l.com/go/wotdjp317.html 

***

in another 3 cases, jim incorrectly broke 
an existing paragraph into two paragraphs.

>    This was, fortunately for her
>    http://z-m-l.com/go/wotdjp123.html 

>    What queerer consequence
>    http://z-m-l.com/go/wotdjp181.html 

>    It just faintly rankled in her
>    http://z-m-l.com/go/wotdjp201.html 

***

6 paragraphing mistakes is not bad performance.
with a book containing some 330 pages, like this,
i would say that it's probably about an average job.

***

besides, my point is never to say "gotcha! errors!"

as super-proofer jose menendez has proven,
i make my fair share of book-digitizing errors.

so that's not the point.

there are several big issues that _are_ the point:

1.   comparing digitizations is a great way to
pinpoint errors so that they can be corrected.
i have made this point in repeated examples.

2.   most of the books in the library have errors.
even the best ones, which were done recently...
if you're convinced there are no errors there,
you just don't know how to find them, and i
strongly suggest you return to the first point.

3.   most of the p.g. e-texts will be used only to
proof scan-sets that retain the book's structure,
and then the p.g. e-text will simply be discarded,
since it doesn't contain that important structure.

4.   the p.g. plain-text format has a lot of power
and beauty inside it, if it's merely extended a bit,
which is precisely what i did when i created z.m.l.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090921/be615fc5/attachment-0001.html>

From jimad at msn.com  Mon Sep 21 10:30:20 2009
From: jimad at msn.com (James Adcock)
Date: Mon, 21 Sep 2009 10:30:20 -0700
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
In-Reply-To: <cb8.5613a3d2.37e5e6a0@aol.com>
References: <cb8.5613a3d2.37e5e6a0@aol.com>
Message-ID: <SNT120-DS6EB37A853D5991FA9DFC1AEDD0@phx.gbl>

>by using a program on your own personal computer.



You assume the reader has a personal computer.  Some do not.  More
importantly, many do not at that point in time when they decide they want to
choose a new book to read, such as while sitting at an airport waiting for
the plane to take off.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090921/a5ad90ef/attachment.html>

From richfield at telkomsa.net  Mon Sep 21 08:39:40 2009
From: richfield at telkomsa.net (Jon Richfield)
Date: Mon, 21 Sep 2009 17:39:40 +0200
Subject: [gutvol-d] Re: Name lists and Big-endianism
In-Reply-To: <30C62690-C9E8-4642-B41E-FE8474DEB5FB@uni-trier.de>
References: <c8d.50ac845f.37e7fe18@aol.com>
	<30C62690-C9E8-4642-B41E-FE8474DEB5FB@uni-trier.de>
Message-ID: <4AB79E3C.7000306@telkomsa.net>

Hi Keith
>
>
> With the development of a structured databese. Which means
> that we will have to comprise, that is cover the basic cases and
> in certain cases hand edit the fields involved. These special cases
> will be harder to find, but there will be a set of rules which will
> help us look for them. To make things easier we could use cross-
> references as in library catalogues. 
>
> There is no magic bullet. As aexample take look at iTunes.
> It has field for sorting Artist. they use a db and for my own
> CDs the information is gotten from a diferent DB. I have my own
> notion how things should be sorted. So I edit the "sort for Artist" field.
> The only problem here is that for classical music sorting/ indexing by 
> Artist is not viable. I prefer to use the Komposer field. So I have to 
> use a different index. 

I take your point, but I reckon that with a bit of definition of 
canonical fields and formats one should be able to clean the lot up with 
the exception of cases where previous manual record entry had violated 
sensible rules. Most of the problems could be cleaned up automatically, 
and only the horrible examples (basically errors) need get special 
manual treatment. Trying to construct special rules for your data base 
to negotiate, would fall foul of the ingenuity of fools.

Whether you really need a "formal data base" or not is an open question. 
Some direct access to properly sorted and indexed files can be 
startlingly effective.

Jon


From richfield at telkomsa.net  Mon Sep 21 08:29:48 2009
From: richfield at telkomsa.net (Jon Richfield)
Date: Mon, 21 Sep 2009 17:29:48 +0200
Subject: [gutvol-d] Re: Name lists and Big-endianism
In-Reply-To: <c8d.50ac845f.37e7fe18@aol.com>
References: <c8d.50ac845f.37e7fe18@aol.com>
Message-ID: <4AB79BEC.4010606@telkomsa.net>

Really BB! You of all people! You can do better than that! I had assumed
that you were IT-savvy. What you say suggests that you may be a DB user,
but you sure as Sherridan don't talk like a DB designer, much less a
systems designer. Capitalised or not?  Weeeellll... maybe if the
distinction is built into your software and hard to leave out.  Have you
considered what difference it makes to the mechanics of sorting,
classification,  or access?  Whether you see it happen or not? For
little toy kilo-record files it might be trivial, but we don't all work
on those all the time.
How anything is sorted? Oh boy...
BB, in a certain large corporation which here shall be nameless, I got
lumbered with a job of indexing the world-wide email and phone list
after some other people repeatedly failed to do it. (Their software 
tools kept dying when fed the full files.) I wrote the application from
scratch with no pain in an unfamiliar language in a few days, partly 
because I saw to it that a temp got hired to re-format all the names 
canonically. A year or two later Global HQ decreed a new, 
commercial-DB-based (Again no names of which large corporation's DB 
package it was based on!) package, and so we used that instead.
Except that the savvy seniors clandestinely loaded and retained my
version for years afterward because it was easier to use, more often
successful in searching, and faster than the off-the-shelf even when 
there was a first-time hit.
Canonically formatted files are VERY efficiently handleable.
But you knew that BB, didn't you?
How about this?
A certain file-checking job involved cross-checking two files against
each other. (Again, never mind which international corporation's files
those were!)  The job had been manual, but rapidly became infeasible as
the files grew. Someone wrote a quick-and-dirty to help, but it took a
week to run (5-day week, but still!) and only partly did the job.
Someone (maybe the same guy; I don't remember) did the job better, and
it ran in a day, still partly successfully. Someone else did a totally
different job and it ran in a couple of hours, almost successfully, but
it didn't work. Then to get me out of someone's hair I got the job. I
began by reformatting the input file every run. Stupid, but whoever
expected anything else. Run time, including the sort (Which I also had
to write myself) and selection match pass: 49 seconds. Several orders of
magnitude improvement in performance plus perfect results.
And best of all, it didn't take a lot of sexy programming, just
competent design. I probably cold have halved the times for both jobs if
I had written in low level code, but it wasn't really necessary.

Now BB, I reckon that when proper attention changes a job from not worth
running, to so trivial that at first the user thinks that the job hadn't
run, it is not a "minute detail, which makes very little difference at
all", but a very important detail, which makes enough difference to get
management respect -- till the next toughie comes along!
You see BB, 'who cares how "van holst" is sorted? --a search for "holst"
is gonna find it no matter what you do' is exactly the sort of detail
that made the difference in the real life cases.

Would you believe, BB, that I could go on for some time in this vain
vein? My Gordian (Note the Caps BB!) gnot was nicely productive once I
kut it with proper knit-picking design (as in untangling rather than
depediculotic activity). Not a louse-egg of "full paralysis" in sight,
or in anyone's hair!

It is not a matter of bottom-up vs top-down; it is knowing when and why
which is appropriate.

Cheers,

Jon

> and we see yet another excellent example of how
> the "metadata" b.s. is such an unproductive path.
>
> the o.c.d. people love to focus on these minute
> details, which make very little difference at all
> -- who cares how "van holst" is sorted?, or if the
> "van" is capitalized or not?, or indeed whether
> it is "capitalised" or not?, because a search for
> "holst" is gonna find it no matter what you do --
> and, as if this insignificance wasn't bad enough,
> such compulsiveness usually causes full paralysis.
>
> you can tie yourself up worrying about that crap...
> or you can cut the gordian knot and be productive.
>
> -bowerbird
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d




From richfield at telkomsa.net  Mon Sep 21 08:45:24 2009
From: richfield at telkomsa.net (Jon Richfield)
Date: Mon, 21 Sep 2009 17:45:24 +0200
Subject: [gutvol-d] Re: Name lists and Big-endianism
In-Reply-To: <2aef272c5d38121baf7069b06bc76554@xs4all.nl>
References: <c8d.50ac845f.37e7fe18@aol.com>	<30C62690-C9E8-4642-B41E-FE8474DEB5FB@uni-trier.de>
	<2aef272c5d38121baf7069b06bc76554@xs4all.nl>
Message-ID: <4AB79F94.3060507@telkomsa.net>

Hi Walter

Generally I agree, though I don't think that most of the extant search 
algorithms are so sophisticated. Most packages use brute force, relying 
on fast hardware. "Throwing silicon at the problem."

It works to a point, but in data sets that grow far enough to run into 
exponential problems (even large quadratic problems ftm) a decent design 
relying on an appropriate algorithm can do nice things for nice people.

CU

Jon
>
> Not quite. If I am looking for a book written by a particular author, I
> want to be able to search for his or her name and not for all books about
> that particular author. Therefore metadata has a, albeit in this era of
> sophisticated search algorithms, somewhat reduced, purpose.
>
> And to that particular bird that is usually relegated to my spambox: I
> really do care whether the 'van' part in my family names is capitalised or
> not. I'm rather proud of it and do not need beastly pseudonyms to cower
> behind.
>
> Regards,
>
>  Walter
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>
>
>   



From marcello at perathoner.de  Mon Sep 21 10:40:51 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Mon, 21 Sep 2009 19:40:51 +0200
Subject: [gutvol-d] Once every 7 years a post of monumental stupidity comes
	along ...
In-Reply-To: <EE5C6FBB-3CB2-4486-90A1-8F16F61898A9@uni-trier.de>
References: <4AB6F577.2030105@sbcglobal.net>
	<EE5C6FBB-3CB2-4486-90A1-8F16F61898A9@uni-trier.de>
Message-ID: <4AB7BAA3.5080706@perathoner.de>

... which isn't even funny. Here it is:

Keith J. Schultz wrote:

>> The change I would like is to have spaces taken into account in the 
>> name sort.
>>
>> So we would have something like this:
>>
>> Green, Alice
>> Green, Robert
>> Greenacre, Janet
>> Greenjeans, Mr.
>>
>> instead of like this:
>>
>> Greenacre, Janet
>> Green, Alice
>> Greenjeans, Mr.
>> Green, Robert


>     Duhhhh !! If this is true there are some people
>     that ougth to take a course in 101 programming or db
>     design. It takes about 5 minutes to write the code.


And it took the writer of that post no longer than that to ruin his 
reputation forever.

Bowerbird, meet Keith, Keith, meet Bowerbird.


Obviously the writer's ignorance about modern web serving infrastructure 
is complete. Even a single afternoon class about database programming 
would have taught him enough to keep his mouth shut. The writer of this 
nonsense obviously does not know that:

  - To sort a dataset locally on a web server, like the writer proposes, 
you have to request the whole dataset from the database server. This 
induces a considerable load on the database server and on the wire.

  - Sorting on the web server is much slower than sorting on the 
database server because the database server uses precomputed tables 
(indexes) which are already sorted, but the web server needs to sort 
from scratch.

So instead of asking the database server to:

   give me 100 authors sorted by name starting at offset 4500

which the server could almost instantly satisfy out of the pre-sorted 
index tables, you have to ask the server to

   give me all authors

which are 12800 at present.


Instead of reading 100 rows from the disk and passing them over the wire 
to the web server, you'll end up reading from the disk 12800 rows and 
transmitting them. Already a factor of 128 times slower.

Then comes the gratuitous sorting of 12800 rows on the web server. After 
which sort we throw away 12700 rows and present the user with the 100 
rows she requested.


But the ignorance of the writer is not only colossal regarding present 
day database systems, it becomes even more surrealistic when the writer 
tries to apply himself to programming.

The writer wastes 30 lines of code to re-implement a function that every 
programming language carries out-of-the-box. That alone would have 
sufficed to demonstrate that the writer's notions about programming are 
extremely vague at best.

We will furthermore see that the writer used pseudo-code not only to 
hide his ignorance of any actual programming language, but also to avoid 
having to test his absurd concoction, which test would have immediately 
revealed its uttermost bullshittiness even to himself.

The absurd proposal of the writer runs thus (feel free to skip to the 
beef, the irksomeness of this code is just good enough for a smile):


>     IsEntrySmallertThan(X, Y) :-
>         Pos := 0;
>         If (Length(X) < Length (Y))
>             then MaxPos = Length(X) -1;
> 
>             else MaxPos = Length(Y) - 1;
>         end if
> 
>         While  ((IsSmaller := CharSmaller(X[Pos], Y[Pos]) == 0) 
 >                 and Pos != MaxPos )
>             Pos := Pos +1;
>         end While
 >
 >         return IsSmaller;
 >      end EntrySmallerThan
 >
 >     CharAtSmaller(X, Y) :-
 >         If (Cardinal(X) <  Cardinal(Y) )
 >             return 1
 >             else
 >                 If Cardinal(X) > Cardinal(Y)
 >                     then return -1;
 >                     else return 0;
 >                 end if
 >         end if
 >     end CahrAtSmaller
 >
 >     Cardinal(X) :-
 >         If (X in set of standard Chars)
 >             then return X
 >             else return -1;
 >         end if
 >     end Cardinal
 >
 >     Put this ipseudo code what language you want and voila.
 > Cardinal can be made as
 >     complex as you want if you needed finer distinctions.
 >
 >     regards
 >         Keith.


For the sake of playing let us call:

   IsEntrySmallertThan ('a', 'ab').

MaxPos would then be set to 0.

The While loop will call CharSmaller, which does not exist, because the 
function is called CharAtSmaller. First Bug.

CharAtSmaller would then return 0 because it compares 'a' to 'a', which 
two are equal. The iteration will then stop because Pos == MaxPos == 0. 
The function will then return.

Conclusion:

   IsEntrySmallertThan ('a', 'ab') returns 0

According to this guy's wisdom, 'a' is not `smallert? than 'ab'.

QED

Moreover: this code would dump core on you the moment you call it with 
an empty string, Cardinal (X) returns X or -1 so you'll end up comparing 
characters with -1, which will not work on machines with unsigned 
characters ... and so on.


Throwing even one line of code over the wall without testing it, is the 
hallmark of the utter clueless beginner. Even people less full of 
themselves fall for it sometimes.




-- 
Marcello Perathoner
webmaster at gutenberg.org

From Bowerbird at aol.com  Mon Sep 21 10:43:03 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 21 Sep 2009 13:43:03 EDT
Subject: [gutvol-d] Re: Name lists and Big-endianism
Message-ID: <cc6.58e73415.37e91527@aol.com>

look, jim, you raised some important issues....
which i am willing to talk about.

and you raised some unimportant "issues",
which i am not all that eager to talk about.

so i should just tell you "no" right now, but
i'm gonna say it one more time, just for you.


>   You make great big assumptions about the 
>    nature of the machines that people are reading on

i assume you can search the "metadata", yes.

(and if you cannot, you need to take that up
with someone else, because that is a basic.)

so if you want to find "sun tzu", you'd search
for "sun tzu", and if that didn't work, then
you'd search for "sun" and "tzu" separately...

so it wouldn't matter where in the sort order
that this record fell, because you could find it.

same with marquis de sade and walter van holst
and any other name you want to come up with...

if you want to read more on this general idea,
i would suggest "everything is miscellaneous".


>    But while in there you decide to 
>    look at the fiction stacks just for fun 
>    to see if they have your favorite author.

you're still carrying around a physical mindset -- 
one which has always been riddled with problems 
-- when the world has moved to an electronic one.

which is why i won't bother to discuss this any more.

but those missing italics of yours?   i'll discuss those.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090921/80b4151e/attachment-0001.html>

From Bowerbird at aol.com  Mon Sep 21 10:49:07 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 21 Sep 2009 13:49:07 EDT
Subject: [gutvol-d] Re: In search of a more-vanilla vanilla TXT
Message-ID: <be1.6541da55.37e91693@aol.com>

jim said:
>    You assume the reader has a personal computer.? 

yes i do.

and i consider the iphone to be a computer.   
(it has a chip inside it, you know.)

even the kindle has a computer chip inside it.
it's just too bad that you can't program for it.
of course, even if amazon keeps castrating
the kindle, it's entirely possible they would
put a reader-program on the thing which
was capable of rending z.m.l. beautifully...


>    Some do not.

if you don't have a computer, then i simply
can't tell you how you would use an e-book.


>    More importantly, many 
>    do not at that point in time when they decide 
>    they want to choose a new book to read, 
>    such as while sitting at an airport 
>    waiting for the plane to take off.

paper books remain delightful, in my eyes...

***

jim, you seem to want to argue about all these
unimportant things...   what about those italics?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090921/31c625b8/attachment.html>

From Bowerbird at aol.com  Mon Sep 21 10:56:32 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 21 Sep 2009 13:56:32 EDT
Subject: [gutvol-d] Re: Name lists and Big-endianism
Message-ID: <c7d.55782769.37e91850@aol.com>

jon said:
>   Canonically formatted files are VERY efficiently handleable.

i know that.


>    But you knew that BB, didn't you?

yes i did.

what i do _not_ know is this:   who is going to hire the temp
whose job it will be to format the p.g. metadata canonically?

will it be _you_, jon?          ;+)

-bowerbird

p.s.   again, the book is called "everything is miscellaneous"...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090921/e2e949c2/attachment.html>

From jimad at msn.com  Mon Sep 21 11:47:06 2009
From: jimad at msn.com (James Adcock)
Date: Mon, 21 Sep 2009 11:47:06 -0700
Subject: [gutvol-d] Re: a case of deliberate sabotage by a p.g. volunteer
In-Reply-To: <d45.52e15a49.37e5f9a7@aol.com>
References: <d45.52e15a49.37e5f9a7@aol.com>
Message-ID: <SNT120-DS2254E89C3E1A5A3A68136DAEDD0@phx.gbl>

I assure you Bowerbird, that contrary to your comments I did not
"deliberately" disfigure the text file, and I would appreciate it if you
retract your comments. In any case the "formatters" you refer to would be
myself. An army of one.

 

Also, I do not ever rewrap books of my own volition but only as required in
order to be accepted for submission by PG. What you see posted by PG is not
necessarily the same thing as I would choose to submit to PG, [nor
identically that which I did in fact submit to PG] which in my case would
probably at this point in time be an HTML, although I can imagine at some
point in time with good tools TEI might be more interesting to me.  If you
are unhappy with HTML as an input submission format then I recommend writing
a simple parser for HTML that changes the HTML choice of tags to the tags
you prefer. If you wrote such a parser I suspect you could contribute it to
PG where it would represent a positive contribution to the many volunteers
like myself who would prefer to be submitting in HTML format in the first
place.  In practice HTML encodes most of what I as a volunteer would choose
to spend my time and energy transcribing, but I wish it had a little more
power, such as the ability to unambiguously encode authorfirstname,
authorlastname, chapter divisions, etc. 

 

What I do do for PG represents considerable sacrifice to myself and my
family, as I am sure my wife and children would be only too happy to attest.
If you think you have something positive to contribute to PG, please do so.
Abusing me for my choice of which sacrifices I am willing to make, or not
willing to make, does not represent a contribution to PG, nor does it
encourage my continuing contributions to PG.

 

The EPUB was not generated by me nor do I have any great knowledge of the
EPUB format.  I assume that some other volunteer at PG has written a tool to
automatically generate EPUB from HTML and that volunteer did so with some
choice of margins you do not prefer, or which doesn't work well with your
choice of machine. I don't know how to fix this problem, but it does point
out the advantages of TEI which allows the encoding in one document the
various "hints" necessary for attractive rendering of the one TEI input file
into various output rendering language targets.

 

I also did not generate the MOBI, but I use MOBI files all the time with my
favorite reader machine.  The MOBI that some volunteer at PG, not me, has
generated, looks beautiful on my choice of machine, which also allows me to
change the size of the font and the margins to my liking, which tends to
depend on the time of day - by midnight my eyes get tired and then I tend to
like a larger font and smaller margins. Which is why I like reflow formats
and reader machines - they allow me to easily "fix" many of the day-to-day
"poor choices" that some one else has made which would otherwise get in the
way of MY being able to enjoy the book the way *I* want.  Presumably this
other volunteer DID generate the MOBI file in a way that looked attractive
to him or her on his or her choice of machines, which needn't be identical
to my preferences - especially since my preferences tend to change with the
time of day!

 

My machine also works well with PDF files except I can't fix issues like
when the person or process generating the PDF uses a "poor" choice of font,
or poor choice of margins when read on my machine.  I can sometimes work
around these problems by holding my machine in landscape mode,  and
displaying only half a page of PDF at a time, but it tends to be awkward and
painful to hold the machine sideways for a length of time, and PDF often
doesn't like to be read a half a page at a time - since it is a page layout
language, not a half page layout language. Which is why I tend to prefer
reflow formats like MOBI or HTML over PDF.

 

However, at the very least the acidity of Bowerbirds remarks reaffirms my
contention that PG needs to allow volunteers like myself to submit files in
the volunteer's choice of file formats, NOT Bowerbirds.  In which case I
could have offered PG my efforts in one file format, and PG could have
chosen to accept or reject that offering.  If PG chose to accept that
offering then hopefully neither Bowerbird nor any other volunteer would
abuse me of my efforts which PG has then already acknowledged.  Rather, that
volunteer would (hopefully) acknowledge that PG had already accepted my
contribution, and in turn if they felt they could make further positive
contributions to this book, or any other book, in that file format or in any
other file format, then they would be free to do so.  Unfortunately, there
is not a universal sense within the PG community as to what does or does not
represent a positive contribution, which in turn leads to that unhappy state
of affairs to which Bowerbird is only too aptly demonstrating today.

 

Again I ask consideration that PG seriously consider allowing volunteers to
be able to submit books using only ONE file format if they choose to do so,
not requiring multiple file formats since that leads to that unhappy state
of affairs that Bowerbird is today only too well demonstrating.

 

Better yet, pick YOUR OWN book to transcribe and contribute to PG, rather
than abusing ME of MY efforts on MY choice of books!

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090921/055d055d/attachment.html>

From lee at novomail.net  Mon Sep 21 12:13:06 2009
From: lee at novomail.net (Lee Passey)
Date: Mon, 21 Sep 2009 13:13:06 -0600
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
 for eliciting beauty (and more)
In-Reply-To: <SNT120-DS116B72E3B00E699B5AA56EAEE60@phx.gbl>
References: <be0.6442fdd2.37dc344c@aol.com>
	<SNT120-DS116B72E3B00E699B5AA56EAEE60@phx.gbl>
Message-ID: <4AB7D042.5010407@novomail.net>

James Adcock wrote:

[snip]

> What I don't understand is why PG continues to be wedded to plain-text as an
> *input* encoding format demanded of people submitting texts to PG.
> Plain-text is too constrained to do the job well.  

I find that you are generally correct in everything you have said to 
date. But the reality is that PG <em>does</em> continue to be wedded to 
plain (impoverished) text. This topic has come up regularly over the 
years, and in every case has ended without any improvement to PG. While 
I hesitate to say that your advocacy is futile, your advocacy is futile.

> HTML is too ambiguous,
> and too ill-matched to books to do well.  We need something else, something
> that CAN be correctly and automagically converted "correctly" to one or
> another formats including plain-text, and Unicode, and HTML, and  mobi, etc.

HTML, <i class="foreign>per se</i>, is indeed too ambiguous, although I 
have successfully developed a fairly complete set of standard usages and 
class definitions (encapsulated in a CSS file) that allow me to do 
lossless translation back and forth between HTML and TEI. For PG to 
adopt such a scheme, however, would require that PG adopt a set of 
standards, and Mr. Hart has been adamant that PG will <em>never</em> 
adopt <em>any</em> standard, fearing that it may alienate or intimidate 
some speculative volunteer that would otherwise contribute an 
as-yet-unarchived impoverished text file. (Obviously, the implicit 
standards developed and enforced by the Council Of Whitewashers cannot 
be considered as <i class="socalled">true</i> standards.)

I have concluded that Project Gutenberg is impervious to improvement. 
While Bowerbird rejects the notion, I am not afraid to say that for what 
you are attempting to do Project Gutenberg may not be the correct 
archive. I would suggest, rather, perfecting your HTML file, uploading 
it to the Internet Archive (http://www.archive.org/create/) and then 
posting a message here indicating where it can be found if any other 
volunteer wants to create a degraded version of your master copy.

From prosfilaes at gmail.com  Mon Sep 21 12:27:42 2009
From: prosfilaes at gmail.com (David Starner)
Date: Mon, 21 Sep 2009 15:27:42 -0400
Subject: [gutvol-d] Re: a case of deliberate sabotage by a p.g. volunteer
In-Reply-To: <SNT120-DS2254E89C3E1A5A3A68136DAEDD0@phx.gbl>
References: <d45.52e15a49.37e5f9a7@aol.com>
	<SNT120-DS2254E89C3E1A5A3A68136DAEDD0@phx.gbl>
Message-ID: <6d99d1fd0909211227q765dd187n77dc76cbf85d9dd6@mail.gmail.com>

On Mon, Sep 21, 2009 at 2:47 PM, James Adcock <jimad at msn.com> wrote:
> If you think you have something positive to contribute to PG, please do so.
> Abusing me for my choice of which sacrifices I am willing to make, or not
> willing to make, does not represent a contribution to PG, nor does it
> encourage my continuing contributions to PG.

Which is why I've killfiled Bowerbird, and I believe that PG should
permanently eject him for their mailing lists.

> However, at the very least the acidity of Bowerbirds remarks reaffirms my
> contention that PG needs to allow volunteers like myself to submit files in
> the volunteer?s choice of file formats, NOT Bowerbirds.

That's not a rational argument. Whatever the base file formats are,
Project Gutenberg, like most archives, needs to pick one or a small
set of them so that the people who use Project Gutenberg can know what
they need to read the files. A PG text reader can't be demanded to
understand any file that anyone cares to use, and nobody can be
expected to understand Word 95 files, and similar garbage that infects
indiscriminate archives.

?--
Kie ekzistas vivo, ekzistas espero.

From Bowerbird at aol.com  Mon Sep 21 14:03:42 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 21 Sep 2009 17:03:42 EDT
Subject: [gutvol-d] Re: a case of deliberate sabotage by a p.g. volunteer
Message-ID: <bd9.3f425e97.37e9442e@aol.com>

jim said:
>    I assure you Bowerbird, that contrary to your comments 
>    I did not ?deliberately? disfigure the text file, and 
>    I would appreciate it if you retract your comments. 

how come the .txt file is missing the italics which are
right there, big as day, in the .html version of the file?

did that happen accidentally?

did someone else cause them to disappear?

if i would have said that you did it "on purpose",
would that have made it better?   i did say that
it was a rather "dramatic" subject-line, but still,
since you're the person who submitted the work,
i'm not sure how else we can explain the fact that
the .html file has italics, but the .txt file does not.

and you knew full well that the .txt version was
missing the italics information, because that was
the impetus that led you here to complain that
the .txt format was inferior.   it was, in your case,
but _only_ because you deliberately made it so...

(or, if you prefer, tell me what word you want me
to use instead of "deliberately", if that's inaccurate.)


>    In any case the ?formatters? you refer to would be myself. 
>    An army of one.

so you threw away your own work.   i guess that
that doesn't carry the same moral baggage that
throwing away the work of other people might...

still, the fact remains that some of the formatting
which you included in the .html version of the book
was _not_ included in the .txt version, so people who
use the .txt version have been deprived of some utility,
which does indeed carry some moral baggage, sadly...

so i don't think you can excuse the fact that you've
thrown away utility, just because you did the work of 
providing that utility in the .html version of the book.


>    Also, I do not ever rewrap books of my own volition but 
>    only as required in order to be accepted for submission

i didn't complain about your rewrapping of this book.


>    If you are unhappy with HTML as an input submission 
>    format then I recommend writing a simple parser for HTML 
>    that changes the HTML choice of tags to the tags you prefer.

don't try to make this about _me_.   or about the _.html_.

this thread started because _you_ came here to complain
about the _.txt_ format, claiming that it was substandard.

it's not.   at least not unless it is deliberately sabotaged...

(or, if you prefer, tell me what word you want me
to use instead of "deliberately", if that's inaccurate.)


>    If you wrote such a parser I suspect you could contribute 
>    it to PG where it would represent a positive contribution 
>    to the many volunteers like myself who would prefer to be 
>    submitting in HTML format in the first place.? 

well, it's easy enough to create a .txt file from an .html file
-- you simply copy the text out of the browser's window...

you'll need to do clean-up on it, since the browser doesn't
copy fully-formatted text to the clipboard in most cases...

however, you can minimize the work needed by using safari
-- which retains the text-styling info -- or internet explorer.

it also helps if you colorize blocks, since that will make it
easier to reintroduce the indentation lost on those blocks.

but really, you're doing it ass-backwards if you're trying to
get a .txt file out of an .html file.   instead, you should be
formatting your .txt file as z.m.l., so you can auto-generate
an .html file out of the z.m.l. file -- much less work that way.

all the work you spend doing .html is just a waste of energy...
furthermore, everyone's .html is different, there's no way that
future maintainers of the p.g. library will be able to update it;
so they will scrap the .html and use the .txt files as their base.
you'd be ahead of the game if you adopted that approach now.


>    In practice HTML encodes most of what I as a volunteer 
>    would choose to spend my time and energy transcribing

well, jim, in your "the wings of the dove" book, there was
very little to encode in the first place, so it hardly matters.


>    but I wish it had a little more power, such as the ability to 
>    unambiguously encode authorfirstname, authorlastname, 
>    chapter divisions, etc.

well, i'm not gonna indulge any more o.c.d. on author-names.

but as far as "chapter divisions", that's easy to do in .html...
indeed, the default understanding of e-book .html markup
is that the title and chapter-headers are tagged with "h#"...

in z.m.l., the title is assumed to be the file's first paragraph.

headers below that are marked with 4 or more empty lines
preceding them, and 2 empty lines following them, so that
gives you an unambiguous outline of the book's structure...

indeed, in "the wings of the dove", there is a 2-level outline,
with "book" as the first level and "chapter" as the next level...
so you will notice that i indicated that by having 8 empty lines
above "book" headers, 5 empty lines above "chapter" headers.


>    If you think you have something positive to contribute to PG, 
>    please do so.? 

well, jim, i _do_ think i have "something positive to contribute".

and i _am_ contributing it, right now...   you're soaking in it...

i'm showing _you_ -- and anyone else who wants to read it --
how you could be making yourself much more efficient, _and_
how you could create more beautiful and powerful e-books too.


>    Abusing me for my choice of which sacrifices I am willing 
>    to make, or not willing to make, does not represent a 
>    contribution to PG, nor does it encourage my continuing 
>    contributions to PG.

back off.   i'm not "abusing" you at all.   i'm pointing out how
the choices you've made have resulted in an inferior product.

surely you're not going to try and argue that the .txt file that
lacks its italics is an acceptable digitization, are you?   really?

and surely you're not suggesting that i simply _ignore_ that?
are you?   if you can't take the balmy breeze, get off the patio!


>    The EPUB was not generated by me

and i don't hold you responsible for the .epub.


>    it does point out the advantages of TEI which allows the 
>    encoding in one document the various ?hints? necessary 
>    for attractive rendering of the one TEI input file into 
>    various output rendering language targets.

except that's _not_ a "benefit" of .tei in particular, jim.
it's a benefit of _any_ "master" format, including .zml.


>    I also did not generate the MOBI

and i don't hold you responsible for the .mobi.

indeed, since mobipocket has never supported the mac,
i have absolutely no interest in that format, thank you...


>    I use MOBI files all the time with my favorite reader machine.

good.   i'm glad you like it.

my z.m.l. workflow calls for output to .html, which can then be
converted easily to .mobi, so i have that base covered well enough.


>    by midnight my eyes get tired and then I tend to like 
>    a larger font and smaller margins. Which is why I like 
>    reflow formats and reader machines ? they allow me to 
>    easily ?fix? many of the day-to-day ?poor choices? that
>    some one else has made which would otherwise get in 
>    the way of MY being able to enjoy the book the way *I* want.

yes, that's the good things about reflowable formats,
which is why we like reflowable formats best of all...


>    My machine also works well with PDF files except I can?t 
>    fix issues like when the person or process generating the PDF 
>    uses a ?poor? choice of font, or poor choice of margins

that's the problem with a nonreflowable format like .pdf...

of course, if you have the _master_ file, such as a .zml file,
and you can customize the .pdf to your _own_ preferences,
then the .pdf you generate will be _exactly_ to your liking...
(of course, that won't help your time-of-day considerations.)


>    I can sometimes work around these problems by 
>    holding my machine in landscape mode,?and 
>    displaying only half a page of PDF at a time, but 
>    it tends to be awkward and painful to hold the machine 
>    sideways for a length of time, and PDF often doesn?t 
>    like to be read a half a page at a time ? since it is 
>    a page layout language, not a half page layout language. 

i can't do much for the uncomfortable sideways position...

but i can tell you that, if you're generating the .pdf yourself,
from a .zml master, then you could make the pagesize _fit_
the landscape display, so you were reading _full_ pages on it,
and not _half_ pages.   just another benefit of customized .pdf.


>    Which is why I tend to prefer reflow formats like MOBI or HTML

right. 


>    However, at the very least the acidity of Bowerbirds remarks

oh please jim.   does everyone coddle your precious identity?
aren't you used to anyone being frank with you in the slightest?
i haven't called you any names, or cast any aspersions on you...
even my claim that you had "deliberately sabotaged" the .txt file
was something that i myself said was "dramatic", even if accurate,
and it was a description of your _behavior_, not your _personality_.


>    reaffirms my contention that PG needs to allow volunteers like 
>    myself to submit files in the volunteer?s choice of file formats, 
>    NOT Bowerbirds.

i'm not trying to tell p.g. what to do.   and neither should you, jim...


>    In which case I could have offered PG my efforts in one file format, 
>    and PG could have chosen to accept or reject that offering.? 

you can do that now.   they will choose not to accept it.   live with it.


>    If PG chose to accept that offering then hopefully neither Bowerbird 
>    nor any other volunteer would abuse me of my efforts 
>    which PG has then already acknowledged.? 

look, if you omit the italics from a book, i'm gonna call you on it...

(and if you choose to see that as "abuse", then that's your problem.)

i don't give a flying burrito if p.g. has "acknowledged" your work
or not; if you left out the italics from the book, i'll call you on it...


>    Rather, that volunteer would (hopefully) acknowledge that PG 
>    had already accepted my contribution, and in turn if they felt 
>    they could make further positive contributions to this book, 
>    or any other book, in that file format or in any other file format, 
>    then they would be free to do so.

you're registered over at distributed proofreaders, jim, so why
don't you see if anyone over there will collaborate with you and
do the parts of the job that you don't want to do?   that would be
far more effective than coming here and asking p.g. to accept half
of what they want, because you just don't wanna do the other half.


>    Unfortunately, there is not a universal sense within the PG community 
>    as to what does or does not represent a positive contribution, which 
>    in turn leads to that unhappy state of affairs to which Bowerbird 
>    is only too aptly demonstrating today.

what is this "unhappy state of affairs" to which you make reference?

i'm unhappy because you left out the italics on a book you digitized,
and p.g. accepted it anyway, probably because they didn't notice it,
probably because they didn't think anyone would do something so
stupid as to put the italics in the .html file and not in the .txt file...


>    Again I ask consideration that PG seriously consider allowing 
>    volunteers to be able to submit books using only ONE file format 

you _can_ submit _one_ file format.   they'll accept a .txt file alone.
no need for the .html file, or for any other format, for that matter.


>    if they choose to do so, not requiring multiple file formats since 
>    that leads to that unhappy state of affairs that Bowerbird is today 
>    only too well demonstrating.

not only are you spouting nonsense about "unhappy state of affairs",
you're _repeating_ it, and in the very next paragraph no less!   weird!


>    Better yet, pick YOUR OWN book to transcribe and contribute to PG, 
>    rather than abusing ME of MY efforts on MY choice of books!

you can repeat that "abuse" line all you want, jim, but as long as
your book is missing those italics, you are the one in the wrong.

but hey, no problem, i'm gonna fix your work -- correct your flaws --
and submit a _corrected_ version of your .txt file, with all the italics...

but don't expect me to fix _all_ of your books!

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090921/1b6a650a/attachment.html>

From joshua at hutchinson.net  Mon Sep 21 14:40:46 2009
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Mon, 21 Sep 2009 21:40:46 +0000 (GMT)
Subject: [gutvol-d] Re: PG French text file #1500
Message-ID: <39275292.87509.1253569246873.JavaMail.mail@webmail11>

An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090921/9a8a3736/attachment-0001.html>

From lee at novomail.net  Mon Sep 21 15:15:21 2009
From: lee at novomail.net (Lee Passey)
Date: Mon, 21 Sep 2009 16:15:21 -0600
Subject: [gutvol-d] Re: PG French text file #1500
In-Reply-To: <39275292.87509.1253569246873.JavaMail.mail@webmail11>
References: <39275292.87509.1253569246873.JavaMail.mail@webmail11>
Message-ID: <4AB7FAF9.8050309@novomail.net>

Joshua Hutchinson wrote:

> Just at a quick glance, it looks like any harvesters would need to track down 
> original scans to clear it through PG's normal clearing routines.

If your interested, you could contact Yann Forget, 
http://www.forget-me.net/; He's the one that did the original post to 
wikimedia. As Alexis de Tocqueville died in 1859, I'm guessing the work 
is out of copyright.

From traverso at posso.dm.unipi.it  Mon Sep 21 22:29:46 2009
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Tue, 22 Sep 2009 07:29:46 +0200 (CEST)
Subject: [gutvol-d] Re: PG French text file #1500
In-Reply-To: <39275292.87509.1253569246873.JavaMail.mail@webmail11> (message
	from Joshua Hutchinson on Mon, 21 Sep 2009 21:40:46 +0000 (GMT))
References: <39275292.87509.1253569246873.JavaMail.mail@webmail11>
Message-ID: <20090922052946.BF28C10138@cardano.dm.unipi.it>


Currently Tocqueville is in proof at DP, in 4 parts between P2 and
P3. It might be fast-tracked if PG wants it, but dozens of french
projects might come before anyway. 

Carlo





From hart at pobox.com  Tue Sep 22 02:20:15 2009
From: hart at pobox.com (Michael S. Hart)
Date: Tue, 22 Sep 2009 02:20:15 -0700 (PDT)
Subject: [gutvol-d] Re: PG French text file #1500
In-Reply-To: <20090922052946.BF28C10138@cardano.dm.unipi.it>
References: <39275292.87509.1253569246873.JavaMail.mail@webmail11>
	<20090922052946.BF28C10138@cardano.dm.unipi.it>
Message-ID: <alpine.DEB.2.00.0909220218460.29510@mail.pglaf.org>



There are still 14 more French eBooks to go,
so I should hope we can get this one done in
time to be #1500, please give it a go.


Thanks!!!


Michael




On Tue, 22 Sep 2009, Carlo Traverso wrote:

>
> Currently Tocqueville is in proof at DP, in 4 parts between P2 and
> P3. It might be fast-tracked if PG wants it, but dozens of french
> projects might come before anyway.
>
> Carlo
>
>
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
>

From schultzk at uni-trier.de  Wed Sep 23 01:38:59 2009
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Wed, 23 Sep 2009 10:38:59 +0200
Subject: [gutvol-d] Re: Name lists and Big-endianism
In-Reply-To: <4AB79E3C.7000306@telkomsa.net>
References: <c8d.50ac845f.37e7fe18@aol.com>
	<30C62690-C9E8-4642-B41E-FE8474DEB5FB@uni-trier.de>
	<4AB79E3C.7000306@telkomsa.net>
Message-ID: <9F4CE293-9C76-4C50-BE5F-E5EA319CD048@uni-trier.de>

Hi Jon,


Am 21.09.2009 um 17:39 schrieb Jon Richfield:

> Hi Keith
>>
>
> I take your point, but I reckon that with a bit of definition of  
> canonical fields and formats one should be able to clean the lot up  
> with the exception of cases where previous manual record entry had  
> violated sensible rules. Most of the problems could be cleaned up  
> automatically, and only the horrible examples (basically errors)  
> need get special manual treatment. Trying to construct special rules  
> for your data base to negotiate, would fall foul of the ingenuity of  
> fools.
>
> Whether you really need a "formal data base" or not is an open  
> question. Some direct access to properly sorted and indexed files  
> can be startlingly effective.
	Basically, I was not saying we need a "formal database" or system.  
The fact is information in the files basically constitute
	a database, albeit the information is structured. As I mentioned due  
to restrisction defined for the metadata the desired
	features are not possible in the present form and could be easily  
overcome.

	regards
		Keith.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090923/d73bc935/attachment.html>

From schultzk at uni-trier.de  Wed Sep 23 02:11:34 2009
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Wed, 23 Sep 2009 11:11:34 +0200
Subject: [gutvol-d] Re: Once every 7 years a post of monumental stupidity
	comes along ...
In-Reply-To: <4AB7BAA3.5080706@perathoner.de>
References: <4AB6F577.2030105@sbcglobal.net>
	<EE5C6FBB-3CB2-4486-90A1-8F16F61898A9@uni-trier.de>
	<4AB7BAA3.5080706@perathoner.de>
Message-ID: <49E5C390-DDAA-4D2C-93E3-963D7A34FCE0@uni-trier.de>

Hi Marcello,

	NOW who is laughing an has more egg in his face.

	 1) A very piss poor DB-server that has to reindex for every request
	 2) An even more piss poor programmer that can NOT do it!
	
	ALL your argumernts are MUTE. It is interesting that companies have  
costumer
	databases, databases for thier employees changing, data requesting  
data and all
	this is based a databases.

	Furthermore, what kind of database system are you using that has to  
role in all that information.

	Last and not least If you could understand have I said you would have  
understood that the
	information retrieved beiing presented to the user can be sorted!!

	You have a complete lack of programming and system engineering.

	I figured you would complain about my pseudo code (you do know what  
that is?)
	I simply just wrote it done without any correction or checking for  
typos. I put in more
	than I wanted to so that the simplest minds could unsterstand the  
basic simplicity
	of doing the task.

	True, I should have use "or" instead of "and" .  I also, admit that I  
did forget the length check.
	But, again it was just to show how easy it can be done. But, I you  
consider i just wrote this down
	whithout any afterthought or thought at it took me less that a minute  
leaving me four minutes to
	clear up the rest.

	I could have been abstract about it and left the code out. Besides, I  
would for myself make it far more
	elaborate to account for encodings and different languages.

	
Am 21.09.2009 um 19:40 schrieb Marcello Perathoner:

> ... which isn't even funny. Here it is:
>
> Keith J. Schultz wrote:
>
>>> The change I would like is to have spaces taken into account in  
>>> the name sort.
>>>
>>> So we would have something like this:
>>>
>>> Green, Alice
>>> Green, Robert
>>> Greenacre, Janet
>>> Greenjeans, Mr.
>>>
>>> instead of like this:
>>>
>>> Greenacre, Janet
>>> Green, Alice
>>> Greenjeans, Mr.
>>> Green, Robert
>
>
>>    Duhhhh !! If this is true there are some people
>>    that ougth to take a course in 101 programming or db
>>    design. It takes about 5 minutes to write the code.
>
>
> And it took the writer of that post no longer than that to ruin his  
> reputation forever.
>
> Bowerbird, meet Keith, Keith, meet Bowerbird.
>
>
> Obviously the writer's ignorance about modern web serving  
> infrastructure is complete. Even a single afternoon class about  
> database programming would have taught him enough to keep his mouth  
> shut. The writer of this nonsense obviously does not know that:
>
> - To sort a dataset locally on a web server, like the writer  
> proposes, you have to request the whole dataset from the database  
> server. This induces a considerable load on the database server and  
> on the wire.
>
> - Sorting on the web server is much slower than sorting on the  
> database server because the database server uses precomputed tables  
> (indexes) which are already sorted, but the web server needs to sort  
> from scratch.
>
> So instead of asking the database server to:
>
>  give me 100 authors sorted by name starting at offset 4500
>
> which the server could almost instantly satisfy out of the pre- 
> sorted index tables, you have to ask the server to
>
>  give me all authors
>
> which are 12800 at present.
>
>
> Instead of reading 100 rows from the disk and passing them over the  
> wire to the web server, you'll end up reading from the disk 12800  
> rows and transmitting them. Already a factor of 128 times slower.
>
> Then comes the gratuitous sorting of 12800 rows on the web server.  
> After which sort we throw away 12700 rows and present the user with  
> the 100 rows she requested.
>
>
> But the ignorance of the writer is not only colossal regarding  
> present day database systems, it becomes even more surrealistic when  
> the writer tries to apply himself to programming.
>
> The writer wastes 30 lines of code to re-implement a function that  
> every programming language carries out-of-the-box. That alone would  
> have sufficed to demonstrate that the writer's notions about  
> programming are extremely vague at best.
>
> We will furthermore see that the writer used pseudo-code not only to  
> hide his ignorance of any actual programming language, but also to  
> avoid having to test his absurd concoction, which test would have  
> immediately revealed its uttermost bullshittiness even to himself.
>
> The absurd proposal of the writer runs thus (feel free to skip to  
> the beef, the irksomeness of this code is just good enough for a  
> smile):
>
>
>>    IsEntrySmallertThan(X, Y) :-
>>        Pos := 0;
>>        If (Length(X) < Length (Y))
>>            then MaxPos = Length(X) -1;
>>            else MaxPos = Length(Y) - 1;
>>        end if
>>        While  ((IsSmaller := CharSmaller(X[Pos], Y[Pos]) == 0)
> >                 and Pos != MaxPos )
>>            Pos := Pos +1;
>>        end While
> >
> >         return IsSmaller;
> >      end EntrySmallerThan
> >
> >     CharAtSmaller(X, Y) :-
> >         If (Cardinal(X) <  Cardinal(Y) )
> >             return 1
> >             else
> >                 If Cardinal(X) > Cardinal(Y)
> >                     then return -1;
> >                     else return 0;
> >                 end if
> >         end if
> >     end CahrAtSmaller
> >
> >     Cardinal(X) :-
> >         If (X in set of standard Chars)
> >             then return X
> >             else return -1;
> >         end if
> >     end Cardinal
> >
> >     Put this ipseudo code what language you want and voila.
> > Cardinal can be made as
> >     complex as you want if you needed finer distinctions.
> >
> >     regards
> >         Keith.
>
>
> For the sake of playing let us call:
>
>  IsEntrySmallertThan ('a', 'ab').
>
> MaxPos would then be set to 0.
>
> The While loop will call CharSmaller, which does not exist, because  
> the function is called CharAtSmaller. First Bug.
>
> CharAtSmaller would then return 0 because it compares 'a' to 'a',  
> which two are equal. The iteration will then stop because Pos ==  
> MaxPos == 0. The function will then return.
>
> Conclusion:
>
>  IsEntrySmallertThan ('a', 'ab') returns 0
>
> According to this guy's wisdom, 'a' is not `smallert? than 'ab'.
>
> QED
>
> Moreover: this code would dump core on you the moment you call it  
> with an empty string, Cardinal (X) returns X or -1 so you'll end up  
> comparing characters with -1, which will not work on machines with  
> unsigned characters ... and so on.
>
>
> Throwing even one line of code over the wall without testing it, is  
> the hallmark of the utter clueless beginner. Even people less full  
> of themselves fall for it sometimes.
>
>
>
>
> -- 
> Marcello Perathoner
> webmaster at gutenberg.org
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d


From Bowerbird at aol.com  Wed Sep 23 02:46:29 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Wed, 23 Sep 2009 05:46:29 EDT
Subject: [gutvol-d] tuesday, september 22, the first day of fall
Message-ID: <bd0.5e5faa63.37eb4875@aol.com>

sometimes you have to timestamp a development...

or two...   or three...

***

new reader-machines are being announced _daily_,
it seems.   today's model is a joint venture between
best buy and verizon.   yes, i know, i know.   how are
best buy and verizon gonna wrangle book-buyers
away from amazon dot com, the web super-retailer,
and the bookstore named after the world-class river?

i have no idea.   and you can bet they have no idea either.

look for two dozen reader-machines to debut in 2009.
look for one dozen of them to be dead by the year-end.

the big winner, certainly, will be adobe and its d.r.m.
the big loser, ironically, will be adobe and its d.r.m.,
because you know adobe ain't gonna be able to pull off
a d.r.m. scheme with dozens of no-experience partners,
and the resultant fiasco will be endlessly entertaining...
plus it's bound to piss off all kinds of paying customers,
and the righteous indignation promises to be amusing...

the kindle, of course, won't be impacted in the slightest
by all this downmarket warfare amongst the ranks, but
second-runner sony might get hit by some of the gunfire.

***

speaking of sony...

saw it the first time on saturday night.
an advertisement for the sony reader,
on broadcast television in prime time.
with justin tumberlank _and_ payton manning.
not to mention the world's fastest speedreader.

saw it again the next time on sunday night.
on network television during the emmy awards.
with justin tomberland _and_ payton manning.
not to mention the world's fastest speedreader.

saw it again the next time on monday night.
during some very big season premiere shows.
with justin timberlake _and_ payton manning.
not to mention the world's fastest speedreader.

these are some serious media buys at high prices,
folks, and a sign that sony isn't just fooling around.
they're fooling around _and_ blowing an ad budget.
but at least we get a sense they think it's important.

ads note the (introductory model) price of $199,
as well as product highlight that the machine can
store hundreds of books.   who woulda thunk that
these machines woulda hit the television waves?

***

in other news, a guy name "brown" sold some books.

***

happy autumnal equinox, folks...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090923/8949c69c/attachment.html>

From joshua at hutchinson.net  Wed Sep 23 05:37:22 2009
From: joshua at hutchinson.net (Joshua Hutchinson)
Date: Wed, 23 Sep 2009 12:37:22 +0000 (GMT)
Subject: [gutvol-d] Re: Once every 7 years a post of monumental stupidity
 comes along ...
Message-ID: <1895795374.134329.1253709442237.JavaMail.mail@webmail07>

An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090923/75575894/attachment-0001.html>

From prosfilaes at gmail.com  Wed Sep 23 10:17:09 2009
From: prosfilaes at gmail.com (David Starner)
Date: Wed, 23 Sep 2009 13:17:09 -0400
Subject: [gutvol-d] Re: Name lists and Big-endianism
In-Reply-To: <SNT120-DS162637A5C5691B09E89569AEDD0@phx.gbl>
References: <c8d.50ac845f.37e7fe18@aol.com>
	<SNT120-DS162637A5C5691B09E89569AEDD0@phx.gbl>
Message-ID: <6d99d1fd0909231017v24382285iaa082094c1624821@mail.gmail.com>

On Mon, Sep 21, 2009 at 1:13 PM, James Adcock <jimad at msn.com> wrote:
> Why does this matter?? Consider the famous author name Sun Tzu

Let's consider it; why do you think the general audience will search
for Sun Tzu and not Tzu, Sun? A system that just gives an unsearchable
list of names and doesn't have Tzu, Sun, even if only as an alias, is
unusable, correct or not. Not to mention that his name is S?n Z?, or
??, or ?? or Sunzi, and that doesn't even start to approach the
problem of spelling questions.

-- 
Kie ekzistas vivo, ekzistas espero.

From jimad at msn.com  Wed Sep 23 11:30:19 2009
From: jimad at msn.com (Jim Adcock)
Date: Wed, 23 Sep 2009 11:30:19 -0700
Subject: [gutvol-d] Re: the wings of the dove -- 002
In-Reply-To: <d3a.4f385cf4.37e91136@aol.com>
References: <d3a.4f385cf4.37e91136@aol.com>
Message-ID: <SNT120-DS6EE88A290EC4E24BD90D0AEDB0@phx.gbl>

Thank you Bowerbird, again, for making my points for me:

1) If I had submitted this book instead to DP there would have been a much
larger number of punc errors introduced as "required" by the DP process.

2) We would all still be waiting for this book, because I prior submitted
two books to DP after a considerable amount of work on my part and they have
still to see the light of day. Someone with a practical knowledge of queuing
theory needs to go over these issues with DP.

3) I know perfectly well that errors remain unseen, which is why I would
like an input file format that easily allows another motivated volunteer to
pick up where I left off when my children start complaining that they are
unfed and unclothed and "reality calls" -- besides which by the time I am
"done" with a book like "Dove" I am splitting blood and ready to do
something else for a while -- rather than listening to Bowerbird insult my
efforts and insult my integrity simply because I do not support his favored
hack markup schemes -- which no one else wants to support either.



From prosfilaes at gmail.com  Wed Sep 23 12:17:03 2009
From: prosfilaes at gmail.com (David Starner)
Date: Wed, 23 Sep 2009 15:17:03 -0400
Subject: [gutvol-d] Re: the wings of the dove -- 002
In-Reply-To: <SNT120-DS6EE88A290EC4E24BD90D0AEDB0@phx.gbl>
References: <d3a.4f385cf4.37e91136@aol.com>
	<SNT120-DS6EE88A290EC4E24BD90D0AEDB0@phx.gbl>
Message-ID: <6d99d1fd0909231217s30902b15jc589f865ce184bc1@mail.gmail.com>

On Wed, Sep 23, 2009 at 2:30 PM, Jim Adcock <jimad at msn.com> wrote:
> Thank you Bowerbird, again, for making my points for me:

No, he's not, because nobody is listening to him. I considering
killfilling you, too, because I no more want to hear from Bowerbird by
proxy than directly. Stop complaining about what he does; he's a
troll, he enjoys it. Just stop reading his messages.

-- 
Kie ekzistas vivo, ekzistas espero.

From jimad at msn.com  Wed Sep 23 12:56:07 2009
From: jimad at msn.com (James Adcock)
Date: Wed, 23 Sep 2009 12:56:07 -0700
Subject: [gutvol-d] Re: Name lists and Big-endianism
In-Reply-To: <cc6.58e73415.37e91527@aol.com>
References: <cc6.58e73415.37e91527@aol.com>
Message-ID: <SNT120-DS1931497CA7F3404CCCBF8DAEDB0@phx.gbl>

>so if you want to find "sun tzu", you'd search
for "sun tzu", and if that didn't work, then
you'd search for "sun" and "tzu" separately...

 

Okay, let's get specific.  My favorite machine lists these things
alphabetical by authorlastname, authorfirstname and I currently on my
favorite machine have about 100 books on it whereas on the previous
generation machine which I use less nowadays I have about 500 books.  So I
get to scroll through the list of books three times to perform your "search
algorithm" example.

 

But, more importantly, in the case of a reader to picks up e-books from PG
and from other publishing houses, say someone who wants to collect and read
everything ever written by Sir Arthur Conan Doyle, finds that his or her
e-book library instead of being correctly sorted and cataloged by author now
finds instead Sir Arthur Conan Doyle spread out at about five factorial
locations on his or her e-book bookshelf.  Or more likely, Sherlock ends up
all in one place if from one of a variety of professional publishing house,
and at another location if the e-book is coming from PG. Or god knows where
if purchased from Amazon "published" there by one of an infinite number of
bottom feeding garage shops.

 

And why am I "o.c.d." on these issues?  Because I have converted a few tens
of thousands of PG books to e-book format and have found, *in practice*,
that the issue of author names and how to "correctly" extract them from the
data PG provides - or doesn't provide - in practice, not in theory, ends up
being one of the real stumbling blocks.  Certainly an extensible format like
TEI, if it contained correctly coded authorlastname, authorfirstname
information, would make extraction of correct "spine" information trivial.
Then the problem reduces to how in the PG system to get a "correct"
canonical form of authorlastname, authorfirstname, and the answer is some
real human being has to do that research --  which is perhaps most
appropriately done as part of the copyright clearance process which I think
frequently refers to LoC in the first place?

 

Or as another simply example of these issues based on an author I have
recently worked on, enter "James Henry" in the PG home page author slot, and
compare what you get to when you enter "Henry James" and then try "Henry,
James" and then try "James, Henry" and then rationalize to the readers of
this list your results and why those results are the "correct" result ???

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090923/00e7144d/attachment.html>

From jimad at msn.com  Wed Sep 23 13:24:20 2009
From: jimad at msn.com (James Adcock)
Date: Wed, 23 Sep 2009 13:24:20 -0700
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
	for eliciting beauty (and more)
In-Reply-To: <4AB7D042.5010407@novomail.net>
References: <be0.6442fdd2.37dc344c@aol.com>	<SNT120-DS116B72E3B00E699B5AA56EAEE60@phx.gbl>
	<4AB7D042.5010407@novomail.net>
Message-ID: <SNT120-DS21183B17C05A4B88EF15C7AEDB0@phx.gbl>

>> What I don't understand is why PG continues to be wedded to plain-text as
an
>> *input* encoding format demanded of people submitting texts to PG.
>> Plain-text is too constrained to do the job well.  
>
>I find that you are generally correct in everything you have said to 
>date. But the reality is that PG <em>does</em> continue to be wedded to 
>plain (impoverished) text. 

I have heard reasonable rational (whether one agrees or not) why PG remains
wedded to PG TXT format as an OUTPUT file format.  I have not heard a
reasonable rational why PG REQUIRES me to submit BOTH an HTML AND a PG TXT
file if what I as a volunteer really want to submit is just an HTML file.
If I were allowed to just submit an HTML file then I could reasonably encode
MOST of what I as a transcriber would like to transcribe, and I could avoid
the abuse that I currently receive from Bowerbird when I don't put in the
extraneous marks and spaces and smiley faces not found in the author's work
but which Bowerbird would like to see in the PG TXT in order to support his
pet theories about how the input file format and the rendered file format
need to be one and the same thing. In turn Bowerbird could use his time and
energies in a positive manner transcribing my HTML input format file into
any particular flavor of PG TXT output file format that Bowerbird likes and
can and will in turn pat himself on the back for, rather than abusing me of
efforts that I didn't want to have to do in the first place.

>For PG to adopt such a scheme, however, would require that PG adopt a set
of 
Standards...

How about a VOLUNTARY set of "suggested" standards for HTML, such that when
a volunteer voluntarily codes to those HTML standards the results can be
translated and displayed on a larger class of machines successfully?
Certainly PG in practice already enforces a number of standards on submitted
input files which if you don't follow your files don't get accepted -- even
though those standards aren't really written down so one ends up having to
rework one's submissions not infrequently in order to get them accepted --
surprise!

>I have concluded that Project Gutenberg is impervious to improvement. 

I don't think its impervious to improvement, it's just that changes are very
slow to come and very hard won.  Certainly from my point of view the recent
decision to support, or at least partially support, EPUB and MOBI has made
my life much more enjoyable.

>I would suggest, rather, perfecting your HTML file, uploading 
it to the Internet Archive (http://www.archive.org/create/) and then 
posting a message here indicating where it can be found if any other 
volunteer wants to create a degraded version of your master copy.

Sigh -- I would hate to think that I have to "route around damage" -- again.



From prosfilaes at gmail.com  Wed Sep 23 13:34:03 2009
From: prosfilaes at gmail.com (David Starner)
Date: Wed, 23 Sep 2009 16:34:03 -0400
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
	for eliciting beauty (and more)
In-Reply-To: <SNT120-DS21183B17C05A4B88EF15C7AEDB0@phx.gbl>
References: <be0.6442fdd2.37dc344c@aol.com>
	<SNT120-DS116B72E3B00E699B5AA56EAEE60@phx.gbl>
	<4AB7D042.5010407@novomail.net>
	<SNT120-DS21183B17C05A4B88EF15C7AEDB0@phx.gbl>
Message-ID: <6d99d1fd0909231334j49f3f342u8052b9c48f629097@mail.gmail.com>

On Wed, Sep 23, 2009 at 4:24 PM, James Adcock <jimad at msn.com> wrote:
> I could avoid
> the abuse that I currently receive from Bowerbird

That's like an abused woman saying that if she just had a better
dishwasher her husband would stop hitting her. Bowerbird will abuse
you no matter what. At some point, it's your fault for not putting him
in a killfile.

-- 
Kie ekzistas vivo, ekzistas espero.

From gbnewby at pglaf.org  Wed Sep 23 13:35:40 2009
From: gbnewby at pglaf.org (Greg Newby)
Date: Wed, 23 Sep 2009 13:35:40 -0700
Subject: [gutvol-d] Re: PG French text file #1500
In-Reply-To: <alpine.DEB.2.00.0909220218460.29510@mail.pglaf.org>
References: <39275292.87509.1253569246873.JavaMail.mail@webmail11>
	<20090922052946.BF28C10138@cardano.dm.unipi.it>
	<alpine.DEB.2.00.0909220218460.29510@mail.pglaf.org>
Message-ID: <20090923203540.GA8486@pglaf.org>

On Tue, Sep 22, 2009 at 02:20:15AM -0700, Michael S. Hart wrote:
> 
> 
> There are still 14 more French eBooks to go,
> so I should hope we can get this one done in
> time to be #1500, please give it a go.

Carlo: If this could be fast-tracked, it would be great.  We
would love to plan on Toqueville as French #1500.

Can you let Michael and I know if this seems likely, so we can
plan accordingly?

Thanks much.
  -- Greg

> 
> Thanks!!!
> 
> 
> Michael
> 
> 
> 
> 
> On Tue, 22 Sep 2009, Carlo Traverso wrote:
> 
> >
> > Currently Tocqueville is in proof at DP, in 4 parts between P2 and
> > P3. It might be fast-tracked if PG wants it, but dozens of french
> > projects might come before anyway.
> >
> > Carlo

From ajhaines at shaw.ca  Wed Sep 23 13:35:58 2009
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Wed, 23 Sep 2009 13:35:58 -0700
Subject: [gutvol-d] Re: the wings of the dove -- 002
References: <d3a.4f385cf4.37e91136@aol.com>
	<SNT120-DS6EE88A290EC4E24BD90D0AEDB0@phx.gbl>
Message-ID: <DC3F8C6370EC4594A24B7CF43FD43634@alp2400>

Jim, I would suggest that if you're spitting blood by the time you finish a 
book that you're going at it too fast/forcefully.  Slow down--there are no 
deadlines at PG.

Re bowerbird - ignore him/it.  Few, if any, aspects of PG (or DP) satisfy 
him/it, while little, if anything, that him/it does, satisfies anyone else.

Al

----- Original Message ----- 
From: "Jim Adcock" <jimad at msn.com>
To: "'Project Gutenberg Volunteer Discussion'" <gutvol-d at lists.pglaf.org>
Sent: Wednesday, September 23, 2009 11:30 AM
Subject: [gutvol-d] Re: the wings of the dove -- 002


> Thank you Bowerbird, again, for making my points for me:
>
> 1) If I had submitted this book instead to DP there would have been a much
> larger number of punc errors introduced as "required" by the DP process.
>
> 2) We would all still be waiting for this book, because I prior submitted
> two books to DP after a considerable amount of work on my part and they 
> have
> still to see the light of day. Someone with a practical knowledge of 
> queuing
> theory needs to go over these issues with DP.
>
> 3) I know perfectly well that errors remain unseen, which is why I would
> like an input file format that easily allows another motivated volunteer 
> to
> pick up where I left off when my children start complaining that they are
> unfed and unclothed and "reality calls" -- besides which by the time I am
> "done" with a book like "Dove" I am splitting blood and ready to do
> something else for a while -- rather than listening to Bowerbird insult my
> efforts and insult my integrity simply because I do not support his 
> favored
> hack markup schemes -- which no one else wants to support either.
>
>
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d
> 



From jimad at msn.com  Wed Sep 23 13:42:14 2009
From: jimad at msn.com (James Adcock)
Date: Wed, 23 Sep 2009 13:42:14 -0700
Subject: [gutvol-d] Re: a case of deliberate sabotage by a p.g. volunteer
In-Reply-To: <6d99d1fd0909211227q765dd187n77dc76cbf85d9dd6@mail.gmail.com>
References: <d45.52e15a49.37e5f9a7@aol.com>	<SNT120-DS2254E89C3E1A5A3A68136DAEDD0@phx.gbl>
	<6d99d1fd0909211227q765dd187n77dc76cbf85d9dd6@mail.gmail.com>
Message-ID: <SNT120-DS1914134D3E52CC23F5991EAEDB0@phx.gbl>

>Whatever the base file formats are,
Project Gutenberg, like most archives, needs to pick one or a small
set of them so that the people who use Project Gutenberg can know what
they need to read the files.

Again, this is confusing input file formats with output file formats.  PG could choose to allow HTML as an acceptable input file format because PG can easily write a tool to convert HTML to their choice of PG TXT file format, including standardizing on such issues as whether italics ought to be rendered in PG TXT files as *star* or +plus+ or _underscore_ or SHOUT or better yet maybe PG could allow these kinds of choices to be made by an output filter so that text readers for the blind could have something more compatible with their prosodic emphasis machines, or better yet maybe the output filters could actually implement some of the "proper" prosodic emphasis markings for the more popular blind reader machines in order to maximize their capabilities.

In my experience what happens is just the opposite of what you might expect -- rather the first time user of PG picks up a PG TXT file because they think that represents the "lowest common denominator" for their machine and so they think "it must surely work" and what they find instead is that what gets displayed on their machine is a total hash of line breaks in non-sensible locations, and random garbage marks, and then they conclude PG is archaic brain dead stuff by people who are clueless and they give up and go away.  Or alternatively they post stupid stuff on public forums like "gee I like all these free books from PG and I read them all the time even though they have these random line-breaks stuck in all over the place" -- which in turn makes the efforts of the PG volunteers look like clueless idiots.

There are other sites which take PG texts and do intelligent things like "tell me what kind of machine you are reading on and I will suggest which of the many file formats will probably display to your liking on your machine" which I think in practice tends to result in happier customers. Right now PG is still basically assuming that the average PG "customer" is a die-hard hacker running some flavor of a *nix machine in a college environment.  Which is probably [somewhat] true of the people submitting books, but not at all true of the people who would just like to read them.
 


From jimad at msn.com  Wed Sep 23 15:17:23 2009
From: jimad at msn.com (Jim Adcock)
Date: Wed, 23 Sep 2009 15:17:23 -0700
Subject: [gutvol-d] Re: a case of deliberate sabotage by a p.g. volunteer
In-Reply-To: <bd9.3f425e97.37e9442e@aol.com>
References: <bd9.3f425e97.37e9442e@aol.com>
Message-ID: <SNT120-DS94E28A29EE351B93101A7AEDB0@phx.gbl>

>how come the .txt file is missing the italics which are right there, big as day, in the .html version of the file?

Presumably I followed the instructions on the PG website by:

1) typing in "italic" in the big red "search site term" box

Then, 

2) typing in "HTML" in the big red "search site term" box, finding the "HTML FAQ" and following the suggestions there in H.12



From jimad at msn.com  Wed Sep 23 18:12:52 2009
From: jimad at msn.com (Jim Adcock)
Date: Wed, 23 Sep 2009 18:12:52 -0700
Subject: [gutvol-d] Re: Name lists and Big-endianism
In-Reply-To: <6d99d1fd0909231017v24382285iaa082094c1624821@mail.gmail.com>
References: <c8d.50ac845f.37e7fe18@aol.com>	<SNT120-DS162637A5C5691B09E89569AEDD0@phx.gbl>
	<6d99d1fd0909231017v24382285iaa082094c1624821@mail.gmail.com>
Message-ID: <SNT120-DS129BA5C659A368E0D10014AEDA0@phx.gbl>

>Let's consider it; why do you think the general audience will search
for Sun Tzu and not Tzu, Sun? A system that just gives an unsearchable
list of names and doesn't have Tzu, Sun, even if only as an alias, is
unusable, correct or not. 

I assure you that I and about a million other people have for our primary reading machines a machine which only provides a library of books sorted and listed by authorlastname and which does not in fact have a "search" capability on authornamepart and while I agree with you that I would prefer a machine with a stronger search capability the reason that we put up with this machine is that it is so many light years ahead of other machines that we might want to read on as to make that decision a "no brainer" -- even given the shortcomings of the user shell design.

In fact after "putting up with" computers and having to print out documents for the last 35 years of my life I now find that I almost never print out anything, and I almost never buy a book or magazine in print anymore.  And the machine goes with me everywhere and I read it every night in bed until I fall asleep.  So this is by far my most useful most favorite machine I have ever had in my life.  But then again, I do a LOT of reading!

A better counter question is why would PG WANT to implement a system that prevents easy and correct implementation of common e-book formats? -- EPUB and MOBI ?



From jimad at msn.com  Wed Sep 23 18:41:34 2009
From: jimad at msn.com (Jim Adcock)
Date: Wed, 23 Sep 2009 18:41:34 -0700
Subject: [gutvol-d] Re: why the plain-text format is the most useful
	format	for eliciting beauty (and more)
In-Reply-To: <6d99d1fd0909231334j49f3f342u8052b9c48f629097@mail.gmail.com>
References: <be0.6442fdd2.37dc344c@aol.com>	<SNT120-DS116B72E3B00E699B5AA56EAEE60@phx.gbl>	<4AB7D042.5010407@novomail.net>	<SNT120-DS21183B17C05A4B88EF15C7AEDB0@phx.gbl>
	<6d99d1fd0909231334j49f3f342u8052b9c48f629097@mail.gmail.com>
Message-ID: <SNT120-DS227D63E004796B18322949AEDA0@phx.gbl>

LOL ok you win your point!  I will attempt to filter him out.




From Bowerbird at aol.com  Wed Sep 23 21:34:37 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 24 Sep 2009 00:34:37 EDT
Subject: [gutvol-d] Re: the wings of the dove -- 002
Message-ID: <cbe.48a9230a.37ec50dd@aol.com>

al said:
>    Re bowerbird - ignore him/it.? 

i'm male, al.

if you had been at the december 2003 meet-up,
where we celebrated the first 10,000 p.g. e-texts,
you would have met me, so you woulda known then.

but i'm sure you probably knew that anyway, and were
just using the "it" form as a mechanism of dehumanization.


>    Few, if any, aspects of PG (or DP) satisfy him/it, 

don't be ridiculous.   i love michael hart, the soul of p.g.,
the man who birthed the project, nurtured it to adulthood.
the man's a saint, and he's smart too, with his solid focus on
the text format as the backbone, to ensure a lifelong viability.

i love all the volunteers, who have generously donated so much
of their time and energy and money to digitize these old books,
and persisted in spite of the shitty schooling and tools they had,
working inside workflows that wasted their time on terrible design.

i love the world, who embraced the project gutenberg library early,
making it the premiere cyberlibrary, beating back half-assed efforts
by other people who were enamored of some gimmick or another,
whether in the form of proprietary formats or open-source snake-oil.

i love the faq, which lay down some good advice on the .txt format,
even if the whitewashers don't do any checks to ensure compliance.

i love david widger for all the hard work he's done over the years...

i love greg newby for offering webspace to anyone who needs it...

i love the d.p. people who give me support behind the enemy lines.

i love the d.p. people (lucy!) who support me in _front_ of those lines.

i love the d.p. people who keep working on improving the .html format.

i love thundergnat, for giving the d.p. people a tool they can use...

i love all the other programmers who said "phuck you" to d.p. because
programmers shouldn't hang out in a place where we ain't appreciated.

i love the guy who programmed "eucalyptus" and thereby proved that
you can create _beautiful_ e-books on the iphone from p.g. e-texts.

i love the guy who programmed "eucalyptus" because he left me room
to prove that you can make those iphone e-books _powerful_ as well...

i'm sure there's more, but that's probably enough off the noggin...


>    while little, if anything, that him/it does, satisfies anyone else.

"that him does" -- didn't think that one through, did you al?

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090924/90da3f84/attachment.html>

From Bowerbird at aol.com  Wed Sep 23 21:37:15 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 24 Sep 2009 00:37:15 EDT
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
	for eliciting beauty (and more)
Message-ID: <c3a.5c4accbc.37ec517b@aol.com>

jim said:
>    I have not heard a reasonable rational why PG REQUIRES me 
>    to submit BOTH an HTML AND a PG TXT file if what I as a volunteer 
>    really want to submit is just an HTML file.
>    If I were allowed to just submit an HTML file then I could reasonably 
>    encode MOST of what I as a transcriber would like to transcribe, 
>    and I could avoid the abuse that I currently receive from Bowerbird 
>    when I don't put in the extraneous marks and spaces and smiley faces 
>    not found in the author's work but which Bowerbird would like to see 
>    in the PG TXT in order to support his pet theories about how 
>    the input file format and the rendered file format need to be one 
>    and the same thing. In turn Bowerbird could use his time and
>    energies in a positive manner transcribing my HTML input format file 
>    into any particular flavor of PG TXT output file format that 
>    Bowerbird likes and can and will in turn pat himself on the back for, 
>    rather than abusing me of efforts that I didn't want to have to do 
>    in the first place.

i won't let you bait me into any more of this nonsense, jim.

everyone paying attention -- and probably even most of those
_not_ paying attention -- knows that i refuted your points deftly,
and completely, except for those which i myself have already made.

(but, um, gee, thanks for all your _support_ on those matters, jim;
having you agreeing with them really bolstered up their credibility.)

you come here looking for a master format.   i handed you one.
but because it doesn't look the way you thought it _would_ look,
you don't recognize that it's exactly what you were looking for...

there's a certain bit of humorous irony in all that...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090924/34f6e2a1/attachment-0001.html>

From Bowerbird at aol.com  Wed Sep 23 22:15:25 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Thu, 24 Sep 2009 01:15:25 EDT
Subject: [gutvol-d] let's wind this down
Message-ID: <c0a.6411121d.37ec5a6d@aol.com>

jim said:
>    1) If I had submitted this book instead to DP there
>    would have been a much larger number of punc 
>    errors introduced as "required" by the DP process.

what?   the d.p. process requires the introduction of errors?


>    2) We would all still be waiting for this book, 
>    because I prior submitted two books to DP after 
>    a considerable amount of work on my part and 
>    they have still to see the light of day. Someone
>    with a practical knowledge of queuing theory 
>    needs to go over these issues with DP.

you might want to discuss queuing theory in the d.p. forums.
i suppose they would get a lot of good out of that discussion.


>    3) I know perfectly well that errors remain unseen, 
>    which is why I would like an input file format that 
>    easily allows another motivated volunteer to
>    pick up where I left off when my children start complaining 
>    that they are unfed and unclothed and "reality calls" 

i suppose i've already told you that z.m.l. does just that.
i even mounted your very book, so that you could see it.
so i don't suppose it'd do any good to repeat it again now.


>    rather than listening to Bowerbird insult my efforts

i pointed out that your .txt version was missing the italics.
if you consider that to be an "insult", your skin is too thin...


>    and insult my integrity simply because 
>    I do not support his favored hack markup schemes 

no, you failed to support the project gutenberg standard,
which calls for italics to be marked in the .txt format...


>    Presumably I followed the instructions on the PG website

no, you most certainly failed to follow those instructions...

>    http://www.gutenberg.org/wiki/Gutenberg:Volunteers'_FAQ#V.94._What_sh
ould_I_do_with_italics.3F

you will see that it says:
>    Underscores are now the effective standard for italics in PG texts.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090924/bc86515a/attachment.html>

From traverso at posso.dm.unipi.it  Wed Sep 23 22:38:57 2009
From: traverso at posso.dm.unipi.it (Carlo Traverso)
Date: Thu, 24 Sep 2009 07:38:57 +0200 (CEST)
Subject: [gutvol-d] Re: PG French text file #1500
In-Reply-To: <20090923203540.GA8486@pglaf.org> (message from Greg Newby on
	Wed, 23 Sep 2009 13:35:40 -0700)
References: <39275292.87509.1253569246873.JavaMail.mail@webmail11>
	<20090922052946.BF28C10138@cardano.dm.unipi.it>
	<alpine.DEB.2.00.0909220218460.29510@mail.pglaf.org>
	<20090923203540.GA8486@pglaf.org>
Message-ID: <20090924053857.59CB010138@cardano.dm.unipi.it>


Yes, we plan to have the first volume complete in approximately one
week from now. The complete edition is in 4 volumes, the other three
volumes will appear subsequently, I hope in a few months (they have
passed P1, need P2 and F1, and will skip P3 and F2 that will be done
offline comparing with the wikimedia edition). I have cross-checked a
part after P2 with the wikimedia edition, and the comparison runs
smoothly, identifying a small set of remaining transcription errors
evenly split between the two. The result might be error-free.

The book is in the Pagnerre 12th edition 1848, and the first volume is
just a revision of the first 1835 edition (I take the informations from
http://www.loa.org/volume.jsp?RequestID=202&section=notes  )
so it is complete in itself. 

If possible, it would be nice to reserve 4 slots that would fit the
complete set.

The last edition published by Tocqueville is the 13th, that has an
additional appendix. The 12th and 13th editions are regarded as the
definitive editions. The 13th is available at the Internet Archive, I
will see what is reasonable to do to make the PG edition the
authoritative online edition. Maybe a transcribers note at the end,
with an analysis of the differences, and the additional material of
the 13th.

Carlo


>>>>> "Greg" == Greg Newby <gbnewby at pglaf.org> writes:

    Greg> On Tue, Sep 22, 2009 at 02:20:15AM -0700, Michael S. Hart
    Greg> wrote:
    >> 
    >> 
    >> There are still 14 more French eBooks to go, so I should hope
    >> we can get this one done in time to be #1500, please give it a
    >> go.

    Greg> Carlo: If this could be fast-tracked, it would be great.  We
    Greg> would love to plan on Toqueville as French #1500.

    Greg> Can you let Michael and I know if this seems likely, so we
    Greg> can plan accordingly?

    Greg> Thanks much.  -- Greg

    >>  Thanks!!!
    >> 
    >> 
    >> Michael
    >> 
    >> 
    >> 
    >> 
    >> On Tue, 22 Sep 2009, Carlo Traverso wrote:
    >> 
    >> >
    >> > Currently Tocqueville is in proof at DP, in 4 parts between
    >> P2 and > P3. It might be fast-tracked if PG wants it, but
    >> dozens of french > projects might come before anyway.
    >> >
    >> > Carlo






From schultzk at uni-trier.de  Wed Sep 23 23:43:17 2009
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Thu, 24 Sep 2009 08:43:17 +0200
Subject: [gutvol-d] Re: Once every 7 years a post of monumental stupidity
	comes along ...
In-Reply-To: <1895795374.134329.1253709442237.JavaMail.mail@webmail07>
References: <1895795374.134329.1253709442237.JavaMail.mail@webmail07>
Message-ID: <D9ACC7DF-13A9-4441-A4B2-D41231F680B8@uni-trier.de>

Hi There,

Am 23.09.2009 um 14:37 schrieb Joshua Hutchinson:

> Keith,
>
> While I agree that Marcello's diplomacy is terrible (always has  
> been, doubt that's going to change! :) ... he's right and you're  
> wrong.
>
> He never claimed the DB has to reindex and he presented very real  
> reasons why your solution is terrible from an efficiency point of  
> view.
>
> Biggest problem (summary): Your solution does the work on the web  
> server, his solution does it on the DB server.
>
> Josh
>
	
In my original post I NEVER said where this code could be used whether  
on the web server or DB server.
Furthermore, I mentioned that the standard sort routines used in a DB  
server can be overiden and the supoopsed
code can be used. So, the question of efficiency is mute. My solution  
will work anywhere you want it to.

Another reason the the socalled efficency arguement is mute is that  
the web-server is calling the db-server
that is actually doing all the work.

As for Marcello attitude I personally could care less. All I wanted to  
do is help and pointed to a simple fact
that the sort routine for the data is easy enough to implement. It is  
not always good enough to use
just built-ins. Which I assume is the case.

The matter of fact remains that the proper sorting can be easily  
achieved anywhere in the system
with out any overhead. Position it where you want.

I have programmed just such a situation and had no overhead. the  
database that was acessed via
a web-server was set so that no new sorting or indexing was required  
when the db was called.
I do know what I am doing an what can be done. 
   

From marcello at perathoner.de  Thu Sep 24 04:00:06 2009
From: marcello at perathoner.de (Marcello Perathoner)
Date: Thu, 24 Sep 2009 13:00:06 +0200
Subject: [gutvol-d] Re: Once every 7 years a post of monumental stupidity
 comes along ...
In-Reply-To: <D9ACC7DF-13A9-4441-A4B2-D41231F680B8@uni-trier.de>
References: <1895795374.134329.1253709442237.JavaMail.mail@webmail07>
	<D9ACC7DF-13A9-4441-A4B2-D41231F680B8@uni-trier.de>
Message-ID: <4ABB5136.5020001@perathoner.de>

Keith J. Schultz wrote:

> In my original post I NEVER said where this code could be used whether 
> on the web server or DB server.

If you intended your code to be run on the database server, it would be 
an even more incredibly stupid thing to do.

If you want to influence the database server, then you must not write a 
routine that *compares* strings, but a routine that *transforms* strings:

If you wanted to sort like German phonebooks do, that is: to sort '?' as 
'oe', then you would write a routine that substitutes all '?'s in your 
input with 'oe's. Then you would feed the transformed string to the 
index table while feeding the original string to the data table. Voil?.

Of course you would have to transform all search terms in the same 
fashion too, because databases do most of the work on index tables and 
reach out to the data tables only when they really really really need to.


> Furthermore, I mentioned that the standard sort routines used in a DB 
> server can be overiden and the supoopsed code can be used.

How would you know? You don't even know which database we are using.


> My solution will work anywhere you want it to.

Your `solution? didn't even work on paper. I found 3 fat bugs just on a 
first eyeball revue.


> I have programmed just such a situation and had no overhead. the 
> database that was acessed via a web-server was set so that no new 
> sorting or indexing was required when the db was called.

You programmed a small in-house application that gets hit a dozen times 
a day. In your situation you can get away with any amount of programming 
sloppiness because the hardware is so much superior to the task.

I am running a site that gets more than 1 Megahits a day, serving more 
than 70,000 customers per day.

Failure to consider scalability issues is another telltale sign of the 
rookie.


-- 
Marcello Perathoner
webmaster at gutenberg.org

From schultzk at uni-trier.de  Fri Sep 25 04:27:00 2009
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Fri, 25 Sep 2009 13:27:00 +0200
Subject: [gutvol-d] Re: Once every 7 years a post of monumental stupidity
	comes along ...
In-Reply-To: <4ABB5136.5020001@perathoner.de>
References: <1895795374.134329.1253709442237.JavaMail.mail@webmail07>
	<D9ACC7DF-13A9-4441-A4B2-D41231F680B8@uni-trier.de>
	<4ABB5136.5020001@perathoner.de>
Message-ID: <9968FCBD-40FF-4E4A-BD37-E4A65C6D4750@uni-trier.de>

Hi Marcello,

	As you evidently do not know even the slightest about dbase systems I  
will
	stop responding to your arguments.

	I will repeat I was using pseudo-code so you could use what ever is  
approriate for the
	task. The basic algorithm wil work with any kind of data or  
structures for that
	matter. All that is need is an appropriate cardnal function.

	You have fail to realize this. Also, I do not what kind of server  
system you are using,
	but I have known system that can handle "?" for decades. Like I said  
you are not
	evidently qualified to partake in this discussion.

	regards
		Keith.

Am 24.09.2009 um 13:00 schrieb Marcello Perathoner:

> Keith J. Schultz wrote:
>
>> In my original post I NEVER said where this code could be used  
>> whether on the web server or DB server.
>
> If you intended your code to be run on the database server, it would  
> be an even more incredibly stupid thing to do.
>
> If you want to influence the database server, then you must not  
> write a routine that *compares* strings, but a routine that  
> *transforms* strings:
>
> If you wanted to sort like German phonebooks do, that is: to sort  
> '?' as 'oe', then you would write a routine that substitutes all  
> '?'s in your input with 'oe's. Then you would feed the transformed  
> string to the index table while feeding the original string to the  
> data table. Voil?.
>
> Of course you would have to transform all search terms in the same  
> fashion too, because databases do most of the work on index tables  
> and reach out to the data tables only when they really really really  
> need to.
>
>
>> Furthermore, I mentioned that the standard sort routines used in a  
>> DB server can be overiden and the supoopsed code can be used.
>
> How would you know? You don't even know which database we are using.
>
>
>> My solution will work anywhere you want it to.
>
> Your `solution? didn't even work on paper. I found 3 fat bugs just  
> on a first eyeball revue.
>
>
>> I have programmed just such a situation and had no overhead. the  
>> database that was acessed via a web-server was set so that no new  
>> sorting or indexing was required when the db was called.
>
> You programmed a small in-house application that gets hit a dozen  
> times a day. In your situation you can get away with any amount of  
> programming sloppiness because the hardware is so much superior to  
> the task.
>
> I am running a site that gets more than 1 Megahits a day, serving  
> more than 70,000 customers per day.
>
> Failure to consider scalability issues is another telltale sign of  
> the rookie.
>
>
> -- 
> Marcello Perathoner
> webmaster at gutenberg.org
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d


From schultzk at uni-trier.de  Fri Sep 25 04:39:32 2009
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Fri, 25 Sep 2009 13:39:32 +0200
Subject: [gutvol-d] Re: Once every 7 years a post of monumental stupidity
	comes along ...
In-Reply-To: <4ABB5136.5020001@perathoner.de>
References: <1895795374.134329.1253709442237.JavaMail.mail@webmail07>
	<D9ACC7DF-13A9-4441-A4B2-D41231F680B8@uni-trier.de>
	<4ABB5136.5020001@perathoner.de>
Message-ID: <6E86B2B6-36AB-424A-A286-26B2D6FA0803@uni-trier.de>

Hi Marcello,

	just you more after thought. I assume you have a database. If so just  
add another or more field
	for sorting and adjust the information from other fields. this is a  
one time trip and you have the
	sorting that you need, just use sort by functions. Of course I am  
assuming that your database is
	structured.

	regards
		Keith.

Am 24.09.2009 um 13:00 schrieb Marcello Perathoner:

> Keith J. Schultz wrote:
>
>> In my original post I NEVER said where this code could be used  
>> whether on the web server or DB server.
>
> If you intended your code to be run on the database server, it would  
> be an even more incredibly stupid thing to do.
>
> If you want to influence the database server, then you must not  
> write a routine that *compares* strings, but a routine that  
> *transforms* strings:
>
> If you wanted to sort like German phonebooks do, that is: to sort  
> '?' as 'oe', then you would write a routine that substitutes all  
> '?'s in your input with 'oe's. Then you would feed the transformed  
> string to the index table while feeding the original string to the  
> data table. Voil?.
>
> Of course you would have to transform all search terms in the same  
> fashion too, because databases do most of the work on index tables  
> and reach out to the data tables only when they really really really  
> need to.
>
>
>> Furthermore, I mentioned that the standard sort routines used in a  
>> DB server can be overiden and the supoopsed code can be used.
>
> How would you know? You don't even know which database we are using.
>
>
>> My solution will work anywhere you want it to.
>
> Your `solution? didn't even work on paper. I found 3 fat bugs just  
> on a first eyeball revue.
>
>
>> I have programmed just such a situation and had no overhead. the  
>> database that was acessed via a web-server was set so that no new  
>> sorting or indexing was required when the db was called.
>
> You programmed a small in-house application that gets hit a dozen  
> times a day. In your situation you can get away with any amount of  
> programming sloppiness because the hardware is so much superior to  
> the task.
>
> I am running a site that gets more than 1 Megahits a day, serving  
> more than 70,000 customers per day.
>
> Failure to consider scalability issues is another telltale sign of  
> the rookie.
>
>
> -- 
> Marcello Perathoner
> webmaster at gutenberg.org
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d


From jimad at msn.com  Fri Sep 25 10:36:27 2009
From: jimad at msn.com (James Adcock)
Date: Fri, 25 Sep 2009 10:36:27 -0700
Subject: [gutvol-d] Re: why the plain-text format is the most useful
	format	for eliciting beauty (and more)
In-Reply-To: <c3a.5c4accbc.37ec517b@aol.com>
References: <c3a.5c4accbc.37ec517b@aol.com>
Message-ID: <SNT120-DS12BA9BF87D0F1885CA4860AED90@phx.gbl>

>but because it doesn't look the way you thought it _would_ look,
you don't recognize that it's exactly what you were looking for...



I reject it not only because its ugly, doesn't have any decent tools to
support it, isn't supported or advocated by anyone world-wide except an army
of one, and will not be used by the other volunteers in any case, but more
importantly because I find cases on a daily basis cases of things I need to
encode as a transcriber where I say "well obviously there would be no good
way to address *this* issue using Bowerbird's scheme."  And then, having
established one has to transcribe into an ugly format, which I certainly
think html, xml, and TEI are also, one comes rapidly to the conclusion that
there is no way that an input transcription format and an output rendered
file format *ought* to be one and the same thing because to do so needlessly
subjects the end reader to unnecessary ugliness. Not to mention that PG is
rendering to 80 different output file formats in any case so why *insist*
that there be only one input transcription format "holy grail" in the first
place?

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090925/f06f23c3/attachment-0001.html>

From jimad at msn.com  Fri Sep 25 11:31:21 2009
From: jimad at msn.com (James Adcock)
Date: Fri, 25 Sep 2009 11:31:21 -0700
Subject: [gutvol-d] Re: let's wind this down
In-Reply-To: <c0a.6411121d.37ec5a6d@aol.com>
References: <c0a.6411121d.37ec5a6d@aol.com>
Message-ID: <SNT120-DS1490DB2D66003F09257E03AED90@phx.gbl>

>what?  the d.p. process requires the introduction of errors?

Yes, in the encoding of m-dash, ellipses, etc.

 

>no, you most certainly failed to follow those instructions...

>
http://www.gutenberg.org/wiki/Gutenberg:Volunteers'_FAQ#V.94._What_should_I_
do_with_italics.3F

 

First of all, again, if this is important to PG then why do they not
properly index it to the PG site's search engine?


Secondly, you refuse to read the immediately preceding section FAQ#V.93
which makes it clear that different volunteers have different priorities
about what "plain text" means and how they will be willing to support it and
will be using different automatic conversion tools and that some of the
volunteers (read: me) will be paying no weight to the desire of other
volunteers to make tools to do "automatic prettyprinting" from the "plain
text" whereas other volunteers (yourself) are willing to insert "ugliness"
into the plain text (their words not mine) in order to better support
prettyprinters such as you are proposing.

 

Finally, you and others at PG are forgetting to heed the closing words given
there:

 

Getting a text on-line is the important thing; which choices you [meaning
me] make in doing so is a matter of detail. 

 

The choices *I* make as a volunteer are to put my time and effort into doing
ONE markup as well as I can namely HTML, and as little time and effort as
possible on TXT files -- because for all the arguments raised here I think
TXT is a loser and a no-win situation for the volunteer transcriber - no
matter HOW one makes the unhappy tradeoffs *required* by TXT someone will
end up unhappy and start "beefing" at you.  And the reason that PG is not
willing to provide an automatic tool to reduce HTML to TXT is because they
know that then THEY not the volunteer transcribers will be the unhappy
recipients of these kinds of diatribes.

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090925/7826bfff/attachment.html>

From Bowerbird at aol.com  Fri Sep 25 14:31:51 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 25 Sep 2009 17:31:51 EDT
Subject: [gutvol-d] Re: let's wind this down
Message-ID: <c4e.5a510a85.37ee90c7@aol.com>

jim said:
>    Yes, in the encoding of m-dash, ellipses, etc.

i was being sarcastic, jim.

just like you, i have railed about the way that d.p. 
(mis)handles things like em-dashes and ellipses.


>    First of all, again, if this is important to PG 
>    then why do they not properly index it 
>    to the PG site?s search engine?

i dunno.   you'll have to ask p.g., not me.

but i agree their instructions suck.   i get
the impression that they don't even want
individual digitizers to do books any more.

they want to channel the labor over to d.p.,
which would be fine, except the d.p. workflow
wastes _so_ much volunteer time and energy.


>    Secondly, you refuse to read the immediately 
>    preceding section FAQ#V.93 which makes it clear that 
>    different volunteers have different priorities about 
>    what ?plain text? means and how they will be willing 
>    to support it and will be using different automatic 
>    conversion tools and that some of the volunteers 
>    (read: me) will be paying no weight to the desire of 
>    other volunteers to make tools to do ?automatic 
>    prettyprinting? from the ?plain text? whereas other
>    volunteers (yourself) are willing to insert ?ugliness? 
>    into the plain text (their words not mine) in order to 
>    better support prettyprinters such as you are proposing.

you can try and do all the doubletalk that you want, jim,
but the fact remains that there is a policy, and it is clear:
>    Underscores are now the effective standard for italics in PG texts.

your .txt version failed to meet the standard, jim.   face it.


>    Finally, you and others at PG are forgetting to heed 
>    the closing words given there:
>    >    Getting a text on-line is the important thing; 
>    >    which choices you [meaning me] make in doing so 
>    >    is a matter of detail.

do you think this gives you the ok to ignore the italics?

not only did your .txt version fail to meet the standard,
but now you're telling us you don't have to meet that?

how about we get a ruling from the p.g. people on this?

are your digitizers free to ignore the italics if they like?

are your digitizers free to ignore any rule they dislike?


>    The choices *I* make as a volunteer are to put my 
>    time and effort into doing ONE markup as well as I can 
>    namely HTML, and as little time and effort as possible 
>    on TXT files

and because you're putting so little time and effort
into your .txt files, they are coming out as inferior.

again, how about a ruling from the p.g. powers-that-be?

are digitizers free to make the .txt files as bad as they choose?

i will keep asking this question until it is answered, so
don't think that you can ignore it and it will just go away.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090925/39508f6f/attachment.html>

From Bowerbird at aol.com  Fri Sep 25 14:41:36 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Fri, 25 Sep 2009 17:41:36 EDT
Subject: [gutvol-d] Re: why the plain-text format is the most useful format
	for eliciting beauty (and more)
Message-ID: <cc9.5be6529a.37ee9310@aol.com>

jim said:
>    I reject it not only because its ugly, 

it's not ugly.   besides, it's a _file-format_, which means
it's not even intended that you would look at it directly,
any more than you are intended to look at an .html file
directly, with its obtrusive angle-brackets.   you're mixed up.


>    doesn?t have any decent tools to support it

i have a whole slew of tools here, and am building more.
further, the format is so simple, authors can build tools too.


>    isn?t supported or advocated by anyone world-wide 
>    except an army of one, 

i'm not an army.   i'm just one.


>    and will not be used by the other volunteers in any case

won't be used by the d.p. people, that's for sure,
not if they know it's from me, because they are so 
stubborn they don't know what's good for them...

which tickles my funny-bone on a constant basis...


>    but more importantly because I find cases on a daily basis 
>    cases of things I need to encode as a transcriber where I say 
>    ?well obviously there would be no good way to address 
>    *this* issue using Bowerbird?s scheme.?? 

well, that doesn't surprise me on bit, jim, because
you don't know jack-shit about my little "scheme".

but i am being quite sincere when i tell you that i
would _love_ to hear about these so-called "cases."

you should be told that i have put out many calls
for such "cases", and nobody has ever been able to
meet the challenge.   so step up, jim, and be the first.


>    And then, having established one has to transcribe 
>    into an ugly format, which I certainly think html, xml, 
>    and TEI are also, one comes rapidly to the conclusion that 
>    there is no way that an input transcription format and 
>    an output rendered file format *ought* to be one and 
>    the same thing because to do so needlessly subjects 
>    the end reader to unnecessary ugliness. 

jim, you keep talking about "input" and "output", and
you're just confusing yourself with that terminology...


>    Not to mention that PG is rendering to 80 different 
>    output file formats in any case so why *insist* that there 
>    be only one input transcription format ?holy grail? 
>    in the first place?

the benefit of a "master" format is that you only have to
store and maintain that one format.   so it's cost-effective.

but thanks for playing, we'll have a consolation gift for you.

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090925/3e439739/attachment.html>

From ajhaines at shaw.ca  Fri Sep 25 15:35:31 2009
From: ajhaines at shaw.ca (Al Haines (shaw))
Date: Fri, 25 Sep 2009 15:35:31 -0700
Subject: [gutvol-d] Re: let's wind this down
References: <c4e.5a510a85.37ee90c7@aol.com>
Message-ID: <71D2BEDB5309484AB6E9D5AD0316C65A@alp2400>

Wearing my Whitewasher hat (I've got a producer's hat around here somewhere, 
too <g>), I'll answer some of the issues raised below, but I won't argue 
about them.  I leave that to others.


Q1: are your digitizers free to ignore the italics if they like?

A1: They can, but they'll be referred to the PG Volunteers' FAQ and How-to 
article(s), and asked to make the necessary corrections.

Q2: are your digitizers free to ignore any rule they dislike?

A2: Some rules/principles can be ignored in specific cases, e.g. line 
lengths can exceed 75 characters for highly structured material such as 
tables, poetry, and such-like.  There may be other occasions for bending the 
rules, but A1 above should be kept in mind.

Q3: are digitizers free to make the .txt files as bad as they choose?

A3: See A1


It is to be hoped that submitters realize that a plain text file can and 
should carry almost as much information as any other format.  Obviously, 
they can't carry illustrations or typeface info, but they can certainly 
carry all the words (in the vast majority of books, the only things that 
really count) and most, if not all, of the standard and near-standard 
emphasis indicators (e.g. underscores for italics).



----- Original Message ----- 
From: Bowerbird at aol.com
To: gutvol-d at lists.pglaf.org ; Bowerbird at aol.com
Sent: Friday, September 25, 2009 2:31 PM
Subject: [gutvol-d] Re: let's wind this down


jim said:
>   Yes, in the encoding of m-dash, ellipses, etc.

i was being sarcastic, jim.

just like you, i have railed about the way that d.p.
(mis)handles things like em-dashes and ellipses.


>   First of all, again, if this is important to PG
>   then why do they not properly index it
>   to the PG site?s search engine?

i dunno.  you'll have to ask p.g., not me.

but i agree their instructions suck.  i get
the impression that they don't even want
individual digitizers to do books any more.

they want to channel the labor over to d.p.,
which would be fine, except the d.p. workflow
wastes _so_ much volunteer time and energy.


>   Secondly, you refuse to read the immediately
>   preceding section FAQ#V.93 which makes it clear that
>   different volunteers have different priorities about
>   what ?plain text? means and how they will be willing
>   to support it and will be using different automatic
>   conversion tools and that some of the volunteers
>   (read: me) will be paying no weight to the desire of
>   other volunteers to make tools to do ?automatic
>   prettyprinting? from the ?plain text? whereas other
>   volunteers (yourself) are willing to insert ?ugliness?
>   into the plain text (their words not mine) in order to
>   better support prettyprinters such as you are proposing.

you can try and do all the doubletalk that you want, jim,
but the fact remains that there is a policy, and it is clear:
>   Underscores are now the effective standard for italics in PG texts.

your .txt version failed to meet the standard, jim.  face it.


>   Finally, you and others at PG are forgetting to heed
>   the closing words given there:
>   >   Getting a text on-line is the important thing;
>   >   which choices you [meaning me] make in doing so
>   >   is a matter of detail.

do you think this gives you the ok to ignore the italics?

not only did your .txt version fail to meet the standard,
but now you're telling us you don't have to meet that?

how about we get a ruling from the p.g. people on this?

are your digitizers free to ignore the italics if they like?

are your digitizers free to ignore any rule they dislike?


>   The choices *I* make as a volunteer are to put my
>   time and effort into doing ONE markup as well as I can
>   namely HTML, and as little time and effort as possible
>   on TXT files

and because you're putting so little time and effort
into your .txt files, they are coming out as inferior.

again, how about a ruling from the p.g. powers-that-be?

are digitizers free to make the .txt files as bad as they choose?

i will keep asking this question until it is answered, so
don't think that you can ignore it and it will just go away.

-bowerbird




_______________________________________________
gutvol-d mailing list
gutvol-d at lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d 



From jimad at msn.com  Fri Sep 25 16:20:07 2009
From: jimad at msn.com (Jim Adcock)
Date: Fri, 25 Sep 2009 16:20:07 -0700
Subject: [gutvol-d] Re: why the plain-text format is the most useful
	format	for eliciting beauty (and more)
In-Reply-To: <cc9.5be6529a.37ee9310@aol.com>
References: <cc9.5be6529a.37ee9310@aol.com>
Message-ID: <SNT120-DS15E583AEF42393BD1D78BFAED90@phx.gbl>

>i have a whole slew of tools here, and am building more.
further, the format is so simple, authors can build tools too.

Once you have the tools done, I will try them, in spite of the fact that what I find over and over again is that tools touted by people on DP and PG 1) fail to even install correctly, 2) and when I try them they really don't do anything useful to help me make books.

If the tools prove to be useful, then I will happily put up with an ugly file coding format.



From Bowerbird at aol.com  Mon Sep 28 12:06:47 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 28 Sep 2009 15:06:47 EDT
Subject: [gutvol-d] 2 weeks in, princeton students hate kindle
Message-ID: <c44.32986529.37f26347@aol.com>

princeton students have been using the kindle...

>    after two weeks of use in three classes, 
>    the Daily Princetonian reports many are 
>    "dissatisfied and uncomfortable" with their 
>    e-readers, with one student calling it 
>    "a poor excuse of an academic tool." 
>    Most of the criticisms center around 
>    the Kindle's weak annotation features, 
>    which make things like highlighting and 
>    margin notes almost impossible to use, 
>    but even a simple thing like the lack of 
>    true page numbers has caused problems, 
>    since allowing students to cite the Kindle's
>    location numbers in their papers is 
>    "meaningless for anyone working from 
>    analog books."

oops...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090928/b449d2d7/attachment.html>

From hart at pobox.com  Mon Sep 28 12:41:41 2009
From: hart at pobox.com (Michael S. Hart)
Date: Mon, 28 Sep 2009 12:41:41 -0700 (PDT)
Subject: [gutvol-d] Re: 2 weeks in, princeton students hate kindle
In-Reply-To: <c44.32986529.37f26347@aol.com>
References: <c44.32986529.37f26347@aol.com>
Message-ID: <alpine.DEB.2.00.0909281241070.12259@mail.pglaf.org>

I tried out the new Sonys yesterday, can't say I liked them at all.


mh

From Bowerbird at aol.com  Mon Sep 28 12:49:56 2009
From: Bowerbird at aol.com (Bowerbird at aol.com)
Date: Mon, 28 Sep 2009 15:49:56 EDT
Subject: [gutvol-d] Re: 2 weeks in, princeton students hate kindle
Message-ID: <c3f.58942a87.37f26d64@aol.com>

michael said:
>    I tried out the new Sonys yesterday, can't say I liked them at all.

the sony is no better than the kindle, for school use.

but for your own use, michael, why didn't you like it?
(i'm not surprised by that, but just would like specifics.)

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090928/acafb4c5/attachment.html>

From jimad at msn.com  Tue Sep 29 11:57:29 2009
From: jimad at msn.com (Jim Adcock)
Date: Tue, 29 Sep 2009 11:57:29 -0700
Subject: [gutvol-d] Re: 2 weeks in, princeton students hate kindle
In-Reply-To: <c3f.58942a87.37f26d64@aol.com>
References: <c3f.58942a87.37f26d64@aol.com>
Message-ID: <SNT120-DS11967DCF3C48BFEEDE97BAED50@phx.gbl>

Not sure which of the Kindles the Princeton students got, but if they got
the Kindle DX, then they could be reading documents in PDF mode, in which
case the pages look the same as paper pages, including page numbers.  Re
marking up pages, yes Kindle allows one to make annotations, but the ability
is pretty weak.  Page number complaints re Kindle could just as well be page
number complaints for PG, since we seldom keep the page numbers. Just heard
an NPR report on students complaining about buying paper books for college
-- $150 new, $100 used, and $75 if rented for one semester.



From gbnewby at pglaf.org  Tue Sep 29 16:58:32 2009
From: gbnewby at pglaf.org (Greg Newby)
Date: Tue, 29 Sep 2009 16:58:32 -0700
Subject: [gutvol-d] yolink add-on
Message-ID: <20090929235832.GA17153@pglaf.org>

Did I already forward this information? Sorry if so.  This
is a search add-on.  The info is provided by the producers.
I've tried it, and it worked well:


Current search tools don???t help you find and analyze information
quickly. yolink is a unique and powerful free browser add-on which
takes search to the next level:

??? cut down search time by as much as 90% by reducing clicks

??? analyze search results quickly ??? content is delivered to you with
         keywords highlighted

??? enhance your search ??? yolink shines at searching through links and
         electronic documents for multiple terms and relationships

??? understand information in context ??? unlike typical search results
         that present a couple of lines of information, yolink
         displays your keywords in its full context so you can
         identify and associate related information

??? unlock the written word ??? use yolink???s ability to search for and
         display multiple keywords in context to quickly analyze books
         for themes, relationships, and quotes

The Web is all about hyperlinks and digital content. Today???s search
engines return lists of links that you need to click-through to find
information. Below is a link to a short video using yolink on a
Gutenberg.org electronic book

  http://www.yolink.com/yolink/media/gutenburglarge.jsp

yolink also provides a hosted archiving and collaboration platform
with its Save & Share feature. yolink Save & Share allows you to
quickly organize and share information, including accessing the
information from mobile devices such as the iPhone.  Download yolink
today at www.yolink.com



From schultzk at uni-trier.de  Wed Sep 30 01:26:26 2009
From: schultzk at uni-trier.de (Keith J. Schultz)
Date: Wed, 30 Sep 2009 10:26:26 +0200
Subject: [gutvol-d] Re: 2 weeks in,
	princeton students hate kindle(slightly OT)
In-Reply-To: <c44.32986529.37f26347@aol.com>
References: <c44.32986529.37f26347@aol.com>
Message-ID: <FCB51057-DB22-4D3E-9FB5-ED54998B612A@uni-trier.de>

Hi All,

	As a acedemic I can understand the problem with
	cite free form texts without any true text pages.

	But there are wayws to get around this. One can use
	chapters. Also, one might add information on how
	one formated the text to make things easier. Another method
	could also, be to use lines numbers.

	Similar problems arise when citing texst on the web.
	Texts on the web give rise to another problem:
	sometimes texts are for one reason or another no
	longer on the web!!

	All this makes it difficult for acedemics to cite correctly.
	For students it can even degrade thier papers as
	the can not truely cite in a correct manner thereby get
	a poorer rating.

	It also, makes it harder for someone  to research a hypothesis.
	Also, a research may not have a Kindle (e.g) to check the
	cite and its ramifications. One can always use other sources
	but that carries other problems with it.

	I personally would not consider the Kindle a acedemic
	toll, but a reading aide and research tool. Just as I would
	do the same for a computer. Yes, a computer can be an
	acedemic research tool if use properly.
	
	The Kindle was not developed as a research tool.
	Just because students are allowed to use them does
	not make them a acedemic tool. The idea was to reduce
	cost for  the students. For truely acedemic work one would
	use other sources than e-book readers. The only real
	reason to use them in the acedemic field for research
	is if they are the only source to get the information.
	This goes just the same for texts produced by PG.
	These texts can only be a starting point not the end.


	regards
		Keith.


	
Am 28.09.2009 um 21:06 schrieb Bowerbird at aol.com:

> princeton students have been using the kindle...
>
> >   after two weeks of use in three classes,
> >   the Daily Princetonian reports many are
> >   "dissatisfied and uncomfortable" with their
> >   e-readers, with one student calling it
> >   "a poor excuse of an academic tool."
> >   Most of the criticisms center around
> >   the Kindle's weak annotation features,
> >   which make things like highlighting and
> >   margin notes almost impossible to use,
> >   but even a simple thing like the lack of
> >   true page numbers has caused problems,
> >   since allowing students to cite the Kindle's
> >   location numbers in their papers is
> >   "meaningless for anyone working from
> >   analog books."
>
> oops...
>
> -bowerbird
> _______________________________________________
> gutvol-d mailing list
> gutvol-d at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-d

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090930/ea28b282/attachment.html>

From answerwitch at gmail.com  Wed Sep 30 06:59:11 2009
From: answerwitch at gmail.com (Mjit RaindancerStahl)
Date: Wed, 30 Sep 2009 09:59:11 -0400
Subject: [gutvol-d] Re: 2 weeks in,
	princeton students hate 	kindle(slightly OT)
In-Reply-To: <FCB51057-DB22-4D3E-9FB5-ED54998B612A@uni-trier.de>
References: <c44.32986529.37f26347@aol.com>
	<FCB51057-DB22-4D3E-9FB5-ED54998B612A@uni-trier.de>
Message-ID: <2f9d57a0909300659i6f1ef650h819f722da695c0@mail.gmail.com>

Per APA citation guidelines: if you can't cite the page number, you cite the
paragraph number.

They are right that a format specific reference point is useless outside the
format.



-- 
Mjit RaindancerStahl
answerwitch at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pglaf.org/mailman/private/gutvol-d/attachments/20090930/dcc9b2ae/attachment-0001.html>

From gbnewby at pglaf.org  Wed Sep 30 14:42:35 2009
From: gbnewby at pglaf.org (Greg Newby)
Date: Wed, 30 Sep 2009 14:42:35 -0700
Subject: [gutvol-d] Text to speech service
Message-ID: <20090930214235.GB30174@pglaf.org>

FYI.  This is not free software, but seems interesting for
folks looking to do online text to speech conversion:

 From: Joe messanella <j.messanella at gmail.com>
 To: gbnewby at pglaf.org
 Subject: Re: text to speech solution blurb

The web is essential for most of us , unfortunately 20% of the US
population have various reading issues!  2 years ago I experienced a
racing bike accident, yes, over the handle bars and onto my head.  For
6 weeks my balance was compromised, pronunciation was inconsistent, my
world - gray!  I had limited tolerance for "computers," 6 months later
the effects could still be felt.  Still, as unfortunate as that may
seem, I may now a better person for it.  At least my sense of purpose
has been renewed.  I will soon be working for ( www.voice-corp.com ), a
small company that offers inexpensive "service based" text to speech
solutions, web visitors just click on the listen icon and a player
reads the web text or downloads it in mp3 format.  I did not mention my
accident during my interview however! Would it matter?

  joe.messanella at voice-corp.com